Introduction
Overview

The Genotate platform allows the automatic annotation of transcript sequences. Annotations can be predicted based on sequence homology and functional analyses at both the transcript and amino acid levels. Identified annotations can be easily visualized using interactive viewers. Furthermore, users can search for transcripts having specific features among their annotation results.

In this tutorial, the main functionalities of the Genotate transcript annotation platform are described, such as:

  • the annotation of a single transcript sequence
  • the annotation of a multiple transcript sequences
  • the visualization of annotation results
  • the exploration and identification of transcript sequences based on their identified annotations

Additionally, several Genotate management functionalities are described, such as:

  • the management of annotation results
  • the management of homology references
  • the configuration of the Genotate database
  • the configuration of the Genotate parameters

The algorithms, tools and databases used by Genotate are described at the end of this tutorial.

Background

High-throughput technologies generate large quantities of complex high-dimensional biological data. These techniques are more and more precise and the acquisition costs are constantly decreasing. Especially, RNA-seq (NGS) can be used to characterize the transcriptome of new animal species or specific cell types.

RNA-seq technologies generally produce fragments of transcriptomic sequences, named reads, which need to be assembled. Illumina is one of the most used RNA-seq techniques and can sequence reads up to hundreds of bases. The PacBio and the Nanopore techniques can sequence reads up to hundreds of kilo-bases. Reads are usually assembled into transcripts with different algorithms.

Once assembled, transcripts must be annotated. Transcript annotations can be defined either at the homology or functional levels. Firstly, transcripts can be annotated based on their homology with transcriptomic annotated references. Secondly, proteins translated from transcript sequences can be annotated based on their homology with proteomic annotated references, and based on their peptidic domains.

Genotate functionalities

The Genotate web platform has been developed to allow non-bioinformaticians to automatically annotating their transcript sequences. Annotation results can be visualized and user can search for specific transcripts within their annotation results.

The platform allows to upload transcripts, specify annotation options, predict the transcript annotations, visualize the identified transcript annotations, search for transcript having specific annotations, and to download the computed results.

This platform also provides administrative interfaces to manage annotation results and homology references. Finally, the platform provides interfaces to configure the software dependencies and database parameters.

/functionalities.png?v=2 /functionalities.png?v=2
Overview of the annotation workflow

The Genotate annotation pipeline takes in input one fasta file containing a single transcript or multiple transcripts. The annotation steps and options are defined using the web interface. For each reconstructed transcript, Genotate first detects the set of all possible ORFs with the purpose of annotating them. All ORFs detected by Genotate are annotated based on: (i) their homology with other reference sequences, also named homology annotations; and (ii) the presence of peptidic functional elements on their resulting translated proteins, also named functional annotations. Homology annotations are computed based on any reference dataset of transcriptomic or proteomic sequences specified by users or available by default in Genotate. The functional annotations are computed based on a compendium of publicly available computational tools and databases specified by the user.

A large collection of annotation services and databases are available in Genotate. Indeed, reference transcriptomic and proteomic datasets from the NONCODE, UniRef, and Ensembl databases are available (consisting of more than 100 animal species). Additionally, multiple protein annotation software are available (consisting in around 30 different algorithms). Non-coding transcripts can also be analyzed with Genotate.

workflow.png?v=2
Annotation of single or multplie transcripts
Transcript annotation

Transcripts can be annotated using the single or multiple transcript annotation interfaces. Sequences can be submitted as sequences or as fasta file. Sequences must not contain other characters than 'A', 'T', 'G', 'C', or 'N'. Example sequences are available in the interface to annotate a single transcript.

Each interface provides the different options to parametrize the ORF detection, homology, and functional analyses.

/single_transcript_annotation.png?v=2 /multiple_transcripts_annotation.png?v=2

ORF identification

For each transcript to analyze, Genotate first detects the sets of all possible ORFs based on parameters selected in the ORF panel. ORFs are then translated to obtain the associated protein sequences. The start and stop codons (which initiate and end the ORFs) can be specified by users. By default, start codon is set to 'ATG' and stop codons to 'TAG, TGA, and TAA'. ORFs with a length lower than a threshold can be filtered to avoid interpretations of sequences with no biological meaning. Inner ORFs (which consist of nested ORF sequences) can also be identified as well as outside ORF (which consist of ORFs lacking either the start or stop codon). By default, the complete transcript sequence is conserved to be annotated as a non-coding RNA. By default, ORFs are also identified on the reverse complemented transcript sequence.

orf_identification_panel.png?v=2

In detail, the protein associated to a transcript are obtained by detecting all the possible ORF on the transcript. A frame is composed of nucleotide triplets called codon. The transcript sequence is divided into three frames, with a shift of one base on the sequence strand. The transcript sequence can also be reversed, and the nucleic base complemented to obtain the complementary sequence. An Open Reading Frame begins with a codon start and ends with a codon stop. A codon can be translated to an amino acid or end of translation signal. A codon encoding the beginning of the translation, such as 'ATG', is called codon start. A codon encoding the end of the translation, such as 'TAG, TGA, TAA', is called codon stop. A protein is obtained from the translated sequence of an Open Reading Frame.

orf_identification.png?v=2
Identification of homology annotations

Homology annotations are computed based on any reference dataset of transcriptomic or proteomic sequences specified by users or available by default in Genotate.

references.png?v=2

Sequences homologies are identified using the BLAST algorithm. Homology results can be filtered based on the percentage identity match, the percentage of query sequence coverage, and the percentage of reference sequence coverage.

references.png?v=2
Identification of functional annotations

Genotate can annotate transcripts based on the functional domains of their associated proteins based on multiples algorithm. These annotation algorithms can be selected in the functional annotation panel. For each algorithm, a threshold or e-value parameter is available to filter the annotation results.

/functional_annotation_panel.png?v=2

In details, the functional annotation are computed based on a large set of publicly available computational algorithms and databases. Especially, the InterproScan identify conserved functional domains on a protein, and unify multiple protein family databases and alignment algorithms. InterproScan unifies proteins functional domains from different databases such as PFAM, SUPERFAMILY, and PANTHER.

The functional annotation are computed based on multiple other prediction algorithms, such as TMHMM, SIGNALP, and PROP.

functional.png?v=2

Visualization of annotation results
Annotation results

A result interface is displayed when the annotation are computed. For each transcript, a panel represents the elements identified on the transcript. Moreover, a result summary panel provides the number of ORFs identifies, the number of identified annotations, and allows to download the associated sequences.

annotation_results.png?v=2
Annotation panel viewer

Once annotated, the transcript annotation viewer panel provides a graphical representation of the identified ORFs. The transcript sequence is represented in blue on the top of the representation. ORFs identified on the transcript sequence are represented under the '> > >' symbols. ORFs identified on the reverse complemented transcript sequence are represented under the '< < <' symbol. Identified inner ORFs (ORFs nested in a larger ORFs), outside ORFs (ORFs lacking a start or stop codon), and ncRNA are also represented in this overview viewer.

/transcript_viewer.png?v=2

Each annotated ncRNA and ORF are represented by an interactive annotate viewer. This panel provides an interactive annotation representation, functional annotation descriptions, homology annotation descriptions. Furthermore, the viewer allow the possibility to search transcript, ORFs and proteins sequences in NCBI databases.

panel.png?v=2

Multiple actions are available through the annotation viewer, such as:

  • Detach the panel in a new windows
  • Display or hide on the graph the annotations identified by each algorithm
  • Select the begin and end position of the transcript region displayed
  • Display the panels containing annotation details either for functional annotations or similarity annotations
  • NCBI Blast search can be computed using either the nucleic sequence or the protein sequence if available
  • Download the sequences of the transcript, coding or non-coding regions, associated protein, and the identified annotations
Search identified annotations
Search annotations

The search interface allows user to explore available annotated transcripts based on specific criteria. By default, searches are made on the whole set of annotation result datasets. Specific datasets can be selected to limit the exploration of annotated transcripts. For each identified annotation, a specific annotation can be selected. It is also possible to search for any annotation of an algorithm with a minimal and maximal number of annotation.

/search.png?v=2

A summary panel provide the number of ncRNA and ORFs matching the annotation filters, and allow to download the sequences and the annotations.

/results_statistics.png?v=2

The ncRNA and ORF matching the annotation filters are displayed in the result panel, with 20 results by page. They can be ordered by length, begin position, and end position.

/results.png?v=2
Administration of annotation results
Manage annotations

The annotation management interface list the annotated transcripts datasets with their computation current status, annotation parameters, results, and the possibility to rename or delete them. For each dataset, the transcripts sequences, the ORFs sequences, and the annotation can be downloaded.

/manage_annotations.png?v=2

The transcript dataset details panel provides the dataset information, ORF identification parameters, functional annotation algorithms and their threshold or e-value, similarity annotation references and their identity and coverage threshold are available.

/annotation_dataset_details.png?v=2

Administration of homology references
Create a homology reference

Datasets of nucleic or proteomic sequences can be used as references for annotating submitted transcripts by homology. Admin users can create homology reference by providing a FASTA file or an ftp link.

/create_reference.png?v=2

Manage homology references

The list of all available homology references can be displayed with their current computation status and sequence. Through this interface, it is possible to rename or delete the homology reference. The details of each homology reference can be displayed to provide the release, the species, the sequence type, and the description of the reference.

/manage_references.png?v=2
Import homology references from public databases

Transcriptomic and proteomic datasets from the NONCODE, UniRef and Ensembl databases can be easily imported as reference homologies. For each dataset, a description and an external link to the public database are provided.

/external_references_noncode.png?v=2

/external_references_uniref.png?v=2

/external_references_ensembl.png?v=2

Configuration of the Genotate platform
Parallelization of annotation computations

Genotate annotation computations can be parallelized for efficient computations of large transcript datasets. The pipeline can execute simultaneously multiple process, called workers. The number of workers is configured by default to 8 and can be specified by the users.

img/parallelization.png?v=2

For each annotation query, the whole pool of identified ncRNAs and ORFs is split in subsets of 100 sequences. For each subset, sequences are annotated by the different algorithms. Each algorithm annotation is computed by multiple workers. The annotation obtained are unified in a common result file.

Dependencies configuration

To annotate transcripts datasets, the web platform uses a local annotation pipeline. The annotation pipeline is launched with Java. The web platform allows to create reference datasets for similarity annotation, and BLAST is required to generate a sequence database for each reference dataset. The web platform requires several folders to store the uploaded transcripts and reference datasets, to store the files generated by the annotation pipeline including the annotation result files.

The web platform configuration file 'web/genotateweb.config' is required to use the web platform dependencies interface and contains the path to each folder and binaries required to run properly. The annotation pipeline binaries are automatically downloaded from GitHub in a binaries folder. The folders are automatically generated if they do not exist.

/dependencies_configuration_1.png?v=2

The annotation pipeline configuration file 'binaries/genotate.config' is required to use the annotation pipeline dependencies panel. The annotation pipeline require annotation algorithms and similarity annotation datasets to be installed. Genotate annotation pipeline dependencies can be installed by following the instruction available at https://github.com/tchitchek-lab/genotate.life.

/dependencies_configuration_2.png?v=2

Annotation colors

This interface allows modifying the color associated to each homology and functional annotations in the graphical representations of transcript sequences.

img/annotation_colors.png?v=2
Database configuration

Genotate requires a database management system (DBMS) to store multiple information (such as annotation results, associated parameters, algorithm information, homology reference dataset information, user information, ...). The database configuration interface allows users to initialize and configure the Genotate database. Users can provide here the hostname, database name, user name, and password. The database can be reset using this interface. The database can be created if the database does not already exist.

/database_configuration.png?v=2

Configuration of algorithms and databases
Algorithms used for the identification of functional annotations

Many public algorithms are used by Genotate to identify functional annotations on the transcript sequences. These algorithms are described in the table below.

names description hosting institutes website links
Interproscan InterproScan software uses the protein family patterns to search functional domains on proteins. Proteins family allows to group proteins with the same function, and conserved domains can be identified in a family. Databases of proteins family provide a large number of pattern and conserved domains. EMBL EBI in Hinxton, The Wellcome Genome Campus
Tmhmm TMHMM predicts transmembrane domains and the cellular location of the inter- transmembrane domains, based on hidden Markov models. Transmembrane domains fundamentally rule all the membrane biochemical processes. National Center for Biotechnology Information, NLM/NIH Bethesda & Department of Biochemistry, Arrhenius Laboratory, Stockholm University & Center for Biological Sequence Analysis, Technical University of Denmark
Signalp SIGNALP predicts the secretory signal peptide, a ubiquitous signal that targets for translocation across the membrane, based on neural network. Center for Biological Sequence Analysis, Department of Systems Biology,
Technical University of Denmark, Lyngby, Denmark.
Novo Nordisk Foundation, Center for Protein Research, Health Sciences Faculty, University of Copenhagen, Copenhagen, Denmark.
Center for Biomembrane Research, Department of Biochemistry and Biophysics, Stockholm University, Stockholm, Sweden.
Science for Life Laboratory, Stockholm University, Solna, Sweden.
Prop ProP predicts arginine and lysine propeptide cleavage sites, which characterize inactive peptides precursors. The precursors undergo post translational processing to become biologically active polypeptides. Center for Biological Sequence Analysis, BioCentrum DTU, Technical University of Denmark
Coils Predicts coiled coil conformation SIB Swiss Institute of Bioinformatics
NETCGLYC Glycosylation attach covalently a carbohydrate to proteins and lipids. Some proteins require being glycosylated to fold correctly. NetCGlyc produces neural network predictions of C-mannosylation sites in mammalian proteins. Department of Medical Biochemistry and Biophysics, Karolinska Institutet, SE-171 77 Stockholm, Sweden and Stockholm Bioinformatics Center
NETNGLYC Glycosylation attach covalently a carbohydrate to proteins and lipids. Some proteins require being glycosylated to fold correctly. NetNglyc predicts N-Glycosylation sites in human proteins using artificial neural networks. Center for Biological Sequence Analysis, The Technical University of Denmark, Lyngby, Denmark
BEPIPRED An epitope, also known as antigenic determinant, is the part of an antigen that is recognized by the immune system, specifically by antibodies, B cells. Predict the location of linear B cell epitopes using a combination of a hidden Markov model and a propensity scale method Center for Biological Sequence Analysis, BioCentrum-DTU, Building 208, Technical University of Denmark
MHCI An epitope, also known as antigenic determinant, is the part of an antigen that is recognized by the immune system, specifically by antibodies, T cells. MHC I from IEDB database determine each subsequence's ability to bind to a specific MHC class I molecule Division of Vaccine Discovery, La Jolla Institute for Allergy and Immunology
MHCII An epitope, also known as antigenic determinant, is the part of an antigen that is recognized by the immune system, specifically by antibodies, T cells. MHC II from IEDB database predict MHC Class II epitopes, including a consensus approach which combines NN align, SMM align and Combinatorial library methods Division of Vaccine Discovery, La Jolla Institute for Allergy and Immunology
rnammer Annotates ribosomal RNA genes Centre for Molecular Biology and Neuroscience and Institute of Medical Microbiology, University of Oslo
tRNAscan Predicts transfer RNA genes Biomolecular Engineering, University of California Santa Cruz
CATH-Gene3D CATH-Gene3D database describes protein families and domain architectures in complete genomes. Protein family clusters are identified according to sequence identity. For each protein family conserved domain, a protein structure is available. CATH-Gene3D is based at University College, London, UK. CATH-Gene3D is based at University College, London, UK.
CDD CDD is a collection of annotated multiple sequence alignment models. These are available as position-specific score matrices (PSSMs) for fast identification of conserved domains in protein sequences via RPS-BLAST. CDD content includes NCBI-curated domain models, as well as domain models imported from a number of external source databases. The National Center for Biotechnology Information (NCBI) is part of the United States National Library of Medicine (NLM), a branch of the National Institutes of Health.
MobiDB MobiDB offers a centralized resource for annotation of intrinsic protein disorder. The database features three levels of annotation: manually curated, indirect and predicted. MobiDB is based at Padua (Italy) in the Biocomputing UP research group, part of the Department of Biomedical Sciences, University of Padua.
HAMAP HAMAP stands for High-quality Automated and Manual Annotation of Proteins. HAMAP profiles are manually created by expert curators. They identify proteins that are part of well-conserved proteins families or subfamilies. HAMAP is based at the SIB Swiss Institute of Bioinformatics, Geneva, Switzerland.
Panther PANTHER is a large collection of protein families that have been subdivided into functionally related subfamilies, using human expertise. These subfamilies have a more specific function, conserved domain, and provide models for classifying additional protein sequences. PANTHER is based at University of Southern California, CA, US.
Pfam Pfam is a large collection of multiple sequence alignments and hidden Markov models covering many common protein domains. Pfam is based at EMBL-EBI, Hinxton, UK.
PIRSF PIRSF protein classification system is a network with multiple levels of sequence diversity from superfamilies to subfamilies that reflects the evolutionary relationship of full-length proteins and domains. PIRSF is based at the Protein Information Resource, Georgetown University Medical Centre, Washington DC, US.
PRINTS PRINTS is a compendium of protein fingerprints. A fingerprint is a group of conserved motifs used to characterize a protein family or domain. PRINTS is based at the University of Manchester, UK.
ProDom ProDom protein domain database consists of an automatic compilation of homologous domains. Current versions of ProDom are built using a novel procedure based on recursive PSI-BLAST searches. ProDom is based at PRABI Villeurbanne, France.
PROSITE PROSITE is a database of protein families and domains. It consists of biologically significant sites, patterns, and profiles that help to reliably identify to which known protein family a new sequence belongs. PROSITE is base at the Swiss Institute of Bioinformatics (SIB), Geneva, Switzerland.
SFLD SFLD (Structure-Function linkage Database) is a hierarchical classification of enzymes that relates specific sequence-structure features to specific chemical capabilities. UC San Francisco, Babbitt Lab, SFLD Team
SMART SMART (a Simple Modular Architecture Research Algorithm) allows the identification and annotation of genetically mobile domains and the analysis of domain architectures. SMART is based at EMBL, Heidelberg, Germany.
SUPERFAMILY SUPERFAMILY is a library of profile hidden Markov models that represent all proteins of known structure. SUPERFAMILY is based at the University of Bristol, UK.
Tigrfam TIGRFAMs is a collection of protein families, featuring curated multiple sequence alignments, hidden Markov models (HMMs) and annotation, which provides a algorithm for identifying functionally related proteins based on sequence homology. TIGRFAMs is based at the J. Craig Venter Institute, Rockville, MD, US.
Homology reference references

Annotated sequences of the genomes, transcriptomes and proteomes of different species can be used as reference to annotate transcripts at the homology level.

names descriptions links
Ensembl Transcriptome(cds, cdna, ncrna) and proteome for a large number of species
Uniprot UniProtKB/TrEMBL contains high quality computationally analyzed records that are enriched with automatic annotation and classification.
Swissprot contains high quality manually annotated and non-redundant protein sequence database.
NONCODE database dedicated to non-coding RNAs (excluding tRNAs and rRNAs), by species
Genotate is supported by the IDMIT infrastructure and funded by the ANR.
IDMIT is part of the French Alternative Energies and Atomic Energy Commission (CEA) | Terms of use (GPL license)