The Genotate platform allows the automatic annotation of transcript sequences. Annotations can be predicted based on sequence homology and functional analyses at both the transcript and amino acid levels. Identified annotations can be easily visualized using interactive viewers. Furthermore, users can search for transcripts having specific features among their annotation results.
In this tutorial, the main functionalities of the Genotate transcript annotation platform are described, such as:
- the annotation of a single transcript sequence
- the annotation of a multiple transcript sequences
- the visualization of annotation results
- the exploration and identification of transcript sequences based on their identified annotations
Additionally, several Genotate management functionalities are described, such as:
- the management of annotation results
- the management of homology references
- the configuration of the Genotate database
- the configuration of the Genotate parameters
The algorithms, tools and databases used by Genotate are described at the end of this tutorial.
High-throughput technologies generate large quantities of complex high-dimensional biological data. These techniques are more and more precise and the acquisition costs are constantly decreasing. Especially, RNA-seq (NGS) can be used to characterize the transcriptome of new animal species or specific cell types.
RNA-seq technologies generally produce fragments of transcriptomic sequences, named reads, which need to be assembled. Illumina is one of the most used RNA-seq techniques and can sequence reads up to hundreds of bases. The PacBio and the Nanopore techniques can sequence reads up to hundreds of kilo-bases. Reads are usually assembled into transcripts with different algorithms.
Once assembled, transcripts must be annotated. Transcript annotations can be defined either at the homology or functional levels. Firstly, transcripts can be annotated based on their homology with transcriptomic annotated references. Secondly, proteins translated from transcript sequences can be annotated based on their homology with proteomic annotated references, and based on their peptidic domains.
The Genotate web platform has been developed to allow non-bioinformaticians to automatically annotating their transcript sequences. Annotation results can be visualized and user can search for specific transcripts within their annotation results.
The platform allows to upload transcripts, specify annotation options, predict the transcript annotations, visualize the identified transcript annotations, search for transcript having specific annotations, and to download the computed results.
This platform also provides administrative interfaces to manage annotation results and homology references. Finally, the platform provides interfaces to configure the software dependencies and database parameters.
The Genotate annotation pipeline takes in input one fasta file containing a single transcript or multiple transcripts. The annotation steps and options are defined using the web interface. For each reconstructed transcript, Genotate first separate coding and non coding transcript. Coding transcript contain one or multiple encoding ORF. Genotate detects the set of all possible ORFs, the ORF with a protein coding potential over a defined threshold are translated for further annotation. The transcript without any protein coding ORF are further annotated as noncoding transcript. All transcript are annotated based on: (i) their homology with other reference sequences, also named homology annotations; and (ii) for coding transcript, the presence of peptidic functional elements on their resulting translated proteins, also named functional annotations. Homology annotations are computed based on any reference dataset of nucleic, transcriptomic or proteomic sequences specified by users or available by default in Genotate. The functional annotations are computed based on a compendium of publicly available computational tools and databases specified by the user.
A large collection of annotation services and databases are available in Genotate. Indeed, reference transcriptomic and proteomic datasets from the NONCODE, UniRef, and Ensembl databases are available (consisting of more than 100 animal species). Additionally, multiple protein annotation software are available (consisting in around 30 different algorithms). Non-coding transcripts can also be analyzed with Genotate.
Transcripts can be annotated using the single or multiple transcript annotation interfaces. Sequences can be submitted as sequences or as fasta file. Sequences must not contain other characters than 'A', 'T', 'G', 'C', or 'N'. Example sequences are available in the interface to annotate a single transcript.
Each interface provides the different options to parametrize the ORF detection, homology, and functional analyses.
For each transcript to analyze, Genotate first check all transcript with CPAT to measure their coding potential. CPAT use by default the longest ORF which is kept if the coding potential is superior the selected threshold. ORFs are then translated to obtain the associated protein sequences.
The user can also chose to disable CPAT and use a custom detection of all possible ORFs based on parameters selected in the ORF panel. The start and stop codons (which initiate and end the ORFs) can be specified by users. By default, start codon is set to 'ATG' and stop codons to 'TAG, TGA, and TAA'. ORFs with a length lower than a threshold can be filtered to avoid interpretations of sequences with no biological meaning. Inner ORFs (which consist of nested ORF sequences) can also be identified as well as outside ORF (which consist of ORFs lacking either the start or stop codon).
By default, the transcript without any coding ORF are annotated as a non-coding RNA; and ORFs are also identified on the reverse complemented transcript sequence.
In detail, the protein associated to a transcript can be obtained from multiples ORF encoded on the transcript. A frame is composed of nucleotide triplets called codon. The transcript sequence is divided into three frames, with a shift of one base on the sequence strand. The transcript sequence can also be reversed, and the nucleic base complemented to obtain the complementary sequence. An Open Reading Frame begins with a codon start and ends with a codon stop. A codon can be translated to an amino acid or end of translation signal. A codon encoding the beginning of the translation, such as 'ATG', is called codon start. A codon encoding the end of the translation, such as 'TAG, TGA, TAA', is called codon stop. A protein is obtained from the translated sequence of an Open Reading Frame. Moreover, due to sequencing errors the start or the stop codon can be truncated and not included in the input transcript sequence.
Homology annotations are computed based on any reference dataset of transcriptomic or proteomic sequences specified by users or available by default in Genotate.
Sequences homologies are identified using the BLAST algorithm. Homology results can be filtered based on the percentage identity match, the percentage of query sequence coverage, and the percentage of reference sequence coverage.
Genotate can annotate transcripts based on the functional domains of their associated proteins based on multiples algorithm. These annotation algorithms can be selected in the functional annotation panel. For each algorithm, a threshold or e-value parameter is available to filter the annotation results.
In details, the functional annotation are computed based on a large set of publicly available computational algorithms and databases. Especially, the InterproScan identify conserved functional domains on a protein, and unify multiple protein family databases and alignment algorithms. InterproScan unifies proteins functional domains from different databases such as PFAM, SUPERFAMILY, and PANTHER.
The functional annotation are computed based on multiple other prediction algorithms, such as TMHMM, SIGNALP, and PROP.
A result interface is displayed when the annotation are computed. For each transcript, a panel represents the elements identified on the transcript. Moreover, a result summary panel provides the number of ORFs identifies, the number of identified annotations, and allows to download the associated sequences.
Once annotated, the transcript annotation viewer panel provides a graphical representation of the identified ORFs. The transcript sequence is represented in blue on the top of the representation. ORFs identified on the transcript sequence are represented under the '> > >' symbols. ORFs identified on the reverse complemented transcript sequence are represented under the '< < <' symbol.
For each transcript, either the ncRNA (possibly two for both strand) or the identified ORF(s) are represented by an interactive annotate viewer. This panel provides an interactive annotation representation, functional annotation descriptions, homology annotation descriptions. Furthermore, the viewer allow the possibility to search transcript, ORFs and proteins sequences in NCBI databases.
Multiple actions are available through the annotation viewer, such as:
- Detach the panel in a new windows
- Display or hide on the graph the annotations identified by each algorithm
- Select the begin and end position of the transcript region displayed
- Display the panels containing annotation details either for functional annotations or similarity annotations
- NCBI Blast search can be computed using either the nucleic sequence or the protein sequence if available
- Download the sequences of the transcript, coding or non-coding regions, associated protein, and the identified annotations
The search interface allows user to explore available annotated transcripts based on specific criteria. By default, searches are made on the whole set of annotation result datasets. Specific datasets can be selected to limit the exploration of annotated transcripts. For each identified annotation, a specific annotation can be selected. It is also possible to search for any annotation of an algorithm with a minimal and maximal number of annotation.
A summary panel provide the number of ncRNA and ORFs matching the annotation filters, and allow to download the sequences and the annotations.
The ncRNA and ORF matching the annotation filters are displayed in the result panel, with 20 results by page. They can be ordered by length, begin position, and end position.
The annotation management interface list the annotated transcripts datasets with their computation current status, annotation parameters, results, and the possibility to rename or delete them. For each dataset, the transcripts sequences, the ORFs sequences, and the annotation can be downloaded.
The transcript dataset details panel provides the dataset information, ORF identification parameters, functional annotation algorithms and their threshold or e-value, similarity annotation references and their identity and coverage threshold are available.
Datasets of nucleic or proteomic sequences can be used as references for annotating submitted transcripts by homology. Admin users can create homology reference by providing a FASTA file or an ftp link.
The list of all available homology references can be displayed with their current computation status and sequence. Through this interface, it is possible to rename or delete the homology reference. The details of each homology reference can be displayed to provide the release, the species, the sequence type, and the description of the reference.
Transcriptomic and proteomic datasets from the NONCODE, UniRef and Ensembl databases can be easily imported as reference homologies. For each dataset, a description and an external link to the public database are provided.
Genotate annotation computations can be parallelized for efficient computations of large transcript datasets. The pipeline can execute simultaneously multiple process, called workers. The number of workers is configured by default to 8 and can be specified by the users.
For each annotation query, the whole pool of identified ncRNAs and ORFs is split in subsets of 100 sequences. For each subset, sequences are annotated by the different algorithms. Each algorithm annotation is computed by multiple workers. The annotation obtained are unified in a common result file.
To annotate transcripts datasets, the web platform uses a local annotation pipeline. The annotation pipeline is launched with Java. The web platform allows to create reference datasets for similarity annotation, and BLAST is required to generate a sequence database for each reference dataset. The web platform requires several folders to store the uploaded transcripts and reference datasets, to store the files generated by the annotation pipeline including the annotation result files.
The web platform configuration file 'web/genotateweb.config' is required to use the web platform dependencies interface and contains the path to each folder and binaries required to run properly. The annotation pipeline binaries are automatically downloaded from GitHub in a binaries folder. The folders are automatically generated if they do not exist.
The annotation pipeline configuration file 'binaries/genotate.config' is required to use the annotation pipeline dependencies panel. The annotation pipeline require annotation algorithms and similarity annotation datasets to be installed. Genotate annotation pipeline dependencies can be installed by following the instruction available at https://github.com/tchitchek-lab/genotate.life.
This interface allows modifying the color associated to each homology and functional annotations in the graphical representations of transcript sequences.
Genotate requires a database management system (DBMS) to store multiple information (such as annotation results, associated parameters, algorithm information, homology reference dataset information, user information, ...). The database configuration interface allows users to initialize and configure the Genotate database. Users can provide here the hostname, database name, user name, and password. The database can be reset using this interface. The database can be created if the database does not already exist.
Many public algorithms are used by Genotate to identify functional annotations on the transcript sequences. These algorithms are described in the table below.
Annotated sequences of the genomes, transcriptomes and proteomes of different species can be used as reference to annotate transcripts at the homology level.