Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2008;9(12):R175.
doi: 10.1186/gb-2008-9-12-r175. Epub 2008 Dec 16.

Annotating genomes with massive-scale RNA sequencing

Affiliations

Annotating genomes with massive-scale RNA sequencing

France Denoeud et al. Genome Biol. 2008.

Abstract

Next generation technologies enable massive-scale cDNA sequencing (so-called RNA-Seq). Mainly because of the difficulty of aligning short reads on exon-exon junctions, no attempts have been made so far to use RNA-Seq for building gene models de novo, that is, in the absence of a set of known genes and/or splicing events. We present G-Mo.R-Se (Gene Modelling using RNA-Seq), an approach aimed at building gene models directly from RNA-Seq and demonstrate its utility on the grapevine genome.

PubMed Disclaimer

Figures

Figure 1
Figure 1
G-Mo.R-Se method for building gene models from short reads. The five black boxes show the 5 steps of the approach. Step 1 (covtig construction) is the construction of covtigs (coverage contigs), which are built from positions where short reads are mapped above a given depth threshold. Step 2 (candidate exons) is the definition of a list of stranded candidate exons derived from each covtig. Splice sites are searched 100 nucleotides around each covtig boundary, which allows the orientation of the candidate exons on the forward or the reverse strand, as shown in the second box. Step 3 (junction validation) consists of the validation of junctions between candidate exons using a word dictionary built from the unmapped reads. During step 4 (graph of candidates exons linked by validated junctions), a graph is created where nodes are candidate exons (black boxes) and oriented edges (purple arrows) between two nodes represent validated junctions. The two last connected components show an example of a split gene that can be corrected using open reading frame detection between the last exon of the first model and the first exon of the second model. In the final step, step 5 (model construction and coding sequence detection) we go through the previous graph and extract all possible paths between each source and each sink. Each path will then represent a predicted transcript, and a CDS will be identified for each transcript. Models M1, M2, M5 and M7 (untranslated regions are in grey, introns in black and coding exons in red) correctly model real transcripts T1, T2, T3 and T5 (untranslated regions are in grey, and introns and exons are indicated by black lines and boxes, respectively). As all possible paths are extracted from the graph, some of them may not correspond to real transcripts (for example, models M3, M4 and M6).
Figure 2
Figure 2
Read coverage depth for reference genes overlapped by G-Mo.R-Se models and Velvet contigs. The distribution of the average depth (log) on all exonic nucleotides of the genes is plotted for genes overlapped on ≥ 75% of their nucleotides by G-Mo.R-Se models (red line) and Velvet contig (dashed purple line). The y-axis corresponds to the percentage of reference genes in each bin (bin width is 0.2).
Figure 3
Figure 3
Proportion of unique 32-mers in cDNA clusters. The percentage of unique 32-mers is shown for cDNA clusters overlapped by models on more than 75% of their nucleotides (green) and cDNA clusters not overlapped by models (red). The y-axis corresponds to the percentage of cDNA clusters in each bin (bin width is 10% of unique 32-mers among all 32-mers in the cluster).
Figure 4
Figure 4
Read coverage depth for models overlapping cDNA loci and models not overlapping cDNAs. The distribution of the average depth (log) on all exonic nucleotides of the models is plotted for models overlapping cDNAs on ≥ 50% of their nucleotides (green) and models not overlapping cDNAs (black). The y-axis corresponds to the percentage of models in each bin (bin width is 0.2).
Figure 5
Figure 5
Example of alternatively spliced models built from short reads. The figure shows a capture of a 4 kb genomic region from V. vinifera chromosome 12 between 3,836,500 bp and 3,840,500 bp. The first track (Genoscope annotations) contains the automatic annotation from [44]. The green models are GeneWise alignments of Uniprot proteins. Alignment of V. vinifera cDNAs from [44] are in red, and public V. vinifera ESTs are in light green. The next track displays the models predicted by G-Mo.R-Se (untranslated region in grey, CDS in red). Initial covtigs are displayed as brown boxes (average depth of covtigs is written below each covtig). Alignments of velvet contigs are displayed in purple. Ab initio models produced by geneID [51] and SNAP [52] are displayed in blue and pink, respectively. The short reads coverage depth is plotted on the last track (black): the dashed red line shows the threshold used to build covtigs. Model M2 is confirmed by numerous resources, model M3 seems to be a minor alternative splice form (it is only supported by two public ESTs: E1 and E2), and model M1 is a novel alternative splice form.

Similar articles

Cited by

References

    1. Holt RA, Jones SJ. The new paradigm of flow cell sequencing. Genome Res. 2008;18:839–846. doi: 10.1101/gr.073262.107. - DOI - PubMed
    1. Mardis ER. Next-generation DNA sequencing methods. Annu Rev Genomics Hum Genet. 2008;9:387–402. doi: 10.1146/annurev.genom.9.081307.164359. - DOI - PubMed
    1. Korbel JO, Urban AE, Grubert F, Du J, Royce TE, Starr P, Zhong G, Emanuel BS, Weissman SM, Snyder M, Gerstein MB. Systematic prediction and validation of breakpoints associated with copy-number variants in the human genome. Proc Natl Acad Sci USA. 2007;104:10110–10115. doi: 10.1073/pnas.0703834104. - DOI - PMC - PubMed
    1. Chen W, Kalscheuer V, Tzschach A, Menzel C, Ullmann R, Schulz MH, Erdogan F, Li N, Kijas Z, Arkesteijn G, Pajares IL, Goetz-Sothmann M, Heinrich U, Rost I, Dufke A, Grasshoff U, Glaeser B, Vingron M, Ropers HH. Mapping translocation breakpoints by next-generation sequencing. Genome Res. 2008;18:1143–1149. doi: 10.1101/gr.076166.108. - DOI - PMC - PubMed
    1. Cokus SJ, Feng S, Zhang X, Chen Z, Merriman B, Haudenschild CD, Pradhan S, Nelson SF, Pellegrini M, Jacobsen SE. Shotgun bisulphite sequencing of the Arabidopsis genome reveals DNA methylation patterning. Nature. 2008;452:215–219. doi: 10.1038/nature06745. - DOI - PMC - PubMed

Publication types

LinkOut - more resources