Annotating genomes with massive-scale RNA sequencing

doi:10.1186/gb-2008-9-12-r175

. 2008;9(12):R175.

doi: 10.1186/gb-2008-9-12-r175. Epub 2008 Dec 16.

Annotating genomes with massive-scale RNA sequencing

France Denoeud¹, Jean-Marc Aury, Corinne Da Silva, Benjamin Noel, Odile Rogier, Massimo Delledonne, Michele Morgante, Giorgio Valle, Patrick Wincker, Claude Scarpelli, Olivier Jaillon, François Artiguenave

Affiliations

PMID: 19087247
PMCID: PMC2646279
DOI: 10.1186/gb-2008-9-12-r175

Annotating genomes with massive-scale RNA sequencing

France Denoeud et al. Genome Biol. 2008.

. 2008;9(12):R175.

doi: 10.1186/gb-2008-9-12-r175. Epub 2008 Dec 16.

Authors

Affiliation

¹ CEA, DSV, Institut de Génomique, Genoscope, 2 rue Gaston Crémieux, CP5706, 91057 Evry, France. [email protected]

PMID: 19087247
PMCID: PMC2646279
DOI: 10.1186/gb-2008-9-12-r175

Abstract

Next generation technologies enable massive-scale cDNA sequencing (so-called RNA-Seq). Mainly because of the difficulty of aligning short reads on exon-exon junctions, no attempts have been made so far to use RNA-Seq for building gene models de novo, that is, in the absence of a set of known genes and/or splicing events. We present G-Mo.R-Se (Gene Modelling using RNA-Seq), an approach aimed at building gene models directly from RNA-Seq and demonstrate its utility on the grapevine genome.

PubMed Disclaimer

Figures

**Figure 1**
*G-Mo.R-Se* method for building gene models from short reads. The five black boxes show the 5 steps of the approach. Step 1 (covtig construction) is the construction of covtigs (coverage contigs), which are built from positions where short reads are mapped above a given depth threshold. Step 2 (candidate exons) is the definition of a list of stranded candidate exons derived from each covtig. Splice sites are searched 100 nucleotides around each covtig boundary, which allows the orientation of the candidate exons on the forward or the reverse strand, as shown in the second box. Step 3 (junction validation) consists of the validation of junctions between candidate exons using a word dictionary built from the unmapped reads. During step 4 (graph of candidates exons linked by validated junctions), a graph is created where nodes are candidate exons (black boxes) and oriented edges (purple arrows) between two nodes represent validated junctions. The two last connected components show an example of a split gene that can be corrected using open reading frame detection between the last exon of the first model and the first exon of the second model. In the final step, step 5 (model construction and coding sequence detection) we go through the previous graph and extract all possible paths between each source and each sink. Each path will then represent a predicted transcript, and a CDS will be identified for each transcript. Models M₁, M₂, M₅and M₇(untranslated regions are in grey, introns in black and coding exons in red) correctly model real transcripts T₁, T₂, T₃and T₅(untranslated regions are in grey, and introns and exons are indicated by black lines and boxes, respectively). As all possible paths are extracted from the graph, some of them may not correspond to real transcripts (for example, models M₃, M₄and M₆).

**Figure 2**
**Read coverage depth for reference genes overlapped by *G-Mo.R-Se* models and Velvet contigs**. The distribution of the average depth (log) on all exonic nucleotides of the genes is plotted for genes overlapped on ≥ 75% of their nucleotides by *G-Mo.R-Se* models (red line) and Velvet contig (dashed purple line). The y-axis corresponds to the percentage of reference genes in each bin (bin width is 0.2).

**Figure 3**
**Proportion of unique 32-mers in cDNA clusters**. The percentage of unique 32-mers is shown for cDNA clusters overlapped by models on more than 75% of their nucleotides (green) and cDNA clusters not overlapped by models (red). The y-axis corresponds to the percentage of cDNA clusters in each bin (bin width is 10% of unique 32-mers among all 32-mers in the cluster).

**Figure 4**
**Read coverage depth for models overlapping cDNA loci and models not overlapping cDNAs**. The distribution of the average depth (log) on all exonic nucleotides of the models is plotted for models overlapping cDNAs on ≥ 50% of their nucleotides (green) and models not overlapping cDNAs (black). The y-axis corresponds to the percentage of models in each bin (bin width is 0.2).

**Figure 5**
**Example of alternatively spliced models built from short reads**. The figure shows a capture of a 4 kb genomic region from *V. vinifera* chromosome 12 between 3,836,500 bp and 3,840,500 bp. The first track (Genoscope annotations) contains the automatic annotation from [44]. The green models are GeneWise alignments of Uniprot proteins. Alignment of *V. vinifera* cDNAs from [44] are in red, and public *V. vinifera* ESTs are in light green. The next track displays the models predicted by *G-Mo.R-Se* (untranslated region in grey, CDS in red). Initial covtigs are displayed as brown boxes (average depth of covtigs is written below each covtig). Alignments of velvet contigs are displayed in purple. *Ab initio* models produced by geneID [51] and SNAP [52] are displayed in blue and pink, respectively. The short reads coverage depth is plotted on the last track (black): the dashed red line shows the threshold used to build covtigs. Model M₂is confirmed by numerous resources, model M₃seems to be a minor alternative splice form (it is only supported by two public ESTs: E₁and E₂), and model M₁is a novel alternative splice form.

See this image and copyright information in PMC

Cited by

Species and condition specific adaptation of the transcriptional landscapes in Candida albicans and Candida dubliniensis.
Grumaz C, Lorenz S, Stevens P, Lindemann E, Schöck U, Retey J, Rupp S, Sohn K. Grumaz C, et al. BMC Genomics. 2013 Apr 2;14:212. doi: 10.1186/1471-2164-14-212. BMC Genomics. 2013. PMID: 23547856 Free PMC article.
Cutoffs and k-mers: implications from a transcriptome study in allopolyploid plants.
Gruenheit N, Deusch O, Esser C, Becker M, Voelckel C, Lockhart P. Gruenheit N, et al. BMC Genomics. 2012 Mar 14;13:92. doi: 10.1186/1471-2164-13-92. BMC Genomics. 2012. PMID: 22417298 Free PMC article.
Incorporating RNA-seq data into the zebrafish Ensembl genebuild.
Collins JE, White S, Searle SM, Stemple DL. Collins JE, et al. Genome Res. 2012 Oct;22(10):2067-78. doi: 10.1101/gr.137901.112. Epub 2012 Jul 12. Genome Res. 2012. PMID: 22798491 Free PMC article.
RNA-Seq Atlas of Glycine max: a guide to the soybean transcriptome.
Severin AJ, Woody JL, Bolon YT, Joseph B, Diers BW, Farmer AD, Muehlbauer GJ, Nelson RT, Grant D, Specht JE, Graham MA, Cannon SB, May GD, Vance CP, Shoemaker RC. Severin AJ, et al. BMC Plant Biol. 2010 Aug 5;10:160. doi: 10.1186/1471-2229-10-160. BMC Plant Biol. 2010. PMID: 20687943 Free PMC article.
RNA-seq: from technology to biology.
Marguerat S, Bähler J. Marguerat S, et al. Cell Mol Life Sci. 2010 Feb;67(4):569-79. doi: 10.1007/s00018-009-0180-6. Epub 2009 Oct 27. Cell Mol Life Sci. 2010. PMID: 19859660 Free PMC article. Review.

See all "Cited by" articles

References

1. Holt RA, Jones SJ. The new paradigm of flow cell sequencing. Genome Res. 2008;18:839–846. doi: 10.1101/gr.073262.107. - DOI - PubMed
1. Mardis ER. Next-generation DNA sequencing methods. Annu Rev Genomics Hum Genet. 2008;9:387–402. doi: 10.1146/annurev.genom.9.081307.164359. - DOI - PubMed
1. Korbel JO, Urban AE, Grubert F, Du J, Royce TE, Starr P, Zhong G, Emanuel BS, Weissman SM, Snyder M, Gerstein MB. Systematic prediction and validation of breakpoints associated with copy-number variants in the human genome. Proc Natl Acad Sci USA. 2007;104:10110–10115. doi: 10.1073/pnas.0703834104. - DOI - PMC - PubMed
1. Chen W, Kalscheuer V, Tzschach A, Menzel C, Ullmann R, Schulz MH, Erdogan F, Li N, Kijas Z, Arkesteijn G, Pajares IL, Goetz-Sothmann M, Heinrich U, Rost I, Dufke A, Grasshoff U, Glaeser B, Vingron M, Ropers HH. Mapping translocation breakpoints by next-generation sequencing. Genome Res. 2008;18:1143–1149. doi: 10.1101/gr.076166.108. - DOI - PMC - PubMed
1. Cokus SJ, Feng S, Zhang X, Chen Z, Merriman B, Haudenschild CD, Pradhan S, Nelson SF, Pellegrini M, Jacobsen SE. Shotgun bisulphite sequencing of the Arabidopsis genome reveals DNA methylation patterning. Nature. 2008;452:215–219. doi: 10.1038/nature06745. - DOI - PMC - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions
Actions

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database

[1] Holt RA, Jones SJ. The new paradigm of flow cell sequencing. Genome Res. 2008;18:839–846. doi: 10.1101/gr.073262.107. - DOI - PubMed

[2] Holt RA, Jones SJ. The new paradigm of flow cell sequencing. Genome Res. 2008;18:839–846. doi: 10.1101/gr.073262.107. - DOI - PubMed

[3] Mardis ER. Next-generation DNA sequencing methods. Annu Rev Genomics Hum Genet. 2008;9:387–402. doi: 10.1146/annurev.genom.9.081307.164359. - DOI - PubMed

[4] Mardis ER. Next-generation DNA sequencing methods. Annu Rev Genomics Hum Genet. 2008;9:387–402. doi: 10.1146/annurev.genom.9.081307.164359. - DOI - PubMed

[5] Korbel JO, Urban AE, Grubert F, Du J, Royce TE, Starr P, Zhong G, Emanuel BS, Weissman SM, Snyder M, Gerstein MB. Systematic prediction and validation of breakpoints associated with copy-number variants in the human genome. Proc Natl Acad Sci USA. 2007;104:10110–10115. doi: 10.1073/pnas.0703834104. - DOI - PMC - PubMed

[6] Korbel JO, Urban AE, Grubert F, Du J, Royce TE, Starr P, Zhong G, Emanuel BS, Weissman SM, Snyder M, Gerstein MB. Systematic prediction and validation of breakpoints associated with copy-number variants in the human genome. Proc Natl Acad Sci USA. 2007;104:10110–10115. doi: 10.1073/pnas.0703834104. - DOI - PMC - PubMed

[7] Chen W, Kalscheuer V, Tzschach A, Menzel C, Ullmann R, Schulz MH, Erdogan F, Li N, Kijas Z, Arkesteijn G, Pajares IL, Goetz-Sothmann M, Heinrich U, Rost I, Dufke A, Grasshoff U, Glaeser B, Vingron M, Ropers HH. Mapping translocation breakpoints by next-generation sequencing. Genome Res. 2008;18:1143–1149. doi: 10.1101/gr.076166.108. - DOI - PMC - PubMed

[8] Chen W, Kalscheuer V, Tzschach A, Menzel C, Ullmann R, Schulz MH, Erdogan F, Li N, Kijas Z, Arkesteijn G, Pajares IL, Goetz-Sothmann M, Heinrich U, Rost I, Dufke A, Grasshoff U, Glaeser B, Vingron M, Ropers HH. Mapping translocation breakpoints by next-generation sequencing. Genome Res. 2008;18:1143–1149. doi: 10.1101/gr.076166.108. - DOI - PMC - PubMed

[9] Cokus SJ, Feng S, Zhang X, Chen Z, Merriman B, Haudenschild CD, Pradhan S, Nelson SF, Pellegrini M, Jacobsen SE. Shotgun bisulphite sequencing of the Arabidopsis genome reveals DNA methylation patterning. Nature. 2008;452:215–219. doi: 10.1038/nature06745. - DOI - PMC - PubMed

[10] Cokus SJ, Feng S, Zhang X, Chen Z, Merriman B, Haudenschild CD, Pradhan S, Nelson SF, Pellegrini M, Jacobsen SE. Shotgun bisulphite sequencing of the Arabidopsis genome reveals DNA methylation patterning. Nature. 2008;452:215–219. doi: 10.1038/nature06745. - DOI - PMC - PubMed

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Annotating genomes with massive-scale RNA sequencing

Affiliation

Annotating genomes with massive-scale RNA sequencing

Authors

Affiliation

Abstract

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Substances

LinkOut - more resources

Full Text Sources

Other Literature Sources

Abstract

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Substances

Related information

LinkOut - more resources

Full Text Sources

Other Literature Sources