Integrating human sequence data sets provides a resource of benchmark SNP and indel genotype calls

Zook, Justin M; Chapman, Brad; Wang, Jason; Mittelman, David; Hofmann, Oliver; Hide, Winston; Salit, Marc

doi:10.1038/nbt.2835

Analysis
Published: 16 February 2014

Integrating human sequence data sets provides a resource of benchmark SNP and indel genotype calls

Justin M ZookÂ ORCID: orcid.org/0000-0003-2309-8402¹,
Brad Chapman²,
Jason Wang³,
David Mittelman^3,4,
Oliver Hofmann²,
Winston HideÂ ORCID: orcid.org/0000-0002-8621-3271² &
â€¦
Marc Salit¹Â

Nature Biotechnology volumeÂ 32,Â pages 246â€“251 (2014)Cite this article

51k Accesses
204 Altmetric
Metrics details

Subjects

Abstract

Clinical adoption of human genome sequencing requires methods that output genotypes with known accuracy at millions or billions of positions across a genome. Because of substantial discordance among calls made by existing sequencing methods and algorithms, there is a need for a highly accurate set of genotypes across a genome that can be used as a benchmark. Here we present methods to make high-confidence, single-nucleotide polymorphism (SNP), indel and homozygous reference genotype calls for NA12878, the pilot genome for the Genome in a Bottle Consortium. We minimize bias toward any method by integrating and arbitrating between 14 data sets from five sequencing technologies, seven read mappers and three variant callers. We identify regions for which no confident genotype call could be made, and classify them into different categories based on reasons for uncertainty. Our genotype calls are publicly available on the Genome Comparison and Analytic Testing website to enable real-time benchmarking of any method.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on SpringerLink
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Figure 1: Integration process used to develop high-confidence genotypes.**

**Figure 2: Complex variants have multiple representations.**

**Figure 3: GCAT website can be used to generate performance metrics versus our high-confidence genotypes (GIB v2.18 WGS and GIBv2.18).**

Variant calling and benchmarking in an era of complete human genome sequences

Article 14 April 2023

Extend the benchmarking indel set by manual review using the individual cell line sequencing data from the Sequencing Quality Control 2 (SEQC2) project

Article Open access 25 March 2024

The GIAB genomic stratifications resource for human reference genomes

Article Open access 19 October 2024

Accession codes

Primary accessions

European Nucleotide Archive

References

Pleasance, E.D. et al. A comprehensive catalogue of somatic mutations from a human cancer genome. Nature 463, 191â€“196 (2010).
ArticleÂ CASÂ Google ScholarÂ
Banerji, S. et al. Sequence analysis of mutations and translocations across breast cancer subtypes. Nature 486, 405â€“409 (2012).
ArticleÂ CASÂ Google ScholarÂ
Jones, D.T.W. et al. Dissecting the genomic complexity underlying medulloblastoma. Nature 488, 100â€“105 (2012).
ArticleÂ CASÂ Google ScholarÂ
The Cancer Genome Atlas Network. Comprehensive molecular characterization of human colon and rectal cancer. Nature 487, 330â€“337 (2012).
Boland, J.F. et al. The new sequencer on the block: comparison of Life Technology's Proton sequencer to an Illumina HiSeq for whole-exome sequencing. Hum. Genet. 132, 1153â€“1163 (2013).
ArticleÂ CASÂ Google ScholarÂ
Rieber, N. et al. Coverage bias and sensitivity of variant calling for four whole-genome sequencing technologies. PLoS ONE 8, e66621 (2013).
ArticleÂ CASÂ Google ScholarÂ
Ross, M.G. et al. Characterizing and measuring bias in sequence data. Genome Biol. 14, R51 (2013).
ArticleÂ Google ScholarÂ
Lam, H.Y.K. et al. Performance comparison of whole-genome sequencing platforms. Nat. Biotechnol. 30, 78â€“82 (2012).
ArticleÂ CASÂ Google ScholarÂ
Reumers, J. et al. Optimized filtering reduces the error rate in detecting genomic variants by short-read sequencing. Nat. Biotechnol. 30, 61â€“68 (2012).
ArticleÂ CASÂ Google ScholarÂ
Author, A. The Plasma Proteins: Structure, Function and Genetic Control, edn. 2 (Academic Press, New York, 1975).
O'Rawe, J. et al. Low concordance of multiple variant-calling pipelines: practical implications for exome and genome sequencing. Genome Med. 5, 28 (2013).
ArticleÂ CASÂ Google ScholarÂ
Collins, F. & Hamburg, M. First FDA authorization for next-generation sequencer. N. Engl. J. Med. 369, 2369â€“2371 (2013).
ArticleÂ CASÂ Google ScholarÂ
The 1000 Genomes Project Consortium. A map of human genome variation from population-scale sequencing. Nature 467, 1061â€“1073 (2010).
McKenna, A. et al. The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res. 20, 1297â€“1303 (2010).
ArticleÂ CASÂ Google ScholarÂ
DePristo, M.A. et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat. Genet. 43, 491â€“498 (2011).
ArticleÂ CASÂ Google ScholarÂ
The 1000 Genomes Project Consortium. An integrated map of genetic variation from 1,092 human genomes. Nature 491, 56â€“65 (2012).
Blum, A. & Mitchell, T. in Proceedings of the Eleventh Annual Conference on Computational Learning Theory (eds. P. Bartlett & Y. Mansour) 92â€“100 (ACM, Madison, Wisconsin, USA, 1998).
Meacham, F. et al. Identification and correction of systematic error in high-throughput sequence data. BMC Bioinformatics 12, 451 (2011).
ArticleÂ Google ScholarÂ
Zook, J.M., Samarov, D., McDaniel, J., Sen, S.K. & Salit, M. Synthetic spike-in standards improve run-specific systematic error analysis for DNA and RNA sequencing. PLoS ONE 7, e41356 (2012).
ArticleÂ CASÂ Google ScholarÂ
Tian, D.C. et al. Single-nucleotide mutation rate increases close to insertions/deletions in eukaryotes. Nature 455, 105â€“108 (2008).
ArticleÂ CASÂ Google ScholarÂ
Lee, H. & Schatz, M.C. Genomic dark matter: the reliability of short read mapping illustrated by the genome mappability score. Bioinformatics 28, 2097â€“2105 (2012).
ArticleÂ CASÂ Google ScholarÂ
Li, H. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. Preprint at arXiv:1303.3997v2 [q-bio.GN] (2013).
Garrison, E. & Marth, G. Haplotype-based variant detection from short-read sequencing. Preprint at arXiv:1207.3907v2 [q-bio.GN] (2012).
Li, H. et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics 25, 2078â€“2079 (2009).
ArticleÂ Google ScholarÂ
Li, H. & Durbin, R. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 25, 1754â€“1760 (2009).
ArticleÂ CASÂ Google ScholarÂ
Drmanac, R. et al. Human genome sequencing using unchained base reads on self-assembling DNA nanoarrays. Science 327, 78â€“81 (2010).
ArticleÂ CASÂ Google ScholarÂ
Ajay, S.S., Parker, S.C.J., Abaan, H.O., Fajardo, K.V.F. & Margulies, E.H. Accurate and comprehensive sequencing of personal genomes. Genome Res. 21, 1498â€“1505 (2011).
ArticleÂ Google ScholarÂ

Download references

Acknowledgements

We thank J. Johnson and A. Varadarajan from the Archon Genomics X Prize and EdgeBio for contributing their whole-genome sequencing data from SOLiD and Illumina, Complete Genomics and Life Technologies for providing bam files for NA12878, and the Broad Institute and 1000 Genomes Project for making publicly available bam and VCF files for NA12878. The Illumina exome data on GCAT were given to the Mittelman laboratory by M. Linderman at Icahn Institute of Genomics and Multiscale Biology of the Icahn School of Medicine at Mount Sinai. We thank the US Food and Drug Administration High Performance Computing staff for their support in running the bioinformatics analyses. Harvard School of Public Health contributions were funded by the Archon Genomics X PRIZE. Certain commercial equipment, instruments or materials are identified in this document. Such identification does not imply recommendation or endorsement by the National Institute of Standards and Technology, nor does it imply that the products identified are necessarily the best available for the purpose.

Author information

Authors and Affiliations

Biosystems and Biomaterials Division, National Institute of Standards and Technology, Gaithersburg, Maryland, USA
Justin M ZookÂ &Â Marc Salit
Department of Biostatistics, Bioinformatics Core, Harvard School of Public Health, Cambridge, Massachusetts, USA
Brad Chapman,Â Oliver HofmannÂ &Â Winston Hide
Arpeggi, Inc., Austin, Texas, USA
Jason WangÂ &Â David Mittelman
Virginia Bioinformatics Institute and Department of Biological Sciences, Blacksburg, Virginia, USA
David Mittelman

Authors

Justin M Zook
View author publications
You can also search for this author inPubMedÂ Google Scholar
Brad Chapman
View author publications
You can also search for this author inPubMedÂ Google Scholar
Jason Wang
View author publications
You can also search for this author inPubMedÂ Google Scholar
David Mittelman
View author publications
You can also search for this author inPubMedÂ Google Scholar
Oliver Hofmann
View author publications
You can also search for this author inPubMedÂ Google Scholar
Winston Hide
View author publications
You can also search for this author inPubMedÂ Google Scholar
Marc Salit
View author publications
You can also search for this author inPubMedÂ Google Scholar

Contributions

J.M.Z., M.S., B.C., O.H. and W.H. conceived the integration methods. J.M.Z. wrote the code for the integration methods and wrote the main manuscript. D.M. and J.W. designed the GCAT platform, implemented comparison to our genotype calls, and generated figures.

Corresponding author

Correspondence to Justin M Zook.

Ethics declarations

Competing interests

D.M. and J.W. are partners and equity holders in Gene by Gene Ltd., which offers clinical and direct-to-consumer genetic testing.

Supplementary information

Supplementary Text and Figures

Supplementary Figures 1â€“38, Supplementary Discussion and Supplementary Tables 1â€“7 (PDF 6582 kb)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zook, J., Chapman, B., Wang, J. et al. Integrating human sequence data sets provides a resource of benchmark SNP and indel genotype calls. Nat Biotechnol 32, 246â€“251 (2014). https://doi.org/10.1038/nbt.2835

Download citation

Received: 14 December 2013
Accepted: 27 January 2014
Published: 16 February 2014
Issue Date: March 2014
DOI: https://doi.org/10.1038/nbt.2835