Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Analysis
  • Published:

Normalization of RNA-seq data using factor analysis of control genes or samples

Abstract

Normalization of RNA-sequencing (RNA-seq) data has proven essential to ensure accurate inference of expression levels. Here, we show that usual normalization approaches mostly account for sequencing depth and fail to correct for library preparation and other more complex unwanted technical effects. We evaluate the performance of the External RNA Control Consortium (ERCC) spike-in controls and investigate the possibility of using them directly for normalization. We show that the spike-ins are not reliable enough to be used in standard global-scaling or regression-based normalization procedures. We propose a normalization strategy, called remove unwanted variation (RUV), that adjusts for nuisance technical effects by performing factor analysis on suitable sets of control genes (e.g., ERCC spike-ins) or samples (e.g., replicate libraries). Our approach leads to more accurate estimates of expression fold-changes and tests of differential expression compared to state-of-the-art normalization methods. In particular, RUV promises to be valuable for large collaborative projects involving multiple laboratories, technicians, and/or sequencing platforms.

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Figure 1: Unwanted variation in the SEQC RNA-seq data set.
Figure 2: Unwanted variation in the zebrafish RNA-seq data set.
Figure 3: RUVg normalization using in silico empirical control genes.
Figure 4: Behavior of the ERCC spike-in controls.
Figure 5: Using the ERCC spike-in controls for normalization, zebrafish data set.
Figure 6: Impact of normalization on differential expression analysis.

Similar content being viewed by others

Accession codes

Primary accessions

Gene Expression Omnibus

References

  1. Bullard, J., Purdom, E., Hansen, K. & Dudoit, S. Evaluation of statistical methods for normalization and differential expression in mRNA-Seq experiments. BMC Bioinformatics 11, 94 (2010).

    Article  Google Scholar 

  2. Risso, D., Schwartz, K., Sherlock, G. & Dudoit, S. GC-content normalization for RNA-Seq data. BMC Bioinformatics 12, 480 (2011).

    Article  CAS  Google Scholar 

  3. Dillies, M.-A. et al. A comprehensive evaluation of normalization methods for Illumina high-throughput RNA sequencing data analysis. Brief. Bioinform. 14, 671–683 (2013).

    Article  CAS  Google Scholar 

  4. Robinson, M.D. & Oshlack, A. A scaling normalization method for differential expression analysis of RNA-seq data. Genome Biol. 11, R25 (2010).

    Article  Google Scholar 

  5. Hansen, K.D., Irizarry, R.A. & Zhijin, W. Removing technical variability in RNA-seq data using conditional quantile normalization. Biostatistics 13, 204–216 (2012).

    Article  Google Scholar 

  6. Sun, Z. & Zhu, Y. Systematic comparison of RNA-Seq normalization methods using measurement error models. Bioinformatics 28, 2584–2591 (2012).

    Article  CAS  Google Scholar 

  7. Yang, Y.H. et al. Normalization for cDNA microarray data: a robust composite method addressing single and multiple slide systematic variation. Nucleic Acids Res. 30, e15 (2002).

    Article  Google Scholar 

  8. Oshlack, A., Emslie, D., Corcoran, L.M. & Smyth, G.K. Normalization of boutique two-color microarrays with a high proportion of differentially expressed probes. Genome Biol. 8, R2 (2007).

    Article  Google Scholar 

  9. Wu, D. et al. The use of miRNA microarrays for the analysis of cancer samples with global miRNA decrease. RNA 19, 876–888 (2013).

    Article  Google Scholar 

  10. Risso, D., Massa, M.S., Chiogna, M. & Romualdi, C. A modified LOESS normalization applied to microRNA arrays: a comparative evaluation. Bioinformatics 25, 2685–2691 (2009).

    Article  CAS  Google Scholar 

  11. Lovén, J. et al. Revisiting global gene expression analysis. Cell 151, 476–482 (2012).

    Article  Google Scholar 

  12. Baker, S.C. et al. The external RNA controls consortium: a progress report. Nat. Methods 2, 731–734 (2005).

    Article  CAS  Google Scholar 

  13. Jiang, L. et al. Synthetic spike-in standards for RNA-seq experiments. Genome Res. 21, 1543–1551 (2011).

    Article  CAS  Google Scholar 

  14. Bolstad, B.M., Irizarry, R.A., Astrand, M. & Speed, T.P. A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics 19, 185–193 (2003).

    Article  CAS  Google Scholar 

  15. Cleveland, W.S. & Devlin, S.J. Locally weighted regression: an approach to regression analysis by local fitting. JASA 83, 596–610 (1988).

    Article  Google Scholar 

  16. Qing, T., Yu, Y., Du, T. & Shi, L. mRNA enrichment protocols determine the quantification characteristics of external RNA spike-in controls in RNA-Seq studies. Sci. China Life Sci. 56, 134–142 (2013).

    Article  CAS  Google Scholar 

  17. SEQC/MAQC-III Consortium. A comprehensive assessment of RNA-seq accuracy, reproducibility and information content by the Sequencing Quality Control Consortium. Nat. Biotechnol. 10.1038/nbt.2957 (24 August 2014).

  18. Canales, R.D. et al. Evaluation of DNA microarray results with quantitative gene expression platforms. Nat. Biotechnol. 24, 1115–1122 (2006).

    Article  CAS  Google Scholar 

  19. Ferreira, T. et al. Silencing of odorant receptor genes by G Protein βγ signaling ensures the expression of one odorant receptor per olfactory sensory neuron. Neuron 81, 847–859 (2014).

    Article  CAS  Google Scholar 

  20. Gagnon-Bartsch, J. & Speed, T. Using control genes to correct for unwanted variation in microarray data. Biostatistics 13, 539–552 (2012).

    Article  Google Scholar 

  21. Gagnon-Bartsch, J., Jacob, L. & Speed, T.P. Removing unwanted variation from high dimensional data with negative controls. Tech. Rep. 820, Department of Statistics, University of California, Berkeley (2013).

  22. Cancer Genome Atlas Research Network. Comprehensive genomic characterization defines human glioblastoma genes and core pathways. Nature 455, 1061–1068 (2008).

  23. ENCODE Project Consortium. The ENCODE (ENCyclopedia of DNA elements) project. Science 306, 636–640 (2004).

  24. Leek, J.T. & Storey, J.D. Capturing heterogeneity in gene expression studies by surrogate variable analysis. PLoS Genet. 3, 1724–1735 (2007).

    Article  CAS  Google Scholar 

  25. 't Hoen, P. et al. Reproducibility of high-throughput mRNA and small RNA sequencing across laboratories. Nat. Biotechnol. 31, 1015–1022 (2013).

    Article  CAS  Google Scholar 

  26. Jacob, L., Gagnon-Bartsch, J. & Speed, T.P. Correcting gene expression data when neither the unwanted variation nor the factor of interest are observed. Tech. Rep. 818, Department of Statistics, University of California, Berkeley (2013).

  27. Tang, F., Lao, K. & Surani, M.A. Development and applications of single-cell transcriptome analysis. Nat. Methods 8, S6–S11 (2011).

    Article  CAS  Google Scholar 

  28. Brennecke, P. et al. Accounting for technical noise in single-cell RNA-seq experiments. Nat. Methods 10, 1093–1095 (2013).

    Article  CAS  Google Scholar 

  29. Cleveland, W.S. Robust locally weighted regression and smoothing scatterplots. JASA 74, 829–836 (1979).

    Article  Google Scholar 

  30. Flicek, P. et al. Ensembl 2012. Nucleic Acids Res. 40, D84–D90 (2012).

    Article  CAS  Google Scholar 

  31. Trapnell, C., Pachter, L. & Salzberg, S.L. TopHat: discovering splice junctions with RNA-Seq. Bioinformatics 25, 1105–1111 (2009).

    Article  CAS  Google Scholar 

  32. McCullagh, P. & Nelder, J. Generalized Linear Models (Chapman and Hall, New York, 1989).

  33. Listgarten, J., Kadie, C., Schadt, E.E. & Heckerman, D. Correction for hidden confounders in the genetic analysis of gene expression. Proc. Natl. Acad. Sci. USA 107, 16465–16470 (2010).

    Article  CAS  Google Scholar 

  34. Robinson, M.D., McCarthy, D.J. & Smyth, G.K. edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics 26, 139–140 (2010).

    Article  CAS  Google Scholar 

  35. Anders, S. & Huber, W. Differential expression analysis for sequence count data. Genome Biol. 11, R106 (2010).

    Article  CAS  Google Scholar 

  36. Smyth, G.K. Linear models and empirical Bayes methods for assessing differential expression in microarray experiments. Stat. Appl. Genet. Mol. Biol. 3, 3 (2004).

    Article  Google Scholar 

  37. Mortazavi, A., Williams, B.A., McCue, K., Schaeffer, L. & Wold, B. Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nat. Methods 5, 621–628 (2008).

    Article  CAS  Google Scholar 

  38. Benjamini, Y. & Hochberg, Y. Controlling the false discovery rate: A practical and powerful approach to multiple testing. J. R. Stat. Soc., B 57, 289–300 (1995).

    Google Scholar 

  39. Gentleman, R.C. et al. Bioconductor: open software development for computational biology and bioinformatics. Genome Biol. 5, R80 (2004).

    Article  Google Scholar 

Download references

Acknowledgements

We thank the SEQC Consortium for granting us early access to the SEQC pilot data, L. Jacob for his help with the RUV methodology and its software implementation, and J. Gagnon-Bartsch, J. Choi, and W. Shi for helpful discussions. J.N. was supported by a grant from the National Institute on Deafness and Other Communication Disorders. T.P.S. was supported by a National Health and Medical Research Council (NHMRC) Australia Fellowship.

Author information

Authors and Affiliations

Authors

Contributions

D.R., S.D. and T.P.S. developed the statistical methods; D.R. and S.D. analyzed the data; J.N. designed the zebrafish experiment; D.R. and S.D. wrote the manuscript; all authors read and approved the manuscript.

Corresponding authors

Correspondence to Davide Risso or Sandrine Dudoit.

Ethics declarations

Competing interests

The authors declare no competing financial interests.

Supplementary information

Supplementary Text and Figures

Supplementary Figures 1–20 and Supplementary Table 1 (PDF 3382 kb)

Supplementary Software

RUVSeq_0.1.1.tar.gz (ZIP 135 kb)

Source data

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Risso, D., Ngai, J., Speed, T. et al. Normalization of RNA-seq data using factor analysis of control genes or samples. Nat Biotechnol 32, 896–902 (2014). https://doi.org/10.1038/nbt.2931

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/nbt.2931

Search

Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing