Abstract
The use of single-cell RNA-sequencing (scRNA-seq) allows observation of different cells at multi-tiered complexity in the same microenvironment. To get insights into cell identity using scRNA-seq data, we present Cepo, which generates cell-type-specific gene statistics of differentially stable genes from scRNA-seq data to define cell identity. When applied to multiple datasets, Cepo outperforms current methods in assigning cell identity and enhances several cell identification applications such as cell-type characterisation, spatial mapping of single cells and lineage inference of single cells.
This is a preview of subscription content, access via your institution
Access options
Access Nature and 54 other Nature Portfolio journals
Get Nature+, our best-value online-access subscription
$29.99 / 30 days
cancel any time
Subscribe to this journal
Receive 12 digital issues and online access to articles
$99.00 per year
only $8.25 per issue
Buy this article
- Purchase on SpringerLink
- Instant access to full article PDF
Prices may be subject to local taxes which are calculated during checkout


Similar content being viewed by others
Data availability
All the datasets used in this study are publicly available. The Molecular Signatures Database gene sets were downloaded from http://www.gsea-msigdb.org/gsea/msigdb/. The Tabula Muris data collection was downloaded from https://tabula-muris.ds.czbiohub.org/. The CellBench data collection was downloaded from https://github.com/LuyiTian/sc_mixology/. The Embryogenesis atlas data, which profiles 48 h of mouse embryonic development, was downloaded from https://github.com/MarioniLab/EmbryoTimecourse2018. The parsed Gastrulation data, sequenced using scNMT-seq, were downloaded from the link provided in https://github.com/rargelaguet/scnmt_gastrulation. The processed Gastrulation data were downloaded from http://www.human-gastrula.net. The hematopoietic stem cells differentiation data were downloaded from https://cytotrace.stanford.edu/. The Fetal tissue atlas data were downloaded from NCBI Gene Expression Omnibus under accession number GSE156793. The spatial embryo data were downloaded from NCBI Gene Expression Omnibus under accession number GSE120963.
Code availability
Cepo R package, source code to generate figures, and the detailed vignette including various applications such as its usage together with scRNA-seq data normalisation, batch correction and integration pipelines are available from https://github.com/PYangLab/Cepo (ref. 50).
References
Wagner, A., Regev, A. & Yosef, N. Revealing the vectors of cellular identity with single-cell genomics. Nat. Biotechnol. 34, 1145–1160 (2016).
Kotliar, D. et al. Identifying gene expression programs of cell-type identity and cellular activity with single-cell RNA-Seq. eLife https://doi.org/10.7554/eLife.43803 (2019).
Morris, S. A. The evolving concept of cell identity in the single cell era. Development 146, dev169748 (2019).
Wang, T., Li, B., Nelson, C. E. & Nabavi, S. Comparative analysis of differential gene expression analysis tools for single-cell RNA sequencing data. BMC Bioinformatics 20, 40 (2019).
Soneson, C. & Robinson, M. D. Bias, robustness and scalability in single-cell differential expression analysis. Nat. Methods 15, 255–261 (2018).
Law, C. W., Chen, Y., Shi, W. & Smyth, G. K. voom: precision weights unlock linear model analysis tools for RNA-seq read counts. Genome Biol. 15, R29 (2014).
Robinson, M. D., McCarthy, D. J. & Smyth, G. K. edgeR: A Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics 26, 139–140 (2009).
Finak, G. et al. MAST: a flexible statistical framework for assessing transcriptional changes and characterizing heterogeneity in single-cell RNA sequencing data. Genome Biol. 16, 278 (2015).
Korthauer, K. D. et al. A statistical approach for identifying differential distributions in single-cell RNA-seq experiments. Genome Biol. 17, 222 (2016).
Stuart, T. et al. Comprehensive integration of single-cell data. Cell 177, 1888–1902 (2019).
Ritchie, M. E. et al. Limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucl. Acids Res. 43, e47–e47 (2015).
Tian, L. et al. Benchmarking single cell RNA-sequencing analysis pipelines using mixture control experiments. Nat. Methods 16, 479–487 (2019).
Segal, E., Friedman, N., Koller, D. & Regev, A. A module map showing conditional activity of expression modules in cancer. Nat. Genet. 36, 1090–1098 (2004).
Cao, J. et al. A human cell atlas of fetal gene expression. Science 370, aba7721 (2020).
Pijuan-Sala, B. et al. A single-cell molecular map of mouse gastrulation and early organogenesis. Nature 566, 490–495 (2019).
Argelaguet, R. et al. Multi-omics profiling of mouse gastrulation at single-cell resolution. Nature 576, 487–491 (2019).
Tyser, R.C.V. et al. Single-cell transcriptomic characterization of a gastrulating human embryo. Nature https://doi.org/10.1038/s41586-021-04158-y (2021).
Peng, G. et al. Molecular architecture of lineage allocation and tissue organization in early mouse embryo. Nature 572, 528–532 (2019).
Akashi, K., Traver, D., Miyamoto, T. & Weissman, I. L. A clonogenic common myeloid progenitor that gives rise to all myeloid lineages. Nature 404, 193–197 (2000).
Weinreb, C., Rodriguez-Fraticelli, A., Camargo, F. D. & Klein, A. M. Lineage tracing on transcriptional landscapes links state to fate during differentiation. Science 367, aaw3381 (2020).
Olsson, A. et al. Single-cell analysis of mixed-lineage states leading to a binary cell fate choice. Nature 537, 698–702 (2016).
Schaum, N. et al. Single-cell transcriptomics of 20 mouse organs creates a Tabula Muris. Nature 562, 367–372 (2018).
Lun, A. T. L., Bach, K. & Marioni, J. C. Pooling across cells to normalize single-cell RNA sequencing data with many zero counts. Genome Biol. 17, 75 (2016).
Clark, S. J. et al. ScNMT-seq enables joint profiling of chromatin accessibility DNA methylation and transcription in single cells. Nat. Commun. 9, 781 (2018).
Picelli, S. et al. Smart-seq2 for sensitive full-length transcriptome profiling in single cells. Nat. Methods 10, 1096–1100 (2013).
Cao, J. et al. Comprehensive single-cell transcriptional profiling of a multicellular organism. Science 357, aam8940 (2017).
Peng, G. et al. Spatial transcriptome for the molecular annotation of lineage fates and cell identity in mid-gastrula mouse embryo. Developmental Cell 36, 681–697 (2016).
McCarthy, D. J., Campbell, K. R., Lun, A. T. L. & Wills, Q. F. Scater: pre-processing, quality control, normalization and visualization of single-cell RNA-seq data in R. Bioinformatics 33, 1179–1186 (2017).
Lin, Y. et al. Evaluating stably expressed genes in single cells. GigaScience 8, giz106 (2019).
Massey, F. J. The Kolmogorov–Smirnov test for goodness of fit. J. Am. Stat. Assoc. 46, 68–78 (1951).
Benjamini, Y. & Hochberg, Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. R. Stat. Soc. B 57, 289–300 (1995).
Zappia, L., Phipson, B. & Oshlack, A. Splatter: simulation of single-cell RNA sequencing data. Genome Biol. 18, 174 (2017).
Kuhn, M. & Vaughan, D. Yardstick: Tidy Characterizations of Model Performance (Yardstick, 2020).
Pagès, H. HDF5Array: HDF5 Backend for DelayedArray Objects. R package version 1.22.1, https://bioconductor.org/packages/HDF5Array (2020).
Su, S. et al. CellBench: R/Bioconductor software for comparing single-cell RNA-seq analysis methods. Bioinformatics 36, 2288–2290 (2020).
Van der Laan, M. J. & Pollard, K. S. A new algorithm for hybrid hierarchical clustering with visualization and the bootstrap. J. Stat. Plann. Inference 117, 275–303 (2003).
Kim, T. et al. Impact of similarity metrics on single-cell RNA-seq data clustering. Brief. Bioinform. 20, 2316–2326 (2019).
Lun, A. T. L., McCarthy, D. J. & Marioni, J. C. A step-by-step workflow for low-level analysis of single-cell RNA-seq data with Bioconductor. F1000Research https://doi.org/10.12688/f1000research.9501.2 (2016).
Kolde, R. pheatmap: Pretty Heatmaps. R Package Version 1.0.12 R Package Version 1.0.8 (2015).
Gómez-Rubio, V. ggplot2—elegant graphics for data analysis (2nd edition). J. Stat. Softw. https://doi.org/10.18637/jss.v077.b02 (2017).
Saelens, W., Cannoodt, R., Todorov, H. & Saeys, Y. A comparison of single-cell trajectory inference methods. Nat. Biotechnol. 37, 547–554 (2019)
Street, K. et al. Slingshot: Cell lineage and pseudotime inference for single-cell transcriptomics. BMC Genomics 19, 477 (2018).
Trapnell, C. et al. The dynamics and regulators of cell fate decisions are revealed by pseudotemporal ordering of single cells. Nat. Biotechnol. 32, 381–386 (2014).
duVerle, D. A., Yotsukura, S., Nomura, S., Aburatani, H. & Tsuda, K. CellTree: an R/bioconductor package to infer the hierarchical structure of cell populations from single-cell RNA-seq data. BMC Bioinform. 17, 363 (2016).
Taddy, M. A. On estimation and selection for topic models. In Proc. 15th International Conference on Artificial Intelligence and Statistics (AISTATS) (AISTATS, 2012).
Sergushichev, A. A. An algorithm for fast preranked gene set enrichment analysis using cumulative statistic calculation. Preprint at https://www.biorxiv.org/content/10.1101/060012v1 (2016).
Liberzon, A. et al. Molecular signatures database (MSigDB) 3.0. Bioinformatics 27, 1739–1740 (2011).
Yu, G., Wang, L., Han, Y. & He, Q. Y. clusterProfiler: an R package for comparing biological themes among gene clusters. OMICS J. Integr. Biol. 16, 284–287 (2012).
Avila Cobos, F., Alquicira-Hernandez, J., Powell, J. E., Mestdagh, P. & de Preter, K. Benchmarking of cell type deconvolution pipelines for transcriptomics data. Nat. Commun. 11, 5650 (2020).
Kim, H., Yang, P. & Wang, K. PYangLab/Cepo: Release of Cepo (Zenodo, 2021); https://doi.org/10.5281/ZENODO.5652243
Acknowledgements
We thank all of our colleagues—particularly at the School of Mathematics and Statistics, The University of Sydney and Sydney Precision Bioinformatics Alliance—for their support and intellectual engagement. This work is supported by an Australian Research Council (ARC)/Discovery Early Career Researcher Award (DE170100759) and a National Health and Medical Research Council (NHMRC) Investigator Grant (1173469) to P.Y., an Australian Research Council Discovery Project grant (DP170100654) to P.Y. and J.Y.H.Y., and an Australian Research Council (ARC) Postgraduate Research Scholarship and Children’s Medical Research Institute Postgraduate Scholarship to H.J.K.
Author information
Authors and Affiliations
Contributions
P.Y. and H.J.K. conceived the study with input J.Y.H.Y. and D.M.L. H.J.K. and K.W. developed the method and software with input from P.Y. H.J.K., P.Y. and K.W. performed data analyses with input from C.C. and Y.L. H.J.K., P.Y., K.W. and J.Y.H.Y. interpreted the results with input from P.P.L.T. H.J.K., P.Y. and K.W. wrote the manuscript with input from J.Y.H.Y. All of the authors read and approved the final version of the manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Peer review information
Nature Computational Science thanks the anonymous reviewers for their contribution to the peer review of this work. Handling editor: Ananya Rastogi, in collaboration with the Nature Computational Science team.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Supplementary Information
Supplementary Figs. 1–21.
Source data
Source Data Fig. 1
Statistical Source Data for Fig. 1
Source Data Fig. 2
Statistical Source Data for Fig. 2
Rights and permissions
About this article
Cite this article
Kim, H.J., Wang, K., Chen, C. et al. Uncovering cell identity through differential stability with Cepo. Nat Comput Sci 1, 784–790 (2021). https://doi.org/10.1038/s43588-021-00172-2
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1038/s43588-021-00172-2
This article is cited by
-
Time-series single-cell transcriptomic profiling of luteal-phase endometrium uncovers dynamic characteristics and its dysregulation in recurrent implantation failures
Nature Communications (2025)
-
Atlas of multilineage stem cell differentiation reveals TMEM88 as a developmental regulator of blood pressure
Nature Communications (2025)
-
A comparison of marker gene selection methods for single-cell RNA sequencing data
Genome Biology (2024)
-
scCTS: identifying the cell type-specific marker genes from population-level single-cell RNA-seq
Genome Biology (2024)
-
Decoding the hallmarks of allograft dysfunction with a comprehensive pan-organ transcriptomic atlas
Nature Medicine (2024)