Abstract
Clustering and cell type classification are important steps in single-cell RNA-seq (scRNA-seq) analysis. As more and more scRNA-seq data are becoming available, supervised cell type classification methods that utilize external well-annotated source data start to gain popularity over unsupervised clustering algorithms; however, the performance of existing supervised methods is highly dependent on source data quality and they often have limited accuracy to classify cell types that are missing in the source data. We developed ItClust to overcome these limitations, a transfer learning algorithm that borrows ideas from supervised cell type classification algorithms, but also leverages information in target data to ensure sensitivity in classifying cells that are only present in the target data. Through extensive evaluations using data from different species and tissues generated with diverse scRNA-seq protocols, we show that ItClust considerably improves clustering and cell type classification accuracy over popular unsupervised clustering and supervised cell type classification algorithms.
This is a preview of subscription content, access via your institution
Access options
Access Nature and 54 other Nature Portfolio journals
Get Nature+, our best-value online-access subscription
$29.99 / 30 days
cancel any time
Subscribe to this journal
Receive 12 digital issues and online access to articles
$119.00 per year
only $9.92 per issue
Buy this article
- Purchase on SpringerLink
- Instant access to full article PDF
Prices may be subject to local taxes which are calculated during checkout






Similar content being viewed by others
Data availability
We analysed multiple scRNA-seq datasets. Publicly available data were acquired from the access numbers provided by the original publications: Baron et al.19 (GSE84133), Xin et al.26 (GSE81608), Grün et al.17 (GSE81076), Muraro et al.18 (GSE85241), Lawlor et al.15 (GSE86469), Segerstolpe et al.16 (E-MTAB-5061), Park et al.24 (GSE107585), Peng et al.27 (GSE118480), Paul et al.33 (GSE727857) and Tusi et al.34 (GSE89754). Details of the datasets analysed in this paper were described in Supplementary Table 1.
Code availability
An open-source implementation of the ItClust algorithm can be downloaded from https://github.com/jianhuupenn/ItClust, https://doi.org/10.5281/zenodo.397606135.
References
Sandberg, R. Entering the era of single-cell transcriptomics in biology and medicine. Nat. Methods 11, 22–24 (2014).
Blondel, V. D., Guillaume, J.-L., Lambiotte, R. & Lefebvre, E. Fast unfolding of communities in large networks. J. Stat. 2008, P10008 (2008).
Wang, B., Zhu, J., Pierson, E., Ramazzotti, D. & Batzoglou, S. Visualization and analysis of single-cell RNA-seq data by kernel-based similarity learning. Nat. Methods 14, 414 (2017).
Kiselev, V. Y. et al. SC3: consensus clustering of single-cell RNA-seq data. Nat. Methods 14, 483–486 (2017).
Huang, M. et al. SAVER: gene expression recovery for single-cell RNA sequencing. Nat. Methods 15, 539–542 (2018).
Eraslan, G., Simon, L. M., Mircea, M., Mueller, N. S. & Theis, F. J. Single-cell RNA-seq denoising using a deep count autoencoder. Nat. Commun. 10, 1–14 (2019).
Kiselev, V. Y., Yiu, A. & Hemberg, M. scmap: projection of single-cell RNA-seq data across data sets. Nat. Methods 15, 359 (2018).
Xu, C. et al. Probabilistic harmonization and annotation of single-cell transcriptomics data with deep generative models. Preprint at https://doi.org/10.1101/532895 (2020).
Lopez, R., Regier, J., Cole, M. B., Jordan, M. I. & Yosef, N. Deep generative modeling for single-cell transcriptomics. Nat. Methods 15, 1053–1058 (2018).
Wagner, F. & Yanai, I. Moana: a robust and scalable cell type classification framework for single-cell RNA-Seq data. Preprint at https://doi.org/10.1101/456129 (2018).
Stuart, T. et al. Comprehensive integration of single-cell data. Cell 177, e1821 (2019).
Donahue, J. et al. DeCAF: a deep convolutional activation feature for generic visual recognition. In Proc. 31st International Conference on Machine Learning Vol. 32, 647–655 (JMLR, 2014).
Li, X. et al. Deep learning enables accurate clustering with batch effect removal in single-cell RNA-seq analysis. Nat. Commun. 11, 2338 (2020).
Wang, J. et al. Data denoising with transfer learning in single-cell transcriptomics. Nat. Methods 16, 875–878 (2019).
Lawlor, N. et al. Single-cell transcriptomes identify human islet cell signatures and reveal cell-type-specific expression changes in type 2 diabetes. Genome Res. 27, 208–222 (2017).
Segerstolpe, Å. et al. Single-cell transcriptome profiling of human pancreatic islets in health and type 2 diabetes. Cell Metabol. 24, 593–607 (2016).
Grün, D. et al. De novo prediction of stem cell identity using single-cell transcriptome data. Cell Stem Cell 19, 266–277 (2016).
Muraro, M. J. et al. A single-cell transcriptome atlas of the human pancreas. Cell Syst. 3, e383 (2016).
Baron, M. et al. A single-cell transcriptomic map of the human and mouse pancreas reveals inter- and intra-cell population structure. Cell systems 3, e344 (2016).
Franzén, O., Gan, L.-M. & Björkegren, J. L. PanglaoDB: a web server for exploration of mouse and human single-cell RNA sequencing data. Database 2019, baz046 (2019).
Zhang, X. et al. CellMarker: a manually curated resource of cell markers in human and mouse. Nucl. Acids Res. 47, D721–D728 (2019).
Abdelaal, T. et al. A comparison of automatic cell identification methods for single-cell RNA sequencing data. Genome Biol. 20, 194 (2019).
Grapin-Botton, A. Ductal cells of the pancreas. Int. J. Biochem. Cell Biol. 37, 504–510 (2005).
Park, J. et al. Single-cell transcriptomics of the mouse kidney reveals potential cellular targets of kidney disease. Science 360, 758–763 (2018).
Young, M. D. et al. Single-cell transcriptomes from human kidneys reveal the cellular identity of renal tumors. Science 361, 594–599 (2018).
Xin, Y. et al. RNA sequencing of single human islet cells reveals type 2 diabetes genes. Cell Metabol. 24, 608–615 (2016).
Peng, Y.-R. et al. Molecular classification and comparative taxonomics of foveal and peripheral cells in primate retina. Cell 176, e1222 (2019).
Mohammadi, S., Davila-Velderrain, J. & Kellis, M. Multi-resolution single-cell state characterization via joint archetypal/network analysis. Preprint at https://doi.org/10.1101/746339 (2019).
Vincent, P., Larochelle, H., Lajoie, I., Bengio, Y. & Manzagol, P.-A. Stacked denoising autoencoders: learning useful representations in a deep network with a local denoising criterion. J. Mach. Learn. Res. 11, 3371–3408 (2010).
Xie, J., Girshick, R. & Farhadi, A. Unsupervised deep embedding for clustering analysis. In International conference on machine learning. In Proc. 33rd International Conference on Machine Learning Vol. 48, 478–487 (PMLR, 2016).
Maaten, L. V. D. & Hinton, G. Visualizing data using t-SNE. J. Mach. Learn. Res. 9, 2579–2605 (2008).
Wolf, F. A., Angerer, P. & Theis, F. J. SCANPY: large-scale single-cell gene expression data analysis. Genome Biol. 19, 15 (2018).
Paul, F. et al. Transcriptional heterogeneity and lineage commitment in myeloid progenitors. Cell 163, 1663–1677 (2015).
Tusi, B. K. et al. Population snapshots predict early haematopoietic and erythroid hierarchies. Nature 555, 54–60 (2018).
Hu, J. jianhuupenn/ItClust: ItClust V1.0.0 (Zenodo, 2020); https://doi.org/10.5281/zenodo.3976061.
Acknowledgements
This work was supported by the following grants: NIH R01GM108600, R01GM125301, R01HL113147, R01HL150359, R01EY030192 and R01EY031209 (to M.L.), and R01DK076077 (to. K.S.).
Author information
Authors and Affiliations
Contributions
This study was conceived of and led by M.L.. J.H., X.L., G.H. and M.L. designed the model and algorithm. J.H. implemented the ItClust software and led the data analysis with input from M.L., X.L., G.H., Y.L. and K.S. J.H. and M.L. wrote the paper with feedback from X.L., G.H., Y.L. and K.S.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Extended data
Extended Data Fig. 1 scVI’s latent space of Lawlor et al. data.
UMAP plot of scVI’s latent space when Baron human data were used as source data and Lawlor et al. data were used as target data. The plot indicates that scVI failed to remove batch effect between the source and target data, which led to low cell type annotation accuracy in the target data.
Extended Data Fig. 2 Dot plots for human kidney data.
Dot plots of known marker genes used for cell type identification for the human kidney data (data generated ourselves together with data from Young et al.). The marker genes used to label the cell types are: SLC13A3 and SLC34A1 for PT (Proximal Tubule); CLDN16 and SLC12A for Loop of Henle; PTPRB and KDR for Endo_AVR_1 (Endothelial Ascending Vasa Recta); PTPRB and SLC14A1 for Endo_AVR_2; PTPRB, KDR, and SLC14A1 for Endo_DVR (Endothelial Descending Vasa Recta); KCNJ1 and SLC8A1 for Distal Tubules; SLC4A1 and CLCNKB for CD_IC_A; SLC26A4 and CLCNKB for CD_IC_B; GZMA and GZMB for NK_cells; CD3D, CD3E, and CD3G for T_cells; CD14, S100A8, and S100A9 for Macrophage_1; CD14 and FCER1A for Macrophage_2; CD79A and CD79B for B_cells.
Extended Data Fig. 3 Computing cost of ItClust.
Memory usage and CPU time for the kidney data analysis.
Extended Data Fig. 4 Analyzing Tusi et al. data using ItClust.
UMAP (a) and Sankey (b) plots of Tusi et al. data based on ItClust embedding and predicted cell types.
Extended Data Fig. 5 Classification accuracies for combined source data and read depth down sampling experiments.
(a) The classification accuracies of ItClust, Seurat 3.0, Moana, scmap, and scVI for the Segerstolpe human pancreatic islet data, using different source datasets as input. Source data 1 is the reduced Baron human pancreatic islet data as in Fig. 5(b) and source data 2 is the Xin human pancreatic islet data, which only include alpha, beta, gamma, and delta cells. (b) The classification accuracies of ItClust before and after fine-tuning, Seurat 3.0, Moana, scmap, and scVI for the macaque retina data across different down-sampling efficiencies. Cells from macaques 1, 2, and 3 were used as the source data, and cells from macaque 4 were used as the target data.
Extended Data Fig. 6 Sankey plots for Segerstolpe et al. data analysis.
The Sankey plots of ItClust, Seurat 3.0, Moana, scmap, and scVI cell type classification results for the Segerstolpe et al. dataset using the combined source data.
Supplementary information
Supplementary Information
Supplementary Tables 1–5 and Notes 1–4.
Rights and permissions
About this article
Cite this article
Hu, J., Li, X., Hu, G. et al. Iterative transfer learning with neural network for clustering and cell type classification in single-cell RNA-seq analysis. Nat Mach Intell 2, 607–618 (2020). https://doi.org/10.1038/s42256-020-00233-7
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1038/s42256-020-00233-7