Main

Single-cell omics provides a detailed view of gene expression patterns in individual cells. This level of granularity is used to decode complex biological processes and disease states. It facilitates the identification of distinct types and states of cells1, the inference of cellular interactions within a tissue or organism2, and the characterization of temporal dynamics during processes such as development3, disease progression4 or treatment-induced response5.

To organize the vast, noisy and high-dimensional raw data from single-cell omics experiments into interpretable patterns, individual cells can be labeled or annotated in terms of their type, state, location or phenotypic association. The resulting annotations serve as a linchpin in unraveling the intricate details of cellular identities, function and responses to various stimuli.

Cell annotations can be assigned directly on the basis of experimental conditions (for example, samples originating from patients versus healthy controls) or inferred by computational analysis of cellular gene expression and additional measured attributes6. While continuous, hierarchical7 or otherwise structured annotations exist in some cases8,9,10,11, the computational annotation process can be inherently ambiguous, compounded by the assignment of discrete labels to heterogeneous cell populations based on noisy, sparse and high-dimensional data. As a result, annotations must often be refined manually on the basis of prior information relevant to a particular system, followed by arduous verification based on independent statistical analyses, expert knowledge or additional experimental measurements6. Even then, due to their ambiguous nature, a substantial subset of the assigned annotations may be fully or partially incongruent with the cells they aim to describe12.

Here, we show that the level of congruence between cells and their original annotations carries critical information for downstream interpretation of single-cell data. To evaluate the match between a cell population and its given annotations, we exploit the dynamics of training deep neural networks (DNNs) on such data—an information source that is usually discarded. Specifically, we do not use DNNs for direct predictions of cell types, states or other types of annotation, as has been done in many recent works13,14. Instead, we focus on the time it takes and the stability with which a DNN learns to predict the original input annotation for each cell in the data.

DNNs were shown, in many cases, to have the capacity to memorize practically any training dataset, including noisy data points and inaccurate labels15. Although DNNs that are expressive enough will eventually learn all input labels or annotations provided to them, empirical evidence suggests a progressive learning pattern across epochs. DNNs tend to first learn decision boundaries that capture the general structure underlying the annotated data. This implies that the DNN will first learn data points that were correctly annotated and associated with low noise and then progress to those characterized by high noise and/or incorrect annotations16,17,18. Indeed, in the context of image classification, the learning time of an input data point has been successfully used to determine its correct labeling17,19. In the context of natural language processing, ‘data maps’ were introduced to analyze a model’s behavior during training, an approach that was shown to be useful for predicting the reliability of the input label for each data point18.

Here, we leverage the structure of annotated data revealed by neural network training dynamics to identify meaningful patterns in single-cell omics. We present Annotatability, a method for quantifying the congruence between a cell and its input annotation by annotation-trainability analysis. Annotatability is equipped with several modules that address diverse challenges of single-cell omics analysis, including auditing and rectifying erroneous annotations, identifying ambiguous or intermediate cell states, identifying cellular communities with shared gene expression profiles and annotation trainability via graph embedding, and detecting annotation-associated genes (Fig. 1). We demonstrate the utility of Annotatability in different real-world scenarios, including the identification of false annotations of cell types and intermediate cell types in human peripheral blood mononuclear cells (PBMCs) (Fig. 2) analyzing different annotations in the mouse small intestine (Supplementary Section A.1 and Extended Data Fig. 1); correction of false cell type annotations in spatial transcriptomics of MERFISH (multiplexed error-robust fluorescence in situ hybridization) mouse hypothalamic preoptic region (Fig. 3); embedding of cells along the epithelial-to-mesenchymal transition (EMT) (Fig. 4); and embedding of pancreatic β cells according to disease-related cell states (Fig. 5), with applications for screening of disease-associated genes, evaluation of treatment effectiveness and sensitive detection of rare subpopulations of healthy-like cells (Fig. 5).

Fig. 1: Annotatability schematic workflow.
figure 1

a, Given an annotated dataset consisting of observations (for example, single-cell gene expression profiles) and corresponding annotations per cell (for example, cell types) (step 1), Annotatability trains a DNN to predict the input annotations (step 2), analyzes its predictions along the training procedure (step 3) and computes confidence and variability scores per cell (step 4). b,c, Each cell is subsequently classified as easy to learn (high confidence/low variability), hard to learn (low confidence/low variability) or ambiguous (mid-confidence/high variability) (b), corresponding to correctly annotated, erroneously annotated or ambiguously annotated cells, respectively (c). d, Annotatability can potentially amend the annotations of cells identified as erroneously annotated using its reannotation module. e,f, Annotatability includes a trainability-aware graph-embedding module, which incorporates similarity between cells in both training dynamics statistics as well as gene expression (e), enabling signal-specific downstream analyses (f). g, Annotatability is also equipped with a training-dynamics-based score that captures either positive or negative association of genes relative to a given annotations, corresponding to a given biological signal.

Fig. 2: Identification of erroneous and ambiguous annotations in human PBMC scRNA-seq data.
figure 2

a,b, Two-dimensional Uniform Manifold Approximation and Projection (UMAP) of cells colored by either the input cell type annotations (a) or the confidence of the cell type annotation (b) inferred by Annotatability25. Megakaryocytes were filtered out due to low cell counts (88/11,990). c, Two-dimensional confidence–variability map of annotated cells. Subpopulations of cells identified as correctly, ambiguously or erroneously annotated are colored in orange, blue and green, respectively. One-dimensional histograms of each corresponding category appear across the confidence and variability axes. d, A dot plot of the expression of cell type marker genes in each cluster of the annotated cell types. e,f, Dot plots of the expression of cell type marker genes of the annotated cell types in either CD14+ monocytes (e, n = 2,227) or B cells (f, n = 1,621) identified as correctly annotated (top row) or erroneously annotated (bottom row). g, A heatmap of the expression of marker genes of NK cells (n = 457) subtypes (CD56bright and CD56dim) identified as ambiguously annotated (top row) or correctly annotated (bottom row). h, A heatmap of the mean expression of marker genes of monocytes cells (n = 2,578) subtypes (classical, intermediate and nonclassical) corresponding to cell groups with different initial annotations and trainability characterization. i, A plot of the mean inferred probability of initially annotated CD14+ monocytes as a function of training epoch for cells identified as correctly annotated (blue), ambiguously annotated (orange) and erroneously annotated (red) CD14+ cells and ambiguously annotated to be FCGR3A+ cells (green).

Source data

Fig. 3: Reannotation of erroneous annotations in spatial transcriptomics MERFISH dataset of mouse hypothalamic preoptic region (n = 64,373).
figure 3

a, Two-dimensiomal UMAP of cells colored by cell type annotations29. b,c, A spatial scatter plot of MERFISH data colored by either cell type annotations (b) or cell type annotation confidence inferred by Annotatability (c). d, Two-dimensional confidence–variability map of the annotated cells. e, AUROC (area under the receiver operating characteristic curve) versus percentage of mislabeled cells, for four different methods: Annotatability (green), or scReClassify with either SVM classifier (red), KNN classifier (brown) or RF classifier (blue). The dots represent the mean value, and the vertical lines mark the standard deviation. fi, Heatmaps of the mean expression of marker genes of the annotated cell types, in inhibitory neurons (f, n = 24,761), excitatory neurons (g, n = 11,757), astrocytes (h, n = 8,393) or endothelial cells (i, n = 5,749) identified as correctly annotated (top row) or erroneously annotated (bottom row). The cell type marker gene29: Gad1 (inhibitory neurons), Slc17a6 (excitatory neurons), Myh11 (pericytes), Fn1 (endothelial), Cd24a (ependymal), Selplg (microglia), Aqp4 (astrocytes), Pdgfra (immature oligodendrocytes), Ttyh2 and Mbp (mature oligodendrocytes). j,k, Heatmaps of the mean expression of marker genes of cells that were initially annotated as either inhibitory neurons (j) or excitatory neurons (k) and identified by Annotatability as either correctly annotated (first row) or erroneously annotated (second row), as well as their expression after reannotation by Annotatability (next rows).

Source data

Fig. 4: Capturing the epithelial to mesenchymal transition via trainability-aware graph analysis, based on EMT scRNA-seq data (n = 25,806).
figure 4

a,b, Two-dimensional Uniform Manifold Approximation and Projection (UMAP) of cells colored by either cell line and treatment (a) or cell phenotype (b)36. c, Two-dimensional confidence–variability map of the annotated cells, colored by treatment. d, Two-dimensional uniform manifold approximation and projection (UMAP) of cells colored by inferred confidence. e,f, The mean expression of epithelial (E, top row) and mesenchymal (M, bottom row) marker genes as a function of the inferred confidence rank in epithelial (e) and mesenchymal (f) annotations. gi, Two-dimensional UMAPs for (left to right columns) baseline expression graph of mesenchymal annotated cells (from the MCF10A cell line), trainability-aware graph of mesenchymal annotated cells, baseline expression graph of epithelial annotated cells, and trainability-aware graph of epithelial annotated cells. The graphs are colored by (top to bottom rows) Louvain clustering (g), treatment (h) and inferred confidence by Annotatability (i). j, Heatmaps of scaled and centered expression of epithelial and mesenchymal marker genes in cells belonging to the two clusters as in g. The columns, from left to right, correspond to the graphs as in gi.

Source data

Fig. 5: Inferring disease progression, state heterogeneity and treatment responses, based on pancreatic islets scRNA-seq data (n = 32,888).
figure 5

a, A two-dimensional map of the ranks of the confidence and variability scores, computed across all mSTZ cells (regardless of follow-up treatment) with respect to a disease annotation, colored by Ins1 expression38. b, Two-dimensional Uniform Manifold Approximation and Projection (UMAP) showing the trainability-aware graph colored by Ins1 expression; inset: the same UMAP colored by the Louvain clustering of the graph. h, healthy-like; i, intermediate; d, disease. c, A heatmap showing the mean scaled expression of nine marker genes for β cell activity and maturation and β cell dedifferentiation in the Louvain clusters obtained from the trainability-aware graph (disease, intermediate and healthy-like clusters), and in the healthy control. d, A bar plot showing the mean scaled gene expression for all β cell activity/maturation markers (blue) and dedifferentiation markers (orange) in the Louvain clusters and the healthy control. e, Two-dimensional UMAP of the trainability-aware graph colored by the ranks of the confidence score, computed across all mSTZ cells. Inset: the same UMAP colored by the raw confidence scores. f, The same heatmap as in c for five clusters; the number of clusters was increased by tuning the Louvain clustering hyperparameters (Supplementary Fig. 7h–m). g, The same as in a, split by the different treatments. h, A normalized stacked bar plot showing the share of cells from each Louvain cluster for each treatment. i, A two-dimensional confidence–variability map of all mSTZ cells either treated with insulin (alone or with other compounds; red) or not treated with insulin (gray). The rectangle indicates the 43 hard-to-learn cells, of which 41 were treated with insulin (variance <0.15, confidence <0.15). One-dimensiomal histograms of each corresponding category appear across the confidence and variability axes. j, Dot plots showing the mean scaled expression of the same marker genes as in c, for either the easy-to-learn + ambiguous mSTZ cells (top row), the hard-to-learn mSTZ cells (middle row) or the control healthy cells (bottom row). Inter., intermediate.

Source data

Results

Characterizing cells and annotations with Annotatability

We developed Annotatability, a framework for annotation-trainability analysis for single-cell omics data, achieved by monitoring the training dynamics of DNNs. Annotatability quantifies these training dynamics as follows. First, it takes as input an annotated dataset consisting of observations such as gene expression profiles of single cells and corresponding annotations such as cell types or disease states per cell (Fig. 1a, step 1). Second, it trains a DNN, in our case, a simple multilayer perceptron, to predict the input annotation of each cell (‘Annotatability workflow’ section in Methods; Fig. 1a, step 2). Third, it analyzes the DNN’s predictions along the training procedure (Fig. 1a, step 3). Finally, it computes two scores for each cell18: (1) the cell’s confidence score is the mean probability assigned to its input annotation across epochs, and (2) the cell’s variability score is the standard deviation of the probability assigned to its input annotation across epochs (‘Annotatability workflow’ section in Methods; Fig. 1a, step 4).

Based on previous results18, we expect learning to become easier, quicker and less volatile as the fit between a cell and its input annotation increases, and vice versa. Consequently, we expect correctly annotated cells to have high confidence and low variability scores, erroneously annotated cells to have low confidence and low variability scores, and ambiguously annotated cells to have mid-confidence and high variability scores. By setting appropriate thresholds (‘Classifying cells on the basis of confidence and variability scores and setting thresholds’ section in Methods), we classify each cell to one of these three categories: correctly annotated (easy to learn), erroneously annotated (hard to learn) or ambiguously annotated (Fig. 1b,c).

To reduce the sensitivity of Annotatability to the details of the DNN, we use a general and nonspecialized form of DNNs, specifically a fully connected network20. Moreover, the known tendency of DNNs to overfit is a key aspect of our approach, implying that the data will be eventually memorized by the network without needing to impose particular assumptions on its structure or the network hyperparameters. Indeed, Annotatability was found to be robust across a range of hyperparameter and architecture settings (Supplementary Section A.5 and Supplementary Fig. 1).

Annotatability uses the trainability-based cell classifications for downstream biological analysis in several ways. First, it allows us to focus any downstream analysis solely on correctly annotated cells and study cells that were classified as ambiguously annotated as candidates representing intermediate cell states. It also allows us to identify distinct subpopulations of cells that do not fit their input annotations, such as in the case of healthy-like cells in a diseased sample. Second, the identification of correctly and erroneously annotated cells can be used to amend erroneous annotations; we equipped Annotatability with an optional reannotation module that trains a DNN solely on correctly annotated cells, and the network is then used to reannotate the entire dataset (‘Reannotation of erroneously annotated cells’ section in Methods; Fig. 1d). Third, Annotatability incorporates a trainability-aware graph-embedding module, utilizing the trainability of a given annotation as a proxy for biological variation, or signal, encoded by the cell. Recent years have seen the development of many single-cell RNA sequencing (scRNA-seq) graph-based analysis pipelines, where nodes represent cells and edges between cell pairs are weighted by their proximity in gene expression space10,21,22,23,24. Annotatability extends this approach by also considering the similarity of cells’ Annotatability confidence scores when weighting edges (‘Trainability-aware graph embedding’ section in Methods). This reweighting enhances signals of interest, such as a cell’s developmental stage or disease state, enabling signal-specific downstream analyses (‘Trainability-aware graph embedding’ section in Methods; Fig. 1e,f). Finally, Annotatability is equipped with a training dynamics-based score that captures either positive or negative association of genes relative to a given biological signal, revealed by their correlation or anticorrelation with the confidence in a particular annotation (‘Annotation-trainability score’ section in Methods; Fig. 1g).

Identifying erroneous annotations and ambiguous cell states in human PBMCs

We begin by demonstrating how annotation-trainability analysis via Annotatability can be used to identify both erroneously and ambiguously annotated cells, the latter potentially corresponding to intermediate cell types. We apply Annotatability to a scRNA-seq dataset of human PBMCs25. In the original dataset, each cell was preannotated to one of multiple cell types (Fig. 2a). We used Annotatability to train a DNN classifier that predicts cell types from the annotated PBMC scRNA-seq data, and assign confidence and variability scores to each cell (Fig. 2b,c).

We first assess the set of cells classified as erroneously annotated by Annotatability on the basis of their low confidence and low variability scores (‘Classifying cells on the basis of confidence and variability scores and setting thresholds’ section in Methods). These erroneously annotated cells indeed exhibit underexpression (or no expression) of marker genes corresponding to their input cell type annotation (Fig. 2d). For example, cells classified as erroneously annotated CD14+ monocytes underexpress CD14, yet ~80% of them express the T cell marker IL7R and ~60% express the natural killer (NK)/T cell marker NKG7. Similarly, cells classified as erroneously annotated B cells underexpress the B cell marker CD79A, yet a substantial fraction expresses various T cell or dendritic cell markers (Fig. 2d–f and Supplementary Fig. 2a,b).

We followed by assessing whether cells classified as ambiguously annotated indeed correspond to intermediate or otherwise ambiguous cell types, focusing first on the monocytes population. Monocytes are broadly divided into three major subtypes: classical (CD14++CD16), intermediate (CD14++CD16+) and nonclassical (CD14+CD16++)26. However, in the original scRNA-seq data, monocytes were annotated as either classical (CD14+) or nonclassical (FCGR3A+), where FCGR3A is a common alias for CD16. Thus, we expect intermediate cells in the PBMC dataset to be classified as ambiguously annotated. To support our approach, we used previously established marker genes25, specifically, the top two highly expressed marker genes for each monocyte subtype that were not filtered out during preprocessing (classical monocyte markers: S100A12/ALOX5AP; intermediate monocyte markers: GPR35 and MARCO; nonclassical monocyte markers: ICAM4/CD79b). Indeed, CD14+ monocytes classified by Annotatability as ambiguously annotated express lower levels of classical monocyte markers and higher levels of intermediate monocyte markers than CD14+ monocytes classified as correctly annotated (Fig. 2d,h, Supplementary Fig. 3 and Supplementary Section A.2.2). Similary, FCGR3A+ monocytes classified as ambiguously annotated express lower levels of nonclassical monocyte markers and higher levels of intermediate monocyte markers than FCGR3A+ monocytes classified as correctly annotated (Fig. 2d,h, Supplementary Fig. 3 and Supplementary Section A.2.2). In a semi-simulated dataset in which cells from the PBMC dataset were randomly blended, Annotatability outperformed other classifiers in identifying ambiguous cell states (Supplementary Section A.4). Similar analysis of peripheral blood NK cells revealed intermediate cell states classified by Annotatability as ambiguously annotated (Fig. 2g and Supplementary Section A.2.1).

Having validated Annotatability’s categorization of cells to either correctly, erroneously or ambiguously annotated cells, we next directly evaluate differences in training dynamics among these three categories as a function of the learning epoch. We focus on the group of cells originally annotated as classical (CD14+) monocytes. For each epoch, we compute the mean probability that cells in each category are recognized as classical monocytes by the neural network. We identify three distinct regimes along training: early (0–5 epochs), middle (5–50 epochs) and late (50–100 epochs). For cells in the correctly annotated category, the mean inferred probability to keep the classical (CD14+) annotation increases to a value close to 1.0 during the early epoch regime and remains high throughout the training process (Fig. 2i, blue). In that sense, correctly annotated cells are easy to learn. For cells in the incorrectly annotated category, the mean inferred probability decreases to a value close to 0.0 during the early regime and remains low throughout the middle regime, yet gradually increases during the late regime, reaching a value of 0.25 (Fig. 2i, red). In that sense, erroneously annotated cells are hard to learn, but the network begins to memorize their incorrect annotations during the late regime, which is therefore also termed the overfitting regime. For cells in the ambiguously annotated category, the mean inferred probability to keep the classical (CD14+) annotation increases gradually during the early and middle regimes, saturating only upon reaching the late regime (Fig. 2i, orange). For the same set of cells, the mean inferred probability that they are in fact nonclassical monocytes (FCGR3A+) increases during the early regime, but reverts to 0.0 during the middle regime. This example demonstrates how the training dynamics patterns of ambiguously annotated cells differ from those of correctly and erroneously annotated cells, and why they can inform an intermediate transcriptional state that is neither classical nor nonclassical (Fig. 2i).

Reannotating spatial transcriptomics data

After identifying erroneously annotated cells using Annotatability, one option is to filter them from their corresponding single-cell dataset, based on the assumption that the remaining cells retain sufficiently rich information. However, in some cases, discarding all misclassified cells may disrupt the biological interpretation of the data. For example, in spatially aware single-cell transcriptomic measurements27, preserving comprehensive spatial maps of the measured regions is crucial for correctly interpreting spatial gene expression and for inferring associated collective cell behavior. Therefore, we equipped Annotatability with a reannotation module for the robust reannotation of cells identified as erroneously annotated (‘Reannotation of erroneously annotated cells’ section in Methods). Furthermore, the annotation of spatial data poses unique challenges compared with nonspatial data, as many of the spatially aware transcriptomics experimental methods are limited in the number of genes that are measured and/or in their spatial resolution, where each spatial location can contain mixed measurements of multiple cells 28. While Annotatability, in its current form, does not simultaneously infer confidence for mixtures of cell types, the confidence score in a cell type annotation appears to be strongly correlated with the relative contribution of the corresponding type to the gene expression mixture (Supplementary Fig. 4e,f).

We evaluated the capability of Annotatability to identify erroneously annotated cells and reannotate them in spatial transcriptomics data for a MERFISH dataset of mouse hypothalamic preoptic region29 (Fig. 3a–d). To benchmark our approach, we first used a semi-synthetic dataset, created by randomly perturbing the annotation of varying fractions of cells, with the underlying assumption that the majority of cells were correctly classified. We used Annotatability to identify the erroneously annotated cells in the perturbed MERFISH data, and compared its performance with scReClassify, a state-of-the-art method for the identification of misclassified cells30 based on either a random forest (RF) or a support vector machine (SVM) classifier, in addition to comparing with k-nearest neighbors (KNN) classifier based on the scReClassify framework (Supplementary Section A.4). Annotatability consistently outperformed baselines in distinguishing between correctly and erroneously annotated cells in the perturbed data, its advantage increasing with the fraction of synthetically misclassified cells (Fig. 3e). Similarly, Annotatability outperformed baselines in identifying misclassified cells in the context of three additional synthetically perturbed spatial datasets, including 10x Genomics Visium dataset of coronal section of mouse brain31, 4i dataset of subcellular, human tissue cell culture32 and a seqFISH (sequential fluorescence in situ hybridization) mouse embryonic dataset33 (Supplementary Fig. 5). The primary advantage of Annotatability becomes evident in ’noisy’ scenarios, for example, in cases with a large proportion of misclassified cells (Fig. 3 and Supplementary Fig. 5).

Progressing to a real-world setting, we inspected cells identified as erroneously annotated in the unperturbed hypothalamic MERFISH dataset (1,138 out of 62,880 cells; Fig. 3f–i and Supplementary Fig. 5a–d), of which the majority were initially annotated as either inhibitory neurons (587/24,761 erroneously annotated; Fig. 3f, j) or excitatory neurons (314/11,757 erroneously annotated; Fig. 3g,k). To reannotate these erroneously annotated cells, we next trained Annotatability’s reannotation module (‘Reannotation of erroneously annotated cells’ section in Methods) and used the trained reannotation classifier to predict annotations for the set of erroneously annotated cells (‘Reannotation of erroneously annotated cells’ section in Methods). Indeed, following this procedure, the annotations of cells identified as erroneously annotated were transformed into annotations overall consistent with their marker gene characterization, based on established marker genes for the different cell types29 (Fig. 3j,k and Supplementary Table 1). For example, the mean normalized expression of Gad1, a marker gene for inhibitory neurons29, is 6.830 in the correctly annotated Inhibitory neurons, 1.458 in the erroneously annotated inhibitory neurons and 6.795 in cells reannotated as inhibitory neurons. More generally, for cells that Annotatability identifies as erroneously annotated, marker gene expression, on average, aligns with the new cell type annotations following reannotation, rather than the original input annotations (Supplementary Table 1). Thus, error detection based on training dynamics combined with retraining on a subset of high-confidence/low-variability cells enables robust reannotation of spatial transcriptomics data.

Uncovering the epithelial-to-mesenchymal pseudotime trajectory

Going beyond reasoning about annotations of static cell properties and their interpretation, we next demonstrate that annotation-trainability analysis can reveal dynamic trajectories and temporal shifts in cellular states. We illustrate this concept using single-cell data related to the EMT34. Traditionally perceived as a binary transition, the EMT has been conceptually generalized to a continuous temporal transition through intermediate cellular phenotypes35. Computationally, these intermediate states can be inferred using techniques such as graph-based clustering and pseudotime reconstruction35. However, graph-based clustering that relies solely on gene expression, as commonly done21, can be sensitive to sources of variation other than the EMT, such as treatment effects. In contrast, we use Annotatability to construct a trainability-aware gene expression graph that specifically enhances the EMT signal.

We used Annotatability to analyze scRNA-seq datasets of MCF10A and HuMEC cell lines36, which serve as models for benign mammary epithelial cells. Our analysis included both untreated cells (mock condition) and cells treated with the cytokine TGF-β, which induces the EMT36 (Fig. 4a), among other effects37. To generate initial input labels for Annotatability, we clustered the cells on the basis of their gene expression using Louvain clustering and labeled cells in the resulting 22 clusters as either epithelial (E) or mesenchymal (M) on the basis of known marker genes36 (epithelial: CDH1, CRB3 and DSP; mesenchymal: VIM, FN1 and CDH2) (Fig. 4b). Following standard preprocessing (‘Data preprocessing’ section in Methods) and without requiring batch integration, we then used Annotatability to compute confidence and variability scores for each cell (‘Annotatability workflow’ section in Methods; Fig. 4c,d). We also used Annotatability to identify erroneously annotated cells and filter them out.

In principle, the assigned confidence scores provide a continuous measure of each cell’s congruence with its input label. Therefore, we hypothesized that these scores reflect a cell’s state along the gradual EMT. Indeed, for cells that were originally labeled as epithelial, the mean expression of epithelial (mesenchymal) marker genes increases (decreases) monotonically with Annotability’s confidence score (Spearman’s rank correlations: 0.410 and −0.421, respectively) (Fig. 4e,f). In comparison, the correlation of the mean expression of the same genes to pseudotimes computed using diffusion pseudotime trajectory inference algorithm21,22 is 0.260 and −0.127, respectively (Supplementary Section A.4). Similarly, the mean expression of mesenchymal (epithelial) marker genes increases (decreases) monotonically with the confidence score for cells that were originally labeled as mesenchymal (Spearman correlation: 0.606 and −0.210, respectively, compared with 0.555 and 0.031 for diffusion pseudotimes, respectively).

To focus on the EMT signal, we applied our trainability-aware graph embedding algorithm (‘Trainability-aware graph embedding’ section in Methods) to the MCF10A cell line scRNA-seq data, individually for the epithelial and mesenchymal cells, followed by Louvain clustering of each graph (analogous results were obtained for the HuMEC cell line; Supplementary Fig. 6). The trainability-aware graphs, integrating information about both training dynamics statistics and gene expression, indeed capture the progression of cells along the EMT as the confidence in the input E/M labels changes gradually along the graphs (Fig. 4g–i). This is in contrast to baseline expression graphs, that is, graphs constructed on the basis of gene expression alone, resulting in graph structures that capture treatment effects (mock versus TGF-β) instead of the EMT process (‘Trainability-aware graph embedding’ section in Methods; Fig. 4g–i and Supplementary Fig. 6a–c). Specifically, in the mesenchymal (epithelial) baseline expression graph, 0.99% (100%) of cells in one cluster and 0.98% (0.99%) of cells in the second cluster were treated with mock or TGF-β, respectively. This is in contrast to 0.69% (0.69%) and 0.77% (0.81%) of cells in each of the corresponding clusters in the trainability-aware graph that were treated with mock or TGF-β, respectively. Furthermore, E/M marker genes are differentially expressed at different stages along the trainability-aware expression graph structure, and not along the expression graph structure, further supporting the correspondence between the trainability-aware graph and the EMT process (Methods, Fig. 4j and Supplementary Fig. 6d).

Inference of disease-related cell states and treatment responses

We next use Annotatability to infer disease states, disease-associated genes and individual cell responses to treatment. We analyzed scRNA-seq data collected from pancreatic islets isolated from both healthy mice and from multiple low-dose Streptozotocin (mSTZ)-induced diabetic models38. In the mSTZ-induced diabetic models, STZ is applied in multiple low doses, partially degrading β cell activity39, where the level of degradation may differ between individual cells40. To quantify the damage to individual α, β and δ cells in the mSTZ mice, we used Annotatability to infer confidence and variability scores with respect to the mSTZ label for each cell, against the backdrop of control cells labeled as healthy (‘Annotatability workflow’ section in Methods). We hypothesized that learning the mSTZ label is easiest for the most severely damaged cells and hardest for the least severely damaged cells. Indeed, the inferred confidence scores are negatively correlated with the expression of β cell activity markers (Ins1, Ins2, Slc2a2, Trpm5, G6pc2, Slc30a8 and Ucn3; Spearman’s ρ = −0.662 for sum of expression; Fig. 5a and Supplementary Fig. 7a) and positively correlated with the expression of β cell dedifferentiation markers (ρ = 0.573; Supplementary Fig. 7b,c)41. In comparison, for the diffusion pseudotime trajectory inference algorithm21,22 (Supplementary Section A.4), the corresponding correlation coefficients are −0.021 for the β cell activity markers and −0.222 for the dedifferentiation markers. In the opposite direction, Annotatability’s confidence score has also enabled us to identify genes associated with the disease state of individual mSTZ β cells, in the sense that their expression levels are correlated (positively associated) or anticorrelated (negatively associated) with the confidence score assigned to each cell’s disease annotation (‘Annotation-trainability score’ section in Methods; Fig. 1g, Supplementary Section A.3.1 and Supplementary Table 2).

To refine the classification of mSTZ cells by disease level, we constructed a trainability-aware graph (‘Trainability-aware graph embedding’ section in Methods) followed by Louvain clustering, similarly to the above EMT analysis. Initially, we tuned the clustering resolution to guarantee an output of three clusters (Fig. 5b). The three clusters are distinguished by the expression patterns of the nine disease progression markers (Fig. 5b–d and Supplementary Fig. 7d–f): a disease cluster expressing low levels of activity and maturation markers and high levels of dedifferentiation markers; a healthy-like cluster expressing high levels of activity and maturation markers and low levels of dedifferentiation markers, resembling those in the healthy control; and an intermediate cluster expressing intermediate levels of all markers relative to the disease and healthy-like clusters. We verified that the difference in the expression of disease-associated genes across clusters indeed corresponds to changes in the training dynamics scores, specifically, decreasing confidence in the disease annotation along the disease, intermediate and healthy-like clusters (Fig. 5e and Supplementary Fig. 7g). The clustering quality is robust with respect to clustering granularity, consistently manifesting a graded progression from a disease-like to a healthy-like expression pattern of marker genes when the cells are divided into three, four or five clusters (Fig. 5f and Supplementary Fig. 7h–m). Thus, the clustering of the trainability-aware graph provides a refined quantification of the damage to individual mSTZ β cells.

Finally, the islets dataset includes six subgroups of mSTZ-induced diabetic models, each treated with a different compound or combinations thereof38. The distribution of the confidence and variability score ranks for β cells in each treatment subgroup reveals both treatment-independent and treatment-dependent cellular heterogeneity (Fig. 5g,h, Supplementary Fig. 7a–c and Supplementary Section A.3.2). Specifically, each treatment group can be subdivided to a disease, intermediate and healthy-like cluster using a trainability-aware graph as above (Fig. 5h and Supplementary Fig. 7n). However, insulin-treated cells have a larger proportion of healthy-like cells, and the combination of glucagon-like peptide-1 and estrogen was shown to be more effective than either treatment alone, in line with previous studies on treatment effects42 Furthermore, our analysis revealed a small subset of 43 hard-to-learn β cells, almost all of which are insulin treated. Marker gene expression for these hard-to-learn cells aligns particularly closely with that of healthy cells, to a degree that is not observed in other treatment groups and cannot be explained by treatment-independent heterogeneity (Fig. 5i,j and Supplementary Section A.3.2). Thus, annotation-trainability analysis provides a highly sensitive tool for evaluating treatment effectiveness and inferring individual cell responses.

Discussion

Annotations of single cells enrich raw biological data with descriptive information that provides context, explanation and detail. However, discrete annotations also force heterogeneous cell populations into rigid molds, whose interpretation is inherently subjective. Consequently, a cell can be incongruent with its input annotation for reasons ranging from simple assignment errors to fundamental ambiguities associated with underlying biological processes. In this work, we introduced Annotatability, which identifies annotation mismatches by tapping into the commonly neglected signal encoded in a DNN’s training dynamics, and, subsequently, relies on these mismatches to enhance the interpretation of single-cell and spatial omics data.

Annotatability is a generic approach, in the sense that, given a dataset of cells and their corresponding annotations (or, in principle any biological entities and their annotations), it allows to leverage training dynamics to categorize cells according to their ‘goodness-of-match’ to their annotations. While fully connected DNN are utilized for Annotatability’s internal implementation, the approach itself is oblivious to how the input annotations were assigned externally.

At the dataset level, the convergence rate of training a DNN can be influenced by various properties of the data and annotations, such as geometric structure and noise levels. Consequently, these differences may manifest as variations in the distribution of annotated data points in terms of their confidence and variability scores, even for the same dataset using different annotation classes. For example, when the mSTZ dataset (‘Inference of disease-related cell states and treatment responses’ section in Results)38 was annotated by experimental condition (healthy versus mSTZ mouse model) or treatment type, the confidence and variability scores displayed a relatively large spread and a relatively low proportion of easy-to-learn cells (Extended Data Fig. 2a,c), in line with the marked heterogeneity introduced by different disease states and treatment effectiveness (Fig. 5). In contrast, when the same dataset was annotated by cell type, the spread was much lower, with a high proportion of easy-to-learn cells (Extended Data Fig. 2b), probably due to the expression of distinct marker genes that define differentiated α, β and δ cells. Changes in the distribution of confidence and variability scores may reveal biological differences even within a single annotation class. For example, in a dataset of mouse small-intestine cells43 annotated by cell type, the DNN had more difficulty learning stem cell labels than labels associated with more differentiated cells (Extended Data Fig. 1). In future work, we aim to extend Annotatability to enable comparative analysis of the global properties of annotated datasets based on their collective training dynamics. For example, training dynamics can serve as a practical proxy for complexity measures developed in the context of machine learning theory44.

We next discuss several limitations of Annotatability and how to address them in the future. First, Annotatability does not directly handle convolved or aggregated data, such as that obtained from certain spatial transcriptomics technologies, and future work can extend the methodology explicitly to fractional or probabilistic annotations. Second, while Annotatability is robust to the precise choice hyperparameter values (Supplementary Fig. 1), the determination of hyperparameters such as training time need to be standardized, as they may influence the determination of the confidence and variability scores. Third, the use of thresholding (‘Classifying cells on the basis of confidence and variability scores and setting thresholds’ section in Methods) on these scores for categorizing introduces arbitrary cutoffs that could influence the interpretation of the data and will be further automated in the future. Fourth, Annotatability currently treats input annotations as nominal categories. For the special but common binary case (for example, epithelial versus mesenchymal, healthy versus diseased tissue), the confidence in a cell’s annotation naturally gives rise to ordinal or numerical interpretations, where lack of confidence (and low variability) in one annotation may be considered as increased confidence in its complement. In future work, we aim to generalize Annotatability to probabilistic and continuous annotations, which will enable us to leverage the internal structure of such annotated datasets, including those continuously annotated according to a developmental or disease trajectory. Furthermore, while the measures of confidence and variability of DNN predictions along the training process provided ample information to reveal cellular heterogeneity, structure, and other patterns related to label incongruence, future work can incorporate additional measures derived from the training dynamics of a DNN to characterize additional aspects of the congruence of cells with their input annotations and potentially provide a wider interpretation lens into single-cell heterogeneity. Finally, in this work, we demonstrate the application of annotation-trainability analysis to scRNA-seq and spatial transcriptomics datasets. However, our approach is general and can readily extend to other single-cell multi-omic techniques and annotated biological datasets.

Methods

Annotatability workflow

Given input data that include single-cell or spatial omics observables (for example, scRNA-seq gene expression profiles) and corresponding annotations (for example, cell type annotation per cell), the Annotatability general-case workflow is as follows:

  1. (1)

    preprocess the data;

  2. (2)

    train a DNN on the input data, monitoring the training dynamics and recording the prediction of the DNN after each epoch;

  3. (3)

    calculate the training dynamics scores confidence and variability;

  4. (4)

    classify each observable (for example, cell) into one of three categories: easy to learn (correctly annotated), hard to learn (erroneously annotated) or ambiguous;

  5. (5)

    optional step: filter out erroneously annotated cells;

  6. (6)

    optional step: reannotate erroneously annotated cells using the reannotation module;

  7. (7)

    optional step: construct trainability-aware graph embedding, integrating information from gene expression and training dynamics statistics, followed by signal-specific analysis;

  8. (8)

    optional step: score and rank genes according to their (positive or negative) annotation trainability association.

Rationale

One of the main aims of Annotatability is to detect cells that are erroneously annotated and cells that are in ambiguous cell states. For both of those tasks, we train a deep learning-based classifier to assign each cell to its input annotation and monitor the training dynamics. We set the model’s level of complexity to be sufficiently high to ensure that the network will eventually assign the input annotations to all provided examples; the underlying assumption is that a complex network such as the one we are using is capable of memorizing any input annotation, even if it is incorrect45. Formally, our approach is based on a model that involves selecting parameters to minimize empirical risk by stochastic gradient-based optimization procedure over E epochs. The model is assumed to establish a probability distribution over labels for a given observation (cell). The training dynamics of a given instance i, following previous work in the context of natural language processing18, are described by statistical metrics computed over E epochs, which are then used as coordinates on a map. The first metric aims to quantify the level of confidence with which the learning algorithm assigns the correct label (yi) to an observation (the gene expression of the cell, xi) based on its probability distribution. Specifically, we define confidence, \(\widehat{{u}_{i}}\), as the average probability assigned by the model to the correct label (yi) across all epochs:

$$\widehat{{u}_{i}}=\frac{1}{E}\mathop{\sum }\limits_{e=1}^{E}{p}_{{\theta }^{e}}({y}_{i}| {x}_{i}),$$
(1)

where \({p}_{{\theta }^{e}}\) denotes the model’s probability with parameters θe for the eth epoch. In addition, variability is defined using the standard deviation of \({p}_{{\theta }^{e}}({y}_{i}| {x}_{i})\) across epochs:

$${\hat{\sigma}}_{i}=\sqrt{\frac{\mathop{\sum }\nolimits_{e = 1}^{E}{({p}_{{\theta }^{e}}({y}_{i}| {x}_{i})-{\hat{u}}_{i})}^{2}}{E}}.$$
(2)

Based on previous studies in the fields of computer vision46,47 and natural language processing18, our underlying assumption is that cells with high confidence and low variability (easy to learn) probably have correctly labeled annotations, which the network learns early in the training process. Conversely, cells with low confidence and low variability (hard to learn) are probably mislabeled and would be learned only at later stages of the training procedure. Cells with moderate confidence and high variability are in an ambiguous or intermediate state associated with at least two different annotations.

Initial training phase

For training, we constructed a simple DNN consisting of three fully connected layers, each utilizing the ReLU activation function48. For the output layer, we used a log softmax activation function. To calculate the confidence and variability scores, we exponentiate the resulting log probabilities to transform them into a standard probability distribution. As a loss function, we used cross-entropy loss, which is minimized using a stochastic gradient descent optimizer with an initial learning rate of 0.001 and momentum of 0.9. We used PyTorch49 to perform the training procedure described above. To mitigate the effects of class imbalance (for example, rare cell types) we used a weighted sampler, ensuring that rare annotations are given higher importance during the training process. The number of epochs is chosen to be sufficiently large, such that the empirical risk will be minimized and will reach a value close to zero by the end of the training procedure. In cases where the training process fails to converge, often characterized by high loss and only a small fraction of easy-to-learn samples, it is advisable to deepen the neural network. This option is exemplified in the Annotatability code package. The full procedure can be found in the publicly available code via GitHub at https://github.com/nitzanlab/Annotatability.

Annotation-trainability score

We define the annotation-trainability positive association score (gj per gene j), which utilizes training dynamics to identify genes with higher expression levels in cell states associated with a given annotation. Using the annotation-trainability association score, genes can be ranked on the basis of the correlation between their expression in each cell and the corresponding confidence scores:

$$\mathbf{g}={A}^{T}\cdot \mathbf{f}(\mathbf{\hat{u}}),$$
(3)

where \(\mathbf{g}\) is the vector of gene association scores, A is the gene expression matrix (rows correspond to cells and columns correspond to genes), f: R → R is a monotonically increasing function for rescaling confidence scores (by default, Annotatability uses the identity function \(f(\hat{{u}_{i}})=\hat{{u}_{i}}\)), and \(\mathbf{f}\) is a vector-valued function applying f(x) to every entry in the vector of cell confidence scores \(\mathbf{\hat{u}}\). As a preprocessing step, the expression level of each gene is scaled to have variance 1 (L2 normalization).

Similarly, we define an annotation-trainability negative association (\({\widetilde{g}}_{j}\)), which identifies genes with lower expression levels in cell states associated with a given annotation based on training dynamics:

$$\mathbf{\widetilde{g}}={A}^{T}\cdot \mathbf{f}(1-\mathbf{\hat{u}}).$$
(4)

For the special case where \(\mathbf{f}(x)\) is the identity function, \(\mathbf{\widetilde{g}}\) equals \(C-\mathbf{g}\) for some constant C and, thus, \(\mathbf{\widetilde{g}}\) ranks genes in the reverse order relative to g. However, for the general case where f(x) may be a nonlinear monotonically increasing function, genes ranked by \(\mathbf{\widetilde{g}}\) may be sorted differently from the reverse of their rankings by g.

Classifying cells on the basis of confidence and variability scores and setting thresholds

To classify cells as easy to learn, hard to learn or ambiguous, Annotatability applies a threshold on the computed confidence and variability scores. This threshold is dataset dependent, similarly to the case of area under the margin ranking statistic in the context of computer vision17.

In many cases, the training process generates well-separated groups of cells that can, in turn, be clustered in the confidence–variability plane to three groups corresponding to low confidence/low variability, mid-confidence/high variability and high confidence/low variability, a threshold can be set manually to distinguish the groups to hard to learn, ambiguous and easy to learn, respectively. We will next discuss how to set each one of the thresholds when the cells cannot be easily clustered as described above.

Threshold for cells that are hard to learn

In the general case, we infer the threshold for hard-to-learn cells as follows. (1) Randomly sample c cells (5–10%) and change their annotations (sample a different annotation out of the set of input annotations uniformly at random). (2) Next, train a DNN classifier (as described in ‘Initial training phase’ section) and record the training dynamics. Set the threshold over the confidence and variability scores for hard-to-learn cells as the q percentile of the confidence and variability scores over the c reannotated cells. q is chosen in a dataset-dependent manner to be between 25% (for datasets with a relatively low fraction of erroneously annotated cells, such as the PBMC scRNA-seq dataset) and 90% (for datasets with a relatively high fraction of erroneously annotated cells, such as spatial transcriptomics).

Threshold between easy-to-learn and ambiguously annotated cells

When the underlying process captured by the annotations is continuous, we can bypass the need for a threshold altogether by ranking the cells to capture the transition between the easy-to-learn and ambiguous regions in the confidence–variability plane (such as in the case of the EMT process; ‘Uncovering the epithelial-to-mesenchymal pseudotime trajectory’ section in Results). A threshold can be chosen manually if those groups are separated in the confidence–variability plane (as in Fig. 3d). Or, if marker genes for these groups are available, their expression and transition over the confidence–variability plane can be used to inform thresholding. In this manuscript, the threshold for easy-to-learn cells was set consistently for all datasets to be at confidence score 0.95 and variability score 0.15. The threshold can also be tuned, keeping in mind that, with more epochs, the confidence will increase and the variability will decrease. An alternative way to set the threshold is to use the signal-aware graph embedding (‘Trainability-aware graph embedding’ section) and perform graph-based clustering on the inferred graph (as in ‘Characterizing cells and annotations with Annotatability’ and ‘Inference of disease-related cell states and treatment responses’ sections in Results), to generate structure that can potentially distinguish cells by the integration of both gene expression and trainability-based measures.

Reannotation of erroneously annotated cells

To correct the annotations of cells that were identified by Annotatability as erroneously annotated, a DNN classifier (as described in ‘Initial training phase’ section) is trained exclusively on the subset of cells identified as correctly annotated. The erroneously annotated cells are then reannotated according to the predictions of the newly trained DNN.

Trainability-aware graph embedding

To construct a trainability-aware gene expression graph, we first compute pairwise distances between cells by integrating gene expression information and training dynamics statistics. Specifically, the trainability-aware gene expression distance matrix \(\widetilde{W}\) is computed as

$${\widetilde{W}}_{ij}=\alpha \times {W}_{ij}+(1-\alpha )\times \frac{(| {\hat{u}}_{i}-{\hat{u}}_{j}| )}{N},$$
(5)

where W is the Euclidean distance matrix, Wij is the Euclidean distance between cells i and j in gene expression space following dimensionality reduction using PCA (default number of principal components: 50), \({\hat{u}}_{i}\) is the confidence score for cell i, N is the mean value of \(| {\hat{u}}_{i}-{\hat{u}}_{j}|\) across all pairs 0 ≥ i, j ≥ n (where n is the total number of cells) and α is a tunable parameter 0 ≤ α ≤ 1, which interpolates between a gene expression-based distance matrix (α = 1) and a trainability-based distance matrix (α = 0).

Next, the trainability-aware expression distance matrix \(\widetilde{W}\) is transformed into an affinity matrix M using a Gaussian kernel:

$${M}_{ij}=\exp ({\widetilde{W}}_{ij}/\bar{\widetilde{W}}),$$
(6)

where \(\bar{\widetilde{W}}\) is the mean over all values in \(\widetilde{W}\). Finally, the trainability-aware expression graph is constructed by computing a KNN graph (k = 15) over the affinity matrix M.

Data preprocessing

We used standard scRNA-seq preprocessing, which includes per-cell normalization (to 10,000 counts), and a log transformation that is applied to stabilize variance (\(\log ({\rm{normalized}}\,{\rm{expression}}\,{\rm{value}}+1)\)). For the EMT dataset, we use the raw data and filtered out low-quality cells21. The cells that were filtered were those with high amounts of mitochondrial genes (≥5% total counts), cells with a high total count number (number of expressed genes higher than 4,000) and cells expressing less than 200 genes. In addition, we retained only genes that were expressed in more than 3 cells and were highly variable genes (3,000 genes; sc.pp.filter_gene_disppersion21). In the remaining datasets, since they are already preprocessed, we did not perform feature selection. For the PBMC dataset, we excluded cells annotated as megakaryocytes owing to their low number (88/11,990).

Distribution across the confidence–variability plane

The distribution across the confidence–variability plane produced by Annotatability for a given dataset of annotated cells can reflect the difficulty of learning that annotated dataset and the relative fractions of cells in ambiguous and hard-to-learn states within it. Therefore, we suggest a top-down measure for comparing the hardness of learning different classes of annotations for the same dataset, specifically, comparing those classes, following identical training procedures, by calculating the variance of both the confidence and variability scores, as inferred by Annotatability, across all cells.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.