Abstract
Electronic Health Record (EHR) systems are particularly valuable in pediatrics due to high barriers in clinical studies, but pediatric EHR data often suffer from low content density. Existing EHR code embeddings tailored for the general patient population fail to address the unique needs of pediatric patients. To bridge this gap, we introduce a transfer learning approach, MUltisource Graph Synthesis (MUGS), aimed at accurate knowledge extraction and relation detection in pediatric contexts. MUGS integrates graphical data from both pediatric and general EHR systems, along with hierarchical medical ontologies, to create embeddings that adaptively capture both the homogeneity and heterogeneity between hospital systems. These embeddings enable refined EHR feature engineering and nuanced patient profiling, proving particularly effective in identifying pediatric patients similar to specific profiles, with a focus on pulmonary hypertension (PH). MUGS embeddings, resistant to negative transfer, outperform other benchmark methods in multiple applications, advancing evidence-based pediatric research.
Similar content being viewed by others
Introduction
Real-world biomedical data is difficult to obtain but crucial to studying disease diagnosis and management. Clinical trials and prospective observational studies are the gold-standard methods for collecting such data but are time and resource-intensive1,2,3. Pediatric trials and studies are particularly difficult due to factors including additional legal regulations, ethical dilemmas, and commercial profitability concerns4,5,6. The proportion of clinical drug trials that are pediatric lags significantly behind the pediatric proportion of studied disease burdens4,6,7,8,9. Moreover, many pediatric clinical trials suffer from poor enrollment and high termination rates5,9,10.
The difficulty and dearth of pediatric trials and studies contribute to a lack of real-world evidence for the pediatric population. Given the persistent shortfall of pediatric research in spite of governmental incentivization and regulation11, data stored in electronic health record (EHR) systems provide a valuable alternative source of real-world evidence. EHR data in the form of diagnostic and procedure billing codes, laboratory test records, and medication prescriptions push the frontier of clinical research. For example, EHR data have been used to evaluate treatments and patient care models as well as to develop predictive models for diagnosis, treatment, and clinical outcomes12,13,14,15,16,17,18,19,20,21,22.
The wealth of data in EHR systems expands the breadth of biomedical modeling and research, but the scale of the data also poses challenges. A critical step in studying a disease of interest using EHR data is curating a cohort from the clinical population that satisfies specific study criteria. Computable phenotypes are a standardized, machine-readable approach for defining and identifying cases of a disease of interest using patients’ demographic profiles, symptoms, laboratory tests, and other clinical information. Given their broad consideration of clinical characteristics, computable phenotypes are also particularly useful in the study and development of precision medicine. Manually selecting or creating the relevant medical features for computable phenotypes from thousands of EHR features is time-intensive and requires extensive domain knowledge. For poorly understood diseases, feature selection is even more difficult23,24. This has opened another area of EHR clinical research into medical feature extraction and concept representations25,26, such as embeddings taking the form of lower dimensional vector representations. Algorithmic methods for developing such representations can efficiently synthesize patient data and existing knowledge.
Several automated and semi-automated methods for knowledge extraction have been developed to guide knowledge graph construction, feature selection, and modeling strategies22,27,28,29,30,31,32,33. Many of these representations, however, are trained on general population or adult EHR data and are thus not directly relevant to or accurate for pediatric populations. Child-specific diseases, treatments, and preventative care create fundamental vocabulary differences between adult and pediatric EHR data5,34. Even for shared EHR features, knowledge from the general population may not precisely reflect pediatric disease progression or management. Pediatric patients often require higher standards of precision or different rounding strategies in numeric calculations that are not reflected in adult EHR data or studies5,34. Metabolic pathways and rates, receptor functions, and homeostatic mechanisms change from childhood to adulthood. In addition to physiological differences, pharmacological factors such as dosing protocols, drug efficacy and side effects, and therapeutic windows for pediatric patients also differ from those of adult patients35,36,37,38. The relative importance of features such as social history and caregiver information may also differ between pediatric and general population patients5,34.
Knowledge extraction from pediatric EHR data is particularly difficult, however, due to the relative health of children and the resulting sparsity of pediatric EHR. Long before the introduction of EHR systems, pediatric healthcare recordkeeping developed to document well-check visits while the problem-oriented evolution of general healthcare records reflected the general medical care model39. Approximately half of pediatric healthcare visits are well-check visits, and many developmental screening tests and vaccinations are unique to pediatric patients5,34. The preventative care model for children renders pediatric EHR uniquely sparse as compared to EHR data for the general population. As such, existing methods for knowledge extraction from general EHR data are ill-equipped to handle the unique sparsity of pediatric EHR data.
In this study, we introduce the MUltisource Graph Synthesis (MUGS) algorithm, designed to accurately learn pediatric-specific EHR code embeddings by leveraging large-scale EHR data of the general patient population from Mass General Brigham (MGB) to enhance the sparse EHR data from Boston Children’s Hospital (BCH). Utilizing existing hierarchical medical ontology as prior general knowledge, MUGS facilitates efficient transfer learning by grouping similar codes, thereby enhancing the transferability of knowledge from general to pediatric systems. To address the heterogeneity within code groups and between sites, we propose to decompose a code embedding into three components: the group effect, defined based on the hierarchical medical ontology; the site-nonspecific code effect, capturing characteristics of a code that differs from its group effect and are shared between health systems; and the code-site effect, identifying site-specific characteristics of a code. Importantly, this decomposition, coupled with penalty functions applied to the code and code-site effects, provides adaptability to varying degrees of heterogeneity within code groups and between sites and protects against the adverse effects of negative knowledge transfer via hyperparameter tuning, marking a significant advancement in the field.
We demonstrate the superior quality of MUGS embeddings, which outperform benchmarks by more effectively identifying related code pairs and accurately quantifying code relatedness in pediatric populations. Notably, the MUGS algorithm discerns homogeneous and heterogeneous codes between two EHR systems and can potentially improve the understanding of disparities in disease manifestation, progression, and comorbidity development between pediatric and general populations, as well as variations in medical practices between health systems. The MUGS embeddings can be used to facilitate feature engineering for downstream predictive modeling tasks, especially for underrepresented patient groups or rare diseases with scant literature. We demonstrate their exceptional performance in feature selection via examples of pediatric epilepsy and PH. Another primary benefit of high-quality EHR code embeddings is their contribution to patient representation learning and detailed patient-level analysis. By aggregating these embeddings, we create comprehensive patient profiles that aid in the precise and unbiased development of computable phenotypes, the curation of study cohorts, and ‘patient-like-me’ identification tailored to the pediatric population. The MUGS embeddings, rich in information about the interconnectedness among EHR concepts, also provide a vital foundation for developing pediatric knowledge graphs.
Results
Data preprocessing
We utilized EHR data from MGB encompassing 2.5 million patients and BCH encompassing 0.25 million patients in four codified domains: PheCode for diagnoses, RxNorm for medications, Logical Observation Identifiers Names and Code (LOINC)40 for laboratory measurements, and Clinical Classifications Software (CCS) for procedures. The demographics of these patients are provided in Supplementary Table 1. After frequency control, 3055 codes are shared between the two hospital systems, while 1221 codes are unique to BCH and 2350 codes are unique to MGB. We then generated a summary-level co-occurrence matrix of EHR codes for each site as described in25. The co-occurrence matrix of MGB data is the same as that used in27. To allow for further analysis of the dependency between codes, we constructed a shifted positive pointwise mutual information (SPPMI) matrix41 from each co-occurrence matrix. We further utilized the existing top-down hierarchical structure of medical ontologies shared across different hospital systems to create code groups. We obtained 337 groups for PheCodes, 304 groups for RxNorms, and 700 groups for LOINCs.
Model overview: knowledge transfer via MUGS
MUGS co-trains embeddings for EHR codes using SPPMI matrices from MGB and BCH. Due to the imbalance of EHR data between MGB and BCH, it transfers useful knowledge from the information-rich MGB site to enhance the quality of embeddings in the information-sparse BCH site. Each embedding is decomposed into the sum of three distinct components: the group effect, the code effect, and the code-site effect. The group effect incorporates prior knowledge from the hierarchical medical ontologies, which is shared between sites. This approach allows for the grouping of codes based on their similarity, aiming to improve knowledge transferability between sites. The code effect represents the knowledge of a code that differs from the effect of the group it belongs to and is shared between health systems. The code-site effect depicts site-specific knowledge, which is crucial for capturing potential heterogeneity between sites. To illustrate the roles of the three components, consider type 1 diabetes and type 2 diabetes. Both belong to the same group, diabetes mellitus and share the same group effect. However, the two code effects are distinct as type 1 diabetes involves the immune system attacking insulin-producing cells, while type 2 diabetes is characterized by insulin resistance or insufficient insulin production over time. Moreover, the progression and comorbidities of type 1 diabetes and type 2 diabetes differ between pediatric and adult patients. Consequently, the MGB code-site effects will differ from the BCH code-site effects. In contrast, for conditions that manifest similarly in pediatric and adult patients, the code-site effects would be very close to zero.
To adaptively capture the degree of heterogeneity within groups and between sites, we employ two penalties: one for the code effects and the other for the code-site effects. In the extreme scenario where the penalty for code-site effects approaches infinity, all code-site effects shrink to zero, indicating complete homogeneity between sites. In this case, any code will have the same embedding regardless of the site. Furthermore, if the penalties for both the code effects and the code-site effects simultaneously approach infinity, both effects shrink to zero. Consequently, codes within the same ontology-defined group will share a uniform embedding between sites. Conversely, when both penalties are set to zero, MUGS reduces to the single-site benchmark Skip-Gram method42, resulting in no knowledge transfer. Consequently, MUGS is guaranteed to perform no worse than the single-site benchmark and is thus robust against negative transfer. The key advantage of MUGS is its ability to adaptively capture varying levels of heterogeneity for each code through hyperparameter tuning, effectively identifying both homogeneous and heterogeneous codes. Specifically, as depicted in Fig. 1, Panel (b), MUGS assigns the same embedding to codes with similar meanings between general and pediatric patients, such as bacterial enteritis, while assigning different embeddings to codes that function differently between these patient groups, such as epilepsy.
MUGS algorithm has four main steps: (i) conducting singular value decomposition (SVD) of the SPPMI matrix within each site to create two sets of embeddings, (ii) performing an orthogonal transformation to align the two sets of embeddings, (iii) training initial estimators for group effects, code effects, and code-site effects by pooling aligned embeddings between sites, (iv) updating the group effects, code effects, and code-site effects through alternating and iterative solutions of penalized linear regressions until convergence. Figure 1, Panel (a) outlines the procedures with MGB and BCH data, and a more detailed and comprehensive explanation can be found in the ‘Knowledge Transfer via MUGS’ subsection of the ‘Methods’ section. As depicted in Fig. 1, Panel (c), using MUGS embeddings as input, we can perform a wide range of downstream tasks, such as pediatric knowledge graph construction, pediatric patient embedding generation, and more.
Accuracy, adaptivity, and robustness of MUGS compared with benchmark methods
We compared our MUGS method with four benchmark methods: SapBert43, Coder44, Skip-Gram, and PreTrain. Coder and SapBert are two pre-trained large language models that solely utilize the descriptions of each medical code from BCH, without considering EHR data. Skip-Gram, a single-site word embedding method, is equivalent to conducting SVD on the SPPMI matrix from BCH41. PreTrain aggregates co-occurrence information from both MGB and BCH into a single co-occurrence matrix by summing the counts of co-occurrences of EHR codes and then applies SVD to the aggregated SPPMI to obtain embeddings for each code. With this approach, codes present in both sites are assigned the same embedding. The embedding dimension for Coder and SapBert was set at 768, as used in43,44, while the dimension for the other methods was set at 500. A detailed discussion on the selection of embedding dimensions is available in the ‘Discussion’ section.
As illustrated in Fig. 2, we evaluated the pediatric EHR code embeddings trained using different methods through labels curated from three primary references: pediatric articles from reputable websites, surveys conducted with pediatric physicians, and outputs from the generative pre-trained transformer (GPT) models. We refer to these labels as ‘silver-standard’ due to potential inaccuracies arising from natural language processing (NLP) and the mapping of unified medical language system (UMLS) clinical concepts to EHR codes, conflicting survey results, and the variability introduced by prompt engineering in GPT models. Furthermore, we corrected the survey labels with conflicting results and curated 1156 gold-standard positive labels by manual review. Silver-standard labels curated from pediatric articles show high reliability with a 91% concordance rate. Details on the curation procedures and the quality assessment of silver-standard labels are available in the ‘Label Curation’ subsection of the ‘Methods’ section.
a Semi-automatic positive label curation from pediatric articles on the websites of BCH, Cincinnati Children’s Hospital, and UpToDate; b labels categorized into three levels: ‘strongly related’, ‘maybe related’, and ‘not related’, by pediatric physicians through a survey on six target diseases; c automatic relatedness score generation by GPT-3.5 and GPT-4 for the six target diseases.
We first assessed the quality of the embeddings by measuring the accuracy of the cosine similarity between two embeddings of each code pair. This measurement was used to classify gold-standard positive pairs versus random pairs. Pairs manually curated based on pediatric literature indicate a strong relationship between two codes, whereas the sparsity of the network allows us to assume that each random pair represents a weak or no relationship. Table 1 summarizes the accuracy of different methods using the area under the ROC curve (AUC) in classifying these pairs. Given the resource-intensive nature of manual label curation, we also calculated the AUC of different methods in classifying a substantially larger number of silver-standard positive pairs curated from pediatric articles versus random pairs, as detailed in Fig. 3. Our results indicate that MUGS outperforms all four benchmark methods in identifying the gold-standard and silver-standard positive pairs curated from different expert sources, as evidenced by higher AUCs. The poor performance of Coder and SapBert, which were pre-trained based on the UMLS medical knowledge graph developed for the general patient population, underscores the heterogeneity between the general patient population and the pediatric patient population. It also suggests that EHR data from BCH is a valuable source for enhancing the quality of pediatric embeddings. Additionally, PreTrain, which leverages EHR data from MGB, beats Skip-Gram, which only utilizes EHR data from BCH in identifying silver-standard pairs. This suggests a degree of similarity between MGB and BCH, enabling knowledge transfer from MGB to improve the quality of pediatric embeddings.
The large volume of silver-standard labels has allowed for a more comprehensive evaluation, revealing insights into why MUGS consistently outperforms other EHR-based benchmarks. To further delineate the advantages of MUGS, we categorized EHR codes into two groups based on their occurrence frequencies within patient records: rare codes, seen in less than 0.1% of patients, and frequent codes, observed in 0.1% or more of patients. This categorization threshold ensures a sufficient sample of rare codes for accurate evaluation. The accuracies of different configurations are summarized in Fig. 4. Compared with the single-site method Skip-Gram, both PreTrain and MUGS exhibit higher accuracies in terms of rare codes by leveraging knowledge from MGB. More importantly, MUGS surpasses PreTrain by enabling the adaptive identification of heterogeneous codes, especially frequent ones, which possess distinct embeddings at MGB and BCH.
Comparison of AUC between MUGS and four benchmark methods, utilizing silver-standard labels curated from two distinct types of pediatric content: a articles from children’s hospital websites (BCH and CCH) and b pediatric articles on the UpToDate website. The analysis differentiates between ‘rare pairs’, which include at least one code seen in less than 0.1% of patients, and ‘frequent pairs’, which include at least one code observed in 0.1% or more of patients at BCH.
In addition to the above binary classification, we provided a more detailed evaluation utilizing ordinal labels, which allow for grading relevance in varying degrees. We evaluated the pediatric embeddings’ quality by calculating the rank correlation (Kendall’s tau) between the cosine similarities, survey majority votes, and relatedness scores from GPT-3.5 and GPT-4. Figure 5 shows that MUGS significantly outperforms benchmark methods, achieving the highest rank correlations with both manually corrected survey labels and GPT-generated silver-standard labels. Further details on evaluation methods are provided in the ‘Evaluation Methods’ subsection of the ‘Methods’ section.
Homogeneity and heterogeneity in epilepsy and PH between general and pediatric populations
Figures 6 and 7 illustrate the 15 codes with the highest cosine similarity for epilepsy and PH, respectively, in MGB and BCH based on MUGS embeddings. This comparative analysis demonstrates that MUGS effectively captures both the homogeneity and heterogeneity between general and pediatric patients for these two diseases. Notably, 13 codes exhibit shared occurrence at both MGB and BCH datasets for epilepsy, and 12 codes for PH, in terms of conditions, medications, procedures, and laboratory tests. This highlights the substantial similarity present between the two sites regarding the respective diseases.
For conciseness, parent PheCodes are omitted if child codes are presented. LOINC codes Valproate, levETIRAcetam, lamoTRIgine, Topiramate, Zonisamide, and OXcarbazepine are derived by moving up one layer in the LOINC hierarchy. For example, Topiramate (LP19239-0) is the parent code encompasses various leaf codes, such as topiramate [mass/volume] in Blood (LOINC: 60192-2), and Topiramate [Mass/volume] in Urine (LOINC: 60193-0).
MUGS embeddings also effectively capture expected heterogeneity between populations. For instance, infantile cerebral palsy, a group of intellectual and movement disorders that emerge in early childhood, ranks as the ninth related feature concerning epilepsy at BCH, with a cosine similarity of 0.50, while at MGB, it is the 15th-most related feature with respect to epilepsy, with a cosine similarity of 0.41. Topiramate is among the drugs most closely associated with epilepsy at BCH but has lower cosine similarity at MGB, perhaps reflecting the more diverse uses of the drug in an adult population, including prevention of migraines, treatment of psychiatric disorders, and management of weight45. Shifting to the realm of PH-related features, notable differences between BCH and MGB include rheumatic heart disease being selected only at MGB as a highly associated condition, while complications of cardiac/vascular device and nonrheumatic tricuspid valve disorders are only selected at BCH. These contrasts reflect expected differences in populations where BCH cares primarily for patients with congenital heart disease while MGB cares primarily for patients with acquired heart disease. Similarly, ambrisentan was not selected as a highly associated medication at BCH but was at MGB, reflecting known practice variations between centers.
Additional illustrations of homogeneity and heterogeneity between MGB and BCH, depicted through knowledge network structures, were created using a web API for both diseases. These can be found in Supplementary Figs. 1 and 2.
Epilepsy and PH-related EHR codes identification for pediatric patients
In disease studies, identifying related EHR features from thousands of potential candidates is crucial. Figures 8 and 9 depict feature clouds for epilepsy and PH, respectively. The clouds were created using MUGS and Skip-Gram embeddings based on cutoffs of cosine similarity for four categories: PheCode-PheCode, PheCode-RxNorm, PheCode-LOINC, and PheCode-CCS. Detailed information on how these cutoffs were selected is provided in the ‘Feature Selection: Cutoff of Cosine Similarity’ subsection of the ‘Methods’ section.
a The feature cloud derived from MUGS embeddings, while b displays the feature cloud from Skip-Gram embeddings. Codes selected by both methods are highlighted in purple. Codes exclusively selected by MUGS are shown in blue, and codes exclusively selected by Skip-Gram are in green. The size of each word is proportional to the cosine similarity between the corresponding code’s embedding and the epilepsy embedding, with larger words indicating higher similarity.
a The feature cloud derived from MUGS embeddings, while b displays the feature cloud from Skip-Gram embeddings. Codes selected by both methods are highlighted in purple. Codes exclusively selected by MUGS are shown in blue, and codes exclusively selected by Skip-Gram are in green. The size of each word is proportional to the cosine similarity between the corresponding code’s embedding and the PH embedding, with larger words indicating higher similarity.
For epilepsy, with the same cutoff selection method, the number of selected features using Skip-Gram embeddings is 16, compared to 130 using MUGS embeddings, primarily due to the sparsity of BCH EHR data. Consequently, important features such as infantile cerebral palsy and lacosamide, an antiepileptic medication, were not selected by the Skip-Gram method but were captured by our MUGS method. MUGS embeddings also identified detailed CSF testing including CSF amino acid profiling and neurotransmitters, important components of diagnosing epilepsy that are particular to the pediatric population.
Similarly, in the context of PH, Skip-Gram embeddings identified 33 related features, whereas MUGS embeddings identified 58. For example, Bosentan, which is primarily used to treat PH, was solely selected by our MUGS method. MUGS also uniquely identified the association between PH and pulmonary embolism. In contrast, the Skip-Gram method identified degeneration of the macula and posterior pole of the retina as a relevant condition, but the association between this condition and PH is not readily apparent.
Case study: phenotyping and sub-phenotyping pediatric PH
We further validated the utility of MUGS embeddings by (i) first classifying PH status among patients who received any PH diagnostic code (PheCode:415, PheCode:415.2, or PheCode:415.21) before age 18; and then (ii) sub-phenotyping PH for predicting progression. The study cohort consisted of 13,510 pediatric patients from BCH, among whom 419 were manually annotated by domain experts46, identifying 256 as true PH cases and the others as not having PH. The remaining 13,091 patients were unlabeled. Demographic information of the cohort can be found in Supplementary Table 2. We trained and validated the PH classification algorithms using the labeled dataset and then classified the PH status of the unlabeled patients. For those classified as PH-positive, we proceeded with sub-phenotyping through unsupervised clustering and validated the findings against mortality risks. Different from existing computable phenotyping methods, we represented each patient using the weighted sum of the embeddings of codes the patient received, with weights determined by the code frequency and the cosine similarity between each code’s embedding and the embedding of PH (PheCode:415.2).
PH classification
We compared embedding-based PH classification methods with three embedding-free computable phenotyping benchmarks. For embedding-based algorithms, we assembled patient embeddings with code embeddings learned by Coder, SapBert, Skip-Gram, and MUGS. For embedding-free methods, we represented patients only based on their feature counts and considered three approaches to selecting features: (i) the ICD benchmark relying solely on the count of PheCode:415.2, (ii) 70 EHR codes selected via the KESER feature selection algorithm27, and (iii) 1140 EHR codes without any feature selection. With patients represented either by their embeddings or feature counts, we trained support vector machine (SVM) algorithms to classify PH status and reported their cross-validated accuracies. The classification results, including AUC, sensitivity, and positive predictive value (PPV) at specificity = 0.95, are summarized in Table 2. Notably, Coder and SapBert are significantly worse than other methods, highlighting the value of EHR data in representation learning once again. MUGS and Skip-Gram outperform the three computable phenotyping benchmarks, exhibiting the efficiency gained through capitalizing on the interconnectedness between medical codes learned with EHR data. These two methods perform similarly due to the straightforward nature of the phenotyping task at hand, though MUGS’s advantages become more apparent in the more complex sub-phenotyping task on PH progression prediction. Further details on the patient embeddings and classification methodology are provided in the ‘Patient Embedding Generation and Classification Methods’ subsection of the ‘Methods’ section.
Stratifying disease progression among PH-positive patients
We compared the survival curves of PH-positive patients by clustering their profiles using Skip-Gram embeddings and MUGS embeddings. As illustrated in Fig. 10, the two groups of PH-positive patients clustered by MUGS are more distinctly separable in terms of survival probability, with a hazard ratio of 3.00 and a 95% confidence interval of (2.15, 4.17); in contrast, the hazard ratio using Skip-Gram embeddings is only 2.03 with a 95% confidence interval of (1.45, 2.84). In practice, patients identified with severe PH progression are subjected to more intensive monitoring and aggressive treatments. The superior clustering performance of MUGS highlights its effectiveness in capturing nuanced aspects of patient profiles, crucial for precise sub-phenotyping.
Kaplan–Meier survival curves comparing mortality between two groups of pediatric patients classified with PH, either manually or through computational algorithms. The groups are created based on patient embeddings derived from Skip-Gram and MUGS code embeddings via the K-means algorithm. The follow-up period begins at the time of the first received PH diagnostic code, with the x-axis representing time in years.
Discussion
The MUGS approach efficiently and robustly summarizes patient-level EHR data into site-specific code embeddings by leveraging existing hierarchical medical ontologies and co-training EHR data from multiple institutions. The novel integration of these data sources empowers the generation of high-quality embeddings, even from sparse EHR data, adaptively capturing the inherent homogeneity and heterogeneity across patient populations. MUGS outperforms existing methods in several aspects. First, it enhances knowledge transfer across sites via a shared hierarchical medical ontology. Second, it can adaptively capture the homogeneous and heterogeneous codes across sites as well as the varying levels of heterogeneity within code groups through hyperparameter tuning. Finally, it maintains robustness against negative transfer, ensuring no performance degradation when introducing data from another site.
MUGS embeddings are instrumental for several downstream tasks that rely on accurate code embeddings tailored to specific patient populations. This becomes especially pivotal when dealing with information-scarce populations or sites, and it is critical in studying rare diseases and conditions that manifest or progress differently across patient populations. An essential application is automated feature selection, a process utilizing code embeddings to identify relevant concepts for studying a disease. This can range from straightforward cosine similarity cutoffs to more complex feature selection algorithms. Effective feature selection plays an important role in constructing clinical knowledge graphs and enhancing phenotyping efforts. The knowledge network structures built from MUGS embeddings form the basis for knowledge graphs with semantic edges. We conducted a preliminary analysis to classify the ‘may treat’ relationship between medications and phenotypes, using PheCode-RxNorm pairs curated from treatment sections of pediatric articles as labels. By concatenating the embeddings of each pair along with their cosine similarity as predictors, we achieved an AUC of 0.90. This demonstrates the feasibility of constructing pediatric knowledge graphs using MUGS embeddings. Additionally, the inherent knowledge transfer capabilities of the MUGS algorithm enhance the ability of these graphs to integrate information from diverse sources, contributing to more accurate and comprehensive graph structures. Another important application is to create high-quality patient embeddings tailored to target diseases and a time window for pediatric patients and other understudied patient groups. High-quality patient embeddings are crucial for enhancing precision medicine, as they enable the detailed characterization of individual health profiles, facilitating targeted treatment and intervention strategies. These embeddings also support the robust analysis of large-scale medical data, improving diagnostic accuracy and the personalization of healthcare by identifying subtle patterns that might be missed with traditional analyses.
While the training of MUGS using BCH and MGB data has already achieved strong performance results compared to existing methods, there are a few aspects that can be improved. One limitation is that MUGS forces the dimension of embedding to be the same across different sites. However, it is possible that the optimal dimension at different sites may vary. In practice, we can use the largest one as the dimension of embeddings. Future work could address this limitation by allowing different dimensions for group effect, code effect, and code-site effect in different sites, then concatenating them into a longer embedding for a code at a site. Incorporating code-site weights based on the frequencies of different codes across sites, available in EHR data, is another future direction worth studying. It could potentially improve the efficiency of identifying heterogeneous codes, especially frequent and heterogeneous ones. Additionally, the current evaluation and validation rely on a limited set of pediatric labels, focusing on PheCode-PheCode and PheCode-RxNorm relationships. This could be expanded to include more diverse relationship types, such as those between diseases and laboratory tests, and diseases and procedures. A larger label set could further enhance the assessment of the performance of pediatric embeddings. Moreover, when identifying epilepsy and PH-related codes for pediatric patients, we used an ad-hoc quantile-based cutoff selection method for feature extraction. Other systematic feature extraction methods, such as sparse embedding regression27, can also be employed, which might offer additional insights and potential improvements in the identification process. Regarding pediatric knowledge graph construction, it is crucial to gather more pediatric edge labels and further investigate feature engineering methods and classifier configurations. To create more comprehensive knowledge graphs, incorporating unstructured or uncoded concepts from clinical notes as concept unique identifiers (CUIs) in the construction of co-occurrence matrices would be beneficial. These steps will enhance the depth and utility of the knowledge graphs, enabling more nuanced insights into pediatric health.
In conclusion, we have demonstrated how MUGS excels in versatile and robust knowledge transfer learning. It co-trains EHR data from MGB and BCH and effectively transfers knowledge from general patients to pediatric patients, overcoming the challenges of heterogeneity between the populations and sparse pediatric EHR data. MUGS not only facilitates research on understudied pediatric populations but also provides a transfer learning framework for diverse healthcare populations.
Methods
Data preprocessing
BCH is a quaternary referral center for pediatric care and also serves as a primary and specialty care site for the local community. MGB is a Boston-based non-profit hospital system serving primarily an adult population, although MGB also provides neonatal and some general and subspecialty pediatric care. The analysis included data from a total of 0.25 million patients at BCH and 2.5 million patients at MGB, each of whom had at least one visit with codified medical records.
We gathered four domains of codified data from BCH and MGB, including diagnosis, medication, lab measurements, and procedures. For diagnoses, International Classification of Disease (ICD) codes representing the same general diagnosis were aggregated into a single PheCode using the ICD-to-Phecode mapping from the PheWAS catalog (https://phewascatalog.org/phecodes)47. Local medication codes were standardized to RxNorm codes48, and local laboratory measurement codes were consolidated under the corresponding LOINC codes. Due to differences in coding systems between MGB and BCH, we combined LOINC codes with the same meaning into a new code. For example, LOINC codes 30341-2 (ESR Bld Qn) and 4537-7 (ESR Bld Qn Westergren) are exclusively used at MGB and BCH, respectively. We leveraged the hierarchical ontology of LOINC codes and consolidated these codes into a single representative code, LP16409-2 (Erythrocyte Sedimentation Rate), to enhance the knowledge transferability between sites. Procedure codes, including Current Procedural Terminology (CPT-4), Healthcare Common Procedure Coding System (HCPCS), Procedure Coding Systems ICD-9-PCS and ICD-10-PCS codes, were categorized into CCS categories using CCS mapping software (https://www.hcup-us.ahrq.gov/toolssoftware/ccs_svcsproc/ccssvcproc.jsp). The top-down hierarchical structures of PheCodes, RxNorm codes, and LOINC codes allow a comprehensive aggregation of the codified data into a common ontology.
We generated a summary-level co-occurrence matrix of EHR codes for each site as described in25. Initially, we created an individual co-occurrence matrix for the EHR codes extracted from each patient’s records. These square matrices have rows and columns representing EHR codes, and the entries indicate the number of times each pair of codes co-occurred within 30 days in a patient’s EHR. We summed the individual matrices for all patients at each site to produce a summary-level co-occurrence matrix. Entries with co-occurrence numbers less than 200 were set to zero, and we removed any columns and rows composed entirely of zeros to control for frequency. To further analyze the dependency between two codes, we constructed the SPPMI matrix for BCH and MGB.
We denote the co-occurrence matrix at site \(k\) as \({C}^{\left(k\right)},\) \(k\in \left\{1,\ldots ,K\right\}.\) In this analysis, we have two sites BCH and MGB, and \(K=2\). We use \({{\mathbb{V}}}_{k}\) to denote the vocabulary, i.e., the set of all EHR codes, at site k, and nk to denote the cardinality of \({{\mathbb{V}}}_{k}\). For the \(k\)-th site, the \(\left(i,j\right)\)-th entry of the SPPMI matrix, S(k), is obtained as \({S}_{{ij}}^{\left(k\right)}=\max \left[0,\log \left\{{C}_{{ij}}^{\left(k\right)}{\left({C}_{i,\cdot }^{\left(k\right)}\right)}^{-1}{\left({C}_{\cdot,j}^{\left(k\right)}\right)}^{-1}{C}_{\cdot,\cdot }^{\left(k\right)}\right\}-\log \left(l\right)\right]\), where \(l\) is the negative sampling value which is often set to 1, i.e., no shifting, \({C}_{i,\cdot}^{\left(k\right)}\) and \({C}_{\cdot,j}^{\left(k\right)}\) are the i-th row sum and j-th column sum of the cooccurrence matrix C(k), respectively, and \({C}_{\cdot,\cdot}^{\left(k\right)}\) is the sum of all elements in C(k).
Knowledge transfer via MUGS
Embedding decomposition
The proposed MUGS algorithm utilizes SPPMI matrices from multiple sites and incorporates shared prior hierarchical medical ontology knowledge. It achieves this by decomposing the embedding of each code at a given site into three component embeddings: group effect, code effect, and code-site effect. Formally, we decompose the \(p\)-dimensional embedding vector for code i at site k, denoted by \({u}_{{ik}}\), as
In the first term of the decomposition, index \({g}_{i}\in \left\{1,\ldots ,m\right\}\) denotes the group of code \(i\), defined by the hierarchical medical ontologies, \({G}_{i}\) is the corresponding group containing \(\left|{G}_{i}\right|\) codes, and \({\beta }_{{g}_{i}}\) is the embedding of the effect of group \({g}_{i}\). The second term, \({\zeta }_{i}\), is the site-nonspecific effect of code \(i\). In the third term, \({\delta }_{{ik}}\) is the code-site effect of code \(i\) at site \(k\) that can capture heterogeneity across sites, and \({{\mathbb{V}}}_{o}\) is the set of overlapping codes, present in at least two sites, with cardinality \(n\). Note that if a group only contains one code, then the group effect and the code effect are not separable. Similarly, for non-overlapping codes that only exist in one site, code effect and code-site effect are not separable. To ensure model identifiability, we introduce the indicator function \(I\left(\cdot \right)\) and assume the group effect is non-zero only if the code belongs to a nontrivial group (i.e., a group consisting of two or more codes), and code-site effect is non-zero only if the code exists in at least two sites.
Penalized loss function
The MUGS algorithm learns \({\beta }_{{g}_{i}}\), \({\zeta }_{i}\), and \({\delta }_{{ik}}\) for all codes across sites from SPPMI matrices using the following model:
where \({u}_{{ik}}^{T}\) is the transpose of \({u}_{{ik}}\), and the error term \({\varepsilon }_{k}\) represents the noise of the SPPMI matrix observed at site k. To estimate the group, code, and code-site effects \(\theta =\left({\beta }_{1},\ldots ,{\beta }_{m},{\zeta }_{1},\ldots ,{\zeta }_{N},{\delta }_{11},\ldots ,{\delta }_{n1},\ldots ,{\delta }_{1K},\ldots ,{\delta }_{{nK}}\right)\), where \(N\) is the total number of different codes across \(K\) sites and \({\delta }_{{ik}}=0\) if code \(i\) does not exist at site \(k\), we employ a loss function with penalties for the code and code-site effects:
where \({\lambda }_{1}\) and \({\lambda }_{2}\) are two tunable hyperparameters. We highlight that the hyperparameters \({\lambda }_{1}\) and \({\lambda }_{2}\) can be tuned to adaptively capture the degree of heterogeneity within groups and across sites. We can see this by considering the following extreme cases. When the underlying populations of the \(K\) sites share little similarity, both \({\lambda }_{1}\) and \({\lambda }_{2}\) would be selected as zeros. Consequently, the penalty terms would disappear, and the method would reduce to direct matrix factorization with respect to the SPPMI matrix within each site. In this case, no knowledge would be transferred. When the underlying populations of the \(K\) sites are exactly the same, λ2 would approach infinity. Consequently, the code-site effects would be shrunk to zeros, and the resulting embeddings would be dominated by the group and code effects that are shared across sites. When the underlying populations of the \(K\) sites are exactly the same and codes in a group also act in the same way, both λ1 and λ2 would approach infinity. In this case, both code and code-site effects would be shrunk towards zeros and the resulting embeddings would be dominated by group effects. In the analysis with BCH and MGB data, we observed that some codes behave similarly within both pediatric and adult patients, while others exhibit distinct patterns. Certain groups of codes act as cohesive units, while other groups show variability in their individual code behaviors. By tuning λ1 and λ2 carefully, we can effectively identify homogeneous and heterogeneous codes, and control the signal strength of different group effects.
Additionally, we prioritize the code-site effect in the optimization process, which is the key of capturing potential heterogeneity across sites. With the unsquared \({L}_{2}\) penalty, code-site effects are allowed to be shrunk to exact zeros, implying these codes are homogeneous across sites. In contrast, the penalty on code effects is a ridge penalty that prevents overfitting of code effects and controls the signal strength of group effects. The design of our loss function thus ensures adaptability and robust performance that is no worse than the baseline single-site matrix factorization method, the Skip-Gram method.
Optimization algorithm
The non-convex nature of the loss function (3) prevents the use of standard optimization algorithms. To address this, we propose an alternating and iterative method to minimize it within a broader context. Although this paper focuses on SPPMI matrices from EHR data, our MUGS approach can be applied to broader settings where \({S}^{\left(k\right)}\) is asymmetric. For example, \({S}^{\left(k\right)}\in {{\mathbb{R}}}^{{n}_{r,k}\times {n}_{c,k}}\) can be a utility matrix of the \(k\)-th recommendation system, with \({n}_{r,k}\) users and \({n}_{c,k}\) items. Hence, from a methodological perspective, we consider the following more general model:
Embedding vector \({v}_{{jk}}\) can be similarly decomposed into \({v}_{{jk}}={\beta }_{{g}_{i}^{{\prime} }}^{{\prime} }I\left(\left|{G}_{j}^{{\prime} }\right| > 1\right)+{\zeta }_{j}^{{\prime} }+{\delta }_{{jk}}^{{\prime} }I\left(j\in {{\mathbb{V}}}_{o}^{{\prime} }\right)\), where \({g}_{j}^{{\prime} }\in \{1,\ldots ,{m}^{{\prime} }\}\) denotes the group of item \(j\), \({G}_{j}^{{\prime} }\) is the corresponding group containing \(|{G}_{j}^{{\prime} }|\) items, and \({{\mathbb{V}}}_{o}^{{\prime} }\) is the set of overlapping items, present in at least two sites, with cardinality \({n}^{{\prime} }\). Let \({\theta }^{{\prime} }=\left({\beta }_{1}^{{\prime} },\,\ldots ,\,{\beta }_{{m}^{{\prime} }}^{{\prime} },\,{\zeta }_{1}^{{\prime} },\,\ldots ,\,{\zeta }_{{N}^{{\prime} }}^{{\prime} },\,{\delta }_{11}^{{\prime} },\,\ldots ,\,{\delta }_{{n}^{{\prime} }1}^{{\prime} },\,\ldots ,\,{\delta }_{1K}^{{\prime} },\,\ldots ,\,{\delta }_{{n}^{{\prime} }K}^{{\prime} }\right),\) where \({N}^{{\prime} }\) is the total number of different items across \(K\) sites. The loss function (3) can be correspondingly extended to
To optimize the non-convex loss function (5), we first need to create relatively good initial estimators for the effects of interest. The following Steps 1-3 are on constructing initial estimators for \(\theta\) and \({\theta }^{{\prime} }\), while Step 4 is on updating these effects alternatingly and iteratively until convergence.
Step 1: Perform SVD on \({S}^{(k)}\) and select the top \(p\) singular values and the corresponding \(p\) left and right singular vectors. Let \({\Sigma }^{(k)}\in {{\mathbb{R}}}^{p\times p}\) be a diagonal matrix whose diagonal elements consist of these top \(p\) singular values, arranged in descending order. Let \({W}^{(k)}\in {{\mathbb{R}}}^{{n}_{r,k}\times p}\) represent the matrix containing the corresponding \(p\) left singular vectors and let \({X}^{(k)}\in {{\mathbb{R}}}^{{n}_{c,k}\times p}\) represent the matrix containing the corresponding \(p\) right singular vectors at site \(k\). The initial embeddings for user \(i\) and item \(j\) at site \(k\) are denoted by \({\widetilde{u}}_{{ik}}={W}_{i}^{(k)}{\Sigma }^{1/2}\) and \({\widetilde{v}}_{{jk}}={X}_{j}^{(k)}{\Sigma }^{1/2}\), where \({W}_{i}^{(k)}\) and \({X}_{j}^{(k)}\) are the \(i\)-th row of \({W}^{(k)}\) and \(j\)-th row of \({X}^{(k)}\), respectively.
Step 2: Align the directions of \(K\) sets of initial embeddings via orthogonal transformation using overlapping codes. Let \({\widetilde{U}}^{(k)}={[{\widetilde{u}}_{1k},\ldots ,{\widetilde{u}}_{{n}_{r,k}k}]}^{T}\in {{\mathbb{R}}}^{{n}_{r,k}\times p}\) denote the initial user embedding matrix at site \(k\), and \({\widetilde{U}}_{{{\mathbb{V}}}_{1}\cap {{\mathbb{V}}}_{k}}^{(k)}\) denote the initial embedding matrix at site \(k\) of the users present at both site \(1\) and site \(k\). The estimated orthogonal transformation matrix is \({\hat{Q}}^{(k)}={{argmin}}_{Q\in {{\mathbb{O}}}^{p\times p}}\parallel {\widetilde{U}}_{{{\mathbb{V}}}_{1}\cap {{\mathbb{V}}}_{k}}^{\left(1\right)}-{\widetilde{U}}_{{{\mathbb{V}}}_{1}\cap {{\mathbb{V}}}_{k}}^{\left(k\right)}Q{\parallel }_{F}\), for \(k\in \left\{2,\,\ldots ,{K}\right\}\), where \({{\mathbb{O}}}^{p\times p}\) is the set of all \(p\times p\) orthogonal matrices, and \(\parallel \cdot {\parallel }_{F}\) denotes the Frobenius norm. This can be solved as an orthogonal procrustes problem with explicit form solution49. Then the aligned user and item embedding matrices at site \(k\) are defined as \({\widetilde{U}}^{\left(k\right),{ot}}={\widetilde{U}}^{\left(k\right)}{\hat{Q}}^{(k)}\), \({\widetilde{V}}^{\left(k\right),{ot}}={\widetilde{V}}^{\left(k\right)}{\hat{Q}}^{(k)}\), for \(k\in \{2,\,\ldots ,{K}\}\), \({\widetilde{U}}^{\left(1\right),{ot}}={\widetilde{U}}^{\left(1\right)}\) and \({\widetilde{V}}^{\left(1\right),{ot}}={\widetilde{V}}^{\left(1\right)}\). Since \({\hat{Q}}^{\left(k\right)}{({\hat{Q}}^{(k)})}^{T}\) is equal to the identity matrix, then \({\widetilde{U}}^{\left(k\right),{ot}}{({\widetilde{V}}^{\left(k\right),{ot}})}^{T}={\widetilde{U}}^{\left(k\right)}{({\widetilde{V}}^{\left(k\right)})^{T}}\).
Step 3: Calculate initial estimators for \({\beta }_{{g}_{i}}\), \({\zeta }_{i}\), and \({\delta }_{{ik}}\), denoted by \({\widetilde{\beta }}_{{g}_{i}}\), \({\widetilde{\zeta }}_{i}\), and \({\widetilde{\delta }}_{{ik}}\), via pooling the aligned initial user embeddings across sites. Specifically, \({\widetilde{\beta }}_{{g}_{i}}={n}_{{G}_{i}}^{-1}\sum _{k}\sum _{j\in {G}_{i}}{\widetilde{u}}_{{jk}}^{{ot}}\), where \({n}_{{G}_{i}}\) is the sum of number of codes in group \({G}_{i}\) across \(K\) sites and \({\widetilde{u}}_{{jk}}^{{ot}}\) is the \(j\)-th row of \({\widetilde{U}}^{\left(k\right),{ot}}\), \({\widetilde{\zeta }}_{i}={n}_{{ik}}^{-1}\sum _{k}{\widetilde{u}}_{{ik}}^{{ot}}-{\widetilde{\beta }}_{{g}_{i}}I\left(\left|{G}_{i}\right| > 1\right)\), where \({n}_{{ik}}\) represents the number of sites where code \(i\) is present, and \({\widetilde{\delta }}_{{ik}}={\widetilde{u}}_{{ik}}^{{ot}}-{\widetilde{\beta }}_{{g}_{i}}I\left(\left|{G}_{i}\right| > 1\right)-{\widetilde{\zeta }}_{i}\), for \(i\in {\cup }_{k=1}^{K}{{\mathbb{V}}}_{k}\) and \(k\in \left\{1,\,\ldots ,{K}\right\}.\) For symmetric \({S}^{(k)}\), \({\widetilde{\beta }}_{{g}_{i}^{{\prime} }}^{{\prime} }={\widetilde{\beta }}_{{g}_{i}}\), \({\widetilde{\zeta }}_{i}^{{\prime} }={\widetilde{\zeta }}_{i}\), and \({\widetilde{\delta }}_{{ik}}^{{\prime} }={\widetilde{\delta }}_{{ik}}\). Otherwise, \({\widetilde{\beta }}_{{g}_{i}^{{\prime} }}^{{\prime} }\), \({\widetilde{\zeta }}_{i}^{{\prime} }\), and \({\widetilde{\delta }}_{{ik}}^{{\prime} }\) can be obtained similarly using \({\widetilde{V}}^{\left(k\right),{ot}}\), \(k\in \left\{1,\,\ldots ,{K}\right\}\).
Step 4: For each iteration \(t\in\{1, 2, ...\}\), update \(\theta\) and \({\theta }^{{\prime} }\) in an alternating fashion. Treat the initial estimators as the results for \(t=0\).
Step 4.1: Iteratively update \({\widetilde{U}}^{\left(k\right),{ot},(t-1)}\), component-wise, by fixing \({\widetilde{V}}^{\left(k\right),{ot},(t-1)}\), where
When the stopping condition \(\left|L\left({\widetilde{\theta }}^{\left(t-1\right)},{\widetilde{\theta }}^{{\prime} \left(t-1\right)}\right)-L\left({\widetilde{\theta }}^{\left(t\right)},{\widetilde{\theta }}^{{\prime} \left(t-1\right)}\right)\right|/L\left({\widetilde{\theta }}^{\left(t-1\right)},{\widetilde{\theta }}^{{\prime} \left(t-1\right)}\right)\le {tol}\), where \({tol}\) is the pre-specified tolerance, is met, set \({\widetilde{u}}_{{ik}}^{{ot},\left(t\right)}={\widetilde{\beta }}_{{g}_{i}}^{\left(t\right)}I\left(\left|{G}_{i}\right| > 1\right)+{\widetilde{\zeta }}_{i}^{(t)}+{\widetilde{\delta }}_{{ik}}^{(t)}I\left(i\in {{\mathbb{V}}}_{o}\right)\).
Step 4.2: In the same manner as Step 4.1, fix the updated \({\widetilde{U}}^{\left(k\right),{ot},(t)}\) and update \({\beta }_{{g}_{j}^{{\prime} }}^{{\prime} }\), \({\zeta }_{j}^{{\prime} }\), and \({\delta }_{{jk}}^{{\prime} }\) sequentially and iteratively until the stopping condition \(\left|L\left({\widetilde{\theta }}^{\left(t\right)},{\widetilde{\theta }}^{{\prime} \left(t-1\right)}\right)-L\left({\widetilde{\theta }}^{\left(t\right)},{\widetilde{\theta }}^{{\prime} \left(t\right)}\right)\right|/L\left({\widetilde{\theta }}^{\left(t\right)},{\widetilde{\theta }}^{{\prime} \left(t-1\right)}\right)\le {tol}\) is met. Then set \({\widetilde{v}}_{{jk}}^{{ot},\left(t\right)}={\widetilde{\beta }}_{{g}_{j}^{{\prime} }}^{{\prime} \left(t\right)}I\left(\left|{G}_{j}^{{\prime} }\right| > 1\right)+{\widetilde{\zeta }}_{j}^{{\prime} (t)}+{\widetilde{\delta }}_{{jk}}^{{\prime} (t)}I\left(j\in {{\mathbb{V}}}_{o}^{{\prime} }\right)\).
Step 4.3: Repeat Step 4.1 and Step 4.2 until the stopping condition
is met. Output \({\widetilde{U}}^{\left(k\right),{ot},(t)}\) and \({\widetilde{V}}^{\left(k\right),{ot},(t)}\), \(k\in \left\{1,\ldots ,K\right\}\), as the final MUGS embedding matrices.
Tuning procedure
We randomly selected a set of silver-standard positive PheCode-PheCode pairs and PheCode-RxNorm pairs curated from pediatric articles to tune \(({\lambda }_{1},{\lambda }_{2})\). Note that these pairs were excluded from the labels used to evaluate the performance of various embeddings. In the tuning procedure, we randomly selected an equal number of random pairs as controls. For embeddings trained with a given \(({\lambda }_{1},{\lambda }_{2})\), the AUC of the cosine similarity in distinguishing silver-standard positive pairs from random pairs was calculated. We selected \(({\lambda }_{1},{\lambda }_{2})\) that maximized the AUC as the final hyperparameters.
Label curation
Silver-standard positive label curation from pediatric articles
To evaluate the quality of the MUGS embeddings specifically for the pediatric population, we semi-automatically curated pediatric silver-standard labels by performing named entity recognition on disease-specific articles from three authoritative sources: BCH website (https://www.childrenshospital.org/conditions), Cincinnati Children’s Hospital (CCH) website (https://www.cincinnatichildrens.org/search/health-library), and UpToDate website (https://www.wolterskluwer.com/en/solutions/uptodate). Our process began by using a web crawler to collect paragraphs on symptoms and treatments from each disease page on these websites. For UpToDate, which includes articles on various medical subjects, we selected only those articles whose titles contain terms like ‘child’, ‘neonate’, or ‘infant’. We then applied the Narrative Informative Linear Extraction (NILE) algorithm50 to identify key CUIs representing diseases, and their corresponding symptoms and treatments. For each disease, we generated CUI pairs in two general forms: disease-condition and disease-medication. These CUI pairs were translated into two types of EHR code pairs, PheCode-PheCode, and PheCode-RxNorm, using an industry-standard CUI-code dictionary. This completed the curation of silver-standard positive pediatric code pairs. We refer to these pairs as ‘silver-standard’ to acknowledge the potential errors that may arise from ignoring the semantics of sentences and from noise in the CUI-code dictionary. We randomly selected 183 unique code-code pairs from an initial set of 200 and manually verified their correctness by reviewing the sentences which they were curated from as well as the pediatric literature. Of these, 166 pairs were confirmed as correct.
Survey label curation by pediatric physicians
A survey was conducted among pediatric physicians focusing on six target diseases: epilepsy, PH, asthma, type 1 diabetes, ulcerative colitis, and Crohn’s disease. For each disease, 10 additional conditions and 10 medications were randomly selected from the BCH code book. Respondents were asked to indicate their perception of the relationship between each disease and the selected conditions/medications by choosing from ‘strongly related,’ ‘maybe related,’ or ‘not related.’ In total, there were 120 questions categorized into six disease categories. The survey received responses from 31 pediatric physicians, with their expertise distributed as follows: 15 responses for epilepsy, 8 for PH, 9 for asthma, 3 for type 1 diabetes, and 2 each for ulcerative colitis and Crohn’s disease. We encountered 14 instances of conflicting responses, where pairs received both ‘strongly related’ and ‘not related’ assessments. To resolve these conflicts, we reviewed the pediatric literature and sought further confirmation from a domain expert to correct these 14 labels. For the remaining questions without conflicting responses, we employed majority voting to determine the final annotations.
Relatedness scores from GPT models
For each target disease in the survey mentioned above, we tasked GPT-3.5 and GPT-4, as AI models equipped with medical knowledge, with assigning a score to assess the degree of relatedness between the target disease and each of the 20 EHR codes paired with it. We provided the following specific instruction to both GPT-3.5 and GPT-4: ‘As an AI with medical knowledge, your task is to evaluate the degree of relatedness between two clinical concepts. The objective is to aid in feature selection specific for pediatric patients, implying that the concepts should ideally bear some clinical or medical connection. Please provide your evaluation as a numerical value, rounded to two decimal points, ranging from 0 (no correlation) to 1 (highly correlated). Note: Only respond with a SINGLE numerical value, NO textual explanations.’
Gold-standard positive label curation
NLP techniques facilitate the curation of a large volume of silver-standard positive labels, albeit with a slight reduction in accuracy. To provide more precise evaluations, we randomly selected 32 diseases with moderate frequencies at BCH and curated gold-standard positive labels by manually reviewing related pediatric articles from BCH, CCH, and UpToDate websites. Including pairs with final annotation as ‘strongly related’ in the survey, we curated 1156 gold-standard positive labels on pediatric-specific code-code pairs.
Evaluation methods
We first evaluated the quality of the embeddings by assessing their accuracy in identifying gold-standard and silver-standard positive pairs, focusing on PheCode-PheCode and PheCode-RxNorm, two types of pairs. Given the predominance of unrelated pairs within each type, the network’s sparsity led us to treat each random pair as representing either a weak or nonexistent relationship between the codes. We encoded the curated pairs from a specific source as ones and generated an equal number of random pairs coded as zeros. In this binary vector, one indicates a strong relationship, and zero denotes its absence. We then calculated the cosine similarity between the embeddings of the two codes in each pair and summarized the accuracy using the AUC score, which compares these cosine similarities against the binary vector.
For survey labels, we assigned ‘strongly related’ a value of 1, ‘maybe related’ 0.5, and ‘not related’ 0. GPT labels provide numeric relatedness scores ranging from zero to one. These ordinal labels, richer in detail than binary labels, necessitated the use of Kendall’s tau instead of AUC, as it better captures the nuances in ordinal data. We calculated the cosine similarity for each code pair using embeddings from four benchmark methods and MUGS. Kendall’s tau was then computed to compare these cosine similarities with majority votes from survey labels and GPT relatedness scores. The average Kendall’s tau across six target diseases was used as the criterion for final evaluation.
Feature selection: cutoff of cosine similarity
When using a PheCode as the target, we consider four types of pairs: PheCode-PheCode, PheCode-RxNorm, PheCode-LOINC, and PheCode-CCS. To establish suitable cutoffs for feature selection, we randomly generated 5000 pairs for each pair type and calculated the 99th percentile of their cosine similarities based on the code embeddings. We then used these four percentiles as cutoffs to filter codes. Specifically, we retained codes whose cosine similarity with the target code was at or above the respective cutoff for each type.
Patient embedding generation and classification methods
Feature selection is an important step in many computable phenotyping methods. We compared our method with three phenotyping benchmarks, using one code (PheCode:415.2), 70 codes chosen through the KESER algorithm, and all 1140 codes from the labeled patients’ records. All counts were transformed by \(x\to \log (x+1)\) to stabilize the training process. To avoid overfitting, we conducted a principal component analysis (PCA) on the transformed count data, selecting the minimal number of leading principal components (PCs) that accounted for at least 85% of the total variation. We then used these PCs to train an SVM with a radial basis function (RBF) kernel.
Unlike traditional feature selection, code embeddings allow us to quantify the interconnectedness between various medical codes and PH by measuring the cosine similarity between the respective code embeddings, which can be viewed as continuous feature weighting. To utilize code embeddings in patient-level tasks, we propose constructing an embedding for each patient as the weighted sum of the code embeddings:
where \({z}_{i}\) is the embedding of patient \(i\), \({\mathbb{V}}\) is the set of codes documented in the labeled patients’ records, \({d}_{{ij}}\) is the count of code \(j\) in patient \(i\)’s records, \({p}_{j}\) is the number of patients in the study cohort who received code \(j\), \({u}_{j}\) is the embedding of code j, and \({u}_{{j}_{0}}\) is the embedding of the target code \({j}_{0}\). For a given code, the weight is a product of two terms. The first term is the term frequency-inverse document frequency (TF-IDF) of the code, and the second term is the cosine similarity between the code embedding and the target code embedding. The greater the count number and the cosine similarity, the more significant the specific code in constructing the patient embedding.
Here, the target code is the PheCode of PH, and we used the full records of each labeled patient to calculate the TF-IDF. We then performed PCA on patient embeddings derived from various sets of code embeddings, including Coder, SapBert, Skip-Gram, and MUGS. Similarly, for each method, we selected the minimal number of leading PCs that explained at least 85% of total variations and fed them together with the binary labels into an SVM with an RBF kernel.
To assess the performance of diverse classification methods, we partitioned the data randomly into ten folds, employing nine for model training and the remaining one for evaluation. To mitigate the impact of random sampling, we repeated this data splitting process 50 times and reported the mean and standard deviation of 500 AUCs, sensitivities, and PPVs with specificity set to 0.95 as our final outcomes.
Embedding-based patient clustering
To further demonstrate the effectiveness of MUGS embeddings in capturing nuanced aspects of a patient’s profile, we conducted an unsupervised patient clustering task. This task aimed to differentiate between patients with severe PH progression and those with mild PH progression among PH-positive individuals. We focused on patients labeled as PH-positive by domain experts and unlabeled patients classified as PH-positive using the above SVM-based supervised classifiers trained with Skip-Gram embeddings and MUGS embeddings, respectively. To ensure a relatively high PPV, we chose the minimal probability cutoff such that the specificity is no less than 95%.
Next, we constructed patient embeddings for the PH-positive patients using Eq. (8). Unlike in the PH classification task, here we only used EHR data from two years prior and one year subsequent to the patients receiving their first PH diagnostic code. We then applied PCA to the newly formed patient embeddings and selected the minimal number of PCs that accounted for at least 85% of the total variance. These selected PCs were used to train a K-Means clustering algorithm, which divided the PH-positive patients into two distinct groups. Finally, we gathered data on each patient’s status at their last visit (whether they were deceased or right-censored) and the time until death or censoring from the EHR. We then calculated the proportion of patients in each group who survived for a specific duration 1 year after receiving their first PH diagnostic code, using the Kaplan–Meier estimator.
Data availability
Due to privacy constraints, the visit-level dataset used to calculate the embedding vectors and perform patient classification and clustering are not shareable. However, the hierarchical medical ontology can be accessed at https://shiny.parse-health.org/hierarchies/, and the knowledge network structures constructed based on our MUGS embeddings can be accessed at https://shiny.parse-health.org/multi-view-net/.
Code availability
We conducted all data analyses using R version 4.1.1. The codes for MUGS embedding construction are available at https://github.com/MengyanLi1992/MUGS.git.
References
Thiese, M. S. Observational and interventional study design types; an overview. Biochem. Med. 24, 199–210 (2014).
Barría, R. M. Cohort Studies in Health Sciences. (InTech, 2018).
Schriger, D. L. Modern epidemiology. Ann. Emerg. Med. 52, 480 (2008).
Bourgeois, F. T. et al. Pediatric versus adult drug trials for conditions with high pediatric disease burden. Pediatrics 130, 285–292 (2012).
Joseph, P. D., Craig, J. C. & Caldwell, P. H. Y. Clinical trials in children. Br. J. Clin. Pharmacol. 79, 357–369 (2015).
Sollo, N. et al. Perceived barriers to clinical trials participation: a survey of pediatric caregivers. Kans. J. Med. 15, 139–143 (2022).
Pasquali, S. K., Lam, W. K., Chiswell, K., Kemper, A. R. & Li, J. S. Status of the pediatric clinical trials enterprise: an analysis of the US ClinicalTrials.gov Registry. Pediatrics 130, e1269–e1277 (2012).
Hill, K. D., Chiswell, K., Califf, R. M., Pearson, G. & Li, J. S. Characteristics of pediatric cardiovascular clinical trials registered on ClinicalTrials.gov. Am. Heart J. 167, 921–929.e2 (2014).
Thomson, D. et al. Controlled trials in children: quantity, methodological quality and descriptive characteristics of pediatric controlled trials published 1948-2006. PLoS ONE 5, e13106 (2010).
Awerbach, J. D., Krasuski, R. A. & Hill, K. D. Characteristics of pediatric pulmonary hypertension trials registered on ClinicalTrials.gov. Pulm. Circ. 7, 348–360 (2017).
Bourgeois, F. T. & Hwang, T. J. The pediatric research equity act moves into adolescence. J. Am. Med. Assoc. 317, 259–260 (2017).
Cowie, M. R. et al. Electronic health records to facilitate clinical research. Clin. Res. Cardiol. 106, 1–9 (2017).
Bennett, C. C., Doub, T. W. & Selove, R. EHRs connect research and practice: Where predictive modeling, artificial intelligence, and clinical decision support intersect. Health Policy Technol. 1, 105–114 (2012).
Denny, J. C. et al. Systematic comparison of phenome-wide association study of electronic medical record data and genome-wide association study data. Nat. Biotechnol. 31, 1102–1110 (2013).
Kohane, I. S. Using electronic health records to drive discovery in disease genomics. Nat. Rev. Genet. 12, 417–428 (2011).
Goldstein, B. A., Navar, A. M., Pencina, M. J. & Ioannidis, J. P. A. Opportunities and challenges in developing risk prediction models with electronic health records data: a systematic review. J. Am. Med. Inform. Assoc. 24, 198–208 (2017).
Lin, K. J. & Schneeweiss, S. Considerations for the analysis of longitudinal electronic health records linked to claims data to study the effectiveness and safety of drugs. Clin. Pharmacol. Ther. 100, 147–159 (2016).
Lipton, Z. C., Kale, D. C., Elkan, C. & Wetzel, R. Learning to Diagnose with LSTM Recurrent Neural Networks. Preprint at arXiv https://arxiv.org/abs/1511.03677v7 (2015).
Choi, E., Bahadori, M. T., Schuetz, A., Stewart, W. F. & Sun, J. Doctor AI: predicting clinical events via recurrent neural networks. JMLR Workshop Conf. Proc. 56, 301–318 (2016).
Rajkomar, A. et al. Scalable and accurate deep learning with electronic health records. NPJ Digit. Med. 1, 18 (2018).
Federico, P., Unger, J., Amor-Amorós, A. & Sacchi, L. Gnaeus: utilizing clinical guidelines for knowledge-assisted visualisation of EHR cohorts. EuroVA https://doi.org/10.2312/eurova.20151108 (2015).
Rotmensch, M., Halpern, Y., Tlimat, A., Horng, S. & Sontag, D. Learning a health knowledge graph from electronic medical records. Sci. Rep. 7, 5994 (2017).
He, T. et al. Trends and opportunities in computable clinical phenotyping: a scoping review. J. Biomed. Inform. 140, 104335 (2023).
Pendergrass, S. A. & Crawford, D. C. Using electronic health records to generate phenotypes for research. Curr. Protoc. Hum. Genet. 100, e80 (2019).
Beam, A. L. et al. Clinical concept embeddings learned from massive sources of multimodal medical data. Pac. Symp. Biocomput. 25, 295–306 (2020).
Choi, Y., Chiu, C. Y.-I. & Sontag, D. Learning low-dimensional representations of medical concepts. AMIA Jt Summits Transl. Sci. Proc. 2016, 41–50 (2016).
Hong, C. et al. Clinical knowledge extraction via sparse embedding regression (KESER) with multi-center large scale electronic health record data. NPJ Digit. Med 4, 151 (2021).
Zhou, D. et al. Multiview Incomplete Knowledge Graph Integration with application to cross-institutional EHR data harmonization. J. Biomed. Inform. 133, 104147 (2022).
Miotto, R., Li, L., Kidd, B. A. & Dudley, J. T. Deep patient: an unsupervised representation to predict the future of patients from the electronic health records. Sci. Rep. 6, 26094 (2016).
Shang, Y. et al. EHR-oriented knowledge graph system: toward efficient utilization of non-used information buried in routine clinical practice. IEEE J. Biomed. Health Inf. 25, 2463–2475 (2021).
Chandak, P., Huang, K. & Zitnik, M. Building a knowledge graph to enable precision medicine. Sci. Data 10, 67 (2023).
Harnoune, A. et al. BERT based clinical knowledge extraction for biomedical knowledge graph construction and analysis. Comput. Methods Prog. Biomed. Update 1, 100042 (2021).
Santos, A. et al. A knowledge graph to interpret clinical proteomics data. Nat. Biotechnol. 40, 692–702 (2022).
Gracy, D., Weisman, J., Grant, R., Pruitt, J. & Brito, A. Content barriers to pediatric uptake of electronic health records. Adv. Pediatr. 59, 159–181 (2012).
Bavdekar, S. B. Pediatric clinical trials. Perspect. Clin. Res. 4, 89–99 (2013).
Gerstle, R. S., Lehmann, C. U. & the Council on Clinical Information Technology. Electronic prescribing systems in pediatrics: the rationale and functionality requirements. Pediatrics 119, e1413–e1422 (2007).
Cramer, K. et al. Children in reviews: methodological issues in child-relevant evidence syntheses. BMC Pediatr. 5, 38 (2005).
Johnson, K. & Lehmann, C. Electronic prescribing in pediatrics: toward safer and more effective medication management. Pediatrics 131, 1350–1356 (2013).
Wasserman, R. C. The patient record and the rise of the pediatric EHR. Curr. Probl. Pediatr. Adolesc. Health Care 52, 101108 (2022).
McDonald, C. J. et al. LOINC, a universal standard for identifying laboratory observations: a 5-year update. Clin. Chem. 49, 624–633 (2003).
Levy, O. & Goldberg, Y. Neural word embedding as implicit matrix factorization. Adv. Neural Inf. Process. Syst. 27, 2177–2185 (2014).
Mikolov, T., Chen, K., Corrado, G. & Dean, J. Efficient estimation of word representations in vector space. Preprint at arXiv https://arxiv.org/1301.3781 (2013).
Liu, F., Shareghi, E., Meng, Z., Basaldella, M. & Collier, N. Self-alignment pretraining for biomedical entity representations. Preprint at arXiv https://arxiv.org/abs/2010.11784 (2020).
Yuan, Z. et al. CODER: knowledge-infused cross-lingual medical term embedding for term normalization. J. Biomed. Inform. 126, 103983 (2022).
Fariba, K. A. & Saadabadi, A. Topiramate. (StatPearls Publishing, 2023).
Geva, A. et al. A computable phenotype improves cohort ascertainment in a pediatric pulmonary hypertension registry. J. Pediatr. 188, 224–231.e5 (2017).
Denny, J. C. et al. PheWAS: demonstrating the feasibility of a phenome-wide scan to discover gene-disease associations. Bioinformatics 26, 1205–1210 (2010).
Liu, S., Ma, W., Moore, R., Ganesan, V. & Nelson, S. RxNorm: prescription for electronic drug information exchange. IT Prof. 7, 17–23 (2005).
Shi, X., Li, X. & Cai, T. Spherical regression under mismatch corruption with application to automated knowledge translation. J. Am. Stat. Assoc. 116, 1953–1964 (2021).
Yu, S., Cai, T. & Cai, T. NILE: fast natural language processing for electronic health records. Preprint at arXiv https://arxiv.org/abs/1311.6063 (2013).
Acknowledgements
This work was additionally supported in part by U01TR002623 from the National Center for Advancing Translational Sciences/NIH, by the PrecisionLink Project at Boston Children’s Hospital, and by the Center for Health and Business at Bentley University.
Author information
Authors and Affiliations
Contributions
M.L.: Contributed to model development, performed coding and data analyses, and wrote the initial draft of the manuscript. X.L.: Developed the core model and the MUGS algorithm and edited the subsection ‘Knowledge Transfer via MUGS’ in the ‘Methods’ section. K.P.: Curated silver-standard labels using NLP techniques, contributed to gold-standard label curation, and created Figs. 1–4. A.G.: Provided expert clinical pediatric advice, contributed to the development and refinement of the survey, assisted in the interpretation and explanation of data analysis results, and participated in manuscript editing. D.Y.: Wrote the Introduction and Discussion, contributed to gold-standard labels curation, and the final manuscript editing. S.M.S.: Ran Coder and SapBert, contributed to label curation and data collection. C.L.B: Contributed to data collection, and edited subsections ‘Data Preprocessing’ in both ‘Results’ and ‘Methods’ sections. V.A.P: Organized hierarchical medical ontologies and contributed to refining the map between CUIs and EHR codes. X.X.: Developed a website for the survey and organized the survey data. K.M.: Conceived the study design and provided constructive suggestions on the scope of this study as well as expert clinical advice. T.C.: Conceived the study design, led the project, and contributed to each step of this study.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Li, M., Li, X., Pan, K. et al. Multisource representation learning for pediatric knowledge extraction from electronic health records. npj Digit. Med. 7, 319 (2024). https://doi.org/10.1038/s41746-024-01320-4
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41746-024-01320-4