SODA: A Natural Language Processing Package to Extract Social Determinants of Health for Cancer Studies

Yu, Zehao; Yang, Xi; Dang, Chong; Adekkanattu, Prakash; Patra, Braja Gopal; Peng, Yifan; Pathak, Jyotishman; Wilson, Debbie L.; Chang, Ching-Yuan; Lo-Ciganic, Wei-Hsuan; George, Thomas J.; Hogan, William R.; Guo, Yi; Bian, Jiang; Wu, Yonghui

doi:10.1016/j.jbi.2024.104642

Computer Science > Computation and Language

arXiv:2212.03000 (cs)

[Submitted on 6 Dec 2022 (v1), last revised 18 May 2023 (this version, v2)]

Title:SODA: A Natural Language Processing Package to Extract Social Determinants of Health for Cancer Studies

Authors:Zehao Yu, Xi Yang, Chong Dang, Prakash Adekkanattu, Braja Gopal Patra, Yifan Peng, Jyotishman Pathak, Debbie L. Wilson, Ching-Yuan Chang, Wei-Hsuan Lo-Ciganic, Thomas J. George, William R. Hogan, Yi Guo, Jiang Bian, Yonghui Wu

View PDF

Abstract:Objective: We aim to develop an open-source natural language processing (NLP) package, SODA (i.e., SOcial DeterminAnts), with pre-trained transformer models to extract social determinants of health (SDoH) for cancer patients, examine the generalizability of SODA to a new disease domain (i.e., opioid use), and evaluate the extraction rate of SDoH using cancer populations.
Methods: We identified SDoH categories and attributes and developed an SDoH corpus using clinical notes from a general cancer cohort. We compared four transformer-based NLP models to extract SDoH, examined the generalizability of NLP models to a cohort of patients prescribed with opioids, and explored customization strategies to improve performance. We applied the best NLP model to extract 19 categories of SDoH from the breast (n=7,971), lung (n=11,804), and colorectal cancer (n=6,240) cohorts.
Results and Conclusion: We developed a corpus of 629 cancer patients notes with annotations of 13,193 SDoH concepts/attributes from 19 categories of SDoH. The Bidirectional Encoder Representations from Transformers (BERT) model achieved the best strict/lenient F1 scores of 0.9216 and 0.9441 for SDoH concept extraction, 0.9617 and 0.9626 for linking attributes to SDoH concepts. Fine-tuning the NLP models using new annotations from opioid use patients improved the strict/lenient F1 scores from 0.8172/0.8502 to 0.8312/0.8679. The extraction rates among 19 categories of SDoH varied greatly, where 10 SDoH could be extracted from >70% of cancer patients, but 9 SDoH had a low extraction rate (<70% of cancer patients). The SODA package with pre-trained transformer models is publicly available at this https URL.

Subjects:	Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
ACM classes:	I.2.7
Cite as:	arXiv:2212.03000 [cs.CL]
	(or arXiv:2212.03000v2 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2212.03000
Journal reference:	Journal of Biomedical Informatics, April 2024, 104642
Related DOI:	https://doi.org/10.1016/j.jbi.2024.104642

Submission history

From: Yonghui Wu [view email]
[v1] Tue, 6 Dec 2022 14:23:38 UTC (351 KB)
[v2] Thu, 18 May 2023 18:39:20 UTC (406 KB)

Computer Science > Computation and Language

Title:SODA: A Natural Language Processing Package to Extract Social Determinants of Health for Cancer Studies

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Computer Science > Computation and Language

Title:SODA: A Natural Language Processing Package to Extract Social Determinants of Health for Cancer Studies

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.