0% found this document useful (0 votes)
54 views6 pages

M1 DS Project LungCancerPrediction

This document summarizes 5 research papers on lung cancer prediction models. The first paper achieved 94% accuracy using the XGBoost algorithm and ensemble learning techniques. The second introduced a machine learning model with 100% and 98.7% accuracy on full and reduced datasets. The third used a convolutional neural network achieving an impressive 94.5% AUC but requiring further validation and sensitivity improvement. The fourth focused on using data mining techniques for diagnosis while the fifth highlighted the importance of comprehensive risk factors in predictive models.

Uploaded by

Mizna Amousa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
54 views6 pages

M1 DS Project LungCancerPrediction

This document summarizes 5 research papers on lung cancer prediction models. The first paper achieved 94% accuracy using the XGBoost algorithm and ensemble learning techniques. The second introduced a machine learning model with 100% and 98.7% accuracy on full and reduced datasets. The third used a convolutional neural network achieving an impressive 94.5% AUC but requiring further validation and sensitivity improvement. The fourth focused on using data mining techniques for diagnosis while the fifth highlighted the importance of comprehensive risk factors in predictive models.

Uploaded by

Mizna Amousa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 6

Lung Cancer Prediction

Ghaida Alsaeed
College of Computer Sciences and
Information Technology, King Faisal Zahra Fawzi Meznah AlMousa
University College of Computer Sciences and College of Computer Sciences and
Alahsa, Saudi Arabia Information Technology, King Faisal Information Technology, King Faisal
220011321@student.kfu.edu.sa University University
Alahsa, Saudi Arabia Alahsa, Saudi Arabia
220021681@student.kfu.edu.sa 220017925@student.kfu.edu.sa
Fatima Ibraheim
College of Computer Sciences and
Information Technology, King Faisal
University
Alahsa, Saudi Arabia
220032204@student.kfu.edu.sa

Abstract—To aid medical professionals and hospitals in the specific algorithm, thereby enhancing the efficiency and
detection of lung cancer, machine learning-based prediction effectiveness of healthcare interventions in tackling this
models for the disease are offered. These models make use of critical health issue.
data gathered from patients who have already received a
diagnosis; this data may include vital signs, lung images, or II. LITERATURE REVIEW
results from other medical tests designed to identify lung cancer.
Depending on the circumstances of each patient, the use of
prediction models can be beneficial in decision-making, risk For the sake of our research, we read five research papers
reduction, and damage limitation. We will provide and analyze related to the prediction of lung cancer, and in this section
the model's output in this report, along with the team members' we will talk about each research paper that we read.
predictions. A summary of the methods and algorithms employed
will also be given. The article “Lung Cancer Prediction Model Using
Ensemble Learning Techniques and Systematic Review
Keywords—component, formatting, style, styling, insert (key Analysis” The article highlights the exceptional performance
words)
of XGBoost in a lung cancer prediction model, achieving
I. INTRODUCTION impressive accuracy 94 %, precision, recall, and AUC
values. It underscores the potential for improving predictive
Lung cancer is a pressing issue in today's healthcare
accuracy through the utilization of ensemble learning
landscape, with its incidence rates steadily rising and its
impact affecting both men and women. As a condition techniques, including XGBoost, LightGBM, AdaBoost, and
characterized by uncontrolled cell growth in the lungs, lung Bagging. In terms of the research methodology, the study
cancer poses significant challenges in terms of prevention involved a literature review of machine learning-based lung
and early detection. While prevention measures are limited, cancer prediction models and the development of ensemble
early detection plays a crucial role in reducing risks and methods using a dataset of 309 individuals, featuring
improving patient outcomes. attributes such as age, smoking, and symptoms. However,
the article lacks discussions on limitations, dataset
In recent years, the field of machine learning has gained representativeness, and the influence of selected features.
tremendous momentum, offering promising solutions to The justification for this approach is based on prior research,
various healthcare problems. By leveraging data and
which demonstrated the effectiveness of ensemble methods
algorithms, machine learning models have demonstrated the
in lung cancer prediction and the significance of various
ability to predict diseases accurately and efficiently. This
advancement holds great significance in saving lives by machine learning algorithms. The use of common
enabling early intervention and timely treatment. performance metrics ensures a comprehensive evaluation of
the predictive models.[1]
Our project focuses on the prediction of lung cancer
using machine learning techniques. Specifically, we aim to Among cancers, lung cancer is one of the most
develop a classification algorithm capable of accurately dreaded conditions, and it is the leading cause of cancer-
detecting lung cancer at an early stage. By training the related deaths worldwide. Early cancer identification and
model on diverse datasets and employing algorithms like prediction help prevent and treat cancer efficiently,
Naïve Bayes, Support Vector Machines, Decision Trees, and especially the beginning cancer stage. Therefore, this study
K-nearest neighbors, we seek to provide swift and reliable
presents a prediction model for lung cancer level based on
predictions for this devastating disease.
machine learning. Machine learning algorithms are applied
In this paper, we will delve into the details of our as primary methods. Firstly, the dataset collection is
methodology, data collection, and model development. We implemented; then, feature selection algorithms are used to
will present the results of our lung cancer prediction model identify essential features. Secondly, the proposed model
and discuss its implications for the medical community. applies the machine learning algorithms on two datasets
Ultimately, our objective is to provide a fast and easily (The full dataset and the dataset of essential features).
accessible solution for predicting lung cancer using a Finally, experimental results demonstrate that this proposed

1
system has an excellent performance, with 100% and 98.7% demographic disparities in cancer risk. These statistics are
accuracy on the full dataset and the dataset of the top three crucial for enhancing the precision of models like those by
essential features, respectively.[2] Rockhill et al. (2003) and Tice et al. (2005), which
incorporate mammographic density and familial risk factors
This concise literature review examines the study by into breast cancer risk assessments. The breadth of research,
Heuvelmans et al. titled "Lung cancer prediction by Deep from the Harvard report on cancer prevention (Colditz et al.,
Learning to identify benign lung nodules." The research 2000) to the exploration of genetic associations by Wu et al.
investigates the performance of the Lung Cancer Prediction (2002), demonstrates the multifaceted nature of cancer
Convolutional Neural Network (LCP-CNN), trained on US etiology and the critical role of comprehensive risk
screening data and validated on an independent European assessment in crafting effective predictive models. This
multicenter dataset. The study's key findings highlight an body of literature highlights the dynamic nature of risk
impressive AUC of 94.5% across European centers, with a prediction models, which are essential for tailored screening
sensitivity of 99.0% in ruling out malignancy for 22.1% of and prevention measures, with the ultimate goal of
nodules, reducing unnecessary follow-up scans. The chosen mitigating cancer's impact through early detection and
approach of utilizing deep learning algorithms, supported by personalized care (Colditz et al., 2000; Wu et al., 2002)[5].
existing literature, proves effective in improving lung nodule
classification. However, gaps in current knowledge include Five research papers on lung cancer prediction offer diverse
further validation on larger nodules and diverse patient insights. The first emphasizes XGBoost's 94% accuracy and
populations, as well as improving sensitivity for challenging the use of ensemble learning techniques, lacking depth in
cases. Addressing these gaps can enhance lung cancer discussing limitations. The second introduces a machine
diagnosis and patient outcomes.[3] learning-based model with outstanding accuracy of 100%
and 98.7%. The third presents the Lung Cancer Prediction
The literature on “Diagnosis of Lung Cancer Prediction Convolutional Neural Network, achieving a noteworthy
System Using Data Mining Classification Techniques” AUC of 94.5% but identifying gaps in validation and
demonstrates a growing interest in leveraging computational sensitivity. The fourth focuses on the effectiveness of Naive
techniques to improve diagnostic accuracy and patient Bayes in predicting lung cancer, addressing the challenge of
outcomes. V. Krishnaiah et al. developed a prototype system prevalence in rural India. Finally, the fifth provides an
for predicting lung cancer, employing various classification overview of personalized medicine in oncology,
methods such as Naive Bayes, IF-THEN rules, Decision emphasizing the intricate link between smoking habits and
Trees, and Neural Networks. Their findings suggest that lung cancer risk. These studies collectively enrich our
Naive Bayes is the most effective model, outperforming understanding of lung cancer prediction, diagnosis, and
others regarding prediction accuracy. This is consistent with personalized treatment.
other research, such as Sang Min Park et al. and Yongqian
Qiang et al., which underscores the importance of predictive III. DATASET AND APPROACH
analytics in oncology. The study by Krishnaiah et al. also
highlights the interpretability of Decision Trees and the Many illnesses can be reliably predicted by machine
robustness of Neural Networks despite their complex learning models. By anticipating illness, machine learning
representations. The research acknowledges the challenge of algorithms and techniques can help people live for a longer
time. A disease known as cancer occurs when certain body
lung cancer prevalence in rural India, emphasizing the need
cells proliferate uncontrollably and spread to other bodily
for targeted screening facilitated by data mining techniques.
regions. Almost anywhere in the cells that comprise the
This is in line with the broader literature that recognizes the human body can become the site of cancer. Lung cancer will
potential of data mining in identifying disease patterns and be discussed in our report. Lung cancer is becoming more
risk factors, as discussed by Harleen Kaur and Siri Krishan common these days, and it can affect both men and women.
Wasan, Murat Karabhatak, and M.Cevdet Ince. The Once it begins, lung cancer cannot be stopped, but the risks
literature review indicates a consensus on the utility of data can be decreased, and it can be identified early. An
mining in healthcare, focusing on predictive systems that algorithm for classification will identify the lung cancer
can aid in early diagnosis and treatment planning for prediction. Many algorithms exist for classification. The
diseases such as lung cancer[4]. objective of this report is to predict lung cancer fast and
easily as soon as possible by a specific algorithm.
The development of cancer risk prediction models has been
a cornerstone in the evolution of personalized medicine, A. Data Collection
especially in oncology. The article “A Risk Model for The dataset appears to have been collected through a
Prediction of Lung Cancer” states that The logistic risk survey. Each row represents an individual and contains
score introduced by Breslow and Day (1980) has been various attributes such as gender, age, smoking habits,
foundational, with subsequent models integrating a range of symptoms (e.g., coughing, shortness of breath), and the
risk factors, from genetic predispositions to lifestyle choices presence or absence of lung cancer.
(Breslow & Day, 1980; Gail et al., 1989). Notably, the
adjustment constants for smoking status-specific incidence B. Description
rates (Appendix Table 1) reflect the intricate link between The dataset used in this Kaggle competition is a public
smoking habits and lung cancer risk, echoing findings from dataset of lung cancer patients from the University of
the Office on Smoking and Health (2004) and the Centers California, Irvine (UCI) Machine Learning Repository. The
for Disease Control (2005). Furthermore, the age and sex- data set has 309 rows and 16 variables all containing
stratified lung cancer incidence and mortality rates provided information about patients with lung cancer.
by the SEER Program (2005) offer essential insights into

2
The dataset consists of several features, including: C. Data Preprocessing
Data preprocessing is a crucial step in the data science
● GENDER: Gender of the individual (M for male, F pipeline that involves transforming raw data into a suitable
for female) format for analysis and modeling. It helps clean and prepare
the data, ensuring its quality and compatibility with the
● AGE: Age of the individua chosen data science techniques. Some common data
preprocessing steps are[6]:
● SMOKING: Smoking habits (1 for current smoker,
1. Data Cleaning
2 for non-smok
Data cleaning is a critical component of the
● YELLOW_FINGERS: Presence of yellow fingers (1
data preprocessing phase, where the focus is on
for yes, 2 for no) identifying and rectifying errors, inconsistencies,
and inaccuracies in the dataset. The objective of
● ANXIETY: Anxiety level (1 for low, 2 for high) data cleaning is to improve data quality, ensuring
that the dataset is reliable and suitable for
● PEER_PRESSURE: Peer pressure to smoke (1 for analysis[7]. These are a few typical tasks for data
low, 2 for high) cleaning:

● CHRONIC DISEASE: Presence of chronic disease


(1 for yes, 2 for no)
● Handling missing values
● FATIGUE: Fatigue level (1 for low, 2 for high) ● Outlier Detection and Handling
● Encoding Categorical Variables
● ALLERGY: Presence of allergies (1 for yes, 2 for ● Removing duplicate records
● Scaling Numerical Features
no)
● Feature Scaling
● WHEEZING: Wheezing symptoms (1 for yes, 2 for
2. Data Integration
no)
Data integration is the process of
● ALCOHOL CONSUMING: Alcohol consumption combining data from multiple sources or different
habits (1 for yes, 2 for no) data sets to create a unified and consolidated view
of the data. It involves merging, transforming, and
● COUGHING: Presence of coughing symptoms (1 harmonizing data from various sources into a single
for yes, 2 for no) coherent dataset. The goal of data integration is to
enable efficient and effective analysis by providing
● SHORTNESS OF BREATH: Presence of shortness a comprehensive and integrated view of the
of breath symptoms (1 for yes, 2 for no) data[7].

● SWALLOWING DIFFICULTY: Difficulty in 3. Data Transformation


swallowing (1 for yes, 2 for no)
After data cleansing, we need to combine
● CHEST PAIN: Presence of chest pain (1 for yes, 2 quality data in different ways by utilizing the data
for no) transformation strategies covered below to alter the
data's cost, composition, or format[7]:
● LUNG_CANCER: Target variable indicating the
presence or absence of lung cancer (YES for Lung ● Generalization
cancer present or NO for Lung cancer absent) ● Normalization
● Attribute Selection
● Aggregation

3
4. Data Reduction its performance. Common split ratios are 70:30 or 80:20, but
this can vary depending on the size of the dataset and
Data reduction is the process of reducing specific requirements.
the size or dimensionality of a dataset while After that, we check the dataset if it is balanced or
preserving its essential information. It aims to imbalanced. In our case, we figured that our dataset is
eliminate redundant, irrelevant, or noisy data, imbalanced. Imbalance classification refers to a
making the dataset more manageable and efficient classification problem that will result in unequal class
for analysis or modeling tasks[7]. distribution. The decision was made to use under-sampling
which deletes some rows from the majority class to get
D. Techniques chosen for analysis equal
distribution to handle the imbalance problem. After we used
The dataset can be used for various data science it, we notice every time we run the code, we get different
techniques to analyze and predict lung cancer. Some accuracy results for the model. Then, we decide to use
possible techniques that can be applied include: another technique which is resampled (). It will up-sample
the dataset so that the minority class matches the majority
class.
● Exploratory Data Analysis (EDA): This involves
visualizing and summarizing the dataset to gain
insights into the relationships between variables and B. data exploration & preliminary analysis
identify patterns or trends. Data set exploration and preliminary analysis
In order to discover the data set, we will need the following:
● Classification Algorithms: Since the target variable Import data: We load data from “survey_lung_cancer.csv”
is binary (lung cancer present or absent), file into a suitable data analysis tool or programming
classification algorithms such as Logistic language environment like Python with libraries like pandas
Regression, Decision Trees, Random Forest, or and numpy.
Support Vector Machines can be employed to build Data exploration: To get an overview of a data set by
predictive models. examining its structure and contents. Use functions such as
head(), shape(), 'dtypes()', and description() to understand
● Feature Selection: Identifying the most relevant columns, data types, summary statistics, and any missing
features that contribute to lung cancer prediction can values.
be done using techniques like correlation analysis or Data cleaning: Check for missing values, duplicate records
feature importance ranking. and outliers in the data set. If necessary, handle missing
values through imputation or removal, and address any data
● Model Evaluation: Assessing the performance of the quality issues.
trained models using appropriate evaluation metrics Identify relationships: Explore relationships between
such as accuracy, precision, recall, and F1-score to variables to identify any patterns or associations.
determine the effectiveness of the predictive models. In order to conduct the initial analysis, we will calculate
summary statistics and explore variables separately for
individuals, with (LUNG_CANCER = YES) representing
IV. PRELIMINARY ANALYSIS people with lung cancer and (LUNG_CANCER = NO)
representing people without lung cancer.
A. dataset preprocessing Formulating hypotheses: Based on the preliminary analysis
Data pre-processing is required before you can start creating that we mentioned above, the formulation of hypotheses will
a predictive machine learning model. The crucial stage in be about potential factors associated with lung cancer. For
creating the model is pre-processing. Compiling a dataset is example:
a difficult task. The data set contains a large number of ● Hypothesis 1: Smoking is positively associated
errors. Before creating the model, we need to address these with the presence of lung cancer.
errors. Checking the format of the data set and the ● Hypothesis 2: Age is an important factor in the
presentation of the information can help you identify it in
development of lung cancer.
the first place. We should look for duplicate and empty
values. Some columns may contain empty values, indicating ● Hypothesis 3: Chronic illness increases the
that there is no data in them, while some may contain probability of developing lung cancer.
duplicate values, indicating that there are more rows with ● Hypothesis 4: You will have shortness of breath if
the same but meaningless data. But in our dataset there are you have cancer, but if you have Shortness of
no null values, as we discovered. But there are duplicate breath, that doesn't necessarily mean that u have
rows and we will deal with them by dropping the rows that lung cancer.
contain duplicate values. After that, we see that the data set
contains categorical data, and as we know, most machine
learning models cannot accept categorical data, so we will C. visualizations
encrypt the categorical data and convert it to digital data in An essential component of data science is data
order to facilitate dealing with it. Also, in order to accurately visualization. Tools for visualizing ideas can help people
evaluate the performance of models, the dataset is usually understand difficult ideas more easily. Utilizing
divided into training and testing subsets. The training set is visualization has several advantages, including the ability to
used to train the model, while the test set is used to evaluate explore, observe, and explain data. Both technical and non-

4
technical people can better comprehend and interpret us that the largest number of those infected is between
anything linked to data with the use of visualization. the ages of 55 and 65.
Technical personnel look for patterns in the data, establish
correlations between various variables, and identify
anomalies during the pre-processing phase. Non-technical
people make an effort to comprehend data and how it might
help them solve problems.

In this project, the data was visualized using a bar chart to


reveal age and its relationship to developing cancer or not,
and also for the relationship between chronic disease and
likelihood of developing lung cancer. In addition, We used a
pie chart to show the number of lopsided and balanced data.
Here we provide an overview of the visualizations we used.

Figure 3: imbalanced data pie chart

Figure 3 The distribution is shown using a pie chart of the


data. The number of samples classified as “yes” for lung
cancer is equal to 87.4% and the number of samples
classified as “no” for lung cancer is equal to 12.6%. The
graph reveals a major imbalance in the data set, as the
number of samples classified as having lung cancer far
Figure 1: Lung Cancer Factors: Relationship with Gender exceed the number of samples classified as unaffected. This
severe class imbalance highlights the disproportionate
Figure 1 shows plots in which each of them presents the representation of the affected group compared to the
relationship between a specific factor and lung cancer based unaffected group, such that there is no balance.
on gender. Each subplot displays the count of individuals
with lung cancer ("YES") categorized by gender and the
presence or absence of a particular factor. Factors include
yellow fingers, anxiety, chronic disease, chest pain, fatigue,
wheezing, coughing, shortness of breath, swallowing
difficulty, and allergy.

Figure 4: balanced data pie chart

Figure 4 shows us a pie chart data after we have balanced it


so that there is no bias as much as we see in Figure 3 so that
the number of infected and uninfected people is equal.
Figure 2: Age distribution by lung cancer

Figure 2: This chart shows us the graphical distribution


based on the age of those with lung cancer and those
who are not infected, as it is a comparison between age
and infection or lack of infection. The chart also shows

5
Moving forward to Milestone 2, we aim to refine and
optimize the predictive model. We will fine-tune algorithms,
optimize feature selection, and evaluate performance using
appropriate metrics. Expanding the dataset will enhance the
model's generalizability and validate its effectiveness in
real-world scenarios.

ACKNOWLEDGMENT
We extend our gratitude to King Faisal University for
offering the informative course, data science, and other
resources. We are also thankful for having our instructor,
Prof. Suresh Sankaranarayanan, and blessed with his
guidance and expertise. Their contributions have been
invaluable in shaping this research project. Thank you for
Figure 5: Relationship between chronic diseases and lung cancer
making this research possible.

Figure 5 shows a bar chart that illustrates the correlation


between chronic disease and the likelihood of developing REFERENCES
lung cancer. On the x-axis, we categorize individuals into
two groups: those without a chronic disease (labeled as '0')
and those with a chronic disease (labeled as '1'). The y-axis [1] Mamun, M., Farjana, A., Al Mamun, M., & Ahammed, M. S. (2022,
represents the proportion of individuals diagnosed with lung June). Lung cancer prediction model using ensemble learning
techniques and a systematic review analysis. In 2022 IEEE World AI
cancer within each group.
IoT Congress (AIIoT) (pp. 187-193). IEEE.

The height of each bar indicates the prevalence of lung [2] Ngo, H., & Le, H. (2022). A Prediction Model for Lung Cancer
cancer among the respective groups. A higher bar for one Levels Based on Machine Learning.
group over the other would suggest a greater incidence of
lung cancer in that group, providing insights into whether [3] Heuvelmans, M. A., van Ooijen, P. M., Ather, S., Silva, C. F., Han,
the presence of chronic diseases is associated with an D., Heussel, C. P., ... & Oudkerk, M. (2021). Lung cancer prediction
by Deep Learning to identify benign lung nodules. Lung Cancer, 154,
increased risk of developing lung cancer. 1-4..

This visualization is crucial for understanding potential risk [4] Krishnaiah, V., Narsimha, G., & Chandra, N. S. (2013). Data mining
classification techniques for lung cancer disease prediction.
factors for lung cancer and can be instrumental in guiding International Journal of Computer Science and Information
both clinical research and public health policies. Technologies, 4(1), 39–45.
[5] Breslow, N. E., & Day, N. E. (1980). Statistical methods in cancer
research. Volume I - The analysis of case-control studies. IARC
Scientific Publications.
[6] Geisler Mesevage, T. (2021, May 24). What Is Data Preprocessing &
What Are The Steps Involved? MonkeyLearn Blog.
CONCLUSION https://monkeylearn.com/blog/data-preprocessing/
In conclusion, this research project aims to predict lung
cancer using machine learning techniques, focusing on early [7] Data Preprocessing in Machine Learning [Steps & Techniques].
detection and improved patient outcomes. Milestone 1 (n.d.).Www.v7labs.com.https://www.v7labs.com/blog/data-
preprocessing-guide#h3
provided background information on lung cancer and
highlighted the significance of data science in healthcare.
We discussed the potential of algorithms like Naïve Bayes,
Support Vector Machines, Decision Trees, and K-nearest
neighbors for accurate predictions.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy