The Relationship Between Lung Cancer Prevalence and Air
The Relationship Between Lung Cancer Prevalence and Air
Volume 99 (2024)
1. Introduction
Lung cancer is the second most common cancer in the world and causes the most cancer-related
deaths [1, 2]. Based on available data, the World Health Organization estimates that 239,340 new
cases of lung cancer will occur in 2023. The estimated number of deaths is as high as 127,070, or
20.8% of all cancer deaths [3]. Lung cancer ranks among the top causes of cancer-related fatalities in
93 countries. Compared to other cancers, lung cancer patients still have a low probability of surviving
more than five years, with a relative survival rate of only 25.4 percent [4]. This undoubtedly poses a
serious threat to the whole society. At present, researchers have conducted a large number of studies
on lung cancer and have roughly divided it into third categories: non-small cell lung cancer (NSCLC)
and small cell lung cancer (SCLC), non-small cell lung cancer accounts for about 80% of lung cancer
[5]. A third, less common type of lung cancer is called carcinoid. By understanding the pathogenic
factors that cause lung cancer, people can prevent the occurrence of lung cancer or detect lung cancer
as early as possible. In this way, the survival rate of lung cancer patients or potential patients can be
increased. Fortunately, with the continuous development of medical technology, the death rate of
lung cancer patients is decreasing year by year.
Smoking and air pollution are the two main causes of lung cancer [6]. According to the researchers,
there are at least two carcinogens in the polluted air: PAHs and n-nitroso compounds. The presence
of particulate matter can promote lung cancer, especially PM. It has a high PAH concentration which
can induce oxidative stress and aromatics receptor activation in human bronchial epithelial cells.
Aromatics receptor activation allows cancer cells to invade normal cells and metastasize to other parts
of the body [7]. Smoking directly increases the risk of lung cancer. Because smoke releases
nitrosamines and other known carcinogens [8]. Air pollution and smoking work together as a cause
of lung cancer. In addition to these two factors, age, gender, diet and other factors can also have an
impact on the probability of getting lung cancer [9]. If lung cancer can be detected when it is less
severe, the survival rate of lung cancer patients will be greatly improved.
At present, the factors considered in the clinical practice model are limited, and the prediction
accuracy needs to be improved. On the other hand, the rapid development of machine learning
technology offers the possibility of building some more accurate prediction models for lung cancer
34
Highlights in Science, Engineering and Technology BEEFM 2024
Volume 99 (2024)
severity. Lung cancer caused by smoking is increasing every year and will continue to increase in the
future [10]. If the specific probability of disease can remind people of their physical conditions and
prevent them in advance, it is possible to reduce the number of people suffering from lung cancer.
Therefore, this paper aims to establish a lung cancer risk prediction model based on machine learning
algorithm. In addition, this paper selected some potential related factors and compared the predictive
performance of various algorithms in order to establish a more accurate and appropriate prediction
model for lung cancer severity. Data on 152 lung cancer patients came from the Kaggle website and
included 24 related characteristics such as air pollution, alcohol consumption, dust allergies, genetics
and smoking status. The model was established by Logistic regression, K-nearest neighbor and
support vector machine. Part of the dataset is used as a test dataset to examine the strengths and
weaknesses of each model.
2. Methodology
2.1. Data Source
The Breast cancer patients’ data used in this paper comes from the UC Irvine Machine Learning
Repository website. The original data was saved in .CSV format.
2.2. Data Introduction
The data used in this article contains 569 instances and 10 variables without any missing values.
All the 10 variables are represented in the Table 1 and all of them are numeric variables, meaning
different kinds of important measured numbers.
Table 1. Different types of variables
Term Type Value range
Index [1] Numeric 1-152
Patient.Id Numeric 152
Age Numeric 14-73
Gender Categorical Male/Female
Air.Pollution Categorical 1-8
Alcohol.use Categorical 1-8
Dust.Allergy Categorical 1-8
OccuPational.Hazards Categorical 1-8
Genetic.Risk Categorical 1-7
Chronic.Lung.Disease Categorical 1-7
Balanced.Diet Categorical 1-7
Obesity Categorical 1-7
Smoking Categorical 1-8
Passive.Smoker Categorical 1-8
Chest.Pain Categorical 1-9
Coughing.of.Blood Categorical 1-9
Fatigue Categorical 1-9
Weight.Loss Categorical 1-8
Shortness.of.Breath Categorical 1-9
Wheezing Categorical 1-8
Swallowing.Difficulty Categorical 1-8
Clubbing.of.Finger.Nails Categorical 1-9
Frequent.Cold Categorical 1-7
Dry.Cough Categorical 1-7
Snoring Categorical 1-7
Level Categorical 1-3
35
Highlights in Science, Engineering and Technology BEEFM 2024
Volume 99 (2024)
Figure 1 shows the gender distribution of this dataset. From the bar chart, we can see that there is
little difference in the ratio of men to women in this data set, which is a relatively balanced data set.
36
Highlights in Science, Engineering and Technology BEEFM 2024
Volume 99 (2024)
Fig 3. Each variable was associated with the severity of lung cancer
2.3. Method Introduction
The method used in this study is to model fit each variable. Methods used include K Nearest
Neighbors (KNN), Support vector machines (SVM), decision tree, random forest and multiple logistic
regression. The first step of all method is to divide the data set used for fitting into the training data
set and the prediction data set. Then the model is fitted to the training data. The results are then
verified with the prediction data set, and finally the percentage of predictions that are correct for each
model is calculated. Then accuracies are compared with each other to arrive at the best model.
37
Highlights in Science, Engineering and Technology BEEFM 2024
Volume 99 (2024)
38
Highlights in Science, Engineering and Technology BEEFM 2024
Volume 99 (2024)
of active smoking, exerting a more pronounced effect on the severity of lung cancer. This observation
may be attributed to the fact that passive smokers are unable to excrete the harmful substances
produced by smoking immediately after inhaling them. Another plausible explanation might be that
individuals exposed to passive smoke are not as frequently immersed in smoke, potentially leading
to a heightened sensitivity or reaction to the occasional exposure.
3.4. KNN Model
For a large number of data, in order to pursue efficiency, K-Nearest Neighbors (KNN) algorithm
is also a good choice. This algorithm classifies new observations based on the labels of nearby
observations, endowing it with a resistance to noise. The inherent anti-noise property means that a
few outliers in the dataset have minimal impact on the overall predictive results. However, the
efficacy of the KNN algorithm heavily relies on the selection of the parameter 'k,' representing the
number of neighbors considered during classification. Since KNN concentrates on points in proximity
to the new observation, the labels of surrounding points and the quantity of points observed become
pivotal for accurate predictions. The optimal 'k' value significantly influences the model's accuracy,
and fine-tuning this parameter through cross-validation can maximize predictive performance.
Visually depicted in Figure 6 are the variations in model accuracy corresponding to different 'k'
values. Ultimately, at a 'k' value of 6, the KNN algorithm attains its peak accuracy at 0.8621. Despite
this success, it's essential to acknowledge a limitation inherent in the KNN algorithm—its localized
focus, which leads to a tendency to overlook the broader context.
39
Highlights in Science, Engineering and Technology BEEFM 2024
Volume 99 (2024)
4. Conclusion
This study underscores the importance of adopting healthy behaviors to minimize the risk of severe
lung cancer. Recommendations include reducing smoking, avoiding second-hand smoke, and
residing in areas with good air quality whenever possible. In instances of poor air quality, it is advised
to limit outdoor activities, and if necessary, wearing a mask can help mitigate the inhalation of PM2.5.
These practices, as suggested by the model, can contribute to a decrease in the likelihood of
developing severe lung cancer. Furthermore, the model provides individuals with a valuable tool to
assess their risk and potentially seek timely medical attention for early intervention, thereby
improving the chances of successful treatment. However, it's crucial to acknowledge that the uneven
distribution of individuals living in different air quality levels within the dataset may limit the model's
general applicability. Future studies should focus on collecting more diverse clinical data to enhance
40
Highlights in Science, Engineering and Technology BEEFM 2024
Volume 99 (2024)
the model's universality and ensure its relevance across a broader range of situations. This approach
will contribute to the model's effectiveness in providing personalized insights and guidance for
individuals from various backgrounds and living conditions.
References
[1] Nash J, Brims F. International standards of care in thoracic oncology: A narrative review of clinical quality
indicators. Lung Cancer, 2023, 186: 107421.
[2] Sung H, Ferlay J, Siegel R L, et al. Global Cancer Statistics 2020: GLOBOCAN Estimates of Incidence
and Mortality Worldwide for 36 Cancers in 185 Countries. CA: a Cancer Journal for Clinicians, 2021,
71(3): 209-249.
[3] Maise Al Bakir, Huebner A, Carlos Martinez Ruiz, et al. The evolution of non-small cell lung cancer
metastases in TRACERx. Nature, 2023, 616: 534-542.
[4] Hafiza Padinharayil, Varghese J, Mithun Chacko John, et al. Non-small cell lung carcinoma (NSCLC):
Implications on molecular pathology and advances in early diagnostics and therapeutics. Working paper,
2022.
[5] Traub B, Link K H, Kornmann M. Curing pancreatic cancer. Seminars in Cancer Biology, 2021, 76: 232-
246.
[6] Berg C D, Schiller J H, Boffetta P, et al. Air pollution and lung cancer: A Review by International
Association for the Study of Lung Cancer Early Detection and Screening Committee. Journal of Thoracic
Oncology, 2023.
[7] Samet J M. Particulate Air Pollution and Mortality. Epidemiology, 1995, 6(5): 471-472.
[8] Jia X, Sheng C, Han X, Li M, Wang K. Global burden of stomach cancer attributable to smoking from
1990 to 2019 and predictions to 2044. Public Health, 2023, 226: 182-189.
[9] Loomis D, Grosse Y, Lauby-Secretan B, et al. The carcinogenicity of outdoor air pollution. The Lancet
Oncology, 2013, 14(13): 1262-1263.
[10] Liu Y, Lu L, Yang H, et al. Dysregulation of immunity by cigarette smoking promotes inflammation and
cancer: A review. Environmental Pollution, 2023, 339: 122730.
41