Liver Disease Prediction Using Machine Learning Final
Liver Disease Prediction Using Machine Learning Final
BACHELOR OF TECHNOLOGY
IN
COMPUTER SCIENCE AND ENGINEERING
Submitted by:
1
Department of Computer Science and Engineering
CERTIFICATE
This is to certify that the project work entitled "LIVER DISEASE
PREDICTION USING MACHINE LEARNING" submitted by Sk Ashar
Uddin, Sabarna Maity, Rameshwar Mangal, Saswata Mondal, in fulfillment of
the requirements for the degree of Bachelor of Technology in Computer Science
and Engineering at Future Institute of Technology, is done under the guidance
and supervision of Professor Dr. Soumadip Ghosh and is a bonafide work done
by them.
This report has not been submitted for the award of any other degree at this or
any other Institute/University.
2
ACKNOWLEDGEMENT
---------------------------------------
Sabarna Maity
(University Roll Number:34200121020)
---------------------------------------
Rameshwar Managal
(University Roll Number:34200121002)
---------------------------------------
Saswata Mondal
(University Roll Number:34200121042)
3
LIVER DISEASE PREDICTION USING
MACHINE LEARNING
ABSTRACT
Liver disease is a significant global health concern, with various conditions such
as cirrhosis, hepatitis, and liver cancer affecting millions of people worldwide.
Early diagnosis is crucial to prevent severe complications and improve
treatment outcomes. However, traditional diagnostic methods can be invasive,
costly, and time-consuming. This project aims to develop a machine learning
(ML)-based prediction system to automate the diagnosis of liver disease. Using
a dataset containing various medical parameters such as age, gender, bilirubin
levels, and liver enzymes, several machine learning algorithms, including
Logistic Regression, Decision Trees, Random Forest, and Support Vector
Machines (SVM), were applied to predict the presence or absence of liver
disease. The dataset was pre-processed to handle missing values, scale features,
and encode categorical data. Multiple models were trained and evaluated using
performance metrics such as accuracy, precision, recall, and F1 score. The best-
performing model was selected based on its ability to accurately predict liver
disease. The final model demonstrated promising results, providing a reliable,
non-invasive tool for early liver disease detection. This machine learning-based
approach can assist healthcare professionals in making quicker, more accurate
diagnoses, potentially improving patient outcomes and reducing healthcare
costs. Future work may focus on improving model accuracy, integrating
additional features, and deploying the model in real-world clinical settings for
broader use.
4
CONTENTS:
INTRODUCTION 6
OBJECTIVES 6
SCOPE OF THE SYSTEM 7-8
METHODOLOGY 14-17
EXPECTED OUTCOMES 20
WORK PROGRESS 21
REFERENCES 22
5
INTRODUCTION
Liver diseases, including cirrhosis, hepatitis, and liver cancer, are major health
concerns globally, with early diagnosis crucial for effective treatment.
Traditional diagnostic methods, while effective, can be costly, time-consuming,
and sometimes invasive. Moreover, many liver diseases show few early
symptoms, leading to delays in diagnosis.
This project explores the use of machine learning (ML) techniques to predict
liver disease using medical data such as age, gender, bilirubin levels, and liver
enzyme values. By leveraging ML algorithms, we aim to develop a non-
invasive, accurate, and efficient system for early liver disease detection. The
goal is to assist healthcare professionals in making timely, data-driven
decisions, ultimately improving patient outcomes and reducing healthcare costs.
OBJECTIVES
Data Analysis and Exploration : To analyse and understand the dataset,
including identifying key features that influence liver disease diagnosis.
Data Preprocessing :To clean and preprocess the data by handling missing
values, normalizing numerical features, and encoding categorical variables for
model training.
Model Development :To apply and compare multiple machine learning
algorithms (e.g., Logistic Regression, Decision Trees, Random Forest, Support
Vector Machines) to predict the presence or absence of liver disease.
Model Evaluation :To evaluate the performance of the models using metrics
such as accuracy, precision, recall, F1 score, and confusion matrix.
Model Selection :To select the best-performing machine learning model based
on evaluation metrics for liver disease prediction.
Deployment Considerations :To explore the potential of deploying the model as
a practical tool for assisting healthcare professionals in making early and
accurate liver disease diagnoses.
Future Enhancements :To suggest future improvements in model accuracy,
scalability, and integration into real-world clinical settings.
6
SCOPE OF THE SYSTEM
7
5. Deployment and Integration (Future Scope):
o The current scope of the system is to build and evaluate the model
using existing datasets.
o Future work can focus on deploying the model for real-time
predictions in clinical environments, integrating it with healthcare
systems, or using it as a decision support tool for healthcare
professionals.
6. Real-World Application:
o The system is designed to assist in early detection and diagnosis of
liver diseases, reducing dependency on invasive tests and
improving diagnostic efficiency.
o The tool can be used in healthcare settings to predict liver disease
risk based on patient data, assisting doctors in decision-making.
7. Limitations:
o The model's performance depends on the quality and
representativeness of the data used. The current dataset may not
fully account for all potential variables and real-world
complexities.
o The model will not replace clinical judgment but will act as an
auxiliary tool to help in early diagnosis.
8. Future Enhancements:
o The system can be expanded by incorporating more advanced
techniques such as deep learning or additional data sources (e.g.,
imaging data or genetic factors) to improve prediction accuracy.
o Integration with electronic health records (EHR) or mobile
applications for real-time diagnostic assistance can be explored.
This scope focuses on building a predictive tool for liver disease diagnosis using
machine learning, evaluating its potential impact in healthcare, and identifying
areas for future development.
8
LITERATURE REVIEW
The use of machine learning (ML) in medical diagnostics has been widely
explored in recent years, and liver disease prediction is one of the many
applications where ML techniques can play a significant role. The literature on
this subject highlights various methodologies, datasets, and performance
evaluations of different ML models in predicting liver diseases.
1. Liver Disease Diagnosis and Challenges
Liver disease encompasses a broad range of conditions, including cirrhosis,
hepatitis, fatty liver disease, and liver cancer. According to the World Health
Organization (WHO), liver diseases are among the leading causes of global
morbidity and mortality. However, early diagnosis of liver diseases is often
challenging due to the absence of clear symptoms in the early stages, leading to
delayed interventions. Traditional diagnostic methods such as blood tests,
imaging, and liver biopsy are invasive, costly, and may not always provide
accurate results, especially in the early stages of the disease.
2. Traditional Diagnostic Methods
• Blood Tests: Liver function tests (LFTs) are commonly used to assess
liver health. These tests measure enzymes such as ALT, AST, ALP, and
bilirubin, which can indicate liver damage. However, abnormal levels do
not always correlate with the severity or type of liver disease, leading to
diagnostic ambiguity.
• Imaging: Techniques like ultrasound, CT scans, and MRI are effective for
diagnosing liver abnormalities but are expensive and not suitable for
routine screening, especially in low-resource settings.
• Liver Biopsy: Although the gold standard for confirming liver disease,
liver biopsy is invasive, carries risk, and is expensive.
3. Machine Learning in Medical Diagnosis
Machine learning techniques have gained traction in healthcare for their ability
to analyze large datasets and uncover patterns that are difficult for humans to
detect. In liver disease prediction, several ML models have been used, including
classification and regression techniques.
9
• Decision Trees and Random Forests: Decision trees have been frequently
used for medical diagnosis due to their interpretability. Random Forest,
an ensemble method of decision trees, has been employed to improve
accuracy and robustness. Studies such as those by Dinesh et al. (2020)
have shown that Random Forest classifiers can achieve high accuracy in
liver disease prediction.
• Support Vector Machines (SVM): SVM has been widely used in binary
classification tasks like liver disease detection. Shinde et al. (2019) found
that SVM performs well in distinguishing between liver disease and non-
disease classes, especially when combined with feature selection
techniques.
• Logistic Regression: Simple yet effective, logistic regression has been
applied to predict liver diseases based on clinical parameters. However,
its performance can be limited when dealing with complex and non-linear
relationships in the data.
4. Datasets Used for Liver Disease Prediction
Several datasets have been used in research on liver disease prediction. The
Indian Liver Patient Dataset (ILPD) is one of the most frequently used,
containing various features such as age, gender, bilirubin levels, and liver
enzyme values. The dataset has been used in multiple studies to train ML
models for liver disease prediction, as seen in the work of Gonzalez et al.
(2020). However, issues such as missing data, class imbalance (fewer instances
of liver disease), and the need for data normalization are challenges that
researchers often encounter.
5. Evaluation Metrics
To evaluate the performance of machine learning models in liver disease
prediction, metrics such as accuracy, precision, recall, F1-score, and AUC-ROC
(Area Under the Curve - Receiver Operating Characteristic) are commonly
used. Kumar et al. (2018) showed that models with high accuracy, precision,
and recall perform well in the clinical context, providing a reliable means of
early detection. However, balancing precision and recall is important to avoid
false negatives, which can result in missed diagnoses.
10
6. Recent Advancements in ML for Liver Disease
Recent advancements in deep learning and neural networks have also started
being applied to liver disease prediction. Convolutional Neural Networks
(CNNs) and Recurrent Neural Networks (RNNs) have shown promising results
when combined with medical imaging data. Although these methods require
large datasets and significant computational power, they have demonstrated
high potential for improving prediction accuracy, especially when combined
with imaging data like MRI scans and ultrasound images.
7. Key Findings and Insights from the Literature
• Effectiveness of ML Algorithms: Random Forest and SVM have
consistently outperformed other algorithms in liver disease prediction,
particularly when dealing with imbalanced datasets and noisy data.
• Data Preprocessing Importance: Proper handling of missing data,
normalization, and feature engineering is crucial for improving model
performance and ensuring reliable predictions.
• Challenges with Datasets: The lack of a large, diverse dataset and class
imbalance are recurring challenges in liver disease prediction, and these
issues can affect the generalization of models.
8. Gaps in Existing Research
While machine learning models have shown promise in liver disease prediction,
there are still several gaps in existing research:
• Limited Use of Real-World Data: Many studies use publicly available
datasets, but these datasets may not accurately represent the diversity of
real-world patient populations.
• Model Generalization: Most models are trained on specific datasets and
may not generalize well to different populations or healthcare settings.
• Integration into Clinical Practice: There is limited research on the
practical deployment of these models in clinical environments for real-
time prediction and decision-making support.
11
ABOUT THE DATASET
1.Context
Patients with Liver disease have been continuously increasing because of
excessive consumption of alcohol, inhale of harmful gases, intake of
contaminated food, pickles and drugs. This dataset was used to evaluate
prediction algorithms in an effort to reduce burden on doctors.
2. Content
This data set contains 416 liver patient records and 167 non liver patient
records collected from North East of Andhra Pradesh, India. The
"Dataset" column is a class label used to divide groups into liver patient
(liver disease) or not (no disease). This data set contains 441 male patient
records and 142 female patient records.
3. Columns:
• Age of the patient
• Gender of the patient
• Total Bilirubin
• Direct Bilirubin
• Alkaline Phosphotase
• Alamine Aminotransferase
• Aspartate Aminotransferase
• Total Protiens
• Albumin
• Albumin and Globulin Ratio
• Dataset: field used to split the data into two sets (patient with liver
disease, or no disease)
Any patient whose age exceeded 89 is listed as being of age "90".
12
4.Acknowledgements
This dataset was downloaded from the UCI ML Repository:
13
METHODOLOGY
1. Data Collection
• Dataset Source: The dataset used for this project is the Indian Liver
Patient Dataset (ILPD), which contains medical records of patients
including both liver disease and non-liver disease cases. It includes
features such as age, gender, bilirubin levels, liver enzymes (AST, ALT,
etc.), albumin levels, and others.
• Dataset Overview: The dataset consists of 416 instances with 10 features
(attributes) representing various aspects of liver health and the target
variable indicating whether the patient has liver disease (1) or not (0).
2. Data Preprocessing
Data preprocessing is a critical step in machine learning to prepare the dataset
for model training and to improve the performance of the algorithms.
• Handling Missing Values: Missing values in the dataset were identified
and imputed using appropriate techniques. For numerical features,
missing values were replaced with the mean or median, while for
categorical features, the mode was used.
• Feature Encoding: Categorical features such as gender were converted
into numerical values using techniques like Label Encoding or One-Hot
Encoding.
• Data Normalization/Scaling: Since the dataset contains features with
varying ranges, feature scaling (Standardization or Min-Max scaling) was
applied to normalize the data and bring all features to a similar scale.
• Feature Selection: Irrelevant or redundant features were removed using
techniques like correlation analysis and feature importance to ensure that
only the most significant attributes were included in the model training.
3. Exploratory Data Analysis (EDA)
• Data Visualization: Various plots such as histograms, boxplots, and
correlation matrices were generated to understand the distribution of
features and the relationships between them.
14
• Class Distribution: The class distribution (liver disease vs. non-liver
disease) was analyzed to check for class imbalance. If an imbalance was
detected, strategies like oversampling (SMOTE) or undersampling were
considered to balance the dataset.
4. Model Development
• Algorithm Selection: Several machine learning algorithms were selected
to develop predictive models for liver disease prediction:
o Logistic Regression: A simple but effective algorithm used for
binary classification tasks.
o Decision Tree: A decision tree-based model that provides
interpretable results by splitting the data based on feature values.
o Random Forest: An ensemble method of decision trees that
improves prediction accuracy by averaging the results of multiple
trees.
o Support Vector Machine (SVM): A powerful classification
algorithm that tries to find the optimal hyperplane separating the
classes.
• Training and Testing Split: The dataset was split into a training set
(80%) and a testing set (20%) using random sampling to ensure the model
could generalize well on unseen data.
• Cross-Validation: K-fold cross-validation (e.g., 5-fold) was employed
during model training to minimize overfitting and ensure robust model
evaluation.
5. Model Training and Hyperparameter Tuning
• Model Training: Each selected machine learning algorithm was trained
on the preprocessed training data. The training phase involves adjusting
the model parameters to learn the underlying patterns in the data.
• Hyperparameter Tuning: Hyperparameters such as the depth of the
decision tree, number of trees in the random forest, and kernel function in
SVM were optimized using techniques like Grid Search or Randomized
Search to find the best combination of parameters.
15
6. Model Evaluation
After training the models, the performance was evaluated using various metrics
to assess their ability to predict liver disease accurately:
• Accuracy: The overall percentage of correct predictions made by the
model.
• Precision: The proportion of true positive predictions among all positive
predictions made.
• Recall: The proportion of true positive predictions among all actual
positive cases.
• F1-Score: The harmonic mean of precision and recall, providing a
balance between the two metrics.
• Confusion Matrix: A confusion matrix was generated to visualize the
performance of each model in terms of true positives, true negatives, false
positives, and false negatives.
7. Model Comparison and Selection
• Comparison of Models: After evaluating the performance of all models
based on the metrics mentioned above, the model with the highest
accuracy, precision, recall, and F1 score was selected as the best-
performing model.
• Model Selection: The final model chosen for liver disease prediction was
based on the results of the evaluation, ensuring it provided the most
reliable predictions.
8. Result Analysis
• The results of the selected model were analyzed, and key insights were
derived from the confusion matrix, and other evaluation metrics.
• Key conclusions were drawn regarding the effectiveness of the model in
predicting liver disease and its potential use in real-world applications.
16
• Improvements: Further work could involve improving the model by
incorporating additional features, using deep learning techniques, or
integrating medical imaging data for better prediction.
• Deployment: The model could eventually be deployed as a tool in
healthcare systems for real-time liver disease prediction, helping
clinicians make informed decisions.
17
SYSTEMS REQUIREMENTS
1. Hardware:
o Computer/Laptop with at least 8 GB RAM, 2.5 GHz processor,
and 500 GB storage.
o GPU (optional) for faster model training if using deep learning
techniques.
o Stable Internet Connection for accessing datasets, research
materials, and online resources.
2. Software:
o Operating System: Windows, Linux, or macOS.
o Programming Language: Python for machine learning
implementation.
o IDE: Jupyter Notebook for interactive coding or PyCharm/VS
Code for script-based development.
o Libraries/Tools:
▪ Pandas, NumPy, Scikit-learn, Matplotlib, Seaborn for
data analysis, visualization, and machine learning.
▪ Keras/TensorFlow (optional) for advanced deep learning.
o Version Control: Git for managing code versions and
collaboration.
3. Dataset:
o Indian Liver Patient Dataset (ILPD) or similar publicly available
liver disease datasets.
4. Cloud Platforms (optional):
o Google Colab or Kaggle Kernels for running models on cloud-
based machines.
18
o AWS or Azure for scalable computing resources.
5. Additional Tools (optional):
o Anaconda for managing Python environments and dependencies.
o Flask/Django (future work) for deploying the model as a web
application.
6. Documentation Tools:
o Microsoft Word or Google Docs for writing the report.
o Microsoft PowerPoint or Google Slides for preparing
presentations.
These facilities provide the necessary infrastructure to carry out the machine
learning-based liver disease prediction project successfully.
19
EXPECTED OUTCOMES
20
WORK PROGRESS REPORT
1.Problem Understanding and Objective Definition
o We have identified the primary objective of the project: to develop
a machine learning model that can predict liver disease based on
patient data.
o The project scope and specific goals have been defined, including
selecting relevant evaluation metrics (accuracy, precision, recall,
and F1-score).
2. Dataset Acquisition and Analysis
o The dataset for liver disease detection has been sourced and
preprocessed.
o Initial exploratory data analysis (EDA) has been performed to
understand data characteristics, including distribution, missing
values, and outliers.
o Preliminary insights into feature relationships and their relevance
to liver disease have been identified using visualizations and
statistical methods.
3. Current Phase: Design Phase
In the design phase, we are laying the foundation for the implementation
phase. This includes:
A. Data Processing Pipeline Design
• Planning a robust data preprocessing pipeline, which includes:
o Handling missing values and outliers.
o Standardizing or normalizing numerical features.
21
REFERENCES
1) Chronic Liver Disease Dataset (UCI Machine Learning Repository).
(n.d.). Retrieved from:
https://archive.ics.uci.edu/ml/datasets/Chronic+Liver+Disease
2) Kaggle - Chronic Liver Disease Prediction Dataset. (n.d.). Retrieved
from: https://www.kaggle.com/datasets/uciml/chronic-liver-disease
3) Analytics Vidhya - Machine Learning Projects. (n.d.). Retrieved from:
https://www.analyticsvidhya.com
4) Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of
Statistical Learning: Data Mining, Inference, and Prediction. Springer.
5) Breiman, L. (2001). Random Forests. Machine Learning, 45(1), 5-32.
https://doi.org/10.1023/A:1010933404324
6) Cortes, C., & Vapnik, V. (1995). Support-Vector Networks. Machine
Learning, 20(3), 273-297. https://doi.org/10.1007/BF00994018
7) Lime (Local Interpretable Model-Agnostic Explanations). (n.d.).
Retrieved from: https://github.com/marcotcr/lime
8) SMOTE: Synthetic Minority Over-sampling Technique. (n.d.).
Retrieved from: https://arxiv.org/abs/1106.1813
9) Kotsiantis, S. B., & Pintelas, P. E. (2004). Recent Advances in
Classification Algorithms. Proceedings of the International Conference on
Artificial Intelligence (ICAI), 435-439.
10) Rashidi, H., & Behdad, S. (2020). Machine Learning Applications in
Medical Diagnosis. Journal of Computational Medicine, 1(1), 34-42.
22