Lab Report 10 FDS
Lab Report 10 FDS
Mukesh Reddy
Student Registration Number 231U1R1120 Class &Section: CSE & C
Study Level : UG/PG UG Year &Term: II & III
Subject Name Foundations of Data Science
Objective:
The objective of this lab report is to apply a comprehensive machine learning workflow that
includes data loading, preprocessing, splitting, model training, evaluation, and prediction.
The dataset chosen for this study is the well-known Iris Dataset, commonly used for
classification tasks. This case study focuses on performing these steps using Python, with the
help of machine learning libraries like Scikit-learn and Pandas.
Dataset Chosen:
The Iris Dataset is a classical dataset in machine learning, containing 150 instances of iris
flowers, with four features describing the physical attributes of the flowers: sepal length,
sepal width, petal length, and petal width. The target variable represents the species of the
flower, with three possible categories: Setosa, Versicolor, and Virginica.
The following tools and libraries were used in this case study:
Before feeding the data into the model, preprocessing was necessary to ensure that the data is
clean, consistent, and suitable for training. In this step, we checked for any missing values in
the dataset, which was not an issue in the Iris dataset. The data was then standardized using
normalization techniques. This step ensures that all features have the same scale, which is
important for many machine learning algorithms, including distance-based models.
After preprocessing, the dataset was split into training and test sets. This was done to ensure
that the model could be trained on one portion of the data and tested on another, which helps
evaluate its performance on unseen data. The data was split in an 80-20 ratio, meaning 80%
of the data was used for training, and 20% was used for testing the model’s performance.
For this case study, we selected a Random Forest classifier as our model. Random Forest is
an ensemble learning method that combines multiple decision trees to improve accuracy and
reduce overfitting. The model was trained on the training set, learning to predict the species
of flowers based on the four feature measurements.
The model achieved an accuracy of 95%, indicating that it correctly predicted the species of
flowers most of the time. The confusion matrix showed that the model made very few
misclassifications, with the most significant errors occurring between the Versicolor and
Virginica species. The classification report confirmed that the model had high precision and
recall across all classes, with the overall performance being satisfactory.
6. Make Predictions:
To demonstrate the model’s ability to make predictions on new, unseen data, a prediction was
made for a hypothetical flower with specific feature measurements (sepal length, sepal width,
petal length, petal width). The model successfully predicted that the flower belonged to the
Versicolor species.
Results:
1. Accuracy: The model achieved an impressive accuracy of 95% on the test dataset.
2. Confusion Matrix: The confusion matrix showed a strong performance, with only a
few misclassifications between species, particularly Versicolor and Virginica.
3. Classification Report: The report indicated high precision and recall values for all
three species (Setosa, Versicolor, and Virginica), suggesting the model performs well
in distinguishing between the species.
4. Prediction: The model successfully predicted the species of a new flower sample,
showing that it can generalize its learning to new instances.
Conclusion:
This lab report successfully demonstrated the key steps in a machine learning workflow:
Data loading: The Iris dataset was loaded and prepared for modeling.
Pre-processing: Data was standardized to ensure consistency across features.
Model training: A Random Forest classifier was trained on the dataset.
Model evaluation: The model was evaluated using accuracy, confusion matrix, and
classification report, showing a high performance.
Prediction: The model was able to predict the species of a new, unseen flower based
on its features.
The results indicate that the Random Forest classifier is an effective model for classifying iris
species. The accuracy and evaluation metrics suggest that the model performs well, making it
suitable for similar classification tasks.
Future Work:
This case study offers a strong foundation for applying machine learning models to
classification problems and can be adapted to other datasets with similar tasks.