Ashwin Kumar REPORT - 1BI21IS019
Ashwin Kumar REPORT - 1BI21IS019
Internship Report on
Degree of
Submitted by:
CERTIFICATE
Certified that the Project work entitled “SLEEP EFFICIENCY DATASET ANALYSIS” carried out by
ASHWIN KUMAR (1BI21IS019) is a bona-fide student of Bangalore Institute of Technology,
Bangalore in partial fulfillment of the requirement of V semester (Machine Learning Project) Bachelor of
Engineering in Information Science and Engineering of Visvesvaraya Technological University, Belagavi
during the year 2023 – 2024. It is certified that all corrections/suggestions indicated for Internal
Assessment have been incorporated in the report deposited in the departmental library. The Internship
Project report has been approved as it satisfies the academic requirements with respect of the Mini
Project work prescribed for the said degree.
External Viva
1.
2.
ACKNOWLEDGEMENT
I would like to thank our Principal Dr. M. U. Aswath, Bangalore Institute of Technology for
his support though out this project.
I express my whole hearted gratitude to Dr. Asha T, who is our respectable Head of Dept. of
Information Science. I wish to acknowledge for his valuable help and encouragement.
ASHWIN KUMAR
[1BI21IS019]
ABSTRACT
In this project, we have taken sleep-related dataset sourced from Kaggle which is a comprehensive
collection of information that spans a wide array of metrics related to sleep. It includes data on various
aspects such as sleep efficiency scores, patterns, biometric data, environmental factors, and
demographics. This rich and multifaceted dataset offers a valuable resource for researchers and
professionals in the fields of sleep science, healthcare, and technology. The dataset's inclusivity of
diverse metrics allows for a holistic understanding of sleep-related phenomena. Sleep efficiency scores,
for instance, provide a quantitative measure of how effectively an individual utilizes their time in bed
for actual sleep. Sleep patterns encompass information about the structure and organization of sleep,
including details about the duration and distribution of different sleep stages.
CONTENTS
1. Introduction 5-28
1.1. Problem Statement 27
1.2. Objective 27
1.3 Future Scope 27
2. Requirement Specification 29
2.1. Software Requirements 29
2.2. Hardware Requirements 29
3. System Definition 30-43
3.1. Project Description 30
3.2. Libraries Used 38
3.3. Technology Used 39
3.4. Dataset 40
3.5 Advantages 42
3.6. Disadvantages 42
4. Implementation (Code) 44-50
5. Snapshots 51-54
6. Declaration 55
7. Conclusion/Future Enhancement 56
8. Reference 57
SLEEP EFFICIENCY DATASET ANALYSIS INTERNSHIP PROJECT
CHAPTER - 1
INTRODUCTION
Arthur Samuel, a pioneer in the field of artificial intelligence and computer gaming,
coined the term “Machine Learning”. He defined machine learning as – a “Field of study
that gives computers the capability to learn without being explicitly programmed”. The
process starts with feeding good quality data and then training our machines(computers)
by building machine learning models using the data and different algorithms. The choice
of algorithms depends on what type of data do we have and what kind of task we are
trying to automate.
1. Study the Problems: The first step is to study the problem. This step involves
understanding the business problem and defining the objectives of the model.
2. Data Collection: When the problem is well-defined, we can collect the relevant data
required for the model. The data could come from various sources such as databases,
APIs, or web scraping.
3. Data Preparation: When our problem-related data is collected. then it is a good idea
to check the data properly and make it in the desired format so that it can be used by
the modelto find the hidden patterns. This can be done in the following steps:
Data cleaning
Data Transformation
4. Model Selection: The next step is to select the appropriate machine learning
algorithm that is suitable for our problem. This step requires knowledge of the
strengths and weaknesses of different algorithms. Sometimes we use multiple models
and compare their results and select the best model as per our requirements.
5. Model building and Training: After selecting the algorithm, we have to build the
model.
6. Model Evaluation: Once the model is trained, it can be evaluated on the test dataset to
determine its accuracy and performance using different techniques like classification
report, F1 score, precision, recall, ROC Curve, Mean Square error, absolute error, etc.
7. Model Tuning: Based on the evaluation results, the model may need to be tuned or
optimized to improve its performance. This involves tweaking the hyperparameters of
the model.
8. Deployment: Once the model is trained and tuned, it can be deployed in a production
environment to make predictions on new data. This step requires integrating the
model into an existing software system or creating a new system for the model.
9. Monitoring and Maintenance: Finally, it is essential to monitor the model’s
performance in the production environment and perform maintenance tasks as
DEPT OF ISE, BIT
SLEEP EFFICIENCY DATASET ANALYSIS INTERNSHIP PROJECT
required. This involves monitoring for data drift, retraining the model as needed,
and updating the model as new data becomes available. Implement logging
mechanisms to record predictions and outcomes. This facilitates the analysis of any
unexpected behavior or issues that may arise over time.
Whenever a machine learning model is trained, we can’t train that model on a single
dataset or even we train it on a single dataset then we will not be able to assess the
performance of our model. For that reason, we split our source data into training, testing,
and validation datasets. The data splitting procedure is used to estimate the performance of
machine learning algorithms when they are used to make predictions on data not used to
train the model. Splitting the dataset is essential for an unbiased evaluation of prediction
performance.
Training Data: The part of data we use to train our model. This is the data that your
model actually sees (both input and output) and learns from.
Validation Data: The part of data that is used to do a frequent evaluation of the
model, fit on the training dataset along with improving involved hyperparameters
(initially set parameters before the model begins learning). This data plays its part
Testing Data: Once our model is completely trained, testing data provides an unbiased
evaluation. When we feed in the inputs of Testing data, our model will predict some
values (without seeing actual output). After prediction, we evaluate our model by
comparing it with the actual output present in the testing data. This is how we evaluate
and see how much our model has learned from the experiences feed in as training
data, set at the time of training.
Based on the methods and way of learning, machine learning is divided into mainly four
types, which are:
corresponding output labels, and it learns to generalize from this data to make predictions
on new, unseen data. There are two main categories of supervised learning that are
mentioned below:
Classification
o Logistic Regression
o Random Forest
o Decision Tree
o Naive Bayes
Regression
Regression, on the other hand, deals with predicting continuous target variables, which
represent numerical values. For example, predicting the price of a house based on its
size, location, and amenities, or forecasting the sales of a product. Regression
algorithms learn to map the input features to a continuous numerical value.
Here are some regression algorithms:
o Linear Regression
o Polynomial Regression
o Ridge Regression
o Lasso Regression
o Decision tree
o Random Forest
Supervised Learning models can have high accuracy as they are trained on labelled
DEPT OF ISE, BIT
SLEEP EFFICIENCY DATASET ANALYSIS INTERNSHIP PROJECT
data.
It can often be used in pre-trained models which saves time and resources when
developing new models from scratch.
It has limitations in knowing patterns and may struggle with unseen or unexpected
patterns that are not present in the training data.
It can be time-consuming and costly as it relies on labelled data only.
There are two main categories of unsupervised learning that are mentioned below:
Clustering
Clustering is the process of grouping data points into clusters based on their
similarity. This technique is useful for identifying patterns and relationships in
data without the need for labelled examples.
Here are some clustering algorithms:
o Mean-shift algorithm
o DBSCAN Algorithm
Association
o Apriori Algorithm
o Eclat
o FP-growth Algorithm
It helps to discover hidden patterns and various relationships between the data.
Used for tasks such as customer segmentation, anomaly detection, and data
exploration.
It does not require labelled data and reduces the effort of data labelling.
Without using labels, it may be difficult to predict the quality of the model’s
output.
Cluster Interpretability may not be clear and may not have meaningful
interpretations.
It has techniques such as autoencoders and dimensionality reduction that can
be used to extract meaningful features from raw data.
We use these techniques when we are dealing with data that is a little bit labelled and the
rest large portion of it is unlabelled. We can use the unsupervised techniques to
predict labels and then feed these labels to supervised techniques. This technique is
mostly applicable in the case of image data sets where usually all images are not
labelled.
There are a number of different semi-supervised learning methods each with its own
characteristics. Some of the most common ones include:
Graph-based semi-supervised learning: This approach uses a graph to represent
the relationships between the data points. The graph is then used to propagate
labels fromthe labelled data points to the unlabelled data points.
Label propagation: This approach iteratively propagates labels from the labelled
data points to the unlabelled data points, based on the similarities between the
data points.
Co-training: This approach trains two different machine learning models on
different subsets of the unlabelled data. The two models are then used to label each
other’s predictions.
Self-training: This approach trains a machine learning model on the labelled data
and then uses the model to predict labels for the unlabelled data. The model is
then retrained on the labelled data and the predicted labels for the unlabelled data.
Generative adversarial networks (GANs): GANs are a type of deep learning
algorithm that can be used to generate synthetic data. GANs can be used to
generate unlabelled data for semi-supervised learning by training two neural
networks.
Advantages of Semi- Supervised Machine Learning
Reinforcement Learning
Reinforcement machine learning algorithm is a learning method that interacts with the
environment by producing actions and discovering errors. Trial, error, and delay are the
most relevant characteristics of reinforcement learning. In this technique, the model
keeps on increasing its performance using Reward Feedback to learn the behaviour or
pattern. These algorithms are specific to a particular problem e.g., Google Self Driving
car, AlphaGo where a bot competes with humans and even itself to get better and better
performers in Go Game. Each time we feed in data, they learn and add the data to their
knowledge which is training data. So, the more it learns the better it gets trained and
hence experienced. There are two main types of reinforcement learning:
Positive reinforcement
o Examples: Giving a treat to a dog for sitting, providing a point in a game for
a correct answer.
Negative reinforcement
It has autonomous decision-making that is well-suited for tasks and that can
learnto make a sequence of decisions, like robotics and game-playing.
This technique is preferred to achieve long-term results that are very difficult
to achieve.
It is used to solve a complex problem that cannot be solved by conventional
techniques.
It needs a lot of data and a lot of computation, which makes it impractical and
costly.
Machine learning is important because it allows computers to learn from data and improve
their performance on specific tasks without being explicitly programmed. This ability to
learn from data and adapt to new situations makes machine learning particularly useful
for tasks that involve large amounts of data, complex decision-making, and dynamic
environments.
Here are some specific areas where machine learning is being used:
o Predictive modelling: Machine learning can be used to build predictive models
that can help businesses make better decisions. For example, machine learning can
be used to predict which customers are most likely to buy a particular product, or
o Computer vision: Machine learning is used to build systems that can recognize
and interpret images and videos. This is important for applications such as self-
driving cars, surveillance systems, and medical imaging.
o Fraud detection: Machine learning can be used to detect fraudulent behaviour in
financial transactions, online advertising, and other areas.
o Recommendation systems: Machine learning can be used to build
recommendation systems that suggest products, services, or content to users based
on their past behaviour and preferences.
Overall, machine learning has become an essential tool for many businesses and
industries, as it enables them to make better use of data, improve their decision-making
processes, and deliver more personalized experiences to their customers.
Comparison of Machine Learning Algorithms
Comparing machine learning algorithms is important in itself, but there are some not so-
obvious benefits of comparing various experiments effectively.
Better performance
High performance can be short-lived if the chosen model is tightly coupled with
the training data and fails to interpret unseen data. So, it’s also important to find
DEPT OF ISE, BIT
SLEEP EFFICIENCY DATASET ANALYSIS INTERNSHIP PROJECT
the model that understands underlying data patterns so that the predictions are long-
lasting and the need for re-training is minimal.
Easier retraining
When models are evaluated and prepared for comparisons, minute details, and
metadata get recorded which come in handy during retraining. For example, if a
developer can clearly retrace the reasons behind choosing a model, the causes of
model failure will immediately pop out and re-training can start with equal speed.
Speedy production
With the model details available at hand, it’s easy to narrow down on models that
can offer high processing speed and use memory resources optimally. Also,
during production several parameters are required to configure the machine
learning solutions. Having production-level data can be useful for easily aligning
with the production engineers. Moreover, knowing the resource demands of
different algorithms, it will also be easier to check their compliance and feasibility
with respect to the organization’s allocated assets.
Evaluating the performance of a Machine learning model is one of the important steps
while building an effective ML model. To evaluate the performance or quality of the
model, different metrics are used, and these metrics are known as performance metrics or
evaluation metrics. These performance metrics help us understand how well our model has
performed for the given data. In this way, we can improve the model's performance by
tuning the hyper-parameters. Each ML model aims to generalize well on unseen/new data,
and performance metrics help determine how well the model generalizes on the new
dataset. In machine learning, each task or problem is divided into classification and
Regression. Not all metrics can be used for all types of problems; hence, it is important to
know and understand which metrics should be used. Different evaluation metrics are used
for both Regression and Classification tasks.
The model learns from the given dataset and then classifies the new data into classes or groups
based on the training. It predicts class labels as the output, such as Yes or No, 0 or 1, Spam or
Not Spam, etc. To evaluate the performance of a classification model, different metrics are
used, and some of them are as follows:
o Accuracy
The accuracy metric is one of the simplest Classification metrics to implement, and
it can be determined as the number of correct predictions to the total number of
predictions.
o Confusion Matrix
A confusion matrix is a tabular representation of prediction outcomes of any binary
classifier, which is used to describe the performance of the classification model on
a set of test data when true values are known.
In the matrix, columns are for the prediction values, and rows specify
the Actual values. Here Actual and prediction give two possible
classes, Yes or No. So, if we are predicting the presence of a disease in a
patient, the Prediction column with Yes means, Patient has the disease,
and for NO, the Patient doesn't have the disease.
In this example, the total number of predictions are 165, out of which
110 time predicted yes, whereas 55 times predicted No.
In general, the table is divided into four terminologies, which are as follows:
True Positive (TP): In this case, the prediction outcome is true, and it is
true in reality, also.
True Negative (TN): in this case, the prediction outcome is false, and it
is false in reality, also.
False Positive (FP): In this case, prediction outcomes are true, but they
are false in actuality.
False Negative (FN): In this case, predictions are false, and they are
true inactuality.
o Precision
The precision metric is used to overcome the limitation of Accuracy. The precision
determines the proportion of positive prediction that was actually correct. It can be
calculated as the True Positive or predictions that are actually true to the total
positive predictions (True Positive and False Positive).
o Recall
o F-Score
and Recall. It is a type of single score that represents both Precision and Recall. So,
the F1 Score can be calculated as the harmonic mean of both precision and Recall,
assigning equal weight to each of them.
TPR or true Positive rate is a synonym for Recall, hence can be calculated as:
predicts a numeric or
discrete value. The metrics used for regression are different from the classification
metrics. It means we cannot use the Accuracy metric (explained above) to evaluate a
regression model; instead, the performance of a Regression model is reported as errors
in the prediction. Following are the popular metrics that are used to evaluate .
Mean Squared error or MSE is one of the most suitable metrics for Regression
evaluation. It measures the average of the Squared difference between predicted
values and the actual value given by the model. Since in MSE, errors are squared,
therefore it only assumes non- negative values, and it is usually positive and non-
zero. Moreover, due to squared differences, it penalizes small errors also, and
hence it leads to over-estimation of how bad the model is. MSE is a much-
preferred metric compared to other regression metrics as it is differentiable and
hence optimized better.
o R2 Score
The R squared score will always be less than or equal to 1 without concerning if the
values are too large or small.
o Adjusted R2
To overcome the issue of R square, adjusted R squared is used, which will always
show a lower value than R². It is because it adjusts the values of increasing
predictors and only shows improvement if there is a real improvement.
Data Pre-Processing
Pre-processing refers to the transformations applied to our data before feeding it to the
algorithm. Data preprocessing is a technique that is used to convert the raw data into
a clean data set. In other words, whenever the data is gathered from different sources
it is collected inraw format which is not feasible for the analysis.
Improving Data Quality: Data preprocessing is essential for enhancing the quality of
data by handling inconsistencies, inaccuracies, and errors, which is critical for ensuring
reliable and robust analytics.
Dealing with Missing Values: Data preprocessing includes techniques like imputation
that are critical for dealing with missing data effectively, as datasets often have
missing values which can significantly hinder the performance of machine learning
models.
Normalizing and Scaling: Data preprocessing helps in normalizing or scaling
features, which is especially important for algorithms that are sensitive to the scale of
the input. This ensures that all the features are on a comparable scale, which is
crucial for the accurate performance of many machine learning algorithms.
Handling Outliers: Through data preprocessing, outliers can be identified and managed
appropriately. This is important as outliers can have a disproportionate effect on the
modelling process and can lead to misleading results.
Dimensionality Reduction: Data preprocessing includes techniques such as Principal
Component Analysis (PCA) for reducing the number of input features, which not
only helps in improving the performance of models but also makes the dataset more
manageable and computationally efficient.
Data preprocessing is a step that involves transforming raw data so that issues owing to
the incompleteness, inconsistency, and/or lack of appropriate representation of trends are
resolved so as to arrive at a dataset that is in an understandable format. The steps used in
data preprocessing include the following:
1. Data profiling. Data profiling is the process of examining, analysing and reviewing data
to collect statistics about its quality. It starts with a survey of existing data and its
characteristics. Data scientists identify data sets that are pertinent to the problem at hand,
inventory its significant attributes, and form a hypothesis of features that might be relevant
for the proposed analytics or machine learning task. They also relate data sources to the
relevant business concepts and consider which preprocessing libraries could be used.
2. Data cleansing. The aim here is to find the easiest way to rectify quality issues, such as
eliminating bad data, filling in missing data or otherwise ensuring the raw data is suitable
for feature engineering.
3. Data reduction. Raw data sets often include redundant data that arise from
characterizing phenomena in different ways or data that is not relevant to a particular ML,
AI or analytics task. Data reduction uses techniques like principal component analysis to
transform the raw data into a simpler form suitable for particular use cases.
4. Data transformation. Here, data scientists think about how different aspects of the data
need to be organized to make the most sense for the goal. This could include things
like structuring unstructured data, combining salient variables when it makes sense or
identifying important ranges to focus on.
5. Data enrichment. In this step, data scientists apply the various feature engineering
libraries to the data to effect the desired transformations. The result should be a data set
organized to achieve the optimal balance between the training time for a new model and
the required compute.
6. Data validation. At this stage, the data is split into two sets. The first set is used to train
a machine learning or deep learning model. The second set is the testing data that is used
to gauge the accuracy and robustness of the resulting model. This second step helps
identify any problems in the hypothesis used in the cleaning and feature engineering of the
data. If the data scientists are satisfied with the results, they can push the preprocessing
task to a data engineer who figures out how to scale it for production. If not, the data
scientists can go back and make changes to the way they implemented the data cleansing
and feature engineering steps.
Feature Encoding
Feature encoding is the process of transforming data into a format that can be used by
machine learning algorithms. This is often necessary when working with real-world data,
which can be messy and unstructured.
Machine learning models can only work with numerical values. For this reason, it is
necessary to transform the categorical values of the relevant features into numerical ones.
This process is called feature encoding.
Here are some of the more well-known and widely used encoding techniques:
One-hot encoding: One-hot encoding is the process by which categorical variables are
converted into a form that can be used by ML algorithms.
Binary encoding: Binary encoding is the process of encoding data using the binary
code. Inbinary encoding, each character is represented by a combination of 0s and 1s.
For Example:
o 0000 - 0
o 0001 - 1
o 0010 - 2
o 0011 - 3
o 0100 - 4
o 0101 - 5
o 0110 - 6
o 0111 - 7
o 1000 - 8
o 1001 - 9
o 1010 – 10
• To show the Implementation, Classification and Regression Analysis on the Sleep Efficiency
Dataset by Kaggle.
1.2. OBJECTIVE
Regression: The purpose of regression analysis of the sleep dataset is to predict the
continuous variable “sleep duration” based on the various variables provided in the
dataset. Regression aims to understand and evaluate the relationship between
independent variables (traits) and dependent variables (sleep duration).
During the course of this project, I came to recognize the critical importance of feature
scaling in the performance of machine learning models. The fundamental concept revolves
around ensuring that all features are on a consistent scale, which significantly impacts the
models' effectiveness. As a future direction, we can further enhance the existing system by
delving into advanced feature scaling techniques and exploring their impact on model
accuracy and robustness.
CHAPTER - 2
REQUIREMENTS SPECIFICATION
Processor – i3 Processor
Memory – 2 GB RAM
Keyboard
CHAPTER - 3
SYSTEM DEFINITION
Supervised Machine Learning algorithm can be broadly classified into Regression and
Classification Algorithms. In Regression algorithms, we have predicted the output for
continuous values, but to predict the categorical values, we need Classification
algorithms.
Classification
Classification is a technique for determining which class the dependent belongs to, based
on one or more independent variables.
A classifier is a type of machine learning algorithm that assigns a label to a data input.
Classifier algorithms use labelled data and statistical methods to produce predictions about
data input classifications. Here, we employ logistic regression as the primary
classification algorithm.
Logistic Regression
Firstly, linear regression is performed on the relationship between variables to get the model.
The logistic regression model transforms the linear regression function continuous value
output into categorical value output using a sigmoid function.
Regression
Linear regression, decision tree regression, and random forest regression are the chosen
algorithms in this project.
Linear regression
Linear regression algorithm shows a linear relationship between a dependent (y) and one
or more independent (y) variables, hence called as linear regression. Since linear
regression shows the linear relationship, which means it finds how the value of the
dependent variable is changing according to the value of the independent variable.
The linear regression model provides a sloped straight line representing the relationship
between the variables. Consider the below image:
Here,
The values for x and y variables are training datasets for Linear Regression model
representation.
Decision Tree is a decision-making tool that uses a flowchart-like tree structure or is a model
ofdecisions and all of their possible results, including outcomes, input costs, and utility.
The process of splitting starts at the root node and is followed by a branched tree that
finally leads to a leaf node (terminal node) that contains the prediction or the final
outcome of the algorithm. Construction of decision trees usually works top-down, by
choosing a variable at each step that best splits the set of items. Each sub-tree of the
decision tree model can be represented as a binary tree where a decision node splits into
two nodes based on the conditions.
Decision tree regression observes features of an object and trains a model in the
structure of a tree to predict data in the future to produce meaningful continuous output.
Continuous output means that the output/result is not discrete, i.e., it is not represented
just by a discrete, known set of numbers or values.
Random forests or random decision forests are an ensemble learning method that uses
multiple learning algorithms to obtain better predictive performance than could be
obtained from any of the constituent learning algorithms mostly for solving classification
and regression problems.
Random Forest Regression algorithms are a class of Machine Learning algorithms that use
the combination of multiple random decision trees each trained on a subset of data. The
use of multiple trees gives stability to the algorithm and reduces variance. The random
forest regression algorithm is a commonly used model due to its ability to work well for
large and most kinds of data.
The algorithm creates each tree from a different sample of input data. At each node, a
different sample of features is selected for splitting and the trees run in parallel without
any interaction. The predictions from each of the trees are then averaged to produce a
single result which is the prediction of the Random Forest.
WORKING DESCRIPTION
The sleep efficiency dataset catalogues test subjects' sleep behaviors, demographic details, and
lifestyle factors. It provides a rich source for investigating correlations between sleep patterns
and various influences, offering insights for research in sleep science, healthcare applications,
and advancements in sleep-tracking technologies. However, potential biases and ethical
considerations warrant careful interpretation of the data.
At the core of this analysis are variables such as bedtime, wakeup time, total sleep duration,
sleep efficiency, and the percentages of time spent in different sleep stages, including REM,
deep, and light sleep. These quantitative measures offer insights into the overall sleep
architecture of the test subjects. Additionally, the dataset includes demographic details such as
age and gender, providing a broader context for understanding sleep patterns.
Lifestyle factors play a pivotal role in shaping sleep outcomes, and the dataset captures key
elements such as caffeine and alcohol consumption, smoking status, and exercise frequency.
Exploring these factors in conjunction with sleep-related metrics allows researchers to uncover
potential correlations and patterns. For instance, researchers may investigate whether increased
exercise frequency correlates with improved sleep efficiency or if caffeine consumption is
associated with changes in sleep duration.
The implications of this analysis extend into various domains. In the realm of sleep science,
researchers can gain valuable insights into the complex interplay between lifestyle choices and
sleep quality. Healthcare applications may benefit from understanding how certain lifestyle
factors contribute to sleep-related issues, informing personalized interventions for improved sleep
health.
Context of Dataset
The sleep efficiency dataset contains details on test subjects, including age, gender,
bedtime, wakeup time, sleep duration, and various sleep-related factors such as REM sleep
percentage, deep sleep percentage, and light sleep percentage. Additional information includes
the number of awakenings, caffeine and alcohol consumption, smoking status, and exercise
frequency.
Data Pre-Processing
Data preprocessing is an important step before using it. It refers to the cleaning,
transforming, and integrating of data in order to make it ready for analysis. The goal of
data preprocessing is to improve the quality of the data and to make it more suitable for
the specific model to train. In this dataset, there are both numerical and categorical
features. The categorical features need to be converted to numerical as the models takes
only the numerical values.
Key steps in this process include handling missing data, encoding categorical variables,
and scaling numerical features.
Identify and address missing values in the dataset, employing strategies such as
imputationor removal to maintain data integrity.
Encoding Categorical Variables:
Utilize one-hot encoding to convert categorical variables into a format suitable for
machine learning models, enhancing their interpretability and effectiveness.
Scaling Numerical Features:
By implementing these pre-processing steps, it is ensured that the dataset is primed for
meaningfulanalysis and model development.
To gauge the performance and generalizability of our machine learning models, we employ a
training and testing split. This involves partitioning the dataset into subsets dedicated to model
training and evaluation.
Before splitting the data for training and testing, we have to assign the response variable and
predictor variable to Y and X respectively. Now we have to split the data in an 80:20 ratio. 80%
of the data will be used for training the models and 20% of the data will be used for testing.
Performing Classification
The classification aspect of our project involves predicting student grades based on a combination
of demographic and educational factors. Random Forest Classifier is employed as the primary
classification algorithm.
Performing Regression
The regression component of our project focuses on predicting math scores using a suite
of features. Linear regression is the chosen algorithms.
With the prepared model, test that with the 20% (X_test) testing data and assign that to the
y_pred variable Now test the performance of the model using root squared mean error
and r2 score.
In this project, several essential Python libraries are harnessed to seamlessly process,
analyse, and model complex educational data.
NumPy
NumPy is a general-purpose array-processing package. NumPy served as the backbone for
scientific computing, facilitating numerical operations and array manipulations critical for
data preprocessing. It provides a high-performance multidimensional array object, and
tools for working with these arrays. It is the fundamental package for scientific computing
with Python. Besides its obvious scientific uses, NumPy can also be used as an efficient
multidimensional container of generic data.
Pandas
Pandas is an open-source library that is built on top of NumPy library. It is a Python
package that offers various data structures and operations for manipulating numerical data
and time series. It is mainly popular for importing and analysing data much easier. Pandas
is fast and it has high- performance & productivity for users. It allows to efficiently handle
and explore the 'StudentsPerformance.csv' dataset, organizing the information into a
structured and analyzable format.
Matplotlib
Matplotlib is a plotting library that is ideal for creating visualizations in Python. It
provides a range of tools for creating line plots, scatter plots, histograms, and more.
Scikit-learn
Scikit-learn is a popular library for machine learning in Python. It provides a wide range
of tools for classification, regression, and clustering. Scikit-learn also includes tools for
data preprocessing, model selection, and model evaluation. Scikit-learn provides a wide
array of tools enabling us to predict students' grades and math scores based on various
demographic and educational features.
Machine Learning
Machine learning (ML) constitutes the exploration of computer algorithms designed to
enhance their performance autonomously through experience and the utilization of data.
Positioned within the realm of artificial intelligence, ML algorithms construct models
based on sample data, commonly referred to as "training data." This process enables them
to make predictions or decisions without explicit programming for each scenario.
The application of machine learning spans a diverse range of fields, including medicine,
email filtering, speech recognition, and computer vision. Its significance lies in its ability
to tackle tasks that are challenging or impractical to address using traditional algorithms.
In essence, machine learning empowers systems to learn and adapt, offering valuable
solutions across various domains.
Python
Python stands out as a high-level, versatile, and exceedingly popular programming
language. Widely employed in diverse domains, the latest iteration, Python 3, finds
applications in web development, machine learning, and various cutting-edge technologies
within the software industry. Its adaptability makes it an ideal choice for beginners
entering the programming landscape and proves equally advantageous for seasoned
programmers with expertise in other languages such as C++ and Java. The language's
versatility and broad adoption contribute to its standing as a key player in contemporary
software development.
3.4. DATASET
The sleep efficiency dataset compiles information on a group of test subjects and their
sleep behaviors. Each subject is uniquely identified by a Subject ID and characterized by
age, gender, bedtime, wakeup time, sleep duration, sleep efficiency, REM sleep
percentage, deep sleep percentage, light sleep percentage, awakenings, caffeine and
alcohol consumption, smoking status, and exercise frequency. This dataset serves as a
comprehensive resource for studying the relationships between lifestyle factors and sleep
patterns. It allows researchers to explore how variables such as bedtime habits, substance
consumption, and exercise frequency may influence sleep efficiency, duration, and quality
among the test subjects. Snapshot of part of the dataset is given below:
3.5. ADVANTAGES
Insights into Sleep Patterns: The dataset provides valuable insights into the sleep patterns of the
test subjects, including duration, efficiency, and various sleep stages. This information can be
beneficial for researchers studying sleep science and behavior.
Holistic Understanding: With a diverse set of variables such as caffeine and alcohol
consumption, exercise frequency, and smoking status, the dataset allows for a more holistic
understanding of factors influencing sleep. Researchers can explore the interplay of lifestyle
choices on sleep quality.
Research in Sleep Science: The dataset serves as a valuable resource for research in sleep
science, enabling investigations into the relationships between demographic factors, lifestyle
choices, and sleep outcomes. This can contribute to the development of interventions and
strategies for improving sleep health.
Healthcare Applications: Understanding sleep patterns can have implications for healthcare, as
sleep quality is often linked to overall health. The dataset could be used to identify correlations
between sleep habits and health outcomes.
3.6. DISADVANTAGES
Limited Generalization: Findings from the dataset may have limited generalization due to
potential variations in individual sleep behaviors and preferences. Results may not apply
universally to different populations.
Subjective Reporting: Some data, such as caffeine and alcohol consumption, smoking status,
and exercise frequency, relies on self-reporting, which can introduce biases and inaccuracies.
Participants may not always provide accurate information.
Incomplete Context: The dataset may lack information on certain contextual factors that could
impact sleep, such as stress levels, work schedules, or medical conditions. Without a complete
context, the analysis may not capture all relevant influences.
Small Sample Size: If the dataset has a small sample size, it may limit the statistical power and
generalizability of the findings. Larger and more diverse datasets are often preferred for robust
research conclusions.
Ethical Considerations: The collection and use of sensitive data related to sleep habits and
lifestyle choices raise ethical considerations. Ensuring privacy and obtaining informed consent
are crucial aspects that must be addressed.
Data Quality: The accuracy and reliability of the dataset depend on the quality of data collection
methods. Inconsistent or incomplete data may compromise the validity of analyses and findings.
Temporal Variability: Sleep patterns can vary over time, and a single snapshot of data may not
capture the dynamic nature of sleep behaviors. Longitudinal data would provide a more
comprehensive understanding of changes over time.
In utilizing the sleep efficiency dataset, researchers and analysts should carefully consider these
advantages and disadvantages to draw meaningful conclusions and ensure the responsible use of
the data.
CHAPTER - 4
IMPLEMENTATION (CODE)
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# Load Dataset
df = pd.read_csv('/content/Sleep_Efficiency.csv’)
df
Data preprocessing
Removing Outliers
# Assuming 'Sleep efficiency' is the target variable, drop it before calculating IQR
features = df.drop('Sleep efficiency', axis=1)
# Calculate the IQR for each feature in the dataframe
Q1 = features.quantile(0.25)
Q3 = features.quantile(0.75)
IQR = Q3 - Q1
# Print the shape of the dataframe before removing the outliers
print("Shape of the dataframe before removing outliers: " + str(df.shape))
# Remove the outliers from the dataframe
df_no_outliers = df[~((features < (Q1 - 1.5 * IQR)) | (features > (Q3 + 1.5 *
IQR))).any(axis=1)]
# Print the shape of the dataframe after removing the outliers
Finding Correlations
corrmat = df_no_outliers.corr()
plt.figure(figsize = (10,10))
sns.heatmap(corrmat, annot = True,cmap='coolwarm')
#as deep sleep and light sleep has high negative correlation light sleep will be dropped
df_no_outliers = df_no_outliers.drop('Light sleep percentage',axis=1)
#dropping columns which has below 0.25 or upper than -0.25 correlation with sleep efficiency
df_no_outliers=df_no_outliers.drop(['Smokingstatus','Caffeineconsumption','Age','Gender','Be
dtime','Wakeup time','Sleep duration','Exercise frequency'],axis=1)
df_no_outliers.head(3)
Scaling Features
# Initialize the class
scaler_linear = StandardScaler()
# Compute the mean and standard deviation of the training set then transform it
X = scaler_linear.fit_transform(X)
Visualizing Data
sns.relplot(
data=df, kind="line",
x="Age", y="Sleep efficiency", style="Gender", color="black"
)
plt.show()
sns.boxplot(data=df,x="Smoking status",y="Sleep efficiency", color="green")
plt.xlabel("Yes or No", color="green",fontsize=10)
plt.ylabel("Count", color="green",fontsize=10)
plt.title("number of smokers and non-smokers", color="green",fontsize=10)
plt.show()
sns.relplot(
data=df,
DEPT OF ISE, BIT
SLEEP EFFICIENCY DATASET ANALYSIS INTERNSHIP PROJECT
Classification
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
Regression
yhat_test = linear_model.predict(X_test)
CHAPTER - 5
SNAPSHOTS
These lines import the necessary libraries, including NumPy, pandas, and scikit-learn
modules for various machine learning algorithms.
In this code snippet, a linear regression model is trained and evaluated using scikit-learn in
Python. The dataset is split into training and testing sets using the train_test_split function. The
linear regression model is initialized, trained on the training set, and then used to make
predictions. Mean Squared Error (MSE) and R-squared (R2) scores are calculated for both the
training and test sets, providing insights into the model's performance. The results are presented
in a Pandas DataFrame named Final_Output, summarizing the method used (Linear Regression),
along with the corresponding MSE and R2 scores for both the training and test datasets. This
structured output facilitates a quick comparison of the model's performance metrics.
In this code snippet a classification model using a Random Forest Classifier to predict sleep
efficiency categories (Low, Medium, High) based on selected features. The dataset is split into
training and testing sets, and the classifier is trained on the former. Subsequently, predictions are
made on the testing set, and the model's performance is evaluated using the accuracy score, which
measures the proportion of correctly predicted categories. The output provides a concise assessment
of the model's accuracy in classifying sleep efficiency, offering insights into its effectiveness in
handling the given data and feature set.
CHAPTER – 6
DECLARATION
I, Ashwin Kumar a student of 5th semester BE, Information Science and Engineering
department, Bangalore Institute of Technology , Bengaluru hereby declare that internship project
work entitled " SLEEP EFFICIENCY DATASET ANALYSIS" has been carried out by me at
Prinston Smart Engineers, Bengaluru and submitted in partial fulfilment of the course requirement
for the award of the degree of Bachelor of Engineering in Information Science and Engineering
of Visvesvaraya Technological University, Belagavi, during the academic year 2023-2024.
I also declare that, to the best of my knowledge and belief, the work reported here is not
from the part of dissertation on the basis of which a degree or award was conferred on an earlier
occasion on this by any other student.
Place: Bengaluru
Ashwin kumar
[1BI21IS019]
CHAPTER - 7
CONCLUSION
In conclusion the exploration of the Sleep Efficiency Dataset from Kaggle through
implementation, classification, and regression analyses offers a nuanced understanding of the
multifaceted factors influencing sleep patterns. Utilizing a Random Forest Classifier, the
classification analysis demonstrates the dataset's predictive potential by accurately categorizing
sleep efficiency into Low, Medium, and High classes. This insight contributes to the broader
understanding of how demographic details and lifestyle choices collectively shape sleep outcomes.
Simultaneously, the regression analysis, employing a Linear Regression model, uncovers
quantitative relationships between predictors and continuous sleep-related metrics. Mean Squared
Error (MSE) and R-squared (R2) scores provide measures of model performance, offering
valuable insights into the variability and generalization capabilities of the regression model.
These analyses collectively highlight the interdisciplinary significance of such datasets, bridging
insights from sleep science, healthcare, and technology. The Random Forest Classifier's accuracy
showcases the model's effectiveness in classifying sleep efficiency categories, providing a
practical tool for predicting sleep outcomes. The Linear Regression model, by quantifying
relationships between specific predictors and sleep-related metrics, contributes to the broader
understanding of the nuanced dynamics of sleep duration and efficiency.
However, it is crucial to approach these findings with consideration of potential biases and ethical
concerns associated with self-reported lifestyle information. Rigorous data collection and
interpretation practices are necessary to ensure the robustness and reliability of the conclusions
drawn from the dataset.
The analyses on the Sleep Efficiency Dataset not only contribute to the scientific understanding of
sleep but also offer practical applications for healthcare professionals, researchers, and developers.
By delving into the complexities of sleep patterns, these analyses provide actionable insights that
can inform interventions and technological advancements aimed at enhancing sleep quality and
overall well-being.
CHAPTER - 8
REFERENCES
https://www.kaggle.com/code/hexenmeiser/sleep-efficiency-dataset-eda-and-scoring
https://rstudio-pubs-static.s3.amazonaws.com/1009332_d1d1e3612f374e48ad0dc1e07583ca99.html
https://medium.com/@larissa.tsuda.s/a-linear-regression-model-to-predict-sleep-efficiency-on-
subjects-fac9b94443a5
https://www.researchgate.net/publication/283734463_ISRUC-
Sleep_A_comprehensive_public_dataset_for_sleep_researchers
https://www.gigasheet.com/sample-data/sleep-health-and-lifestyle-dataset