0% found this document useful (0 votes)
63 views57 pages

Ashwin Kumar REPORT - 1BI21IS019

The document discusses machine learning and the process of analyzing a sleep efficiency dataset. It provides an overview of machine learning, the typical machine learning life cycle which includes collecting and preparing data, selecting and training a model, evaluating and tuning the model, and deploying it. It then discusses splitting data for machine learning and the need to reserve part of the data for testing model performance.

Uploaded by

Sanjana S
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
63 views57 pages

Ashwin Kumar REPORT - 1BI21IS019

The document discusses machine learning and the process of analyzing a sleep efficiency dataset. It provides an overview of machine learning, the typical machine learning life cycle which includes collecting and preparing data, selecting and training a model, evaluating and tuning the model, and deploying it. It then discusses splitting data for machine learning and the need to reserve part of the data for testing model performance.

Uploaded by

Sanjana S
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 57

VISVESVARAYA TECHNOLOGICAL UNIVERSITY

“Jnana Sangama”, Belagavi-590018, Karnataka

Internship Report on

“SLEEP EFFICIENCY DATASET ANALYSIS”


Submitted in partial fulfillment of the requirements for the award of a

Degree of

Bachelor of Engineering in Information Science and Engineering

Submitted by:

ASHWIN KUMAR (1BI21IS019)

MACHINE LEARNING INTERNSHIP AT


Prinston Smart Engineers

Under the Guidance of


Mr. Akash
Project Lead
Prinston Smart Engineers
Department of Information Science & Engineering
BANGALORE INSTITUTE OF TECHNOLOGY
K. R. Road, V. V. Pura, Bengaluru – 560004
2023-24
BANGALORE INSTITUTE OF TECHNOLOGY
V.V. Puram, K.R. Road, Bengaluru-560004
Department of Information Science & Engineering

CERTIFICATE

Certified that the Project work entitled “SLEEP EFFICIENCY DATASET ANALYSIS” carried out by
ASHWIN KUMAR (1BI21IS019) is a bona-fide student of Bangalore Institute of Technology,
Bangalore in partial fulfillment of the requirement of V semester (Machine Learning Project) Bachelor of
Engineering in Information Science and Engineering of Visvesvaraya Technological University, Belagavi
during the year 2023 – 2024. It is certified that all corrections/suggestions indicated for Internal
Assessment have been incorporated in the report deposited in the departmental library. The Internship
Project report has been approved as it satisfies the academic requirements with respect of the Mini
Project work prescribed for the said degree.

Dr. Asha T Dr. M. U. Aswath


HOD, Dept. of CS&E Principal

External Viva

Name of the examiners: Signature with date

1.

2.
ACKNOWLEDGEMENT

While presenting this Machine Learning Project on “SLEEP EFFICIENCY DATASET


ANALYSIS”, I feel that it is my duty to acknowledge the help rendered to us by various people.

I would like to thank our Principal Dr. M. U. Aswath, Bangalore Institute of Technology for
his support though out this project.

I express my whole hearted gratitude to Dr. Asha T, who is our respectable Head of Dept. of
Information Science. I wish to acknowledge for his valuable help and encouragement.

I Sincerely Acknowledge Guidance and Constant Encouragement of my Internship Guide


Mr. Akash, Project Lead, Prinston Smart Engineers, for his guidance and valuable advice at
every stage of my project which helped us in the successful completion of project.

ASHWIN KUMAR
[1BI21IS019]
ABSTRACT

In this project, we have taken sleep-related dataset sourced from Kaggle which is a comprehensive
collection of information that spans a wide array of metrics related to sleep. It includes data on various
aspects such as sleep efficiency scores, patterns, biometric data, environmental factors, and
demographics. This rich and multifaceted dataset offers a valuable resource for researchers and
professionals in the fields of sleep science, healthcare, and technology. The dataset's inclusivity of
diverse metrics allows for a holistic understanding of sleep-related phenomena. Sleep efficiency scores,
for instance, provide a quantitative measure of how effectively an individual utilizes their time in bed
for actual sleep. Sleep patterns encompass information about the structure and organization of sleep,
including details about the duration and distribution of different sleep stages.
CONTENTS

1. Introduction 5-28
1.1. Problem Statement 27
1.2. Objective 27
1.3 Future Scope 27
2. Requirement Specification 29
2.1. Software Requirements 29
2.2. Hardware Requirements 29
3. System Definition 30-43
3.1. Project Description 30
3.2. Libraries Used 38
3.3. Technology Used 39
3.4. Dataset 40
3.5 Advantages 42
3.6. Disadvantages 42
4. Implementation (Code) 44-50
5. Snapshots 51-54
6. Declaration 55
7. Conclusion/Future Enhancement 56
8. Reference 57
SLEEP EFFICIENCY DATASET ANALYSIS INTERNSHIP PROJECT

CHAPTER - 1

INTRODUCTION

Arthur Samuel, a pioneer in the field of artificial intelligence and computer gaming,
coined the term “Machine Learning”. He defined machine learning as – a “Field of study
that gives computers the capability to learn without being explicitly programmed”. The
process starts with feeding good quality data and then training our machines(computers)
by building machine learning models using the data and different algorithms. The choice
of algorithms depends on what type of data do we have and what kind of task we are
trying to automate.

The performance of ML algorithms adaptively improves with an increase in the number of


available samples during the ‘learning’ processes. For example, deep learning is a sub-
domain of machine learning that trains computers to imitate natural human traits like
learning from examples. It offers better performance parameters than conventional ML
algorithms.
Machine learning is used in many different applications, from image and speech
recognition to natural language processing, recommendation systems, fraud detection,
portfolio optimization, automated task, and so on. Machine learning models are also used
to power autonomous vehicles, drones, and robots, making them more intelligent and
adaptable to changing environments.

How machine learning algorithms work


The lifecycle of a machine learning project involves a series of steps that include:

1. Study the Problems: The first step is to study the problem. This step involves
understanding the business problem and defining the objectives of the model.

2. Data Collection: When the problem is well-defined, we can collect the relevant data
required for the model. The data could come from various sources such as databases,
APIs, or web scraping.

DEPT OF ISE, BIT


SLEEP EFFICIENCY DATASET ANALYSIS INTERNSHIP PROJECT

3. Data Preparation: When our problem-related data is collected. then it is a good idea
to check the data properly and make it in the desired format so that it can be used by
the modelto find the hidden patterns. This can be done in the following steps:

 Data cleaning

 Data Transformation

 Explanatory Data Analysis and Feature Engineering

 Split the dataset for training and testing.

4. Model Selection: The next step is to select the appropriate machine learning
algorithm that is suitable for our problem. This step requires knowledge of the
strengths and weaknesses of different algorithms. Sometimes we use multiple models
and compare their results and select the best model as per our requirements.
5. Model building and Training: After selecting the algorithm, we have to build the
model.

1. In the case of traditional machine learning building mode is easy it is just a


few hyperparameter tunings.
2. In the case of deep learning, we have to define layer-wise architecture along
with input and output size, number of nodes in each layer, loss function,
gradient descent optimizer, etc.
3. After that model is trained using the pre-processed dataset.

6. Model Evaluation: Once the model is trained, it can be evaluated on the test dataset to
determine its accuracy and performance using different techniques like classification
report, F1 score, precision, recall, ROC Curve, Mean Square error, absolute error, etc.
7. Model Tuning: Based on the evaluation results, the model may need to be tuned or
optimized to improve its performance. This involves tweaking the hyperparameters of
the model.
8. Deployment: Once the model is trained and tuned, it can be deployed in a production
environment to make predictions on new data. This step requires integrating the
model into an existing software system or creating a new system for the model.
9. Monitoring and Maintenance: Finally, it is essential to monitor the model’s
performance in the production environment and perform maintenance tasks as
DEPT OF ISE, BIT
SLEEP EFFICIENCY DATASET ANALYSIS INTERNSHIP PROJECT

required. This involves monitoring for data drift, retraining the model as needed,
and updating the model as new data becomes available. Implement logging
mechanisms to record predictions and outcomes. This facilitates the analysis of any
unexpected behavior or issues that may arise over time.

Figure 1.1 Machine Learning Life Cycle

How we split data in Machine Learning?

Whenever a machine learning model is trained, we can’t train that model on a single
dataset or even we train it on a single dataset then we will not be able to assess the
performance of our model. For that reason, we split our source data into training, testing,
and validation datasets. The data splitting procedure is used to estimate the performance of
machine learning algorithms when they are used to make predictions on data not used to
train the model. Splitting the dataset is essential for an unbiased evaluation of prediction
performance.
 Training Data: The part of data we use to train our model. This is the data that your
model actually sees (both input and output) and learns from.
 Validation Data: The part of data that is used to do a frequent evaluation of the
model, fit on the training dataset along with improving involved hyperparameters
(initially set parameters before the model begins learning). This data plays its part

DEPT OF ISE, BIT


SLEEP EFFICIENCY DATASET ANALYSIS INTERNSHIP PROJECT

when the model is actually training.

 Testing Data: Once our model is completely trained, testing data provides an unbiased
evaluation. When we feed in the inputs of Testing data, our model will predict some
values (without seeing actual output). After prediction, we evaluate our model by
comparing it with the actual output present in the testing data. This is how we evaluate
and see how much our model has learned from the experiences feed in as training
data, set at the time of training.

Figure 1.2 Data Spliting

Types of Machine Learning

Based on the methods and way of learning, machine learning is divided into mainly four
types, which are:

 Supervised Machine Learning

Supervised learning is a type of machine learning in which the algorithm is trained on


the labelled dataset. It learns to map input features to targets based on labelled training
data. In supervised learning, the algorithm is provided with input features and

corresponding output labels, and it learns to generalize from this data to make predictions
on new, unseen data. There are two main categories of supervised learning that are

DEPT OF ISE, BIT


SLEEP EFFICIENCY DATASET ANALYSIS INTERNSHIP PROJECT

mentioned below:

 Classification

Classification deals with predicting categorical target variables, which represent


discrete classes or labels. For instance, classifying emails as spam or not spam, or
predicting whether a patient has a high risk of heart disease. Classification algorithms
learn to map the input features to one of the predefined classes.
Here are some classification algorithms:

o Logistic Regression

o Support Vector Machine

o Random Forest

o Decision Tree

o K-Nearest Neighbours (KNN)

o Naive Bayes

 Regression

Regression, on the other hand, deals with predicting continuous target variables, which
represent numerical values. For example, predicting the price of a house based on its
size, location, and amenities, or forecasting the sales of a product. Regression
algorithms learn to map the input features to a continuous numerical value.
Here are some regression algorithms:

o Linear Regression

o Polynomial Regression

o Ridge Regression

o Lasso Regression

o Decision tree

o Random Forest

Advantages of Supervised Machine Learning

 Supervised Learning models can have high accuracy as they are trained on labelled
DEPT OF ISE, BIT
SLEEP EFFICIENCY DATASET ANALYSIS INTERNSHIP PROJECT

data.

 The process of decision-making in supervised learning models is often interpretable.

 It can often be used in pre-trained models which saves time and resources when
developing new models from scratch.

Disadvantages of Supervised Machine Learning

 It has limitations in knowing patterns and may struggle with unseen or unexpected
patterns that are not present in the training data.
 It can be time-consuming and costly as it relies on labelled data only.

 It may lead to poor generalizations based on new data.

 Unsupervised Machine Learning

Unsupervised Learning Unsupervised learning is a type of machine learning technique in


which an algorithm discovers patterns and relationships using unlabelled data. Unlike
supervised learning, unsupervised learning doesn’t involve providing the algorithm with
labelled target outputs. The primary goal of Unsupervised learning is often to discover
hidden patterns, similarities, or clusters within the data, which can then be used for
various purposes, such as data exploration, visualization, dimensionality reduction, and
more.

There are two main categories of unsupervised learning that are mentioned below:

 Clustering

Clustering is the process of grouping data points into clusters based on their
similarity. This technique is useful for identifying patterns and relationships in
data without the need for labelled examples.
Here are some clustering algorithms:

o K-Means Clustering algorithm

o Mean-shift algorithm

DEPT OF ISE, BIT


SLEEP EFFICIENCY DATASET ANALYSIS INTERNSHIP PROJECT

o DBSCAN Algorithm

o Principal Component Analysis

o Independent Component Analysis

 Association

Association rule learning is a technique for discovering relationships between items


in a dataset. It identifies rules that indicate the presence of one item implies the
presence of another item with a specific probability.
Here are some association rule learning algorithms:

o Apriori Algorithm

o Eclat

o FP-growth Algorithm

Advantages of Unsupervised Machine Learning

 It helps to discover hidden patterns and various relationships between the data.

 Used for tasks such as customer segmentation, anomaly detection, and data
exploration.
 It does not require labelled data and reduces the effort of data labelling.

Disadvantages of Unsupervised Machine Learning

 Without using labels, it may be difficult to predict the quality of the model’s
output.

 Cluster Interpretability may not be clear and may not have meaningful
interpretations.
 It has techniques such as autoencoders and dimensionality reduction that can
be used to extract meaningful features from raw data.

DEPT OF ISE, BIT


SLEEP EFFICIENCY DATASET ANALYSIS INTERNSHIP PROJECT

 Semi-Supervised Machine Learning

Semi-Supervised learning is a machine learning algorithm that works between the


supervised and unsupervised learning so it uses both labelled and unlabelled data. It’s
particularly useful when obtaining labelled data is costly, time-consuming, or resource-
intensive. This approach is useful when the dataset is expensive and time-consuming.
Semi-supervised learning is chosen when labelled data requires skills and relevant
resources in order to train or learn from it.

We use these techniques when we are dealing with data that is a little bit labelled and the
rest large portion of it is unlabelled. We can use the unsupervised techniques to
predict labels and then feed these labels to supervised techniques. This technique is
mostly applicable in the case of image data sets where usually all images are not
labelled.
There are a number of different semi-supervised learning methods each with its own
characteristics. Some of the most common ones include:
 Graph-based semi-supervised learning: This approach uses a graph to represent
the relationships between the data points. The graph is then used to propagate
labels fromthe labelled data points to the unlabelled data points.
 Label propagation: This approach iteratively propagates labels from the labelled
data points to the unlabelled data points, based on the similarities between the
data points.
 Co-training: This approach trains two different machine learning models on
different subsets of the unlabelled data. The two models are then used to label each
other’s predictions.
 Self-training: This approach trains a machine learning model on the labelled data
and then uses the model to predict labels for the unlabelled data. The model is
then retrained on the labelled data and the predicted labels for the unlabelled data.
 Generative adversarial networks (GANs): GANs are a type of deep learning
algorithm that can be used to generate synthetic data. GANs can be used to
generate unlabelled data for semi-supervised learning by training two neural
networks.

DEPT OF ISE, BIT


SLEEP EFFICIENCY DATASET ANALYSIS INTERNSHIP PROJECT


Advantages of Semi- Supervised Machine Learning

 It leads to better generalization as compared to supervised learning, as it


takes bothlabelled and unlabelled data.
 Can be applied to a wide range of data.

Disadvantages of Semi- Supervised Machine Learning

 Semi-supervised methods can be more complex to implement compared to


other approaches.
 It still requires some labelled data that might not always be available or easy
to obtain.
 The unlabelled data can impact the model performance accordingly.

 Reinforcement Learning

Reinforcement machine learning algorithm is a learning method that interacts with the
environment by producing actions and discovering errors. Trial, error, and delay are the
most relevant characteristics of reinforcement learning. In this technique, the model
keeps on increasing its performance using Reward Feedback to learn the behaviour or
pattern. These algorithms are specific to a particular problem e.g., Google Self Driving
car, AlphaGo where a bot competes with humans and even itself to get better and better
performers in Go Game. Each time we feed in data, they learn and add the data to their
knowledge which is training data. So, the more it learns the better it gets trained and
hence experienced. There are two main types of reinforcement learning:

 Positive reinforcement

o Rewards the agent for taking a desired action.

o Encourages the agent to repeat the behaviour.

o Examples: Giving a treat to a dog for sitting, providing a point in a game for
a correct answer.

DEPT OF ISE, BIT


SLEEP EFFICIENCY DATASET ANALYSIS INTERNSHIP PROJECT

 Negative reinforcement

o Removes an undesirable stimulus to encourage a desired behaviour.

o Discourages the agent from repeating the behaviour.

o Examples: Turning off a loud buzzer when a lever is pressed, avoiding a


penalty bycompleting a task.

Advantages of Reinforcement Machine Learning

 It has autonomous decision-making that is well-suited for tasks and that can
learnto make a sequence of decisions, like robotics and game-playing.

 This technique is preferred to achieve long-term results that are very difficult
to achieve.
 It is used to solve a complex problem that cannot be solved by conventional
techniques.

Disadvantages of Reinforcement Machine Learning

 Training Reinforcement Learning agents can be computationally expensive


and time-consuming.
 Reinforcement learning is not preferable to solving simple problems.

 It needs a lot of data and a lot of computation, which makes it impractical and
costly.

 Balancing exploration (trying new actions to discover their outcomes) and


exploitation (choosing known actions for immediate rewards) is a challenging trade-
off in RL, and finding the right balance can be complex.

 RL algorithms often have various hyperparameters that need careful tuning.


Sensitivity to these hyperparameters can make them challenging to use without
extensive experimentation.

 Many RL algorithms require a substantial amount of data to learn effectively. This


can be problematic in situations where data collection is expensive, time-consuming,
or limited.

 Handling continuous action spaces in RL poses challenges. Discretizing these spaces


DEPT OF ISE, BIT
SLEEP EFFICIENCY DATASET ANALYSIS INTERNSHIP PROJECT

can lead to inefficiencies, and direct optimization in continuous spaces can be


computationally demanding.

Figure 1.3 Types of Machine Learning

Need for machine learning:

Machine learning is important because it allows computers to learn from data and improve
their performance on specific tasks without being explicitly programmed. This ability to
learn from data and adapt to new situations makes machine learning particularly useful
for tasks that involve large amounts of data, complex decision-making, and dynamic
environments.

DEPT OF ISE, BIT


SLEEP EFFICIENCY DATASET ANALYSIS INTERNSHIP PROJECT

Here are some specific areas where machine learning is being used:
o Predictive modelling: Machine learning can be used to build predictive models
that can help businesses make better decisions. For example, machine learning can
be used to predict which customers are most likely to buy a particular product, or

which patients are most likely to develop a certain disease.


o Natural language processing: Machine learning is used to build systems that can
understand and interpret human language. This is important for applications such as
voice recognition, chatbots, and language translation.

o Computer vision: Machine learning is used to build systems that can recognize
and interpret images and videos. This is important for applications such as self-
driving cars, surveillance systems, and medical imaging.
o Fraud detection: Machine learning can be used to detect fraudulent behaviour in
financial transactions, online advertising, and other areas.
o Recommendation systems: Machine learning can be used to build
recommendation systems that suggest products, services, or content to users based
on their past behaviour and preferences.

Overall, machine learning has become an essential tool for many businesses and
industries, as it enables them to make better use of data, improve their decision-making
processes, and deliver more personalized experiences to their customers.
Comparison of Machine Learning Algorithms

Comparing machine learning algorithms is important in itself, but there are some not so-
obvious benefits of comparing various experiments effectively.

 Better performance

The primary objective of model comparison and selection is definitely better


performance of the machine learning software/solution. The objective is to narrow
down on the best algorithms that suit both the data and the business requirements.
 Longer lifetime

High performance can be short-lived if the chosen model is tightly coupled with
the training data and fails to interpret unseen data. So, it’s also important to find
DEPT OF ISE, BIT
SLEEP EFFICIENCY DATASET ANALYSIS INTERNSHIP PROJECT

the model that understands underlying data patterns so that the predictions are long-
lasting and the need for re-training is minimal.

 Easier retraining

When models are evaluated and prepared for comparisons, minute details, and
metadata get recorded which come in handy during retraining. For example, if a
developer can clearly retrace the reasons behind choosing a model, the causes of
model failure will immediately pop out and re-training can start with equal speed.
 Speedy production

With the model details available at hand, it’s easy to narrow down on models that
can offer high processing speed and use memory resources optimally. Also,
during production several parameters are required to configure the machine
learning solutions. Having production-level data can be useful for easily aligning
with the production engineers. Moreover, knowing the resource demands of
different algorithms, it will also be easier to check their compliance and feasibility
with respect to the organization’s allocated assets.

Performance Metrics in Machine Learning

Evaluating the performance of a Machine learning model is one of the important steps
while building an effective ML model. To evaluate the performance or quality of the
model, different metrics are used, and these metrics are known as performance metrics or
evaluation metrics. These performance metrics help us understand how well our model has
performed for the given data. In this way, we can improve the model's performance by
tuning the hyper-parameters. Each ML model aims to generalize well on unseen/new data,
and performance metrics help determine how well the model generalizes on the new
dataset. In machine learning, each task or problem is divided into classification and
Regression. Not all metrics can be used for all types of problems; hence, it is important to
know and understand which metrics should be used. Different evaluation metrics are used
for both Regression and Classification tasks.

 Performance metrics for Classification


In a classification problem, the category or classes of data is identified based on training data.
DEPT OF ISE, BIT
SLEEP EFFICIENCY DATASET ANALYSIS INTERNSHIP PROJECT

The model learns from the given dataset and then classifies the new data into classes or groups
based on the training. It predicts class labels as the output, such as Yes or No, 0 or 1, Spam or
Not Spam, etc. To evaluate the performance of a classification model, different metrics are
used, and some of them are as follows:

o Accuracy
The accuracy metric is one of the simplest Classification metrics to implement, and
it can be determined as the number of correct predictions to the total number of
predictions.

o Confusion Matrix
A confusion matrix is a tabular representation of prediction outcomes of any binary
classifier, which is used to describe the performance of the classification model on
a set of test data when true values are known.

We can determine the following from the abovematrix:

 In the matrix, columns are for the prediction values, and rows specify
the Actual values. Here Actual and prediction give two possible
classes, Yes or No. So, if we are predicting the presence of a disease in a
patient, the Prediction column with Yes means, Patient has the disease,
and for NO, the Patient doesn't have the disease.

 In this example, the total number of predictions are 165, out of which
110 time predicted yes, whereas 55 times predicted No.

 However, in reality, 60 cases in which patients don't have the disease,


whereas 105 cases in which patients have the disease.

In general, the table is divided into four terminologies, which are as follows:

 True Positive (TP): In this case, the prediction outcome is true, and it is
true in reality, also.

 True Negative (TN): in this case, the prediction outcome is false, and it
is false in reality, also.

DEPT OF ISE, BIT


SLEEP EFFICIENCY DATASET ANALYSIS INTERNSHIP PROJECT

 False Positive (FP): In this case, prediction outcomes are true, but they
are false in actuality.

 False Negative (FN): In this case, predictions are false, and they are
true inactuality.

o Precision

The precision metric is used to overcome the limitation of Accuracy. The precision
determines the proportion of positive prediction that was actually correct. It can be
calculated as the True Positive or predictions that are actually true to the total
positive predictions (True Positive and False Positive).

o Recall

It is also similar to the Precision metric; however, it aims to calculate the


proportion of actual positive that was identified incorrectly. It can be calculated as
True Positive or predictions that are actually true to the total number of positives,
either correctly predicted as positive or incorrectly predicted as negative (true
Positive and false negative).

o F-Score

F-score or F1 Score is a metric to evaluate a binary classification model on the


basis of predictions that are made for the positive class. It is calculated with the
help of Precision

and Recall. It is a type of single score that represents both Precision and Recall. So,
the F1 Score can be calculated as the harmonic mean of both precision and Recall,
assigning equal weight to each of them.

The formula for calculating the F1 score is given below:

DEPT OF ISE, BIT


SLEEP EFFICIENCY DATASET ANALYSIS INTERNSHIP PROJECT

o AUC (Area Under the Curve)-ROC

Sometimes we need to visualize the performance of the classification model on


charts; then, we can use the AUC-ROC curve. It is one of the popular and
important metrics for evaluating the performance of the classification model.

Firstly, let's understand ROC (Receiver Operating Characteristic curve) curve.


ROC represents a graph to show the performance of a classification model at
different threshold levels. The curve is plotted between two parameters, which are:

 True Positive Rate

 False Positive Rate

TPR or true Positive rate is a synonym for Recall, hence can be calculated as:

FPR or False Positive Rate can be calculated as:

 Performance metrics for Regression


Regression is a supervised learning technique that aims to find the relationships
between the dependent and independent variables. A predictive regression model

DEPT OF ISE, BIT


SLEEP EFFICIENCY DATASET ANALYSIS INTERNSHIP PROJECT

predicts a numeric or

discrete value. The metrics used for regression are different from the classification
metrics. It means we cannot use the Accuracy metric (explained above) to evaluate a
regression model; instead, the performance of a Regression model is reported as errors
in the prediction. Following are the popular metrics that are used to evaluate .

o Mean Absolute Error


Mean Absolute Error or MAE is one of the simplest metrics, which measures the
absolute difference between actual and predicted values, where absolute means
taking a number as Positive.

To understand MAE, let's take an example of Linear Regression, where the


model draws a best fit line between dependent and independent variables. To
measure the MAE or error in prediction, we need to calculate the difference
between actual values and predicted values. But in order to find the absolute error
for the complete dataset, we need to find the mean absolute of the complete dataset.

o Mean Squared Error

Mean Squared error or MSE is one of the most suitable metrics for Regression
evaluation. It measures the average of the Squared difference between predicted
values and the actual value given by the model. Since in MSE, errors are squared,
therefore it only assumes non- negative values, and it is usually positive and non-
zero. Moreover, due to squared differences, it penalizes small errors also, and
hence it leads to over-estimation of how bad the model is. MSE is a much-
preferred metric compared to other regression metrics as it is differentiable and
hence optimized better.

o R2 Score

R squared error is also known as Coefficient of Determination, which is another


popular metric used for Regression model evaluation. The R-squared metric

DEPT OF ISE, BIT


SLEEP EFFICIENCY DATASET ANALYSIS INTERNSHIP PROJECT

enables us to compare our model with a constant baseline to determine the


performance of the model. To select the constant baseline, we need to take the
mean of the data and draw the line at the mean.

The R squared score will always be less than or equal to 1 without concerning if the
values are too large or small.

o Adjusted R2

Adjusted R squared, as the name suggests, is the improved version of R squared


error. R square has a limitation of improvement of a score on increasing the terms,
even though the model is not improving, and it may mislead the data scientists.

To overcome the issue of R square, adjusted R squared is used, which will always
show a lower value than R². It is because it adjusts the values of increasing
predictors and only shows improvement if there is a real improvement.

Data Pre-Processing

Pre-processing refers to the transformations applied to our data before feeding it to the
algorithm. Data preprocessing is a technique that is used to convert the raw data into
a clean data set. In other words, whenever the data is gathered from different sources
it is collected inraw format which is not feasible for the analysis.

Why do We Need Data Preprocessing?

 Improving Data Quality: Data preprocessing is essential for enhancing the quality of
data by handling inconsistencies, inaccuracies, and errors, which is critical for ensuring
reliable and robust analytics.
 Dealing with Missing Values: Data preprocessing includes techniques like imputation
that are critical for dealing with missing data effectively, as datasets often have
missing values which can significantly hinder the performance of machine learning
models.
 Normalizing and Scaling: Data preprocessing helps in normalizing or scaling

DEPT OF ISE, BIT


SLEEP EFFICIENCY DATASET ANALYSIS INTERNSHIP PROJECT

features, which is especially important for algorithms that are sensitive to the scale of
the input. This ensures that all the features are on a comparable scale, which is
crucial for the accurate performance of many machine learning algorithms.
 Handling Outliers: Through data preprocessing, outliers can be identified and managed
appropriately. This is important as outliers can have a disproportionate effect on the
modelling process and can lead to misleading results.
 Dimensionality Reduction: Data preprocessing includes techniques such as Principal
Component Analysis (PCA) for reducing the number of input features, which not
only helps in improving the performance of models but also makes the dataset more
manageable and computationally efficient.

Steps in Data Preprocessing

Data preprocessing is a step that involves transforming raw data so that issues owing to
the incompleteness, inconsistency, and/or lack of appropriate representation of trends are
resolved so as to arrive at a dataset that is in an understandable format. The steps used in
data preprocessing include the following:

1. Data profiling. Data profiling is the process of examining, analysing and reviewing data
to collect statistics about its quality. It starts with a survey of existing data and its
characteristics. Data scientists identify data sets that are pertinent to the problem at hand,
inventory its significant attributes, and form a hypothesis of features that might be relevant
for the proposed analytics or machine learning task. They also relate data sources to the
relevant business concepts and consider which preprocessing libraries could be used.

2. Data cleansing. The aim here is to find the easiest way to rectify quality issues, such as
eliminating bad data, filling in missing data or otherwise ensuring the raw data is suitable
for feature engineering.

3. Data reduction. Raw data sets often include redundant data that arise from
characterizing phenomena in different ways or data that is not relevant to a particular ML,
AI or analytics task. Data reduction uses techniques like principal component analysis to
transform the raw data into a simpler form suitable for particular use cases.

4. Data transformation. Here, data scientists think about how different aspects of the data
need to be organized to make the most sense for the goal. This could include things

DEPT OF ISE, BIT


SLEEP EFFICIENCY DATASET ANALYSIS INTERNSHIP PROJECT

like structuring unstructured data, combining salient variables when it makes sense or
identifying important ranges to focus on.

5. Data enrichment. In this step, data scientists apply the various feature engineering
libraries to the data to effect the desired transformations. The result should be a data set
organized to achieve the optimal balance between the training time for a new model and
the required compute.

6. Data validation. At this stage, the data is split into two sets. The first set is used to train
a machine learning or deep learning model. The second set is the testing data that is used
to gauge the accuracy and robustness of the resulting model. This second step helps
identify any problems in the hypothesis used in the cleaning and feature engineering of the
data. If the data scientists are satisfied with the results, they can push the preprocessing
task to a data engineer who figures out how to scale it for production. If not, the data
scientists can go back and make changes to the way they implemented the data cleansing
and feature engineering steps.

Figure 1.5. Steps in Data Pre-Processing

DEPT OF ISE, BIT


SLEEP EFFICIENCY DATASET ANALYSIS INTERNSHIP PROJECT

Feature Encoding
Feature encoding is the process of transforming data into a format that can be used by
machine learning algorithms. This is often necessary when working with real-world data,
which can be messy and unstructured.

Machine learning models can only work with numerical values. For this reason, it is
necessary to transform the categorical values of the relevant features into numerical ones.
This process is called feature encoding.

Here are some of the more well-known and widely used encoding techniques:

 Label encoding: Label encoding is a method of encoding variables or features in a


dataset. It involves converting categorical variables into numerical variables.
Suppose we have a column Height in some dataset that has elements as Tall, Medium,
and short. To convert this categorical column into a numerical column we will apply
label encoding to this column. After applying label encoding, the Height column is
converted into a numerical column having elements 0,1, and 2 where 0 is the label for
tall, 1 is the label for medium, and 2 is the label for short height.

Figure 1.6. Label Encoding

 One-hot encoding: One-hot encoding is the process by which categorical variables are
converted into a form that can be used by ML algorithms.

DEPT OF ISE, BIT


SLEEP EFFICIENCY DATASET ANALYSIS INTERNSHIP PROJECT

Figure 1.7. One-Hot Encoding

 Binary encoding: Binary encoding is the process of encoding data using the binary
code. Inbinary encoding, each character is represented by a combination of 0s and 1s.

For Example:

o 0000 - 0
o 0001 - 1
o 0010 - 2
o 0011 - 3
o 0100 - 4
o 0101 - 5
o 0110 - 6
o 0111 - 7
o 1000 - 8
o 1001 - 9
o 1010 – 10

DEPT OF ISE, BIT


SLEEP EFFICIENCY DATASET ANALYSIS INTERNSHIP PROJECT

1.1. PROBLEM STATEMENT

• To show the Implementation, Classification and Regression Analysis on the Sleep Efficiency
Dataset by Kaggle.

1.2. OBJECTIVE

 Classification: The main goal of classification analysis is to create predictive models


that classify people into different types of sleep quality based on demographics, sleep
patterns, and lifestyle. Sleep quality is an important indicator that evaluates sleep
quality by measuring the percentage of sleep time relative to total sleep time.

 Visualization: The purpose of the Sleep Efficiency Dataset visualization is to gain


insight, discover patterns, and communicate effectively. Visualization plays an
important role in understanding the relationship between different features and
different targets and helps identify trends and patterns in the data.

 Regression: The purpose of regression analysis of the sleep dataset is to predict the
continuous variable “sleep duration” based on the various variables provided in the
dataset. Regression aims to understand and evaluate the relationship between
independent variables (traits) and dependent variables (sleep duration).

1.3. FUTURE SCOPE

During the course of this project, I came to recognize the critical importance of feature
scaling in the performance of machine learning models. The fundamental concept revolves
around ensuring that all features are on a consistent scale, which significantly impacts the
models' effectiveness. As a future direction, we can further enhance the existing system by
delving into advanced feature scaling techniques and exploring their impact on model
accuracy and robustness.

DEPT OF ISE, BIT


SLEEP EFFICIENCY DATASET ANALYSIS INTERNSHIP PROJECT

CHAPTER - 2

REQUIREMENTS SPECIFICATION

2.1. SOFTWARE REQUIREMENTS

 Operating system – Windows 7/8/10/11

 Google Collab Environment

 Libraries – NumPy, Scikit-Learn, Matplotlib and Pandas

 Language used is Python

2.2. HARDWARE REQUIREMENTS

 Processor – i3 Processor

 Processor Speed – 1 GHz

 Memory – 2 GB RAM

 1TB Hard Disk Drive

 Mouse or any other pointing device

 Keyboard

 Display device: Color Monitor

DEPT OF ISE, BIT


SLEEP EFFICIENCY DATASET ANALYSIS INTERNSHIP PROJECT

CHAPTER - 3

SYSTEM DEFINITION

3.1. PROJECT DESCRIPTION

Supervised Machine Learning algorithm can be broadly classified into Regression and
Classification Algorithms. In Regression algorithms, we have predicted the output for
continuous values, but to predict the categorical values, we need Classification
algorithms.

Classification

Classification is a technique for determining which class the dependent belongs to, based
on one or more independent variables.

A classifier is a type of machine learning algorithm that assigns a label to a data input.
Classifier algorithms use labelled data and statistical methods to produce predictions about
data input classifications. Here, we employ logistic regression as the primary
classification algorithm.

Logistic Regression

Logistic regression is a supervised machine learning algorithm mainly


used for classification tasks where the goal is to predict the probability that an instance
of belonging to a given class. It is used for classification algorithms its name is logistic
regression. it’s referred to as regression because it takes the output of the linear
regression function as input and uses a sigmoid function to estimate the probability for
the given class.

Firstly, linear regression is performed on the relationship between variables to get the model.

DEPT OF ISE, BIT


SLEEP EFFICIENCY DATASET ANALYSIS INTERNSHIP PROJECT

The logistic regression model transforms the linear regression function continuous value
output into categorical value output using a sigmoid function.

Figure 3.1. Logistic Regression

DEPT OF ISE, BIT


SLEEP EFFICIENCY DATASET ANALYSIS INTERNSHIP PROJECT

Regression

Regression is a technique in machine learning that predicts outcomes by finding


relationships between dependent and independent variables. It's a supervised learning
algorithm that uses labelled training data to create models. A regression problem is when
the output variable is a realor continuous value, such as “salary” or “weight”.

Linear regression, decision tree regression, and random forest regression are the chosen
algorithms in this project.

Linear regression

Linear regression algorithm shows a linear relationship between a dependent (y) and one
or more independent (y) variables, hence called as linear regression. Since linear
regression shows the linear relationship, which means it finds how the value of the
dependent variable is changing according to the value of the independent variable.
The linear regression model provides a sloped straight line representing the relationship
between the variables. Consider the below image:

Figure 3.2. Linear Regression

DEPT OF ISE, BIT


SLEEP EFFICIENCY DATASET ANALYSIS INTERNSHIP PROJECT

Mathematically, we can represent a linear regression as:


y= a0+a1x+ ε

Here,

Y= Dependent Variable (Target Variable)

X= Independent Variable (predictor Variable)

a0= intercept of the line (Gives an additional degree of freedom)

a1 = Linear regression coefficient (scale factor to each input


value). ε = random error

The values for x and y variables are training datasets for Linear Regression model
representation.

Decision Tree Regression

Decision Tree is a decision-making tool that uses a flowchart-like tree structure or is a model
ofdecisions and all of their possible results, including outcomes, input costs, and utility.

The process of splitting starts at the root node and is followed by a branched tree that
finally leads to a leaf node (terminal node) that contains the prediction or the final
outcome of the algorithm. Construction of decision trees usually works top-down, by
choosing a variable at each step that best splits the set of items. Each sub-tree of the
decision tree model can be represented as a binary tree where a decision node splits into
two nodes based on the conditions.

Decision tree regression observes features of an object and trains a model in the
structure of a tree to predict data in the future to produce meaningful continuous output.
Continuous output means that the output/result is not discrete, i.e., it is not represented
just by a discrete, known set of numbers or values.

DEPT OF ISE, BIT


SLEEP EFFICIENCY DATASET ANALYSIS INTERNSHIP PROJECT

Figure 3.3. Decision Tree Regression

Random Forest Regression

Random forests or random decision forests are an ensemble learning method that uses
multiple learning algorithms to obtain better predictive performance than could be
obtained from any of the constituent learning algorithms mostly for solving classification
and regression problems.

Random Forest Regression algorithms are a class of Machine Learning algorithms that use
the combination of multiple random decision trees each trained on a subset of data. The
use of multiple trees gives stability to the algorithm and reduces variance. The random
forest regression algorithm is a commonly used model due to its ability to work well for
large and most kinds of data.

The algorithm creates each tree from a different sample of input data. At each node, a
different sample of features is selected for splitting and the trees run in parallel without
any interaction. The predictions from each of the trees are then averaged to produce a
single result which is the prediction of the Random Forest.

DEPT OF ISE, BIT


SLEEP EFFICIENCY DATASET ANALYSIS INTERNSHIP PROJECT

Figure 3.4. Random Forest Regression

WORKING DESCRIPTION

The sleep efficiency dataset catalogues test subjects' sleep behaviors, demographic details, and
lifestyle factors. It provides a rich source for investigating correlations between sleep patterns
and various influences, offering insights for research in sleep science, healthcare applications,
and advancements in sleep-tracking technologies. However, potential biases and ethical
considerations warrant careful interpretation of the data.

SLEEP EFFICIENCY DATASET ANALYSIS

The Sleep Efficiency Dataset Analysis involves a comprehensive examination of data


encompassing the sleep patterns, demographic characteristics, and lifestyle factors of a group of
test subjects. This dataset provides a valuable opportunity for researchers and analysts to delve
into the intricate relationships between various variables and understand the factors influencing
sleep quality.

DEPT OF ISE, BIT


SLEEP EFFICIENCY DATASET ANALYSIS INTERNSHIP PROJECT

At the core of this analysis are variables such as bedtime, wakeup time, total sleep duration,
sleep efficiency, and the percentages of time spent in different sleep stages, including REM,
deep, and light sleep. These quantitative measures offer insights into the overall sleep
architecture of the test subjects. Additionally, the dataset includes demographic details such as
age and gender, providing a broader context for understanding sleep patterns.

Lifestyle factors play a pivotal role in shaping sleep outcomes, and the dataset captures key
elements such as caffeine and alcohol consumption, smoking status, and exercise frequency.
Exploring these factors in conjunction with sleep-related metrics allows researchers to uncover
potential correlations and patterns. For instance, researchers may investigate whether increased
exercise frequency correlates with improved sleep efficiency or if caffeine consumption is
associated with changes in sleep duration.

The implications of this analysis extend into various domains. In the realm of sleep science,
researchers can gain valuable insights into the complex interplay between lifestyle choices and
sleep quality. Healthcare applications may benefit from understanding how certain lifestyle
factors contribute to sleep-related issues, informing personalized interventions for improved sleep
health.
Context of Dataset

The sleep efficiency dataset contains details on test subjects, including age, gender,
bedtime, wakeup time, sleep duration, and various sleep-related factors such as REM sleep
percentage, deep sleep percentage, and light sleep percentage. Additional information includes
the number of awakenings, caffeine and alcohol consumption, smoking status, and exercise
frequency.
Data Pre-Processing
Data preprocessing is an important step before using it. It refers to the cleaning,
transforming, and integrating of data in order to make it ready for analysis. The goal of
data preprocessing is to improve the quality of the data and to make it more suitable for
the specific model to train. In this dataset, there are both numerical and categorical
features. The categorical features need to be converted to numerical as the models takes
only the numerical values.

DEPT OF ISE, BIT


SLEEP EFFICIENCY DATASET ANALYSIS INTERNSHIP PROJECT

Key steps in this process include handling missing data, encoding categorical variables,
and scaling numerical features.

 Handling Missing Data:

Identify and address missing values in the dataset, employing strategies such as
imputationor removal to maintain data integrity.
 Encoding Categorical Variables:

Utilize one-hot encoding to convert categorical variables into a format suitable for
machine learning models, enhancing their interpretability and effectiveness.
 Scaling Numerical Features:

Standardize numerical features to a common scale using techniques like Standard


Scaler, preventing certain variables from dominating the modelling process.

By implementing these pre-processing steps, it is ensured that the dataset is primed for
meaningfulanalysis and model development.

Training and Testing split

To gauge the performance and generalizability of our machine learning models, we employ a
training and testing split. This involves partitioning the dataset into subsets dedicated to model
training and evaluation.

Before splitting the data for training and testing, we have to assign the response variable and
predictor variable to Y and X respectively. Now we have to split the data in an 80:20 ratio. 80%
of the data will be used for training the models and 20% of the data will be used for testing.

Performing Classification

The classification aspect of our project involves predicting student grades based on a combination
of demographic and educational factors. Random Forest Classifier is employed as the primary
classification algorithm.

Through classification modelling, it is aimed to uncover the intricate relationships between


predictor variables and academic grades, offering a predictive framework for educators.

DEPT OF ISE, BIT


SLEEP EFFICIENCY DATASET ANALYSIS INTERNSHIP PROJECT
S

Performing Regression

The regression component of our project focuses on predicting math scores using a suite
of features. Linear regression is the chosen algorithms.

With the prepared model, test that with the 20% (X_test) testing data and assign that to the
y_pred variable Now test the performance of the model using root squared mean error
and r2 score.

3.2. LIBRARIES USED

In this project, several essential Python libraries are harnessed to seamlessly process,
analyse, and model complex educational data.

 NumPy
NumPy is a general-purpose array-processing package. NumPy served as the backbone for
scientific computing, facilitating numerical operations and array manipulations critical for
data preprocessing. It provides a high-performance multidimensional array object, and
tools for working with these arrays. It is the fundamental package for scientific computing
with Python. Besides its obvious scientific uses, NumPy can also be used as an efficient
multidimensional container of generic data.

 Pandas
Pandas is an open-source library that is built on top of NumPy library. It is a Python
package that offers various data structures and operations for manipulating numerical data
and time series. It is mainly popular for importing and analysing data much easier. Pandas
is fast and it has high- performance & productivity for users. It allows to efficiently handle
and explore the 'StudentsPerformance.csv' dataset, organizing the information into a
structured and analyzable format.

 Matplotlib
Matplotlib is a plotting library that is ideal for creating visualizations in Python. It
provides a range of tools for creating line plots, scatter plots, histograms, and more.

DEPT OF ISE, BIT


SLEEP EFFICIENCY DATASET ANALYSIS INTERNSHIP PROJECT

 Scikit-learn

Scikit-learn is a popular library for machine learning in Python. It provides a wide range
of tools for classification, regression, and clustering. Scikit-learn also includes tools for
data preprocessing, model selection, and model evaluation. Scikit-learn provides a wide
array of tools enabling us to predict students' grades and math scores based on various
demographic and educational features.

3.3. TECHNOLOGIES USED

Machine Learning
Machine learning (ML) constitutes the exploration of computer algorithms designed to
enhance their performance autonomously through experience and the utilization of data.
Positioned within the realm of artificial intelligence, ML algorithms construct models
based on sample data, commonly referred to as "training data." This process enables them
to make predictions or decisions without explicit programming for each scenario.

The application of machine learning spans a diverse range of fields, including medicine,
email filtering, speech recognition, and computer vision. Its significance lies in its ability
to tackle tasks that are challenging or impractical to address using traditional algorithms.
In essence, machine learning empowers systems to learn and adapt, offering valuable
solutions across various domains.

Python
Python stands out as a high-level, versatile, and exceedingly popular programming
language. Widely employed in diverse domains, the latest iteration, Python 3, finds
applications in web development, machine learning, and various cutting-edge technologies
within the software industry. Its adaptability makes it an ideal choice for beginners
entering the programming landscape and proves equally advantageous for seasoned
programmers with expertise in other languages such as C++ and Java. The language's
versatility and broad adoption contribute to its standing as a key player in contemporary
software development.

DEPT OF ISE, BIT


SLEEP EFFICIENCY DATASET ANALYSIS INTERNSHIP PROJECT

3.4. DATASET

The sleep efficiency dataset compiles information on a group of test subjects and their
sleep behaviors. Each subject is uniquely identified by a Subject ID and characterized by
age, gender, bedtime, wakeup time, sleep duration, sleep efficiency, REM sleep
percentage, deep sleep percentage, light sleep percentage, awakenings, caffeine and
alcohol consumption, smoking status, and exercise frequency. This dataset serves as a
comprehensive resource for studying the relationships between lifestyle factors and sleep
patterns. It allows researchers to explore how variables such as bedtime habits, substance
consumption, and exercise frequency may influence sleep efficiency, duration, and quality
among the test subjects. Snapshot of part of the dataset is given below:

DEPT OF ISE, BIT


SLEEP EFFICIENCY DATASET ANALYSIS INTERNSHIP PROJECT

DEPT OF ISE, BIT


SLEEP EFFICIENCY DATASET ANALYSIS INTERNSHIP PROJECT

3.5. ADVANTAGES

Insights into Sleep Patterns: The dataset provides valuable insights into the sleep patterns of the
test subjects, including duration, efficiency, and various sleep stages. This information can be
beneficial for researchers studying sleep science and behavior.

Holistic Understanding: With a diverse set of variables such as caffeine and alcohol
consumption, exercise frequency, and smoking status, the dataset allows for a more holistic
understanding of factors influencing sleep. Researchers can explore the interplay of lifestyle
choices on sleep quality.

Research in Sleep Science: The dataset serves as a valuable resource for research in sleep
science, enabling investigations into the relationships between demographic factors, lifestyle
choices, and sleep outcomes. This can contribute to the development of interventions and
strategies for improving sleep health.

Healthcare Applications: Understanding sleep patterns can have implications for healthcare, as
sleep quality is often linked to overall health. The dataset could be used to identify correlations
between sleep habits and health outcomes.

Technology Enhancement: The dataset may support the enhancement of sleep-tracking


technologies. By analyzing the data, developers can gain insights into the effectiveness of current
technologies and explore opportunities for improvement.

3.6. DISADVANTAGES

Limited Generalization: Findings from the dataset may have limited generalization due to
potential variations in individual sleep behaviors and preferences. Results may not apply
universally to different populations.

Subjective Reporting: Some data, such as caffeine and alcohol consumption, smoking status,
and exercise frequency, relies on self-reporting, which can introduce biases and inaccuracies.
Participants may not always provide accurate information.

DEPT OF ISE, BIT


SLEEP EFFICIENCY DATASET ANALYSIS INTERNSHIP PROJECT

Incomplete Context: The dataset may lack information on certain contextual factors that could
impact sleep, such as stress levels, work schedules, or medical conditions. Without a complete
context, the analysis may not capture all relevant influences.

Small Sample Size: If the dataset has a small sample size, it may limit the statistical power and
generalizability of the findings. Larger and more diverse datasets are often preferred for robust
research conclusions.

Ethical Considerations: The collection and use of sensitive data related to sleep habits and
lifestyle choices raise ethical considerations. Ensuring privacy and obtaining informed consent
are crucial aspects that must be addressed.

Data Quality: The accuracy and reliability of the dataset depend on the quality of data collection
methods. Inconsistent or incomplete data may compromise the validity of analyses and findings.

Temporal Variability: Sleep patterns can vary over time, and a single snapshot of data may not
capture the dynamic nature of sleep behaviors. Longitudinal data would provide a more
comprehensive understanding of changes over time.

In utilizing the sleep efficiency dataset, researchers and analysts should carefully consider these
advantages and disadvantages to draw meaningful conclusions and ensure the responsible use of
the data.

DEPT OF ISE, BIT


SLEEP EFFICIENCY DATASET ANALYSIS INTERNSHIP PROJECT

CHAPTER - 4

IMPLEMENTATION (CODE)

Import Necessary Libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Model libraries, scaling, metrics, etc

from sklearn.preprocessing import StandardScaler

# for building linear regression models and preparing data


from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

# Load Dataset
df = pd.read_csv('/content/Sleep_Efficiency.csv’)
df

Data preprocessing

#Dropping unnecessary ID column


df = df.drop('ID',axis=1)
#Converting categorical datas to numerical
#smoking status
df['Smoking status'] = df['Smoking status'].map({'Yes': 1, 'No': 0})
#Gender
df['Gender'] =df['Gender'].map({'Male':1,'Female':0})

DEPT OF ISE, BIT


SLEEP EFFICIENCY DATASET ANALYSIS INTERNSHIP PROJECT

#Converting bedtime to Epoch


df['Bedtime'] = pd.to_datetime(df['Bedtime'], format='%Y-%m-%d %H:%M:%S')
df['Bedtime'] = df['Bedtime'].apply(lambda x: int(x.timestamp()))
df['Wakeup time'] = pd.to_datetime(df['Wakeup time'], format='%Y-%m-%d %H:%M:%S')
df['Wakeup time'] = df['Wakeup time'].apply(lambda x: int(x.timestamp()))
df.head(10)
#Finding Null
print(df.isnull().sum())
# Imputing null values
df['Awakenings'].fillna(df['Awakenings'].min(), inplace=True)
df['Caffeine consumption'].fillna(df['Caffeine consumption'].mean(), inplace=True)
df['Alcohol consumption'].fillna(df['Alcohol consumption'].mean(), inplace=True)
df['Exercise frequency'].fillna(df['Exercise frequency'].mean(), inplace=True)
#checking duplicates
print(f'Duplicate count = {df.duplicated().sum()}')

#Splitting Feature and Target variables first


X = df.drop('Sleep efficiency',axis=1)
y = df['Sleep efficiency']

Removing Outliers
# Assuming 'Sleep efficiency' is the target variable, drop it before calculating IQR
features = df.drop('Sleep efficiency', axis=1)
# Calculate the IQR for each feature in the dataframe
Q1 = features.quantile(0.25)
Q3 = features.quantile(0.75)
IQR = Q3 - Q1
# Print the shape of the dataframe before removing the outliers
print("Shape of the dataframe before removing outliers: " + str(df.shape))
# Remove the outliers from the dataframe
df_no_outliers = df[~((features < (Q1 - 1.5 * IQR)) | (features > (Q3 + 1.5 *
IQR))).any(axis=1)]
# Print the shape of the dataframe after removing the outliers

DEPT OF ISE, BIT


SLEEP EFFICIENCY DATASET ANALYSIS INTERNSHIP PROJECT

print("Shape of the dataframe after removing outliers: " + str(df_no_outliers.shape))

Finding Correlations
corrmat = df_no_outliers.corr()
plt.figure(figsize = (10,10))
sns.heatmap(corrmat, annot = True,cmap='coolwarm')

#as deep sleep and light sleep has high negative correlation light sleep will be dropped
df_no_outliers = df_no_outliers.drop('Light sleep percentage',axis=1)
#dropping columns which has below 0.25 or upper than -0.25 correlation with sleep efficiency
df_no_outliers=df_no_outliers.drop(['Smokingstatus','Caffeineconsumption','Age','Gender','Be
dtime','Wakeup time','Sleep duration','Exercise frequency'],axis=1)
df_no_outliers.head(3)

Scaling Features
# Initialize the class
scaler_linear = StandardScaler()
# Compute the mean and standard deviation of the training set then transform it
X = scaler_linear.fit_transform(X)

Visualizing Data
sns.relplot(
data=df, kind="line",
x="Age", y="Sleep efficiency", style="Gender", color="black"
)
plt.show()
sns.boxplot(data=df,x="Smoking status",y="Sleep efficiency", color="green")
plt.xlabel("Yes or No", color="green",fontsize=10)
plt.ylabel("Count", color="green",fontsize=10)
plt.title("number of smokers and non-smokers", color="green",fontsize=10)
plt.show()
sns.relplot(
data=df,
DEPT OF ISE, BIT
SLEEP EFFICIENCY DATASET ANALYSIS INTERNSHIP PROJECT

x="Gender", y="Sleep efficiency", col="Smoking status",


hue="REM sleep percentage", size="Smoking status",
)
plt.show()
sns.boxplot(data=df,x="Alcohol consumption",y="Sleep efficiency", color="red")
plt.title("What is the effect of drinking alcohol on sleep efficiency?",
color="red",fontsize=10)
plt.show()
sns.boxplot(data=df,x="Caffeine consumption",y="Sleep efficiency", color="green")
plt.title("Does caffeine consumption affect sleep?", color="red",fontsize=10)
plt.show()
sns.pairplot(df)

DEPT OF ISE, BIT


SLEEP EFFICIENCY DATASET ANALYSIS INTERNSHIP PROJECT

Classification

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

# Assuming 'Sleep efficiency' is the target variable


X = df_no_outliers.drop('Sleep efficiency', axis=1)
y = pd.cut(df_no_outliers['Sleep efficiency'], bins=[0, 0.33, 0.66, 1], labels=['Low', 'Medium',
'High'])

# Split the data into training and testing sets


X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

# Initialize the Random Forest Classifier


clf = RandomForestClassifier(random_state=42)

# Train the classifier on the training data


clf.fit(X_train, y_train)

# Make predictions on the testing data


y_pred = clf.predict(X_test)

# Evaluate the model


accuracy = accuracy_score(y_test, y_pred)

# Print the evaluation metrics


print(f"Accuracy: {accuracy:.2f}")

DEPT OF ISE, BIT


SLEEP EFFICIENCY DATASET ANALYSIS INTERNSHIP PROJECT

Regression

#Training the model

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size= 0.25, random_state=50)

# Initialize the class


linear_model = LinearRegression()

# Train the model


linear_model.fit(X_train, y_train)

# Feed the scaled training set and get the predictions


yhat = linear_model.predict(X_train)

# Use scikit-learn's utility function and divide by 2


train_mse = mean_squared_error(y_train, yhat) / 2
train_r2 = r2_score(y_train,yhat)

print(f"training MSE : {train_mse}")


print(f"training R2 :{train_r2}")
#Test data predictions

yhat_test = linear_model.predict(X_test)

# Use scikit-learn's utility function and divide by 2


test_mse = mean_squared_error(y_test, yhat_test) / 2
test_r2 = r2_score(y_test,yhat_test)

print(f"Cross validation MSE: {test_mse}")


print(f"Cross validation R2 :{test_r2}")

DEPT OF ISE, BIT


SLEEP EFFICIENCY DATASET ANALYSIS INTERNSHIP PROJECT

#Comparing scores in a dataframe

Final_Output = pd.DataFrame(['Linear Regression', train_mse, train_r2, test_mse,


test_r2,]).transpose()
Final_Output.columns = ['Method', 'Training MSE', 'Training R2','Test MSE','Test R2']
Final_Output

DEPT OF ISE, BIT


SLEEP EFFICIENCY DATASET ANALYSIS INTERNSHIP PROJECT

CHAPTER - 5

SNAPSHOTS

These lines import the necessary libraries, including NumPy, pandas, and scikit-learn
modules for various machine learning algorithms.

The data looks like:

DEPT OF ISE, BIT


SLEEP EFFICIENCY DATASET ANALYSIS INTERNSHIP PROJECT

In this code snippet, a linear regression model is trained and evaluated using scikit-learn in
Python. The dataset is split into training and testing sets using the train_test_split function. The
linear regression model is initialized, trained on the training set, and then used to make
predictions. Mean Squared Error (MSE) and R-squared (R2) scores are calculated for both the
training and test sets, providing insights into the model's performance. The results are presented
in a Pandas DataFrame named Final_Output, summarizing the method used (Linear Regression),
along with the corresponding MSE and R2 scores for both the training and test datasets. This
structured output facilitates a quick comparison of the model's performance metrics.

DEPT OF ISE, BIT


SLEEP EFFICIENCY DATASET ANALYSIS INTERNSHIP PROJECT

In this code snippet a classification model using a Random Forest Classifier to predict sleep
efficiency categories (Low, Medium, High) based on selected features. The dataset is split into
training and testing sets, and the classifier is trained on the former. Subsequently, predictions are
made on the testing set, and the model's performance is evaluated using the accuracy score, which
measures the proportion of correctly predicted categories. The output provides a concise assessment
of the model's accuracy in classifying sleep efficiency, offering insights into its effectiveness in
handling the given data and feature set.

DEPT OF ISE, BIT


SLEEP EFFICIENCY DATASET ANALYSIS INTERNSHIP PROJECT

DEPT OF ISE, BIT


SLEEP EFFICIENCY DATASET ANALYSIS INTERNSHIP PROJECT

CHAPTER – 6

DECLARATION

I, Ashwin Kumar a student of 5th semester BE, Information Science and Engineering
department, Bangalore Institute of Technology , Bengaluru hereby declare that internship project
work entitled " SLEEP EFFICIENCY DATASET ANALYSIS" has been carried out by me at
Prinston Smart Engineers, Bengaluru and submitted in partial fulfilment of the course requirement
for the award of the degree of Bachelor of Engineering in Information Science and Engineering
of Visvesvaraya Technological University, Belagavi, during the academic year 2023-2024.

I also declare that, to the best of my knowledge and belief, the work reported here is not
from the part of dissertation on the basis of which a degree or award was conferred on an earlier
occasion on this by any other student.

Place: Bengaluru

Ashwin kumar

[1BI21IS019]

DEPT OF ISE, BIT


SLEEP EFFICIENCY DATASET ANALYSIS INTERNSHIP PROJECT

CHAPTER - 7

CONCLUSION

In conclusion the exploration of the Sleep Efficiency Dataset from Kaggle through
implementation, classification, and regression analyses offers a nuanced understanding of the
multifaceted factors influencing sleep patterns. Utilizing a Random Forest Classifier, the
classification analysis demonstrates the dataset's predictive potential by accurately categorizing
sleep efficiency into Low, Medium, and High classes. This insight contributes to the broader
understanding of how demographic details and lifestyle choices collectively shape sleep outcomes.
Simultaneously, the regression analysis, employing a Linear Regression model, uncovers
quantitative relationships between predictors and continuous sleep-related metrics. Mean Squared
Error (MSE) and R-squared (R2) scores provide measures of model performance, offering
valuable insights into the variability and generalization capabilities of the regression model.

These analyses collectively highlight the interdisciplinary significance of such datasets, bridging
insights from sleep science, healthcare, and technology. The Random Forest Classifier's accuracy
showcases the model's effectiveness in classifying sleep efficiency categories, providing a
practical tool for predicting sleep outcomes. The Linear Regression model, by quantifying
relationships between specific predictors and sleep-related metrics, contributes to the broader
understanding of the nuanced dynamics of sleep duration and efficiency.

However, it is crucial to approach these findings with consideration of potential biases and ethical
concerns associated with self-reported lifestyle information. Rigorous data collection and
interpretation practices are necessary to ensure the robustness and reliability of the conclusions
drawn from the dataset.

The analyses on the Sleep Efficiency Dataset not only contribute to the scientific understanding of
sleep but also offer practical applications for healthcare professionals, researchers, and developers.
By delving into the complexities of sleep patterns, these analyses provide actionable insights that
can inform interventions and technological advancements aimed at enhancing sleep quality and
overall well-being.

DEPT OF ISE, BIT


SLEEP EFFICIENCY DATASET ANALYSIS INTERNSHIP PROJECT

CHAPTER - 8

REFERENCES

 https://www.kaggle.com/code/hexenmeiser/sleep-efficiency-dataset-eda-and-scoring
 https://rstudio-pubs-static.s3.amazonaws.com/1009332_d1d1e3612f374e48ad0dc1e07583ca99.html
 https://medium.com/@larissa.tsuda.s/a-linear-regression-model-to-predict-sleep-efficiency-on-
subjects-fac9b94443a5
 https://www.researchgate.net/publication/283734463_ISRUC-
Sleep_A_comprehensive_public_dataset_for_sleep_researchers
 https://www.gigasheet.com/sample-data/sleep-health-and-lifestyle-dataset

DEPT OF ISE, BIT

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy