0% found this document useful (0 votes)
46 views14 pages

Data Science Assignment 2

The document describes a machine learning project that aims to predict whether someone visited a dansala (canteen) using Orange software. Data was collected via a Google sheet and 176 responses were received. The machine learning pipeline involved data preprocessing, feature engineering, model selection/training/evaluation. Logistic regression was found to have the highest accuracy (90.4%) and AUC value (88.8%), making it the best model for prediction.

Uploaded by

anigunasekara
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
46 views14 pages

Data Science Assignment 2

The document describes a machine learning project that aims to predict whether someone visited a dansala (canteen) using Orange software. Data was collected via a Google sheet and 176 responses were received. The machine learning pipeline involved data preprocessing, feature engineering, model selection/training/evaluation. Logistic regression was found to have the highest accuracy (90.4%) and AUC value (88.8%), making it the best model for prediction.

Uploaded by

anigunasekara
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

Predicting the probability of someone who visited Dansala

Introduction

This document describes the findings of a data science project focused on forecasting
dansala participation. Main objective of this is to predict if someone visited dansala or
not, for that Orange software, a user-friendly data analysis and machine learning tool
is used. To collect data we used a google sheet with the following questions and
received 176 responses.

1. Do you have a boyfriend or a girlfriend?


2. Is your partner In this University
3. Do you like to Stay on a queue at Dansala?
4. Is It fine to stay outside in bad weather conditions?
5. Do you regularly attend dansala?
6. Did you attend to Last IMSSA events?
7. What time would you prefer to attend dansala?
8. How many dansalas did you visit in the last Vesak?
9. Distance from residence?
10. What is the way of traveling?
11. What is the type of food in this dansala?
12. Did you attend or not?

1
Machine Learning Pipeline

The process of creating a machine learning model is referred to as the "machine


learning pipeline." It includes every step of the procedure, from data collection and
preparation to model deployment and evaluation.

Here is a basic description of the standard machine learning pipeline.

Data Collection: The model will be trained and tested using relevant and representative
data, so collect it. Depending on the situation, data collection may use hand labelling,
APIs, databases, or other sources.

Data Preprocessing: To guarantee that the gathered data is in an appropriate format for
analysis, clean and preprocess it. In this step, there may be activities like encoding
categorical variables, addressing outliers, normalizing or scaling features, and deleting
missing values.

Feature Engineering: To increase the performance of the model, create or transform


features from the raw data. This could entail methods like feature selection,
dimensionality reduction, or creating new features using formulas or domain expertise.

2
Model Selection: Select a machine learning algorithm or model architecture that is
suitable for the current problem. The type of data, the nature of the task (such as
classification or regression), and the computational resources at hand all play a role in
this choice.

Model Training: Utilize the preprocessed data to train the chosen model. This entails
feeding the model the training data and tweaking its parameters to reduce the
discrepancy between expected and actual results. Techniques like gradient descent are
frequently used in the optimisation process.

Model Evaluation: Utilize evaluation criteria like accuracy, precision, recall, or mean
squared error to rate the trained model's effectiveness. This step evaluates whether
more improvements are required by evaluating how well the model generalizes to new
data.

Model Development: When you're happy with the model's performance, put it in a
production setting so it can forecast using fresh, unforeseen data. This may entail
incorporating the model into a website, a smartphone application, or any other system
that needs real-time forecasts.

Monitoring and Maintenance: Keep an eye on the performance of the deployed model to
make sure it holds up over time while maintaining accuracy and dependability. This
could entail frequently updating the model with fresh data or refining it to take into
account evolving needs or trends.

3
Methodology
First we input the above csv file to Orange software and used below widgets for the
predicting purpose.

Data from Excel

https://docs.google.com/spreadsheets/d/1idG-LqlR-fcO_Op4s27aBvyktuIMsNe
6KpgYFAja4SY/edit?usp=sharing

The same type of dansal was depicted in several ways here as well, just like it was
previously, and the distances were given in various units. But our group resolved
to wait to make a decision until a circumstance arose that would allow us to do so.

Then, the empty data in that dataset were matched with random values.

4
The process of selecting columns in Orange's data mining tool involves choosing two
specific columns from a dataset while ignoring others. Here we choose 13 columns and
ignore 1 column. That is the Timestamp column.

5
We can use the edit domain widget to correct data that has been input incorrectly.Here,
we fixed misspelled words and converted simple case and capital case data to the
same format, and so on.

6
And then we used an impute widget and it was used for handling missing values in the
data set. So here we instance with unknown values. Here, we have used the
Average/most frequent value to handle the missing values.

We used a rank widget and it used to rank variables in a dataset in a data based on
their importance or relevance.

7
Using the data sampler widget we divided the collected data as training data and
testing data. 80% of the data set is training data and the remaining 20% goes to testing
data.

We used cross validations for accuracy estimation instead of random sampling. Cross
validation is a technique used in machine learning to evaluate the performance of a
model on unseen data.

The 3 steps in cross validation are

1. Reserve some portion of the sample data set.


2. Using the rest data set, train the model.
3. Test the model using the reserved portion of the data set

It divides the available data into multiple folds or subsets, using one of these folds as a
validation set, and training the model on the remaining folds. This process is repeated
multiple times, each time using a different fold as the validation set. Finally, the results
from each validation step are averaged to produce a more robust estimate of the
model’s performance.

8
The main purpose of cross validation is to prevent overfitting, which occurs when a
model is trained too well on the training data and performs poorly on new, unseen data.
It improves ability to perform well on new, unseen data.

There are several types of cross validation techniques. We used stratified cross
validation. And put 10 as the no of folds.

Another method is the bootstrap. Bootstrapping is a resampling technique that helps in


estimating the uncertainty of a statistical model.It includes sampling the original dataset
with replacement and generating multiple new datasets of the same size as the original.

Bootstrap is used in machine learning to estimate the accuracy of a model, validate its
performance, and identify areas that need improvement.

For the prediction part we used data sample data that means training data and
remaining data that means testing data. Below is the predicted model and its prediction
accuracy is nearly 85.4%. Precision is 83.9% and top value is 79.1%.

If the value of the AUC is near to the 100 we can say that our model is good.

9
Following is the confusion matrix we can use to see the model prediction errors.We
used logistic regression, KNN, Neural Network and constant as models.

● The Confusion Matrix that comes through the KNN Model

● The Confusion Matrix that comes through the Logistic Regression Model

10
● The Confusion Matrix that comes through the Constant Model

● The Confusion Matrix that comes through the Neural Network Model

11
Conclusion

Below is the interface we received from the test and score widget.

12
Here CA means the accuracy of each model,

KNN Model - 89.2%


Logistic Regression Model - 90.4%
Constant Model - 83.4%
Neural Network Model - 89.2%

From the above, the Logistic regression model has the highest accuracy of 90.4%.

AUC Values

KNN Model - 86.2%


Logistic Regression Model - 88.8%
Constant Model - 45.1%
Neural Network Model - 82.7%

Logistic regression model has the highest AUC value. Therefore it is the best model that
we can use for the prediction.

Group members:

IM/2020/049 - Gimhan Perera


IM/2020/071 - Umesha Silva
IM/2020/079 - Thanusha Withana
IM/2020/097 - Vishmi Ramesha
IM/2020/038 - Pramodya Nethmini

13
14

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy