0% found this document useful (0 votes)

46 views14 pages

Data Science Assignment 2

The document describes a machine learning project that aims to predict whether someone visited a dansala (canteen) using Orange software. Data was collected via a Google sheet and 176 responses were received. The machine learning pipeline involved data preprocessing, feature engineering, model selection/training/evaluation. Logistic regression was found to have the highest accuracy (90.4%) and AUC value (88.8%), making it the best model for prediction.

Uploaded by

anigunasekara

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

46 views14 pages

Data Science Assignment 2

Uploaded by

anigunasekara

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 14

Predicting the probability of someone who visited Dansala

Introduction

This document describes the findings of a data science project focused on forecasting
dansala participation. Main objective of this is to predict if someone visited dansala or
not, for that Orange software, a user-friendly data analysis and machine learning tool
is used. To collect data we used a google sheet with the following questions and
received 176 responses.

1. Do you have a boyfriend or a girlfriend?

2. Is your partner In this University
3. Do you like to Stay on a queue at Dansala?
4. Is It fine to stay outside in bad weather conditions?
5. Do you regularly attend dansala?
6. Did you attend to Last IMSSA events?
7. What time would you prefer to attend dansala?
8. How many dansalas did you visit in the last Vesak?
9. Distance from residence?
10. What is the way of traveling?
11. What is the type of food in this dansala?
12. Did you attend or not?

1
Machine Learning Pipeline

The process of creating a machine learning model is referred to as the "machine

learning pipeline." It includes every step of the procedure, from data collection and
preparation to model deployment and evaluation.

Here is a basic description of the standard machine learning pipeline.

Data Collection: The model will be trained and tested using relevant and representative
data, so collect it. Depending on the situation, data collection may use hand labelling,
APIs, databases, or other sources.

Data Preprocessing: To guarantee that the gathered data is in an appropriate format for
analysis, clean and preprocess it. In this step, there may be activities like encoding
categorical variables, addressing outliers, normalizing or scaling features, and deleting
missing values.

Feature Engineering: To increase the performance of the model, create or transform

features from the raw data. This could entail methods like feature selection,
dimensionality reduction, or creating new features using formulas or domain expertise.

2
Model Selection: Select a machine learning algorithm or model architecture that is
suitable for the current problem. The type of data, the nature of the task (such as
classification or regression), and the computational resources at hand all play a role in
this choice.

Model Training: Utilize the preprocessed data to train the chosen model. This entails
feeding the model the training data and tweaking its parameters to reduce the
discrepancy between expected and actual results. Techniques like gradient descent are
frequently used in the optimisation process.

Model Evaluation: Utilize evaluation criteria like accuracy, precision, recall, or mean
squared error to rate the trained model's effectiveness. This step evaluates whether
more improvements are required by evaluating how well the model generalizes to new
data.

Model Development: When you're happy with the model's performance, put it in a
production setting so it can forecast using fresh, unforeseen data. This may entail
incorporating the model into a website, a smartphone application, or any other system
that needs real-time forecasts.

Monitoring and Maintenance: Keep an eye on the performance of the deployed model to
make sure it holds up over time while maintaining accuracy and dependability. This
could entail frequently updating the model with fresh data or refining it to take into
account evolving needs or trends.

3
Methodology
First we input the above csv file to Orange software and used below widgets for the
predicting purpose.

Data from Excel

https://docs.google.com/spreadsheets/d/1idG-LqlR-fcO_Op4s27aBvyktuIMsNe
6KpgYFAja4SY/edit?usp=sharing

The same type of dansal was depicted in several ways here as well, just like it was
previously, and the distances were given in various units. But our group resolved
to wait to make a decision until a circumstance arose that would allow us to do so.

Then, the empty data in that dataset were matched with random values.

4
The process of selecting columns in Orange's data mining tool involves choosing two
specific columns from a dataset while ignoring others. Here we choose 13 columns and
ignore 1 column. That is the Timestamp column.

5
We can use the edit domain widget to correct data that has been input incorrectly.Here,
we fixed misspelled words and converted simple case and capital case data to the
same format, and so on.

6
And then we used an impute widget and it was used for handling missing values in the
data set. So here we instance with unknown values. Here, we have used the
Average/most frequent value to handle the missing values.

We used a rank widget and it used to rank variables in a dataset in a data based on
their importance or relevance.

7
Using the data sampler widget we divided the collected data as training data and
testing data. 80% of the data set is training data and the remaining 20% goes to testing
data.

We used cross validations for accuracy estimation instead of random sampling. Cross
validation is a technique used in machine learning to evaluate the performance of a
model on unseen data.

The 3 steps in cross validation are

1. Reserve some portion of the sample data set.

2. Using the rest data set, train the model.
3. Test the model using the reserved portion of the data set

It divides the available data into multiple folds or subsets, using one of these folds as a
validation set, and training the model on the remaining folds. This process is repeated
multiple times, each time using a different fold as the validation set. Finally, the results
from each validation step are averaged to produce a more robust estimate of the
model’s performance.

8
The main purpose of cross validation is to prevent overfitting, which occurs when a
model is trained too well on the training data and performs poorly on new, unseen data.
It improves ability to perform well on new, unseen data.

There are several types of cross validation techniques. We used stratified cross
validation. And put 10 as the no of folds.

Another method is the bootstrap. Bootstrapping is a resampling technique that helps in

estimating the uncertainty of a statistical model.It includes sampling the original dataset
with replacement and generating multiple new datasets of the same size as the original.

Bootstrap is used in machine learning to estimate the accuracy of a model, validate its
performance, and identify areas that need improvement.

For the prediction part we used data sample data that means training data and
remaining data that means testing data. Below is the predicted model and its prediction
accuracy is nearly 85.4%. Precision is 83.9% and top value is 79.1%.

If the value of the AUC is near to the 100 we can say that our model is good.

9
Following is the confusion matrix we can use to see the model prediction errors.We
used logistic regression, KNN, Neural Network and constant as models.

● The Confusion Matrix that comes through the KNN Model

● The Confusion Matrix that comes through the Logistic Regression Model

10
● The Confusion Matrix that comes through the Constant Model

● The Confusion Matrix that comes through the Neural Network Model

11
Conclusion

Below is the interface we received from the test and score widget.

12
Here CA means the accuracy of each model,

KNN Model - 89.2%

Logistic Regression Model - 90.4%
Constant Model - 83.4%
Neural Network Model - 89.2%

From the above, the Logistic regression model has the highest accuracy of 90.4%.

AUC Values

KNN Model - 86.2%

Logistic Regression Model - 88.8%
Constant Model - 45.1%
Neural Network Model - 82.7%

Logistic regression model has the highest AUC value. Therefore it is the best model that
we can use for the prediction.

Group members:

IM/2020/049 - Gimhan Perera

IM/2020/071 - Umesha Silva
IM/2020/079 - Thanusha Withana
IM/2020/097 - Vishmi Ramesha
IM/2020/038 - Pramodya Nethmini

13
14

Ravager Walkthrough v524
No ratings yet
Ravager Walkthrough v524
205 pages
30 Days ML Projects Challenge
No ratings yet
30 Days ML Projects Challenge
288 pages
Principles of Divisional Charts - Sanjay Rath
75% (4)
Principles of Divisional Charts - Sanjay Rath
13 pages
Psychotherapy Process and Outcome Research
No ratings yet
Psychotherapy Process and Outcome Research
6 pages
#054 Rail Life Analysis and Its Use in Planning Track Mtce - Railway Tech Inter 1993
No ratings yet
#054 Rail Life Analysis and Its Use in Planning Track Mtce - Railway Tech Inter 1993
11 pages
English 8 - Quarter 4 - Lesson 2 (Synthesizing Essential Information)
No ratings yet
English 8 - Quarter 4 - Lesson 2 (Synthesizing Essential Information)
29 pages
G200 Pilot Initial Client Guide
No ratings yet
G200 Pilot Initial Client Guide
126 pages
501 740 Questionnaire - 035712
No ratings yet
501 740 Questionnaire - 035712
55 pages
Minor Project
No ratings yet
Minor Project
21 pages
Detailed Lesson Plan (Statement of The Problem)
No ratings yet
Detailed Lesson Plan (Statement of The Problem)
6 pages
Jonathan Essay 3
100% (1)
Jonathan Essay 3
4 pages
Statistics For Data Science
No ratings yet
Statistics For Data Science
39 pages
Predictive Unit 1
No ratings yet
Predictive Unit 1
22 pages
Machine Learning
100% (1)
Machine Learning
62 pages
Predictive Analysis 1
No ratings yet
Predictive Analysis 1
22 pages
Module 3 Data Science Machine Learning
No ratings yet
Module 3 Data Science Machine Learning
53 pages
The Romanesque Style
100% (3)
The Romanesque Style
19 pages
ML Unit II Modelling Notes
No ratings yet
ML Unit II Modelling Notes
18 pages
Transformational Leadership Article Presentation (PPT) - HardonoSri
No ratings yet
Transformational Leadership Article Presentation (PPT) - HardonoSri
20 pages
CC 13 Tutorial Final
No ratings yet
CC 13 Tutorial Final
5 pages
Poly ML SIR
No ratings yet
Poly ML SIR
378 pages
Exam PA Knowledge Based Outline
No ratings yet
Exam PA Knowledge Based Outline
22 pages
AI and ML Lab Ex3 To 12
No ratings yet
AI and ML Lab Ex3 To 12
27 pages
Developing A Machining Learning Models From Start To Finish.
No ratings yet
Developing A Machining Learning Models From Start To Finish.
59 pages
Phase Transition
No ratings yet
Phase Transition
177 pages
Task 1 - Interview Report
No ratings yet
Task 1 - Interview Report
6 pages
Model Selection On ML
No ratings yet
Model Selection On ML
49 pages
ML Unit IV
No ratings yet
ML Unit IV
70 pages
Camera, Ethics, Documentary
No ratings yet
Camera, Ethics, Documentary
17 pages
Learner'S Activity Sheet in Health 9 (3 Quarter) Reading Time!
No ratings yet
Learner'S Activity Sheet in Health 9 (3 Quarter) Reading Time!
5 pages
Bennett Arnold How To Live On 24 Hours A Day PDF
No ratings yet
Bennett Arnold How To Live On 24 Hours A Day PDF
35 pages
171 299 1 SM
No ratings yet
171 299 1 SM
6 pages
Pattern Summary Final
No ratings yet
Pattern Summary Final
28 pages
CLADDING AND REPRESENTATION: BETWEEN SCENOGRAPHY AND TECTONICS, MYRIAM BLAlS
No ratings yet
CLADDING AND REPRESENTATION: BETWEEN SCENOGRAPHY AND TECTONICS, MYRIAM BLAlS
5 pages
Factors Influencing Ionisation Energy Chemistry Desk
No ratings yet
Factors Influencing Ionisation Energy Chemistry Desk
4 pages
ML Unit-3 - RTU
No ratings yet
ML Unit-3 - RTU
20 pages
Bilal Ahmed Shaik Data Mining
No ratings yet
Bilal Ahmed Shaik Data Mining
88 pages
Bookdown Demo PDF
No ratings yet
Bookdown Demo PDF
19 pages
AIML-Unit 5 Notes-Assignment 5
No ratings yet
AIML-Unit 5 Notes-Assignment 5
24 pages
Machine Learning Path
No ratings yet
Machine Learning Path
21 pages
Draft Xai
No ratings yet
Draft Xai
16 pages
ML Unit 2
No ratings yet
ML Unit 2
35 pages
Case Study - Churn Mdel Prediction
No ratings yet
Case Study - Churn Mdel Prediction
77 pages
Unacademy Straight Line
No ratings yet
Unacademy Straight Line
32 pages
Model Evaluation
No ratings yet
Model Evaluation
39 pages
ch-3 FML
No ratings yet
ch-3 FML
14 pages
Unit 3
No ratings yet
Unit 3
55 pages
CHAPTER 4 Diabetes
No ratings yet
CHAPTER 4 Diabetes
6 pages
1ST Oral Com M2
No ratings yet
1ST Oral Com M2
1 page
Pattern Recognition Application
No ratings yet
Pattern Recognition Application
43 pages
04 - Model Selection
No ratings yet
04 - Model Selection
62 pages
Big Data Lesson 2 Lucrezia Noli
No ratings yet
Big Data Lesson 2 Lucrezia Noli
21 pages
Chapter 02 Overview - 4
No ratings yet
Chapter 02 Overview - 4
43 pages
Removing Items From A Binary Search Tree
No ratings yet
Removing Items From A Binary Search Tree
2 pages
ML Important
No ratings yet
ML Important
11 pages
Machine Learning Project Checklist
No ratings yet
Machine Learning Project Checklist
30 pages
Unit 1
No ratings yet
Unit 1
41 pages
Machine Learning in PySpark
No ratings yet
Machine Learning in PySpark
18 pages
Phase 3 IBM
No ratings yet
Phase 3 IBM
7 pages
Machine Learning Lecture1 - 26-27 Aug
No ratings yet
Machine Learning Lecture1 - 26-27 Aug
30 pages
CE802 Pilot
No ratings yet
CE802 Pilot
2 pages
Jalali@mshdiua - Ac.ir Jalali - Mshdiau.ac - Ir: Data Mining
No ratings yet
Jalali@mshdiua - Ac.ir Jalali - Mshdiau.ac - Ir: Data Mining
50 pages
Unit6 Part3 General Procedure
No ratings yet
Unit6 Part3 General Procedure
19 pages
Heart Disease Prediction
No ratings yet
Heart Disease Prediction
27 pages
Methods and Models
No ratings yet
Methods and Models
12 pages
Lec 2
No ratings yet
Lec 2
13 pages
Predictive Analytics and Data Mining: Charles Elkan Elkan@cs - Ucsd.edu May 31, 2011
No ratings yet
Predictive Analytics and Data Mining: Charles Elkan Elkan@cs - Ucsd.edu May 31, 2011
165 pages
DM Unit - 3
No ratings yet
DM Unit - 3
21 pages
Model Selection NEW
No ratings yet
Model Selection NEW
24 pages
Memon Et Al. - Moderation Analysis Issue Adn Guidelines PDF
No ratings yet
Memon Et Al. - Moderation Analysis Issue Adn Guidelines PDF
11 pages
English
No ratings yet
English
9 pages
Week 4 - Intro To ML
No ratings yet
Week 4 - Intro To ML
37 pages
CE802 Report
No ratings yet
CE802 Report
7 pages
Data Analytics On Banking
No ratings yet
Data Analytics On Banking
3 pages
Grammar: Interrogative Affirmative Negative
No ratings yet
Grammar: Interrogative Affirmative Negative
3 pages
ML MAKAUT Unit-3
No ratings yet
ML MAKAUT Unit-3
6 pages
AI Project Report: By: Neha Kalra (17csu122) and Prerna Pathak (17csu143)
No ratings yet
AI Project Report: By: Neha Kalra (17csu122) and Prerna Pathak (17csu143)
22 pages
Oe Cae 3
No ratings yet
Oe Cae 3
7 pages
ML Lectures Summary 2
No ratings yet
ML Lectures Summary 2
52 pages
Final ML
No ratings yet
Final ML
2 pages
Information Technology-P1 - (NOV-08), ICAB
No ratings yet
Information Technology-P1 - (NOV-08), ICAB
1 page
Machine Learning Part: Domain Overview
No ratings yet
Machine Learning Part: Domain Overview
20 pages
Final Research Paper
No ratings yet
Final Research Paper
3 pages
ML Checklist PDF
No ratings yet
ML Checklist PDF
4 pages
MAPEH 9 Exam Q2
No ratings yet
MAPEH 9 Exam Q2
2 pages
Beethoven, Ludwig Van Sheet Music CD Contents
No ratings yet
Beethoven, Ludwig Van Sheet Music CD Contents
20 pages
Bibliography On Psychokinesis
No ratings yet
Bibliography On Psychokinesis
3 pages
DATA MINING AND MACHINE LEARNING. PREDICTIVE TECHNIQUES: REGRESSION, GENERALIZED LINEAR MODELS, SUPPORT VECTOR MACHINE AND NEURAL NETWORKS
From Everand
DATA MINING AND MACHINE LEARNING. PREDICTIVE TECHNIQUES: REGRESSION, GENERALIZED LINEAR MODELS, SUPPORT VECTOR MACHINE AND NEURAL NETWORKS
César Pérez López
No ratings yet
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
From Everand
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
César Pérez López
No ratings yet
Random Sample Consensus: Robust Estimation in Computer Vision
From Everand
Random Sample Consensus: Robust Estimation in Computer Vision
Fouad Sabry
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Data Science Assignment 2

Uploaded by

Data Science Assignment 2

Uploaded by

Predicting the probability of someone who visited Dansala

1. Do you have a boyfriend or a girlfriend?

The process of creating a machine learning model is referred to as the "machine

Here is a basic description of the standard machine learning pipeline.

Feature Engineering: To increase the performance of the model, create or transform

Data from Excel

The 3 steps in cross validation are

1. Reserve some portion of the sample data set.

Another method is the bootstrap. Bootstrapping is a resampling technique that helps in

● The Confusion Matrix that comes through the KNN Model

KNN Model - 89.2%

KNN Model - 86.2%

IM/2020/049 - Gimhan Perera

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.