0% found this document useful (0 votes)

33 views32 pages

CP1407 Practials1to Practials 5

The document outlines a practical guide on Data Science and Machine Learning, covering fundamentals, WEKA introduction, clustering methods, data preprocessing, and classification techniques. It includes detailed explanations of hierarchical and non-hierarchical clustering methods, as well as practical exercises and laboratory questions. Additionally, it emphasizes the skills required for data scientists and provides references for further learning.

Uploaded by

tzhang2024

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

33 views32 pages

CP1407 Practials1to Practials 5

Uploaded by

tzhang2024

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 32

1. Practical 1: Fundamentals of Data Science and Machine

Learning
1.1 Introduction to Data Science
1.2 Overview of Machine Learning
1.3 Case Studies and Discussion
2. Practical 2: Introduction to WEKA
2.1 Launching and Starting WEKA
2.2 Data Preprocessing in WEKA
2.3 Attribute Selection and Filtering
3. Practical 3: Clustering Methods Review
3.1 Hierarchical Clustering
3.1.1 Single Linkage Method
3.1.2 Complete Linkage Method
3.2 Non-Hierarchical Clustering (K-Means)
3.2.1 Algorithm Overview
3.2.2 Experimental Results and Discussion
4. Practical 4: Data Preprocessing Review
4.1 Classification of Descriptive Attributes
4.2 Data Discretization Techniques
4.3 Handling Missing Values
4.4 Case Analysis
5. Practical 5: Fundamentals of Classification
5.1 Basics of Classification vs. Regression
5.2 Confusion Matrix and Evaluation Metrics
5.3 1R Algorithm Practice
5.4 Overfitting and Prevention Measures
Practical 1: Fundamentals of Data Science and Machine Learning
Table of Content

1. Data Science is a process of extracting knowledge and utilizing data. It

involves collecting, analyzing, and interpreting data to uncover patterns, gain
insights, and make informed decisions.
Machine Learning refers to a broad class of methods that revolve around
data modeling
to algorithmically make predictions and decipher patterns. It also involves
developing algorithms that allow computers to automatically learn from data,
and perform tasks or take actions based on that learning.
The differences:

DATA SCIENCE MACHINE LEARNING

Data science has a broad focus that Machine learning primarily focuses on
encompasses various techniques for building algorithms that enable
extracting insights and meaning from computers to learn from data and make
data. E.g. statistical analysis and data predictions
visualization.

Data science encompasses a wider Machine learning expertise lies in

range of expertise and applications. designing and fine-tuning algorithms for
E.g. data visualization, data specific tasks. E.g. image recognition,
engineering, and statistical analysis for natural language processing, or
business insights across domains. predictive analytics.
References:
https://learning.linkedin.com/resources/learning-tech/machine-learning-
vs-data-science
https://www.coursera.org/articles/data-science-vs-machine-learning

2. (a) Data Query because it can be extracted directly from the datasets such
as the data of the number of male and female population.

(b) Machine Learning because this needs predictive modelling which would
identify the patterns in the previous datasets to make predictions.

(c) Data Query because it involves retrieving and comparing existing data of
the golf activities of married man and single man. No predictions or pattern
discovery is required.

(d) Machine learning because determining characteristics requires pattern

discovery which is a machine learning task.

(e)Machine learning because we would need to determine the pattern of data

and classify the transaction in order to determine if the credit card transaction
is fraudulent or not.

3. Data scientists work at the raw database level to derive insights and build
data products while analysts may interact with data at both the database
level or the summarized report level.

Reference:
Provost, F., & Fawcett, T. (2013). Data Science for Business: What You Need to
Know about Data - Driven Decision Making. O'Reilly Media
Davenport, T. H., & Patil, D. J. (2012). Data scientist: The sexiest job of the
21st century. Harvard Business Review, 90(10), 70 - 76
McKinsey Global Institute. (2011). Big data: The next frontier for innovation,
competition, and productivity
4. Skills including mathematics expertise, technology; hacking skills and
business/strategy acumen are required as a data scientist. However, the most
important skill that most data scientists will find useful is Programming, such
as Python or R, which one can use to sort, analyze, and manage large
amounts of “big data”. To improve your skills, you can sign up for a course
online or get involved in communities. Furthermore, here is a photo I found
that shows the 10 essential skills as a data scientist.

Reference

Staff, C. (2024, April 5). 7 skills every data scientist should have. Coursera.

https://www.coursera.org/articles/data-scientist-skills

Tayo, O., (2019). Data Science Minimum: 10 Essential Skills You Need to Know
to Start Doing Data Science [Online image]. Towards Data Science.
https://towardsdatascience.com/data-science-minimum-10-essential-
skills-you-need-to-know-to-start-doing-data-science-e5a5a9be5991

5. (a) Classification(Supervised); The data already has labels, so a

classification model can be trained to predict what type of property a new
customer might buy.
It can be used for prediction. Once trained, the model can predict property
preferences for new customers.

(b)Pattern Discovery(Unsupervised); There are no predefined labels.The goal

is to discover hidden patterns, such as whether married customers with kids
prefer houses with garages.
It can not be used for prediction. The patterns found can help with future
predictions but cannot directly make decisions.

(c)Classification(Supervised); The data includes algae population records, so a

model can be trained to predict algae levels based on chemical data.
It can be used for prediction. The trained model can predict algae growth in
new water samples.

Reference
GeeksforGeeks. (n.d.). Real-life examples of supervised learning and
unsupervised learning. Retrieved February 4, 2025, from
https://www.geeksforgeeks.org/real-life-examples-of-supervised-learning-and-
unsupervised-learning/

IBM. (n.d.). Supervised vs. unsupervised learning. Retrieved February 4, 2025,

from https://www.ibm.com/think/topics/supervised-vs-unsupervised-learning?
utm_source=chatgpt.com

Laboratory Questions ( for these questions, you guys can answer both on
your own!!)

1. Top tools: Python, Anaconda, scikit-learn, Tensorflow, Keras, and Apache

Spark
Supporting tools: SQL, Excel, and Tableau

Reference

Piatetsky, G. (2018, June 6). The 6 components of Open-Source Data Science/

Machine Learning Ecosystem; Did Python declare victory over R?

KDnuggets. https://www.kdnuggets.com/2018/06/ecosystem-data-science-

python-victory.html

2.
- What problem do you target to solve using this data set?
The problem I aim to solve as a data scientist is to predict whether a
person has a heart disease or not based on the medical attribute input.
- What part of the data would be used as input for your model?
Data used as input for my modes are age, chest pain type, resting blood
pressure, maximum heart rate achieved and exercise-induced angina.
- What machine learning methods (e.g., classification, clustering) can be
Applied?
The classification methods like Logistic Regression, Support Vector
Machine, Decision Trees and Artificial Neural Networks can be used to sort
data into categories (Keita, 2024).

- What would be the output of this analysis?

The output of this analysis would be if a person has heart disease or not
(yes or no).

Reference:

Keita, Z. (2024, August 8). Classification in Machine Learning:

An Introduction. datacamp.

https://www.datacamp.com/blog/classification-machine-

learning
Practical 2: Introduction to
WEKA
1.Launching and Starting WEKA

2.Data Pre-Processing using WEKA

Loading the Data
Selecting or Filtering Attributes
Discretization
Practical 3: Clustering Methods
Review

Practical 3: Clustering Review – K-Means & Hierarchical Clustering

1. Introduction
The purpose of this practical is to gain a deeper understanding of two
clustering algorithms—hierarchical clustering and non-hierarchical clustering
(K-Means). Through manual calculations and graphical illustrations, we explore
the working principles of these methods using an example dataset. In
addition, we conduct experiments with WEKA on a real dataset
(bank_data.csv) to observe how different linkage methods and distance
metrics affect clustering outcomes.
2. Hierarchical Clustering
2.1 Overview
For hierarchical clustering, we consider eight data points (P1 through P8) with
two dimensions (x and y). The algorithm employs the Manhattan (City-block)
distance and the single-linkage method. The clustering process is performed
by gradually increasing the distance threshold (starting at 0 and incrementing
by 1 at each level) and merging clusters whose minimum pairwise distance
falls within the current threshold.
The Manhattan distance between two points A and B is given by:
cpp
复制
Manhattan distance (A, B) = Σ |xi - yi|
2.2 Clustering Steps
1. Step 1: Initial State
Each data point is treated as an individual cluster.
Clusters: {P1}, {P2}, {P3}, {P4}, {P5}, {P6}, {P7}, {P8}
2. Step 2 (Distance Level 1):
The minimum Manhattan distance is 1, occurring between P1 & P7 and
between P2 & P4.
o Merge P2 and P4 to form the cluster {P2, P4}
o (The merge for P1 and P7 is processed similarly.)
New clusters: {P1, P7}, {P2, P4}, {P3}, {P5}, {P6}, {P8}
3. Step 3 (Distance Level 2):
The next minimum distance is 2. For example, the distance between P3
and the cluster {P2, P4} is 2.
o Merge P3 into {P2, P4} to form {P2, P3, P4}
o Similarly, merge other points based on the single-linkage
strategy.
4. Step 4 (Distance Level 3):
At distance 3, further merges occur; for instance, P1 and P6 may be
merged, resulting in a cluster {P1, P6}.
5. Step 5 (Distance Level 5 – skipping level 4):
Finally, clusters such as {P1, P5, P6, P7, P8} merge with {P2, P3, P4} at
distance 5, resulting in a single cluster containing all points.
2.3 Dendrogram for the First Six Data Points
For a simplified example, consider only the first six points (P1 to P6) with the
following Manhattan distance matrix:
P P P P P P
1 2 3 4 5 6
P
0 11 8 10 5 3
1
P
0 3 1 6 8
2
P
0 2 5 5
3
P
0 5 7
4
P
0 4
5
P
0
6
Merging Process:
 Step 1 (Distance = 1):
Merge P2 and P4 into cluster A = {P2, P4}.
 Step 2 (Distance = 2):
The minimum distance between cluster A and P3 is 2 (since min(P2-P3,
P4-P3) = min(3, 2) = 2).
Merge to form cluster B = {P2, P3, P4}.
 Step 3 (Distance = 3):
Merge P1 and P6 (distance = 3) into cluster C = {P1, P6}.
 Step 4 (Distance = 4):
Merge P5 into cluster C (minimum distance between P5 and points in C
is 4), forming cluster D = {P1, P5, P6}.
 Step 5 (Distance = 5):
Finally, merge clusters B and D (minimum inter-cluster distance = 5) to
obtain a single cluster containing all six points.
Textual Representation of the Dendrogram:
csharp
_______5_______
/ \
{P2, P3, P4} {P1, P5, P6}
(P2 & P4 merged at 1, then with P3 at 2)
(P1 & P6 merged at 3, then with P5 at 4)
Note:
 The horizontal lines represent the merging distance thresholds.
 The left branch shows how P2 and P4 merge at distance 1 and then
combine with P3 at distance 2, while the right branch shows the merge
of P1 and P6 at distance 3 followed by the addition of P5 at distance 4.
 At distance 5, the two branches merge into one final cluster.
3. Non-Hierarchical Clustering (K-Means)
3.1 Overview
Using the same dataset (P1 through P8), the K-Means algorithm aims to
partition the data into two clusters by minimizing the intra-cluster variance.
The process is iterative and includes the following steps:
1. Step 1: Initial Selection
o Two initial cluster centers are selected arbitrarily.
o In this example, P5 (coordinates: 5,2) and P7 (coordinates: 1,2)
are chosen as the initial centers.
2. Step 2: Distance Calculation and Assignment
o Compute the Euclidean distance between each data point and
the two cluster centers.
o Assign each data point to the nearest center based on the
smallest distance.
3. Step 3: Recalculate Cluster Centers
o For each cluster, calculate the new center by taking the mean of
the coordinates of all points in that cluster.
o For example, if cluster C1 consists of points {P1, P5, P6, P7, P8},
the new center is computed as:
 xcenter=xP1+xP5+xP6+xP7+xP85x_{center} = \
frac{x_{P1} + x_{P5} + x_{P6} + x_{P7} + x_{P8}}
{5}xcenter=5xP1+xP5+xP6+xP7+xP8
 ycenter=yP1+yP5+yP6+yP7+yP85y_{center} = \
frac{y_{P1} + y_{P5} + y_{P6} + y_{P7} + y_{P8}}
{5}ycenter=5yP1+yP5+yP6+yP7+yP8
4. Step 4: Iteration
o Re-calculate the distances from each data point to the new
centers.
o Reassign data points if necessary and update the cluster centers
until no point changes its cluster membership.
5. Final Result:
o In this case, the final clusters are:
 C1 = {P1, P5, P6, P7, P8}
 C2 = {P2, P3, P4}
Note: The initial choice of centers might vary, but the algorithm generally
converges to a similar final clustering when the clusters are well separated.
4. Practice on Clustering
As part of the practical exercise, you are required to draw the dendrogram for
the first six data points (P1, P2, P3, P4, P5, P6) using the hierarchical
clustering algorithm described above. You can:
 Hand-draw the dendrogram based on the merging sequence
provided.
 Use software tools (e.g., Excel, R, or Python’s matplotlib) to generate
the dendrogram.
 Include the image in your final document, or present a textual
version as shown above.
5. Laboratory Questions (WEKA Experiment)
5.1 Experimental Steps
1. Hierarchical Clustering in WEKA:
o Open the bank_data.csv file in WEKA.
o Run hierarchical clustering experiments using both single-
linkage and complete-linkage methods.
o Compare the dendrograms produced by the two methods. Note
that:
Single Linkage tends to form “chain-like” clusters where

points merge sequentially.
 Complete Linkage typically produces more compact
clusters with a smaller maximum intra-cluster distance.
2. K-Means Clustering in WEKA:
o Using the same dataset (bank_data.csv), perform K-Means
clustering experiments.
o Run the clustering algorithm using both Euclidean and
Manhattan distance metrics.
o Compare the results to observe how different distance metrics
affect the assignment of data points to clusters.
5.2 Discussion
 Hierarchical Clustering:
The dendrograms can illustrate different clustering structures depending on
the linkage method. Compare the level at which clusters merge and the
overall shape of the dendrogram.

 K-Means Clustering:
o Different distance metrics (Euclidean vs. Manhattan) may lead
to different cluster centroids and cluster assignments. Discuss
the differences in the clusters formed and provide possible
reasons based on the underlying distance calculations.

Documentation:
o Include screenshots of the dendrograms and clustering results
from WEKA.
o Provide a detailed discussion on the impact of different methods
and distance metrics on the clustering outcomes.
6. Conclusion
This practical exercise has provided insight into two fundamental clustering
techniques:
 Hierarchical Clustering:
o Generates a nested set of clusters (dendrogram) where each
level represents a different granularity.
o The choice of linkage method (single vs. complete) has a
significant impact on the clustering structure.
 K-Means Clustering:
o Iteratively partitions the data into a predetermined number of
clusters by optimizing cluster centroids.
o The algorithm is sensitive to the initial choice of centers and the
distance metric used (Euclidean vs. Manhattan).
Overall, by comparing the theoretical computations and the WEKA
experimental results, we gain a better understanding of how various
clustering methods can be applied and the importance of selecting
appropriate parameters and metrics for the data at hand.
Practical 4: Data Preprocessing
Review

Practical 4: Data Pre-Processing Review Answers

Self-Review Questions
Question 1
(a) Classify Each Descriptive Attribute
 Policy Holder ID:
o Type: Nominal data
o Explanation: Although this is numeric in appearance, it functions
as a unique identifier with no inherent arithmetic meaning.
 Occupation:
o Type: Nominal data
o Explanation: Occupation is drawn from a fixed set of categories
(e.g., “Professional,” “Managerial,” etc.) without any intrinsic
order.
 Gender:
o Type: Nominal data
o Example: {male, female}
 Age:
o Type: Numeric data
o Explanation: Age is measured on a continuous (or discrete)
scale.
 Value of Car:
o Type: Numeric data
o Explanation: This represents a monetary amount and can be
used for quantitative analysis.
 Type of Insurance Policy:
o Type: Nominal data
o Explanation: The policy types are selected from a set of
predefined categories (e.g., “Comprehensive,” “Third Party,”
etc.).
 Preferred Contact Channel:
o Type: Nominal data
o Explanation: This attribute represents a choice among fixed
options (e.g., “Telephone,” “Email,” “Postal Mail”).
(b) Pre-defined Data Values for Nominal/Ordinal Attributes
 Policy Holder ID:
o Each value is unique to an individual record; therefore, there is
no fixed set of pre-defined values in the traditional sense.
 Occupation:
o Assumed: Typically, there might be 4–5 predefined categories
(e.g., “Professional,” “Managerial,” “Clerical,” “Skilled,” etc.).
o Note: Verify the exact number from your lecture materials.
 Gender:
o Pre-defined Values: 2 (male, female)
 Type of Insurance Policy:
o Assumed: Commonly, there might be around 3 categories (for
example, “Comprehensive,” “Third Party,” “Third Party Fire and
Theft”).
o Note: Refer to the provided dataset details for the exact list.
 Preferred Contact Channel:
o Assumed: Likely 3 predefined options (e.g., “Telephone,”
“Email,” “Postal Mail”).
o Note: Confirm with the course or dataset documentation.

Self-Review Question 2
(a) Describe the Dataset’s Common Properties
 Type:
o This is a structured (tabular) dataset containing both numeric
and nominal attributes.
 Size:
o The dataset includes 8 employee records (instances) with 6
attributes each (Emp ID, Name, Year of Birth, Gender, Status,
Salary).
 Dimensionality:
o With 6 attributes, the dataset is low-dimensional.
 Sparsity:
o The dataset appears dense (no missing values are evident),
meaning nearly all entries are populated.
(b) Discretising the ‘Salary’ Attribute into Three Pay Bands
A simple yet sensible approach is to use one of the following binning methods:
 Equal Width Binning:
o Divide the range of salary values into three intervals of equal
size.
 Equal Frequency Binning:
o Sort the salary values and partition them into three groups, each
containing roughly the same number of employees.
Example:
Given the salaries (e.g., $32,000, $34,000, $36,000, $66,000, $70,000,
$160,000, $200,000), you might set:
 Low Pay: Up to around $36,000
 Mid Pay: Approximately $36,000 to $70,000
 High Pay: Above $70,000
Equal frequency binning is often preferred in small datasets to ensure each
band is well represented.
(c) Imputing Mr Dujevic’s Unknown Salary
Since Mr Dujevic is a Technician, a sensible replacement is to impute his
salary with the average salary of his peer group.
 Calculation (for example):
o Technician salaries in the dataset are $36,000 (Jones), $32,000
(Millins), and $34,000 (Isovic).
o Mean: ($36,000 + $32,000 + $34,000) / 3 ≈ $34,000
Thus, replacing his unknown salary with approximately $34,000 is appropriate
because it reflects the typical compensation for that role.
(d) Identifying an Outlier and Its Impact
 Outlier Identification:
o The record for the Director (Emp ID 100, Salary $200,000)
stands out as an outlier when compared to the rest of the
employee records, which mainly include Technicians and Senior
Technicians with much lower salaries.
 Potential Harm:
o Outliers can distort statistical analyses by skewing measures
such as the mean and standard deviation.
o They may lead to misinterpretations about the overall salary
distribution and adversely affect data models that are sensitive
to extreme values.

Laboratory Questions
Question 1: Creating and Exploring an ARFF File
1. Creating the ARFF File:
o Store the example data in MS Excel.
o Save the file as a CSV.
o Open the CSV file in WEKA.
o Save the data as an ARFF file.
2. Tasks in WEKA:
(a) Observing Summary Data and Visualizations:
o On the Preprocess tab, review the dataset summary which
shows details such as the number of instances, attribute types,
and any missing values.
o Examine the histograms for all attributes to understand the
distribution of each variable.
o Use the Visualize tab to generate scatter plots. These plots
help reveal any relationships (e.g., correlation between
homework scores and exam marks).
(b) Applying the Unsupervised Discretiz

e
Filter:

o Apply the Discretize filter to the exam marks to transform

continuous scores into discrete intervals.
o This process groups similar scores together, which can help
simplify further analysis.
(c) Handling Missing Values:
o Practice filling in missing values manually using the Viewer
window (via the “Edit” menu).
o Also, use the ReplaceMissingValues filter where numeric
missing values are replaced by the mean and nominal ones by
the mode.
o For nominal attributes where an “unknown” code is used, the
AddValues filter can be applied to add this new category to the
attribute’s domain.
Question 2: [Optional Extension – Principal Component Analysis
(PCA)]
 Performing PCA in WEKA:
o Open a dataset provided in WEKA (e.g., cpu.arff).
o Navigate to the Select Attributes tab.
o Click the Choose button and select the PrincipalComponents
filter.
o Click Start to run the PCA.
 Discussion of Findings:
o The PCA process transforms the original attributes into new
components (eigenvectors) that are linear combinations of the
original features.
o The output lists these new attributes in order of their
significance (i.e., the percentage of variance they explain).
o Typically, the first few principal components capture most of the
variance, suggesting that the intrinsic dimensionality of the data
may be lower than the original number of features.
o This dimensionality reduction can simplify data visualization and
improve the performance of subsequent analyses by reducing
noise.

Practical 5: Fundamentals of
Classification
Self-Review Questions
1. (a) Classification = Discrete categorical values like dog, and cat

Regression = Continuous numerical values like house, income,

age

(b) Confusion matrix (Contingency table), Accuracy , and Error rate

Confusion matrix (Contingency table)

A table that summarizes the performance of a classification model by

showing the counts of true positives, true negatives, false positives, and false
negatives.

Accuracy

The proportion of correctly classified instances out of the total instances.

Error Rate

The proportion of incorrectly classified instances out of the total instances.

(c) Cross-validation and Bootstrap

Cross-validation is a technique for evaluating model performance by

splitting the dataset into multiple subsets, training on some, and testing on
others to ensure generalization. Bootstrap is a resampling method that
randomly samples with replacement to create multiple datasets, estimating
model accuracy and variability.

2. With examples, explain the difference between sensitivity and

precision.

Sensitivity measures how well a model identifies actual positive cases.

It is calculated as Sensitivity = TP/ TP+FN

Precision measures how many of the predicted positive cases are

actually correct. It is calculated as : Precision = TP/TP+FP

3. Why is a confusion-matrix- based evaluation

In unbalanced data sets, accuracy can be misleading.

For example: if 95% of cases are negative, then the model that predicts
all cases to be "negative" has 95% accuracy, but it cannot detect
actual positives.

Confusion matrices provide more insight into true positives, false

positives, false negatives, and true negatives to help evaluate model
performance.

It enables better metrics such as Precision, Recall, and F1-score, which

provide more information than raw accuracy.

4. In the 1R algorithm, how to choose the best rule (attribute)?

The 1R algorithm generates a rule for each attribute in the data set and
then evaluates each rule to determine which rule has the fewest errors.

Selection Method: A typical way to choose the best rule is to calculate

the error rate of each generated rule. Choose the rule with the lowest
error rate as the best rule.

Error rate: This is the number of incorrect predictions divided by the

total number of instances evaluated. For a more balanced view,
especially for unbalanced datasets, you can also consider other metrics
such as accuracy, recall, or F1 scores.

5. Overfitting occurs when the model learns patterns that are too specific
to the training data (including noise) and cannot be generalized to new
data.

Symptoms:

High accuracy on training data, but poor performance on test data.

Models capture noise rather than underlying trends.

Prevention:

Use cross-validation to verify model performance.

Trim the decision tree.

Regularization techniques such as L1/L2 (Ridge/Lasso regression).

Collect more training data.
Practical Tasks: Classification through WEKA

Running 1R in Weka with the weather data

2.
From the training data analysis, the One-R rule derived is:

Outlook = Overcast → Play = Yes

Outlook = Sunny:
If Humidity = High → Play = No
If Humidity = Normal → Play = Yes

Outlook = Rainy:
If Windy = False → Play = Yes
If Windy = True → Play = No

Outlook Temperatur Humidity Windy Play? (Prediction)

Sunny Cool High TRUE No (Sunny + High Humidity =

No)

Rainy Cool Normal FALSE Yes (Rainy + Not Windy = Yes)

Overcast Mild High TRUE Yes (Overcast = Yes)

Sunny Hot Normal FALSE Yes (Sunny + Normal Humidity

= Yes)

Data+Science+in+Python+ +Data+Prep+&+EDA
No ratings yet
Data+Science+in+Python+ +Data+Prep+&+EDA
196 pages
Datascience With Python
No ratings yet
Datascience With Python
178 pages
Introduction To Data Science and Machine Learning
No ratings yet
Introduction To Data Science and Machine Learning
23 pages
Introduction Am
No ratings yet
Introduction Am
74 pages
Python For Data Science and Machine Learning
100% (2)
Python For Data Science and Machine Learning
31 pages
Data Science Unit 1
No ratings yet
Data Science Unit 1
30 pages
Data Science Day 1
No ratings yet
Data Science Day 1
22 pages
1666777204580-1666708806962-Introduction To Data Science REV
No ratings yet
1666777204580-1666708806962-Introduction To Data Science REV
41 pages
Module 1 - What Is Data Science
No ratings yet
Module 1 - What Is Data Science
17 pages
Data Science and Data Analytics Lab CS695A: Sayan Maity Cse 3B Roll-05 12017009001193
No ratings yet
Data Science and Data Analytics Lab CS695A: Sayan Maity Cse 3B Roll-05 12017009001193
30 pages
Da Session 1
No ratings yet
Da Session 1
50 pages
Data Science Lecture No 01
No ratings yet
Data Science Lecture No 01
28 pages
Workshop 0
No ratings yet
Workshop 0
22 pages
NIST Handbook PDF
No ratings yet
NIST Handbook PDF
426 pages
Introduction
No ratings yet
Introduction
20 pages
DS Unit 1 - ABM
No ratings yet
DS Unit 1 - ABM
103 pages
Data Science and Machine Learning
No ratings yet
Data Science and Machine Learning
30 pages
CSC407 - Chapter 1
No ratings yet
CSC407 - Chapter 1
31 pages
Machine Learning Unit-1.1
No ratings yet
Machine Learning Unit-1.1
43 pages
Data Science
100% (2)
Data Science
52 pages
Intro To Data Science - LVC1 With Markings
No ratings yet
Intro To Data Science - LVC1 With Markings
22 pages
Data Science 101: Arik Pelkey Scott Cooley
No ratings yet
Data Science 101: Arik Pelkey Scott Cooley
23 pages
DS3 Data Science Introduction
No ratings yet
DS3 Data Science Introduction
18 pages
1 Introduction To Data Science
No ratings yet
1 Introduction To Data Science
14 pages
How To Use AI Agents in 2025
No ratings yet
How To Use AI Agents in 2025
6 pages
Roadmap Geeksforgeeks
No ratings yet
Roadmap Geeksforgeeks
24 pages
MILIT PPT Modifies
No ratings yet
MILIT PPT Modifies
43 pages
Data Science Foundations
No ratings yet
Data Science Foundations
4 pages
Intro To Data Science - LVC1
No ratings yet
Intro To Data Science - LVC1
22 pages
Introduction To DS PDF
No ratings yet
Introduction To DS PDF
34 pages
Week 12 Intro To DS and ML
No ratings yet
Week 12 Intro To DS and ML
67 pages
File of ML
No ratings yet
File of ML
42 pages
Lecture 1 - Introduction To Data Science
No ratings yet
Lecture 1 - Introduction To Data Science
14 pages
Data - Analytics - Chapter 2
No ratings yet
Data - Analytics - Chapter 2
58 pages
Data-Science - Introduction
No ratings yet
Data-Science - Introduction
35 pages
Duhme - Herrenknecht - Theoretical Basis of Slurry Shield Excavation PDF
No ratings yet
Duhme - Herrenknecht - Theoretical Basis of Slurry Shield Excavation PDF
23 pages
Kadir
No ratings yet
Kadir
84 pages
BROCHURE - Data Science Learning Path - Board - Infinity
No ratings yet
BROCHURE - Data Science Learning Path - Board - Infinity
30 pages
Internship
No ratings yet
Internship
28 pages
Data Science Ppt1 Update
No ratings yet
Data Science Ppt1 Update
67 pages
InfynasLeaningSolutions AI Machine Learining SDC
No ratings yet
InfynasLeaningSolutions AI Machine Learining SDC
6 pages
Data Science Crash Course
100% (1)
Data Science Crash Course
32 pages
Introductiontodatascience 230122140841 B90a0856
No ratings yet
Introductiontodatascience 230122140841 B90a0856
44 pages
Master Data Science, Data Analytics and Machine Learning Using Python
No ratings yet
Master Data Science, Data Analytics and Machine Learning Using Python
16 pages
Unit I
No ratings yet
Unit I
52 pages
Summer Training 2020: Advanced Data Science With IBM & Bionic Robotic Arm
No ratings yet
Summer Training 2020: Advanced Data Science With IBM & Bionic Robotic Arm
10 pages
Data Science RoadMap Min
No ratings yet
Data Science RoadMap Min
27 pages
Machine Learning Unit-1.1
No ratings yet
Machine Learning Unit-1.1
29 pages
Introduction To Data Science and Machine Learning
No ratings yet
Introduction To Data Science and Machine Learning
23 pages
6220010
No ratings yet
6220010
37 pages
Introduction To Data Science
No ratings yet
Introduction To Data Science
25 pages
Data Science With Python-Sasmita PDF
67% (3)
Data Science With Python-Sasmita PDF
9 pages
Data Science Intro
No ratings yet
Data Science Intro
52 pages
Project Report
No ratings yet
Project Report
29 pages
File
No ratings yet
File
27 pages
Introduction To Data Science Course Outline
No ratings yet
Introduction To Data Science Course Outline
5 pages
AI and ML For Business Antim Prahar WITH ANSWERS
No ratings yet
AI and ML For Business Antim Prahar WITH ANSWERS
26 pages
EDS Unit 1?
No ratings yet
EDS Unit 1?
15 pages
Analytical Validation SOP-0014
75% (4)
Analytical Validation SOP-0014
10 pages
A Level Misa Meet PPT 2021
No ratings yet
A Level Misa Meet PPT 2021
113 pages
Screening
100% (1)
Screening
40 pages
About Weighing
No ratings yet
About Weighing
18 pages
What Is Data Science
No ratings yet
What Is Data Science
13 pages
Lecture On Oct 24, 20 AM
No ratings yet
Lecture On Oct 24, 20 AM
48 pages
Theory of Measurements and Errors
No ratings yet
Theory of Measurements and Errors
36 pages
Seal Strength of Flexible Barrier Materials: Standard Test Method For
No ratings yet
Seal Strength of Flexible Barrier Materials: Standard Test Method For
11 pages
Elearning
No ratings yet
Elearning
13 pages
MEP Vol. 1.6 Automation and Alarm Equipment
No ratings yet
MEP Vol. 1.6 Automation and Alarm Equipment
47 pages
Guidelines For Measuring Statistical Output Quality
No ratings yet
Guidelines For Measuring Statistical Output Quality
88 pages
Last Yudh Und
No ratings yet
Last Yudh Und
39 pages
Crisartech - User Manual - RR4xx - 221221
No ratings yet
Crisartech - User Manual - RR4xx - 221221
54 pages
Cardio Disease - Full Document - LightGBM
No ratings yet
Cardio Disease - Full Document - LightGBM
29 pages
1 s2.0 S1060374324000377 Main
No ratings yet
1 s2.0 S1060374324000377 Main
15 pages
Modeling The Equilibrium Compressed Void Volume of Carbon Black
No ratings yet
Modeling The Equilibrium Compressed Void Volume of Carbon Black
30 pages
E-Marking Notes On Physics HSSC I May 2017
No ratings yet
E-Marking Notes On Physics HSSC I May 2017
34 pages
Pranav Data Science Lab
No ratings yet
Pranav Data Science Lab
34 pages
1 - 02 - 2019 - A Novel Approximate Adder With Enhanced Low-Cost
No ratings yet
1 - 02 - 2019 - A Novel Approximate Adder With Enhanced Low-Cost
5 pages
D 3017 - 01 - RDMWMTC
No ratings yet
D 3017 - 01 - RDMWMTC
5 pages
Proximity Based Automatic Data Annotation For Autonomous Driving
No ratings yet
Proximity Based Automatic Data Annotation For Autonomous Driving
10 pages
Cable Plan
No ratings yet
Cable Plan
2 pages
Microvol
No ratings yet
Microvol
2 pages
Machine Vision Solutions For Monitoring Pest Snails in Australian No-Till - Compressed
No ratings yet
Machine Vision Solutions For Monitoring Pest Snails in Australian No-Till - Compressed
13 pages
MLPalmReader 1
No ratings yet
MLPalmReader 1
4 pages
Sample Writing Rubic For Building Technology
No ratings yet
Sample Writing Rubic For Building Technology
2 pages
Gatti 2014 Automation in Construction
No ratings yet
Gatti 2014 Automation in Construction
7 pages
Analysis of Effectiveness Particle Swarm Optimization in Improving The Performance of Naïve Bayes Algorithm
No ratings yet
Analysis of Effectiveness Particle Swarm Optimization in Improving The Performance of Naïve Bayes Algorithm
5 pages
Introduction to Machine Learning and Neural Classification
From Everand
Introduction to Machine Learning and Neural Classification
Trilokesh Khatri
No ratings yet
DATA MINING and MACHINE LEARNING. CLASSIFICATION PREDICTIVE TECHNIQUES: SUPPORT VECTOR MACHINE, LOGISTIC REGRESSION, DISCRIMINANT ANALYSIS and DECISION TREES: Examples with MATLAB
From Everand
DATA MINING and MACHINE LEARNING. CLASSIFICATION PREDICTIVE TECHNIQUES: SUPPORT VECTOR MACHINE, LOGISTIC REGRESSION, DISCRIMINANT ANALYSIS and DECISION TREES: Examples with MATLAB
César Pérez López
No ratings yet
"Big Data Science" Basic Concepts and Applications
From Everand
"Big Data Science" Basic Concepts and Applications
Sukanta Bhattacharya
No ratings yet
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
From Everand
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
César Pérez López
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

CP1407 Practials1to Practials 5

Uploaded by

CP1407 Practials1to Practials 5

Uploaded by

Table of Contents

1. Practical 1: Fundamentals of Data Science and Machine

1. Data Science is a process of extracting knowledge and utilizing data. It

DATA SCIENCE MACHINE LEARNING

Data science encompasses a wider Machine learning expertise lies in

(d) Machine learning because determining characteristics requires pattern

(e)Machine learning because we would need to determine the pattern of data

5. (a) Classification(Supervised); The data already has labels, so a

(b)Pattern Discovery(Unsupervised); There are no predefined labels.The goal

(c)Classification(Supervised); The data includes algae population records, so a

IBM. (n.d.). Supervised vs. unsupervised learning. Retrieved February 4, 2025,

1. Top tools: Python, Anaconda, scikit-learn, Tensorflow, Keras, and Apache

Piatetsky, G. (2018, June 6). The 6 components of Open-Source Data Science/

- What would be the output of this analysis?

Keita, Z. (2024, August 8). Classification in Machine Learning:

2.Data Pre-Processing using WEKA

Practical 3: Clustering Review – K-Means & Hierarchical Clustering

Practical 4: Data Pre-Processing Review Answers

o Apply the Discretize filter to the exam marks to transform

Regression = Continuous numerical values like house, income,

(b) Confusion matrix (Contingency table), Accuracy , and Error rate

Confusion matrix (Contingency table)

A table that summarizes the performance of a classification model by

The proportion of correctly classified instances out of the total instances.

The proportion of incorrectly classified instances out of the total instances.

(c) Cross-validation and Bootstrap

Cross-validation is a technique for evaluating model performance by

2. With examples, explain the difference between sensitivity and

Sensitivity measures how well a model identifies actual positive cases.

Precision measures how many of the predicted positive cases are

3. Why is a confusion-matrix- based evaluation

Confusion matrices provide more insight into true positives, false

It enables better metrics such as Precision, Recall, and F1-score, which

4. In the 1R algorithm, how to choose the best rule (attribute)?

Selection Method: A typical way to choose the best rule is to calculate

Error rate: This is the number of incorrect predictions divided by the

High accuracy on training data, but poor performance on test data.

Models capture noise rather than underlying trends.

Use cross-validation to verify model performance.

Trim the decision tree.

Regularization techniques such as L1/L2 (Ridge/Lasso regression).

Running 1R in Weka with the weather data

Outlook = Overcast → Play = Yes

Outlook Temperatur Humidity Windy Play? (Prediction)

Sunny Cool High TRUE No (Sunny + High Humidity =

Rainy Cool Normal FALSE Yes (Rainy + Not Windy = Yes)

Overcast Mild High TRUE Yes (Overcast = Yes)

Sunny Hot Normal FALSE Yes (Sunny + Normal Humidity

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.