CP1407 Practials1to Practials 5
CP1407 Practials1to Practials 5
Data science has a broad focus that Machine learning primarily focuses on
encompasses various techniques for building algorithms that enable
extracting insights and meaning from computers to learn from data and make
data. E.g. statistical analysis and data predictions
visualization.
2. (a) Data Query because it can be extracted directly from the datasets such
as the data of the number of male and female population.
(b) Machine Learning because this needs predictive modelling which would
identify the patterns in the previous datasets to make predictions.
(c) Data Query because it involves retrieving and comparing existing data of
the golf activities of married man and single man. No predictions or pattern
discovery is required.
3. Data scientists work at the raw database level to derive insights and build
data products while analysts may interact with data at both the database
level or the summarized report level.
Reference:
Provost, F., & Fawcett, T. (2013). Data Science for Business: What You Need to
Know about Data - Driven Decision Making. O'Reilly Media
Davenport, T. H., & Patil, D. J. (2012). Data scientist: The sexiest job of the
21st century. Harvard Business Review, 90(10), 70 - 76
McKinsey Global Institute. (2011). Big data: The next frontier for innovation,
competition, and productivity
4. Skills including mathematics expertise, technology; hacking skills and
business/strategy acumen are required as a data scientist. However, the most
important skill that most data scientists will find useful is Programming, such
as Python or R, which one can use to sort, analyze, and manage large
amounts of “big data”. To improve your skills, you can sign up for a course
online or get involved in communities. Furthermore, here is a photo I found
that shows the 10 essential skills as a data scientist.
Reference
Staff, C. (2024, April 5). 7 skills every data scientist should have. Coursera.
https://www.coursera.org/articles/data-scientist-skills
Tayo, O., (2019). Data Science Minimum: 10 Essential Skills You Need to Know
to Start Doing Data Science [Online image]. Towards Data Science.
https://towardsdatascience.com/data-science-minimum-10-essential-
skills-you-need-to-know-to-start-doing-data-science-e5a5a9be5991
Reference
GeeksforGeeks. (n.d.). Real-life examples of supervised learning and
unsupervised learning. Retrieved February 4, 2025, from
https://www.geeksforgeeks.org/real-life-examples-of-supervised-learning-and-
unsupervised-learning/
Laboratory Questions ( for these questions, you guys can answer both on
your own!!)
Reference
KDnuggets. https://www.kdnuggets.com/2018/06/ecosystem-data-science-
python-victory.html
2.
- What problem do you target to solve using this data set?
The problem I aim to solve as a data scientist is to predict whether a
person has a heart disease or not based on the medical attribute input.
- What part of the data would be used as input for your model?
Data used as input for my modes are age, chest pain type, resting blood
pressure, maximum heart rate achieved and exercise-induced angina.
- What machine learning methods (e.g., classification, clustering) can be
Applied?
The classification methods like Logistic Regression, Support Vector
Machine, Decision Trees and Artificial Neural Networks can be used to sort
data into categories (Keita, 2024).
Reference:
An Introduction. datacamp.
https://www.datacamp.com/blog/classification-machine-
learning
Practical 2: Introduction to
WEKA
1.Launching and Starting WEKA
K-Means Clustering:
o Different distance metrics (Euclidean vs. Manhattan) may lead
to different cluster centroids and cluster assignments. Discuss
the differences in the clusters formed and provide possible
reasons based on the underlying distance calculations.
Documentation:
o Include screenshots of the dendrograms and clustering results
from WEKA.
o Provide a detailed discussion on the impact of different methods
and distance metrics on the clustering outcomes.
6. Conclusion
This practical exercise has provided insight into two fundamental clustering
techniques:
Hierarchical Clustering:
o Generates a nested set of clusters (dendrogram) where each
level represents a different granularity.
o The choice of linkage method (single vs. complete) has a
significant impact on the clustering structure.
K-Means Clustering:
o Iteratively partitions the data into a predetermined number of
clusters by optimizing cluster centroids.
o The algorithm is sensitive to the initial choice of centers and the
distance metric used (Euclidean vs. Manhattan).
Overall, by comparing the theoretical computations and the WEKA
experimental results, we gain a better understanding of how various
clustering methods can be applied and the importance of selecting
appropriate parameters and metrics for the data at hand.
Practical 4: Data Preprocessing
Review
Self-Review Question 2
(a) Describe the Dataset’s Common Properties
Type:
o This is a structured (tabular) dataset containing both numeric
and nominal attributes.
Size:
o The dataset includes 8 employee records (instances) with 6
attributes each (Emp ID, Name, Year of Birth, Gender, Status,
Salary).
Dimensionality:
o With 6 attributes, the dataset is low-dimensional.
Sparsity:
o The dataset appears dense (no missing values are evident),
meaning nearly all entries are populated.
(b) Discretising the ‘Salary’ Attribute into Three Pay Bands
A simple yet sensible approach is to use one of the following binning methods:
Equal Width Binning:
o Divide the range of salary values into three intervals of equal
size.
Equal Frequency Binning:
o Sort the salary values and partition them into three groups, each
containing roughly the same number of employees.
Example:
Given the salaries (e.g., $32,000, $34,000, $36,000, $66,000, $70,000,
$160,000, $200,000), you might set:
Low Pay: Up to around $36,000
Mid Pay: Approximately $36,000 to $70,000
High Pay: Above $70,000
Equal frequency binning is often preferred in small datasets to ensure each
band is well represented.
(c) Imputing Mr Dujevic’s Unknown Salary
Since Mr Dujevic is a Technician, a sensible replacement is to impute his
salary with the average salary of his peer group.
Calculation (for example):
o Technician salaries in the dataset are $36,000 (Jones), $32,000
(Millins), and $34,000 (Isovic).
o Mean: ($36,000 + $32,000 + $34,000) / 3 ≈ $34,000
Thus, replacing his unknown salary with approximately $34,000 is appropriate
because it reflects the typical compensation for that role.
(d) Identifying an Outlier and Its Impact
Outlier Identification:
o The record for the Director (Emp ID 100, Salary $200,000)
stands out as an outlier when compared to the rest of the
employee records, which mainly include Technicians and Senior
Technicians with much lower salaries.
Potential Harm:
o Outliers can distort statistical analyses by skewing measures
such as the mean and standard deviation.
o They may lead to misinterpretations about the overall salary
distribution and adversely affect data models that are sensitive
to extreme values.
Laboratory Questions
Question 1: Creating and Exploring an ARFF File
1. Creating the ARFF File:
o Store the example data in MS Excel.
o Save the file as a CSV.
o Open the CSV file in WEKA.
o Save the data as an ARFF file.
2. Tasks in WEKA:
(a) Observing Summary Data and Visualizations:
o On the Preprocess tab, review the dataset summary which
shows details such as the number of instances, attribute types,
and any missing values.
o Examine the histograms for all attributes to understand the
distribution of each variable.
o Use the Visualize tab to generate scatter plots. These plots
help reveal any relationships (e.g., correlation between
homework scores and exam marks).
(b) Applying the Unsupervised Discretiz
e
Filter:
Practical 5: Fundamentals of
Classification
Self-Review Questions
1. (a) Classification = Discrete categorical values like dog, and cat
Accuracy
Error Rate
For example: if 95% of cases are negative, then the model that predicts
all cases to be "negative" has 95% accuracy, but it cannot detect
actual positives.
The 1R algorithm generates a rule for each attribute in the data set and
then evaluates each rule to determine which rule has the fewest errors.
5. Overfitting occurs when the model learns patterns that are too specific
to the training data (including noise) and cannot be generalized to new
data.
Symptoms:
Prevention:
Outlook = Sunny:
If Humidity = High → Play = No
If Humidity = Normal → Play = Yes
Outlook = Rainy:
If Windy = False → Play = Yes
If Windy = True → Play = No