0% found this document useful (0 votes)

12 views30 pages

CE880_Lecture4_slides

Uploaded by

Anand A J

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views30 pages

CE880_Lecture4_slides

Uploaded by

Anand A J

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 30

School of Computer Science and Electronics Engineering, University of Essex

ILecture 4: Data Exploration, Cleaning, and Wrangling

CE880: An Approachable Introduction to Data Science

Haider Raza
Tuesday, 07st February 2023

1
About Myself

I Name: Haider Raza

I Position: Senior Lecturer in Artificial Intelligence
I Research interest: AI, Machine Learning, Data Science
I Contact: h.raza@essex.ac.uk
I Academic Support Hours: 1-2PM every Friday via zoom. Zoom link is available
on Moodle
I Website: www.sagihaider.com

2
Dimensionality Reduction

Dimensionality reduction is the transformation of data from a high-dimensional space

into a low-dimensional space so that the low-dimensional representation retains some
meaningful properties of the original data, ideally close to its intrinsic dimension.

3
Principal component analysis (PCA)

PCA is a technique used to emphasize variation and bring out strong patterns in a
dataset. It‘s often used to make data easy to explore and visualize
* It is an orthogonal transformation to convert a set of correlated variables into a set
of values of linearly uncorrelated variables called principal components, with the goal
of finding the best summary of the data using a limited number of PCs.

4
PCA: 2D Example

First, consider a dataset in only two dimensions such as (height & weight). This
dataset can be plotted as points in a plane. But if we want to tease out variation,
PCA finds a new coordinate system in which every point has a new (x,y) value. The
axes don’t actually mean anything physical; they’re combinations of height and weight
called "principal components" that are chosen to give one axes lots of variation.

5
PCA: 2D Example

If we’re going to only see the data along

one dimension, though, it might be better
PCA is useful for eliminating dimensions.
to make that dimension the principal
Below, we’ve plotted the data along a pair
component with most variation. We don’t
of lines: one composed of the x-values
lose much by dropping PC2 since it
and another of the y-values.
contributes the least to the variation in
the data set.

6
PCA: 3D Example

With three dimensions, PCA is more useful, because it’s hard to see through a cloud
of data. In the example below, the original data are plotted in 3D, but you can project
the data into 2D through a transformation no different than finding a camera angle:
rotate the axes to find the best angle. The PCA transformation ensures that the
horizontal axis PC1 has the most variation, the vertical axis PC2 the second-most, and
a third axis PC3 the least. Obviously, PC3 is the one we drop.

7
PCA: Real Example

What if our data have way more than 3-dimensions? Like, 17 dimensions?! In the
table is the average consumption of 17 types of food in grams per person per week for
every country in the UK.

The table shows some interesting variations across different food types, but overall
differences aren’t so notable. Let’s see if PCA can eliminate dimensions to emphasize
how countries differ.
8
PCA: Real Example in 1D

Here’s the plot of the data along the first principal component. Already we can see
something is different about Northern Ireland.

9
PCA: Real Example in 2D

Now, see the first and second principal components, we see Northern Ireland a major
outlier. Once we go back and look at the data in the table, this makes sense: the
Northern Irish eat way more grams of fresh potatoes and way fewer of fresh fruits,
cheese, fish and alcoholic drinks.

10
Using PCA

I PCA is not the only algorithm to do dimensionality reduction

I You can also use PCA for doing predictions
I Each PCA component is a linear combination of the input features
I We can plot a correlation plot and see how they are related

11
Applying PCA on half-moon data

12
Before and After PCA

13
Correlation plot

14
Kernel PCA

Is an extension of PCA using techniques of kernel methods. Using a kernel, the

originally linear operations of PCA are performed in a reproducing kernel Hilbert space.

I Kernel PCA just performs PCA in a new space

I It uses Kernel trick to find principal components in different space (Possibly High
Dimensional Space)
I Kernel PCA has been demonstrated to be useful for novelty detection and image
de-noising
I PCA allow us to reconstruct pre-image and sometimes it may not be possible in
KPCA

15
Let‘s run it for Kernel PCA on moon data

16
Kernel PCA plot

17
What is Bias?

Most of you might have watched or at least heard about the popular Netflix series
‘Queen’s Gambit’. This series had excellently captured the struggles of women in
society and one of the best examples of gender bias.

18
What is Data Bias?

Data bias in machine learning is a type of error in which certain elements of a dataset
are more heavily weighted and/or represented than others. A biased dataset does not
accurately represent a model’s use case, resulting in skewed outcomes, low accuracy
levels, and analytical errors.

19
How serious are the implications of neglecting bias in the data?

As Data Scientists we know that if our data sample does not represent the whole
population, then our results are not statistically significant. Which means that we do
not get accurate results.

Example: The AI algorithms developed to detect skin cancer as perfectly as an

experienced dermatologist failed to detect skin cancers in people with dark skin.

20
How can AI bias occur?

I Missing diverse demographic categories

I Bias inherited from humans
I During the feature engineering phase

21
Let‘s dive into real world data

This data set is the result of a chemical analysis of wines grown in the same region in
Italy but derived from three different cultivars. The analysis determined the quantities
of 13 constituents found in each of the three types of wines
https://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data
’Alcohol’, ’Malic acid’, ’Ash’, ’Alcalinity of ash’, ’Magnesium’,’Total phenols’, ’Flavanoids’,
’Nonflavanoid phenols’,’Proanthocyanins’, ’Color intensity’, ’Hue’,’OD280/OD315 of diluted
wines’, ’Proline’

22
Let‘s cluster the wine dataset

I Clustering on true labels

I Number of clusters = 3 (three type of wines)

23
Let‘s cluster the wine dataset

I Clustering on unlabled data using two variables [’Alcohol’and ’Ash’]

I Number of clusters = 3 (three type of wines)

24
Let’s do PCA on wine data to reduce dimensionality

We will get Principal components on X-axis and explained variance ratio on Y-axis

Note: Here the data is splitted into 70:30 ratio

25
Let’s do PCA ...

We will get plot on two best principal components (Unsupervised mode)

26
Let’s do PCA ...

We will get plot on two best princial components (Supervised mode)

27
Let’s train a Logistic Regression Model to classify the test data

28
explanation

29
30

UNIT-4 Machine Learning
No ratings yet
UNIT-4 Machine Learning
20 pages
Stainless Steel Pipe Weight Per Meter and Pipe Thickness Chart in MM
No ratings yet
Stainless Steel Pipe Weight Per Meter and Pipe Thickness Chart in MM
4 pages
Principal Component Analysis (PCA) in Machine Learning
No ratings yet
Principal Component Analysis (PCA) in Machine Learning
20 pages
DMV & ML Lab
No ratings yet
DMV & ML Lab
103 pages
Data Analytics Courses in Pune
No ratings yet
Data Analytics Courses in Pune
25 pages
Lecture 9_PCA
No ratings yet
Lecture 9_PCA
44 pages
3-Data Fundamentals for BI- Part2
No ratings yet
3-Data Fundamentals for BI- Part2
44 pages
CENG3300 Lecture 10
No ratings yet
CENG3300 Lecture 10
20 pages
ML Journal
No ratings yet
ML Journal
29 pages
The Intuition Behind PCA: Machine Learning Assignment
No ratings yet
The Intuition Behind PCA: Machine Learning Assignment
11 pages
Pca&kmean
No ratings yet
Pca&kmean
6 pages
DBMS Report
No ratings yet
DBMS Report
84 pages
Pca
No ratings yet
Pca
30 pages
Principal Component Analysis1
No ratings yet
Principal Component Analysis1
26 pages
Feature Extraction Techniques
No ratings yet
Feature Extraction Techniques
32 pages
Installing RBS
No ratings yet
Installing RBS
305 pages
Principal Component Analysis
No ratings yet
Principal Component Analysis
16 pages
PCA How To.1
No ratings yet
PCA How To.1
13 pages
data reduction
No ratings yet
data reduction
9 pages
Assignment 1 A
No ratings yet
Assignment 1 A
12 pages
Lester Khiets Roa Bsce 2-A 10 Engineers Who Became President or General Manager of A Large Company
No ratings yet
Lester Khiets Roa Bsce 2-A 10 Engineers Who Became President or General Manager of A Large Company
8 pages
Principal Component Analysis: #Datascience
No ratings yet
Principal Component Analysis: #Datascience
13 pages
DS Ca2 PPT 3010 3017
No ratings yet
DS Ca2 PPT 3010 3017
10 pages
Dimensionality Reduction Algorithms
No ratings yet
Dimensionality Reduction Algorithms
7 pages
Assignment
No ratings yet
Assignment
24 pages
PCA Theory
No ratings yet
PCA Theory
13 pages
CE880_Lecture9_slides
No ratings yet
CE880_Lecture9_slides
43 pages
Principal Component Analysis
No ratings yet
Principal Component Analysis
33 pages
CHBE413CDS Lecture 12 Unsupervised DimRed
No ratings yet
CHBE413CDS Lecture 12 Unsupervised DimRed
30 pages
DATA REDUCTION
No ratings yet
DATA REDUCTION
23 pages
Clustering_and_dimensionality_reduction_techniques__PCA__t_SNE__K_means_ (1)
No ratings yet
Clustering_and_dimensionality_reduction_techniques__PCA__t_SNE__K_means_ (1)
15 pages
Principal Component Analysis PCA in Machine Learning
No ratings yet
Principal Component Analysis PCA in Machine Learning
20 pages
Document 2 m
No ratings yet
Document 2 m
23 pages
List of ENGLISH Books by Mahatma Gandhi Available With Gandhi Research Foundation, Jalgaon
No ratings yet
List of ENGLISH Books by Mahatma Gandhi Available With Gandhi Research Foundation, Jalgaon
41 pages
PR- Unit 4
No ratings yet
PR- Unit 4
15 pages
Dimensionality Reduction, PCA, and Kernel Methods
No ratings yet
Dimensionality Reduction, PCA, and Kernel Methods
3 pages
Outbound Genesys
No ratings yet
Outbound Genesys
491 pages
IDS 4 (Week 14)
No ratings yet
IDS 4 (Week 14)
66 pages
Module 3
No ratings yet
Module 3
41 pages
Chapter Five Principal Comonent Analysis (PCA)
No ratings yet
Chapter Five Principal Comonent Analysis (PCA)
33 pages
CE802_Lec_Eval_handouts
No ratings yet
CE802_Lec_Eval_handouts
33 pages
Love Report
No ratings yet
Love Report
7 pages
P-3.1.4 - Pca
No ratings yet
P-3.1.4 - Pca
44 pages
eBook IT Handbook Kinaxis (1)
No ratings yet
eBook IT Handbook Kinaxis (1)
26 pages
Principal Component Analysis and Cluster Analysis
No ratings yet
Principal Component Analysis and Cluster Analysis
14 pages
pca1
No ratings yet
pca1
3 pages
S-TN-IFC-001
No ratings yet
S-TN-IFC-001
25 pages
Module 3 ML
No ratings yet
Module 3 ML
19 pages
PCA_dev
No ratings yet
PCA_dev
16 pages
Linear Regression: Dimensionality Reduction
No ratings yet
Linear Regression: Dimensionality Reduction
7 pages
PCA Finds Representation Through Linear Transformation
No ratings yet
PCA Finds Representation Through Linear Transformation
28 pages
Download Full Foundations of digital art and design with Adobe Creative Cloud Second Edition. Edition Xtine Burrough PDF All Chapters
100% (2)
Download Full Foundations of digital art and design with Adobe Creative Cloud Second Edition. Edition Xtine Burrough PDF All Chapters
65 pages
Bayesian_Learning1
No ratings yet
Bayesian_Learning1
21 pages
Pca
No ratings yet
Pca
17 pages
Principal Component Analysis
No ratings yet
Principal Component Analysis
11 pages
Unit V Foml
No ratings yet
Unit V Foml
18 pages
Dimensionality Reduction
No ratings yet
Dimensionality Reduction
19 pages
U5@-Data Reduction
No ratings yet
U5@-Data Reduction
22 pages
3.2 Pca
No ratings yet
3.2 Pca
27 pages
DR Pca
No ratings yet
DR Pca
22 pages
3 - Feature Extraction
No ratings yet
3 - Feature Extraction
22 pages
CE880_lecture5_slides
No ratings yet
CE880_lecture5_slides
32 pages
Feature Selection and Dimensionality Reduction
No ratings yet
Feature Selection and Dimensionality Reduction
4 pages
Principal Component Analysis
No ratings yet
Principal Component Analysis
6 pages
ML Mod 4 Part 2
No ratings yet
ML Mod 4 Part 2
32 pages
Principal Component Analysis
No ratings yet
Principal Component Analysis
8 pages
Love Report 1
No ratings yet
Love Report 1
10 pages
U4 - PCA - 5th Sem - DS
No ratings yet
U4 - PCA - 5th Sem - DS
14 pages
A Step by Step Explanation of Principal Component Analysis
No ratings yet
A Step by Step Explanation of Principal Component Analysis
7 pages
GATE 2022 Computer Science and Information Technology CS NdmVLIX
No ratings yet
GATE 2022 Computer Science and Information Technology CS NdmVLIX
44 pages
Module 2 Lab 2
No ratings yet
Module 2 Lab 2
5 pages
Lab 1 SSK Programming Report
No ratings yet
Lab 1 SSK Programming Report
9 pages
program-3
No ratings yet
program-3
7 pages
Synthetic Division LP
No ratings yet
Synthetic Division LP
8 pages
Sample Question Etool (Excel) - Secondary
No ratings yet
Sample Question Etool (Excel) - Secondary
8 pages
Tmua TT B1 B2
No ratings yet
Tmua TT B1 B2
14 pages
CE802_Lec_FuncOptim_handouts
No ratings yet
CE802_Lec_FuncOptim_handouts
11 pages
SMC Notes On Unit-I
No ratings yet
SMC Notes On Unit-I
27 pages
Paper AI IN Performance Management
No ratings yet
Paper AI IN Performance Management
11 pages
Complete Plan to Start Learning CGI Ads
No ratings yet
Complete Plan to Start Learning CGI Ads
8 pages
Ele-Ms-03 Lightning-Earthing System
No ratings yet
Ele-Ms-03 Lightning-Earthing System
3 pages
M-16 Series: Basic Description
No ratings yet
M-16 Series: Basic Description
2 pages
An Open Letter To All Engineering Grads Trying To Pursue Physics
No ratings yet
An Open Letter To All Engineering Grads Trying To Pursue Physics
2 pages
Bulletin673 Cavity Battens
No ratings yet
Bulletin673 Cavity Battens
8 pages
5.2 Binary Heap An Min Heap
No ratings yet
5.2 Binary Heap An Min Heap
4 pages
The Dark Side of Internet
No ratings yet
The Dark Side of Internet
4 pages
first year transcript (2)
No ratings yet
first year transcript (2)
1 page
Netapp Cisco Flexpod Foundation
No ratings yet
Netapp Cisco Flexpod Foundation
1 page
Architecture Thesis Topic
No ratings yet
Architecture Thesis Topic
11 pages
IDC For Gas Tight Damper
No ratings yet
IDC For Gas Tight Damper
1 page
Introduction To Well Intervention
No ratings yet
Introduction To Well Intervention
33 pages
Rebecca Gorenstein Resume
No ratings yet
Rebecca Gorenstein Resume
1 page
Shs Las q2 Week 6 Business Math
No ratings yet
Shs Las q2 Week 6 Business Math
4 pages
Machine Learning with Spark and Python: Essential Techniques for Predictive Analytics
From Everand
Machine Learning with Spark and Python: Essential Techniques for Predictive Analytics
Michael Bowles
No ratings yet
Responsible Data Science
From Everand
Responsible Data Science
Peter C. Bruce
No ratings yet
Keras to Kubernetes: The Journey of a Machine Learning Model to Production
From Everand
Keras to Kubernetes: The Journey of a Machine Learning Model to Production
Dattaraj Rao
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

CE880_Lecture4_slides

Uploaded by

CE880_Lecture4_slides

Uploaded by

School of Computer Science and Electronics Engineering, University of Essex

ILecture 4: Data Exploration, Cleaning, and Wrangling

I Name: Haider Raza

Dimensionality reduction is the transformation of data from a high-dimensional space

If we’re going to only see the data along

I PCA is not the only algorithm to do dimensionality reduction

Is an extension of PCA using techniques of kernel methods. Using a kernel, the

I Kernel PCA just performs PCA in a new space

Example: The AI algorithms developed to detect skin cancer as perfectly as an

I Missing diverse demographic categories

I Clustering on true labels

I Clustering on unlabled data using two variables [’Alcohol’and ’Ash’]

Note: Here the data is splitted into 70:30 ratio

We will get plot on two best principal components (Unsupervised mode)

We will get plot on two best princial components (Supervised mode)

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.