0% found this document useful (0 votes)
12 views30 pages

CE880_Lecture4_slides

Uploaded by

Anand A J
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views30 pages

CE880_Lecture4_slides

Uploaded by

Anand A J
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 30

School of Computer Science and Electronics Engineering, University of Essex

ILecture 4: Data Exploration, Cleaning, and Wrangling


CE880: An Approachable Introduction to Data Science

Haider Raza
Tuesday, 07st February 2023

1
About Myself

I Name: Haider Raza


I Position: Senior Lecturer in Artificial Intelligence
I Research interest: AI, Machine Learning, Data Science
I Contact: h.raza@essex.ac.uk
I Academic Support Hours: 1-2PM every Friday via zoom. Zoom link is available
on Moodle
I Website: www.sagihaider.com

2
Dimensionality Reduction

Dimensionality reduction is the transformation of data from a high-dimensional space


into a low-dimensional space so that the low-dimensional representation retains some
meaningful properties of the original data, ideally close to its intrinsic dimension.

3
Principal component analysis (PCA)

PCA is a technique used to emphasize variation and bring out strong patterns in a
dataset. It‘s often used to make data easy to explore and visualize
* It is an orthogonal transformation to convert a set of correlated variables into a set
of values of linearly uncorrelated variables called principal components, with the goal
of finding the best summary of the data using a limited number of PCs.

4
PCA: 2D Example

First, consider a dataset in only two dimensions such as (height & weight). This
dataset can be plotted as points in a plane. But if we want to tease out variation,
PCA finds a new coordinate system in which every point has a new (x,y) value. The
axes don’t actually mean anything physical; they’re combinations of height and weight
called "principal components" that are chosen to give one axes lots of variation.

5
PCA: 2D Example

If we’re going to only see the data along


one dimension, though, it might be better
PCA is useful for eliminating dimensions.
to make that dimension the principal
Below, we’ve plotted the data along a pair
component with most variation. We don’t
of lines: one composed of the x-values
lose much by dropping PC2 since it
and another of the y-values.
contributes the least to the variation in
the data set.

6
PCA: 3D Example

With three dimensions, PCA is more useful, because it’s hard to see through a cloud
of data. In the example below, the original data are plotted in 3D, but you can project
the data into 2D through a transformation no different than finding a camera angle:
rotate the axes to find the best angle. The PCA transformation ensures that the
horizontal axis PC1 has the most variation, the vertical axis PC2 the second-most, and
a third axis PC3 the least. Obviously, PC3 is the one we drop.

7
PCA: Real Example

What if our data have way more than 3-dimensions? Like, 17 dimensions?! In the
table is the average consumption of 17 types of food in grams per person per week for
every country in the UK.

The table shows some interesting variations across different food types, but overall
differences aren’t so notable. Let’s see if PCA can eliminate dimensions to emphasize
how countries differ.
8
PCA: Real Example in 1D

Here’s the plot of the data along the first principal component. Already we can see
something is different about Northern Ireland.

9
PCA: Real Example in 2D

Now, see the first and second principal components, we see Northern Ireland a major
outlier. Once we go back and look at the data in the table, this makes sense: the
Northern Irish eat way more grams of fresh potatoes and way fewer of fresh fruits,
cheese, fish and alcoholic drinks.

10
Using PCA

I PCA is not the only algorithm to do dimensionality reduction


I You can also use PCA for doing predictions
I Each PCA component is a linear combination of the input features
I We can plot a correlation plot and see how they are related

11
Applying PCA on half-moon data

12
Before and After PCA

13
Correlation plot

14
Kernel PCA

Is an extension of PCA using techniques of kernel methods. Using a kernel, the


originally linear operations of PCA are performed in a reproducing kernel Hilbert space.

I Kernel PCA just performs PCA in a new space


I It uses Kernel trick to find principal components in different space (Possibly High
Dimensional Space)
I Kernel PCA has been demonstrated to be useful for novelty detection and image
de-noising
I PCA allow us to reconstruct pre-image and sometimes it may not be possible in
KPCA

15
Let‘s run it for Kernel PCA on moon data

16
Kernel PCA plot

17
What is Bias?

Most of you might have watched or at least heard about the popular Netflix series
‘Queen’s Gambit’. This series had excellently captured the struggles of women in
society and one of the best examples of gender bias.

18
What is Data Bias?

Data bias in machine learning is a type of error in which certain elements of a dataset
are more heavily weighted and/or represented than others. A biased dataset does not
accurately represent a model’s use case, resulting in skewed outcomes, low accuracy
levels, and analytical errors.

19
How serious are the implications of neglecting bias in the data?

As Data Scientists we know that if our data sample does not represent the whole
population, then our results are not statistically significant. Which means that we do
not get accurate results.

Example: The AI algorithms developed to detect skin cancer as perfectly as an


experienced dermatologist failed to detect skin cancers in people with dark skin.

20
How can AI bias occur?

I Missing diverse demographic categories


I Bias inherited from humans
I During the feature engineering phase

21
Let‘s dive into real world data

This data set is the result of a chemical analysis of wines grown in the same region in
Italy but derived from three different cultivars. The analysis determined the quantities
of 13 constituents found in each of the three types of wines
https://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data
’Alcohol’, ’Malic acid’, ’Ash’, ’Alcalinity of ash’, ’Magnesium’,’Total phenols’, ’Flavanoids’,
’Nonflavanoid phenols’,’Proanthocyanins’, ’Color intensity’, ’Hue’,’OD280/OD315 of diluted
wines’, ’Proline’

22
Let‘s cluster the wine dataset

I Clustering on true labels


I Number of clusters = 3 (three type of wines)

23
Let‘s cluster the wine dataset

I Clustering on unlabled data using two variables [’Alcohol’and ’Ash’]


I Number of clusters = 3 (three type of wines)

24
Let’s do PCA on wine data to reduce dimensionality

We will get Principal components on X-axis and explained variance ratio on Y-axis

Note: Here the data is splitted into 70:30 ratio

25
Let’s do PCA ...

We will get plot on two best principal components (Unsupervised mode)

26
Let’s do PCA ...

We will get plot on two best princial components (Supervised mode)

27
Let’s train a Logistic Regression Model to classify the test data

28
explanation

29
30

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy