0% found this document useful (0 votes)
14 views77 pages

Introduction To Data Analytics: ITE 5201 Lecture5-Data Visualization-2

Uploaded by

pateljil0247
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views77 pages

Introduction To Data Analytics: ITE 5201 Lecture5-Data Visualization-2

Uploaded by

pateljil0247
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 77

Introduction to Data

Analytics
ITE 5201
Lecture5-Data Visualization-2
Instructor: Parisa Pouladzadeh
Email: parisa.pouladzadeh@humber.ca
Statistical concepts of classification of Data

➢Classification is the process of arranging data into homogeneous


(similar) groups according to their common characteristics.
➢Raw data cannot be easily understood, and it is not fit for further
analysis and interpretation. Arrangement of data helps users in
comparison and analysis. It is also important for statistical sampling.

Copyright © 2018 Pearson Education, Inc. All Rights Reserved.


Classification of Data
There are four types of classification. They are:
➢Geographical classification
➢ When data are classified on the basis of location or areas, it is called geographical
classification

➢Chronological classification
➢ Chronological classification means classification on the basis of time, like months,
years etc.

➢Qualitative classification
➢ In Qualitative classification, data are classified on the basis of some attributes or
quality such as gender, colour of hair, literacy and religion. In this type of
classification, the attribute under study cannot be measured. It can only be found out
whether it is present or absent in the units of study.

➢Quantitative classification
➢ Quantitative classification refers to the classification of data according to some
characteristics, which can be measured such as height, weight, income, profits etc.

Copyright © 2018 Pearson Education, Inc. All Rights Reserved.


Quantitative classification

➢There are two types of quantitative classification of data: Discrete


frequency distribution and Continuous frequency distribution.
➢In this type of classification there are two elements
➢variable
➢ Variable refers to the characteristic that varies in magnitude or quantity. E.g.
weight of the students. A variable may be discrete or continuous.

➢Frequency
➢ Frequency refers to the number of times each variable gets repeated. For
example there are 50 students having weight of 60 kgs. Here 50 students is
the frequency.

Copyright © 2018 Pearson Education, Inc. All Rights Reserved.


Frequency Distributions

➢After collecting data, the first task for a researcher is to organize and
simplify the data so that it is possible to get a general overview of the
results.

➢This is the goal of descriptive statistical techniques.

➢One method for simplifying and organizing data is to construct a


frequency distribution.

Copyright © 2018 Pearson Education, Inc. All Rights Reserved.


Frequency Distributions
➢The following technical terms are important when a continuous
frequency distribution is formed
➢Class limits: Class limits are the lowest and highest values that can be
included in a class. For example take the class 51-55. The lowest value
of the class is 51 and the highest value is 55. In this class there can be
no value lesser than 51 or more than 55. 51 is the lower class limit and
55 is the upper class limit.
➢Class interval: The difference between the upper and lower limit of a
class is known as class interval of that class.
➢Class frequency: The number of observations corresponding to a
particular class is known as the frequency of that class

Copyright © 2018 Pearson Education, Inc. All Rights Reserved.


Frequency Distributions

➢A frequency distribution is an organized tabulation showing exactly


how many individuals are located in each category on the scale of
measurement.

➢ A frequency distribution presents an organized picture of the entire set


of scores, and it shows where each individual is located relative to
others in the distribution.

➢In a frequency distribution graph, the score categories (X values) are


listed on the X axis and the frequencies are listed on the Y axis.

Copyright © 2018 Pearson Education, Inc. All Rights Reserved.


Histograms
➢A histogram shows a variable’s distribution as a set of
adjacent rectangles on a data chart.

➢A graphical display of data using bars of different heights.

➢It is similar to a Bar Chart, but a histogram groups


numbers into ranges.

➢Histograms represent counts of data within a numerical


range of values.

Copyright © 2018 Pearson Education, Inc. All Rights Reserved.


Histograms graph

➢Frequency distribution graphs are useful because they show the


entire set of scores.
➢At a glance, you can determine the highest score, the lowest
score, and where the scores are centered.
➢The graph also shows whether the scores are clustered together
or scattered over a wide range.

Copyright © 2018 Pearson Education, Inc. All Rights Reserved.


Smooth curve
➢If the scores in the population are measured on an interval or ratio
scale, it is customary to present the distribution as a smooth curve.

➢The smooth curve emphasizes the fact that the distribution is not
showing the exact frequency for each category.

Copyright © 2018 Pearson Education, Inc. All Rights Reserved.


Histograms and Bar graph

➢ A histogram represents the frequency distribution of


continuous variables.
➢A bar graph is a diagrammatic comparison of discrete
variables.
➢Histogram presents numerical data whereas bar graph shows
categorical data.
➢The histogram is drawn in such a way that there is no
gap between the bars.

Copyright © 2018 Pearson Education, Inc. All Rights Reserved.


Shape

➢A graph shows the shape of the distribution.


➢A distribution is symmetrical if the left side of the graph is (roughly) a
mirror image of the right side.

➢On the other hand, distributions are skewed when scores pile up on one side
of the distribution, leaving a "tail" of a few extreme values on the other side.

Copyright © 2018 Pearson Education, Inc. All Rights Reserved.


Positively, Negatively Skewed Distributions

➢In a positively skewed distribution, the scores tend to pile


up on the left side of the distribution with the tail points to
the right.

➢In a negatively skewed distribution, the scores tend to pile


up on the right side and the tail points to the left.

Copyright © 2018 Pearson Education, Inc. All Rights Reserved.


Positively, Negatively Skewed Distributions

Copyright © 2018 Pearson Education, Inc. All Rights Reserved.


Example

Copyright © 2018 Pearson Education, Inc. All Rights Reserved.


Measures of Centre Tendency
In statistics, the central tendency is the descriptive summary of a
data set.
Through the single value from the dataset, it reflects the centre of the
data distribution.
Moreover, it does not provide information regarding individual data
from the dataset, where it gives a summary of the dataset. Generally,
the central tendency of a dataset can be defined using some of the
measures in statistics.

Copyright © 2018 Pearson Education, Inc. All Rights Reserved.


Mean
The mean represents the average value of the dataset.
It can be calculated as the sum of all the values in the dataset
divided by the number of values. In general, it is considered as the
arithmetic mean.
Some other measures of mean used to find the central tendency
are as follows:
◦ Geometric Mean (nth root of the product of n numbers)
◦ Harmonic Mean (the reciprocal of the average of the reciprocals)
◦ Weighted Mean (where some values contribute more than others)

It is observed that if all the values in the dataset are the same,
then all geometric, arithmetic and harmonic mean values are the
same. If there is variability in the data, then the mean value
differs.
Copyright © 2018 Pearson Education, Inc. All Rights Reserved.
Median

❑Median is the middle value of the dataset in


which the dataset is arranged in the ascending
order or in descending order.
❑When the dataset contains an even number
of values, then the median value of the dataset
can be found by taking the mean of the middle
two values.
❑If you have skewed distribution, the best
measure of finding the central tendency is the
median.

Copyright © 2018 Pearson Education, Inc. All Rights Reserved.


Mode
The mode represents the frequently occurring value in
the dataset.
Sometimes the dataset may contain multiple modes and
in some cases, it does not contain any mode at all.
If you have categorical data, the mode is the best choice
to find the central tendency.

Copyright © 2018 Pearson Education, Inc. All Rights Reserved.


Box Plotting

The image above is a boxplot. A boxplot is a standardized way of displaying the


distribution of data based on a five number summary (“minimum”, first quartile
(Q1), median, third quartile (Q3), and “maximum”). It can tell you about your
outliers and what their values are. It can also tell you if your data is symmetrical,
how tightly your data is grouped, and if and how your data is skewed.

Copyright © 2018 Pearson Education, Inc. All Rights Reserved.


Statistical plots

◦ Statistical plots allows viewers to :


◦ Identify outliers
◦ Visual distributions
◦ Deduce variable types
◦ Discover relationships and core relations between variables in a
dataset.

Copyright © 2018 Pearson Education, Inc. All Rights Reserved.


Scatter plots
➢Scatter plots are useful when you want to explore interrelations
or dependencies between two different variables.

➢These data graphics are ideal for visually spotting outliers and
trends in data.

➢Scatter plots are used when you want to show the relationship
between two variables.

➢Scatter plots are sometimes called correlation plots because


they show how two variables are correlated.

Copyright © 2018 Pearson Education, Inc. All Rights Reserved.


Scatter plot
➢Direction:
◦ Positive: as one variable increases so does the other. Height and shoe size are an example; as
one's height increases so does the shoe size.

◦ Negative: as one variable increases, the other decreases. Time spent studying and time spent
on video games are negatively correlated; as your time studying increases, time spent on
video games decreases.

◦ No correlation: there is no apparent relationship between the variables. Video game scores
and shoe size appear to have no correlation; as one increases, the other one is not affected.

➢Form:
◦ Linear
◦ Nonlinear

➢Strength:
◦ Weak
◦ Moderate
◦ Strong

Outliers
Copyright © 2018 Pearson Education, Inc. All Rights Reserved.
Direction

Positive Negative

Copyright © 2018 Pearson Education, Inc. All Rights Reserved.


Form

Linear Linear Non-Linear

Copyright © 2018 Pearson Education, Inc. All Rights Reserved.


Strength

Strong Moderate Weak

Strong Moderate Weak

Copyright © 2018 Pearson Education, Inc. All Rights Reserved.


Strength-Perfect

Perfect Perfect

Copyright © 2018 Pearson Education, Inc. All Rights Reserved.


Strength-No association

No association

Copyright © 2018 Pearson Education, Inc. All Rights Reserved.


Outlier

Outlier

Copyright © 2018 Pearson Education, Inc. All Rights Reserved.


Example

Copyright © 2018 Pearson Education, Inc. All Rights Reserved.


Scatter plot Matrices
➢A scatterplot matrix is a collection of scatterplots organized
into a grid (or matrix).
➢A scatter plot matrix can show how multiple variables are
related.
➢After plotting all the two-way combinations of the
variables, the matrix can show relationships between
variables to highlight which relationships are likely to be
important.
➢The matrix can also identify outliers in multiple scatter
plots.

Copyright © 2018 Pearson Education, Inc. All Rights Reserved.


Scatter plot Matrices

Copyright © 2018 Pearson Education, Inc. All Rights Reserved.


Introduction to Data
Analytics
ITE 5201
Lecture5-DataTypes and statistics
Instructor: Parisa Pouladzadeh
Email: parisa.pouladzadeh@humber.ca
Data Types
➢Ordinal Variables
➢Binary data.
➢Discrete and continuous data.
➢Interval and ratio variables
➢Qualitative and Quantitative traits/ characteristics of data.

Copyright © 2018 Pearson Education, Inc. All Rights Reserved.


Categorical data

Copyright © 2018 Pearson Education, Inc. All Rights Reserved.


Nominal Data
➢A type of categorical data in which objects fall into unordered categories.
Example:
➢Type of Bicycle
➢Mountain bike, road bike, chopper, folding,,BMX.

➢Ethnicity
➢White British, Afro-Caribbean, Asian, Chinese, other, etc. (note problems with
these categories).

➢Smoking status
➢smoker, non-smoker

Copyright © 2018 Pearson Education, Inc. All Rights Reserved.


Ordinal Data
A type of categorical data in which order is important.

➢Class of degree-1st class, 2:1, 2:2, 3rd class, fail


➢Degree of illness- none, mild, moderate, acute, chronic.
➢Opinion of students about stats classes-Very unhappy, unhappy, neutral, happy, ecstatic!

Copyright © 2018 Pearson Education, Inc. All Rights Reserved.


Binary Data
A type of categorical data in which there are only two categories.
Binary data can either be nominal or ordinal.
➢Smoking status- smoker, non-smoker
➢Attendance- present, absent
➢Class of mark- pass, fail.
➢Status of student- undergraduate, postgraduate.

Copyright © 2018 Pearson Education, Inc. All Rights Reserved.


Quantity Data
The objects being studied are ‘measured’ based on some
quantitative trait.
The resulting data are set of numbers.
Example:
➢Pulse rate
➢Height
➢Age
➢Exam marks
➢Size of bicycle frame
➢Time to complete a statistics test
➢Number of cigarettes
Copyright © 2018 Pearsonsmoked
Education, Inc. All Rights Reserved.
Quantity Data can be classified as
Discrete or Continuous

Discrete Data
Only certain values are possible (there are gaps between the possible values).

Continuous Data
Theoretically, with a fine enough measuring device.

Copyright © 2018 Pearson Education, Inc. All Rights Reserved.


Examples: Discrete Data
➢Number of children in a family
➢Number of students passing a stats exam
➢Number of crimes reported to the police
➢Number of bicycles sold in a day.
➢Generally, discrete data are counts.

We would not expect to find 2.2 children in a family or 88.5 students passing an
exam or crimes being reported to the police or half a bicycle being sold in one day.

Copyright © 2018 Pearson Education, Inc. All Rights Reserved.


Examples: Continuous data
➢Size of bicycle frame
➢Height
➢Time to run 500 meters
➢Age
‘Generally, continuous data come from measurements.
(any value within an interval is possible with a fine enough measuring device)

Copyright © 2018 Pearson Education, Inc. All Rights Reserved.


Relationships between Variables

Copyright © 2018 Pearson Education, Inc. All Rights Reserved.


Absolute Measure of Dispersion
An absolute measure of dispersion contains the same unit as the original data set.

➢Range
➢Variance: Standard Deviation
➢Quartiles and Quartile Deviation
➢Mean and Mean Deviation

Copyright © 2018 Pearson Education, Inc. All Rights Reserved.


Range
It is the simplest method of measurement of dispersion.
It is defined as the difference between the largest and the smallest
item in a given distribution.
Range = Largest item (L) – Smallest item (S)
Interquartile Range
It is defined as the difference between the Upper Quartile and
Lower Quartile of a given distribution.
Interquartile Range = Upper Quartile (Q3)–Lower
Quartile(Q1)

Copyright © 2018 Pearson Education, Inc. All Rights Reserved.


Variance and Standard Deviation
Variance: is a measure of how data points differ from the mean.
A variance is a measure of how far a set of data are spread out from
their mean value.
Standard deviation: is a measure of variability. It defines the width of
normal distribution.

Copyright © 2018 Pearson Education, Inc. All Rights Reserved.


Copyright © 2018 Pearson Education, Inc. All Rights Reserved.
Measures of strength of a relationship
(Correlation)
Pearson’s correlation coefficient (r)
Spearman’s rank correlation coefficient (rho, r)

Copyright © 2018 Pearson Education, Inc. All Rights Reserved.


Perfect linear relationship – no error
Strong linear relationship – small error

Weak linear relationship – large error


No linear relationship
Pearsons correlation coefficient is defined as
below:

S xy  (x − x )( y − y )
i i
r= = i =1
n n
S xx S yy
 (x − x )  ( y − y )
2 2
i i
i =1 i =1
Cov= covariance
S= Standard deviation

Copyright © 2018 Pearson Education, Inc. All Rights Reserved.


Properties of Pearson’s correlation coefficient r

1. The value of r is always between –1 and +1.


2. If the relationship between X and Y is positive, then r will be positive.
3. If the relationship between X and Y is negative, then r will be negative.
4. If there is no relationship between X and Y, then r will be zero.
5. The value of r will be +1 if the points, (xi, yi) lie on a straight line with
positive slope.
6. The value of r will be +1 if the points, (xi, yi) lie on a straight line with
positive slope.

Copyright © 2018 Pearson Education, Inc. All Rights Reserved.


Spearman’s rank
correlation coefficient
r (rho)

Copyright © 2018 Pearson Education, Inc. All Rights Reserved.


Spearman’s rank correlation coefficient
r (rho)
➢Spearman’s rank correlation coefficient is computed as follows:
➢Arrange the observations on X in increasing order and assign them the
ranks 1, 2, 3, …, n
➢Arrange the observations on Y in increasing order and assign them the
ranks 1, 2, 3, …, n.

➢For any case (i) let ( i x, y


i) denote the observations on X and Y and let
r, s
( i i) denote the ranks on X and Y.

Copyright © 2018 Pearson Education, Inc. All Rights Reserved.


If the variables X and Y are strongly positively correlated the ranks on
X should generally agree with the ranks on Y. (The largest X should be
the largest Y, The smallest X should be the smallest Y).
If the variables X and Y are strongly negatively correlated the ranks on
X should in the reverse order to the ranks on Y. (The largest X should be
the smallest Y, The smallest X should be the largest Y).
If the variables X and Y are uncorrelated the ranks on X should
randomly distributed with the ranks on Y.

Copyright © 2018 Pearson Education, Inc. All Rights Reserved.


For each case let di = ri – si = difference in the two ranks.
Then Spearman’s rank correlation coefficient (r) is defined as follows:

n
6 d i
2

r = 1− i =1
n (n − 1)2

Copyright © 2018 Pearson Education, Inc. All Rights Reserved.


Properties of Spearman’s rank correlation
coefficient r

1. The value of r is always between –1 and +1.


2. If the relationship between X and Y is positive, then r will be positive.
3. If the relationship between X and Y is negative, then r will be negative.
4. If there is no relationship between X and Y, then r will be zero.
5. The value of r will be +1 if the ranks of X completely agree with the ranks
of Y.
6. The value of r will be -1 if the ranks of X are in reverse order to the ranks
of Y.

Copyright © 2018 Pearson Education, Inc. All Rights Reserved.


Example

xi 25.0 33.9 16.7 37.4 24.6 17.3 40.2


yi 24.3 38.7 13.4 32.1 28.0 12.5 44.9
Ranking the X’s and the Y’s we get:
ri 4 5 1 6 3 2 7
si 3 6 2 5 4 1 7
Computing the differences in ranks gives us:
di 1 -1 -1 1 -1 1 0

d
i =1
i
2
=6
Copyright © 2018 Pearson Education, Inc. All Rights Reserved.
n
6 d i2
r = 1− i =1
n (n − 1)
2

6(6)
= 1−
7(7 − 1)
2

36 3
= 1− = 1−
7(48) 7(4 )
25
= = 0.893
28
Computing Pearsons correlation coefficient,
r, for the same problem:

S xy  (x − x )( y − y )
i i
r= = i =1
n n
S xx S yy
(
 ix − x )2
(
 iy − y )2

i =1 i =1

Copyright © 2018 Pearson Education, Inc. All Rights Reserved.


Calculation
X Y x-x' y-y' (x-x')(y-y') (x-x')**2 (y-y')**2
25 24.3 -2.87 -3.4 9.758 8.2369 11.56
33.9 38.7 6.03 11 66.33 36.3609 121
16.7 13.4 -11.17 -14.3 159.731 124.7689 204.49
37.4 32.1 9.53 4.4 41.932 90.8209 19.36
24.6 28 -3.27 0.3 -0.981 10.6929 0.09
17.3 12.5 -10.57 -15.2 160.664 111.7249 231.04
40.2 44.9 12.33 17.2 212.076 152.0289 295.84

Sum 649.51 534.6343 883.38

Copyright © 2018 Pearson Education, Inc. All Rights Reserved.


2
  n
  xi 
 
n n
S xx =  ( xi − x ) =  xi − i =1
2 2

i =1 i =1 n
2
  n
  yi 
 
n n
S yy =  ( yi − y ) =  yi − i =1
2 2

i =1 i =1 n
n
S xy =  (xi − x )( yi − y )
i =1
 n  n 
  xi   yi 
  
n
=  xi yi − i =1 i =1

i =1 n
Copyright © 2018 Pearson Education, Inc. All Rights Reserved.
To compute
S xx S yy S xy
first compute
n n
A =  xi = 195.1 B =  yi = 193.9
i =1 i =1
n n
C =  xi2 = 5972.35 D =  yi2 = 6254.41
i =1 i =1

n
E =  xi yi = 6053.78
i =1
Copyright © 2018 Pearson Education, Inc. All Rights Reserved.
Then

A2 195.12
S xx = C − = 5972.35 − = 534.63
n 7
B2 193.92
S yy = D − = 6254.41 − = 883.38
n 7

S xy = E −
A B
= 6053.78 −
(195.1)  (193.9)
= 649.51
n 7

Copyright © 2018 Pearson Education, Inc. All Rights Reserved.


and

649.51
r= = 0.945
534.63 883.38

Compare with

r = 0.893
Copyright © 2018 Pearson Education, Inc. All Rights Reserved.
Comments

Spearman’s r is less sensitive to extreme observations. (outliers)


The value of Pearson’s r is much more sensitive to extreme outliers.

This is similar to the comparison between the median and the mean, the standard
deviation and the pseudo-standard deviation. The mean and standard deviation
are more sensitive to outliers than the median and pseudo- standard deviation.

Copyright © 2018 Pearson Education, Inc. All Rights Reserved.


2/15/23, 2:16 PM Lecture 6-Part 2 - Jupyter Notebook

Lecture 6-Part 2

Matrix Plots (Heatmap)


Matrix plots allow you to plot data as color-encoded matrices and can also be used to indicate
clusters within the data

In [1]: 1 import numpy as np


2 import pandas as pd
3 ​
4 ​
5 import matplotlib.pyplot as plt
6 ​

In [2]: 1 import seaborn as sns


2 %matplotlib inline

Import Dataset
In [3]: 1 cars = pd.read_csv('Downloads/mtcars.csv')
2 cars.columns

Out[3]: Index(['name', 'mpg', 'cyl', 'disp', 'hp', 'drat', 'wt', 'qsec', 'vs', 'am',
'gear', 'carb'],
dtype='object')

In [20]: 1 cars.head()

Out[20]: name mpg cyl disp hp drat wt qsec vs am gear carb

0 Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4

1 Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4

2 Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1

3 Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1

4 Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2

Heatmap
In order for a heatmap to work properly, your data should already be in a matrix form, the
sns.heatmap function basically just colors it in for you. For example:

localhost:8888/notebooks/Lecture 6-Part 2.ipynb 1/11


2/15/23, 2:16 PM Lecture 6-Part 2 - Jupyter Notebook

In [4]: 1 # correlation data


2 cars.corr()

Out[4]: mpg cyl disp hp drat wt qsec vs

mpg 1.000000 -0.852162 -0.847551 -0.776168 0.681172 -0.867659 0.418684 0.664039 0.599

cyl -0.852162 1.000000 0.902033 0.832447 -0.699938 0.782496 -0.591242 -0.810812 -0.522

disp -0.847551 0.902033 1.000000 0.790949 -0.710214 0.887980 -0.433698 -0.710416 -0.591

hp -0.776168 0.832447 0.790949 1.000000 -0.448759 0.658748 -0.708223 -0.723097 -0.243

drat 0.681172 -0.699938 -0.710214 -0.448759 1.000000 -0.712441 0.091205 0.440278 0.712

wt -0.867659 0.782496 0.887980 0.658748 -0.712441 1.000000 -0.174716 -0.554916 -0.692

qsec 0.418684 -0.591242 -0.433698 -0.708223 0.091205 -0.174716 1.000000 0.744535 -0.229

vs 0.664039 -0.810812 -0.710416 -0.723097 0.440278 -0.554916 0.744535 1.000000 0.168

am 0.599832 -0.522607 -0.591227 -0.243204 0.712711 -0.692495 -0.229861 0.168345 1.000

gear 0.480285 -0.492687 -0.555569 -0.125704 0.699610 -0.583287 -0.212682 0.206023 0.794

carb -0.550925 0.526988 0.394977 0.749812 -0.090790 0.427606 -0.656249 -0.569607 0.057

In [7]: 1 sns.heatmap(cars.corr())

Out[7]: <matplotlib.axes._subplots.AxesSubplot at 0x210394dd388>

localhost:8888/notebooks/Lecture 6-Part 2.ipynb 2/11


2/15/23, 2:16 PM Lecture 6-Part 2 - Jupyter Notebook

In [8]: 1 sns.heatmap(cars.corr(),cmap='coolwarm',annot=True)

Out[8]: <matplotlib.axes._subplots.AxesSubplot at 0x21039ce7b88>

In [26]: 1 import scipy


2 from scipy.stats.stats import pearsonr
3 from scipy.stats.stats import spearmanr
4 ​

localhost:8888/notebooks/Lecture 6-Part 2.ipynb 3/11


2/15/23, 2:16 PM Lecture 6-Part 2 - Jupyter Notebook

The Pearson Correlation

In [12]: 1 X = cars[['mpg', 'hp', 'qsec', 'wt']]


2 sns.pairplot(X)

Out[12]: <seaborn.axisgrid.PairGrid at 0x21039f58e08>

Here, we will calculate the Pearson correlation between different variables.

In [22]: 1 mpg = cars['mpg']


2 hp = cars['hp']
3 qsec = cars['qsec']
4 wt = cars['wt']

localhost:8888/notebooks/Lecture 6-Part 2.ipynb 4/11


2/15/23, 2:16 PM Lecture 6-Part 2 - Jupyter Notebook

In [23]: 1 pearsonr_coefficient, p_value = pearsonr(mpg, hp)


2 print('PeasonR Correlation Coefficient %0.3f'% (pearsonr_coefficient))

PeasonR Correlation Coefficient -0.776

In [ ]: 1 #### Example1: Please use the empty code cells below to calculate the pear
2 ​
3 1- (mpg, qsec)
4 2- (mpg, wt)

Using pandas to calculate the Pearson correlation coefficient

In [24]: 1 corr = X.corr()


2 corr

Out[24]: mpg hp qsec wt

mpg 1.000000 -0.776168 0.418684 -0.867659

hp -0.776168 1.000000 -0.708223 0.658748

qsec 0.418684 -0.708223 1.000000 -0.174716

wt -0.867659 0.658748 -0.174716 1.000000

In [ ]: 1 #### Example 2: What do the dark (black) shades and light (white) shades i

localhost:8888/notebooks/Lecture 6-Part 2.ipynb 5/11


2/15/23, 2:16 PM Lecture 6-Part 2 - Jupyter Notebook

The Spearman Rank Correlation


In [28]: 1 X = cars[['cyl', 'vs', 'am', 'gear']]
2 sns.pairplot(X)
3 ​

Out[28]: <seaborn.axisgrid.PairGrid at 0x2103a854388>

In [32]: 1 cyl = cars['cyl']


2 vs = cars['vs']
3 am = cars['am']
4 gear = cars['gear']
5 spearmanr_coefficient, p_value = spearmanr(cyl, vs)
6 print('Spearman Rank Correlation Coefficient %0.3f' % (spearmanr_coefficie
7 ​

Spearman Rank Correlation Coefficient -0.814

localhost:8888/notebooks/Lecture 6-Part 2.ipynb 6/11


2/15/23, 2:16 PM Lecture 6-Part 2 - Jupyter Notebook

Chi-square test for independence


In [33]: 1 table = pd.crosstab(cyl, am)
2 from scipy.stats import chi2_contingency
3 chi2, p, dof, expected = chi2_contingency(table.values)
4 print ('Chi-square statistic %0.3f p_value %0.3f' % (chi2, p))

Chi-square statistic 8.741 p_value 0.013

Math and Statistics


In [34]: 1 cars = pd.read_csv('Downloads/mtcars.csv')
2 cars.columns

Out[34]: Index(['name', 'mpg', 'cyl', 'disp', 'hp', 'drat', 'wt', 'qsec', 'vs', 'am',
'gear', 'carb'],
dtype='object')

Looking at summary statistics that decribe a variable's numeric values

In [35]: 1 cars.sum()
2 ​

Out[35]: name Mazda RX4Mazda RX4 WagDatsun 710Hornet 4 Drive...


mpg 642.9
cyl 198
disp 7383.1
hp 4694
drat 115.09
wt 102.952
qsec 571.16
vs 14
am 13
gear 118
carb 90
dtype: object

localhost:8888/notebooks/Lecture 6-Part 2.ipynb 7/11


2/15/23, 2:16 PM Lecture 6-Part 2 - Jupyter Notebook

In [36]: 1 cars.sum(axis=1)
2 ​

Out[36]: 0 328.980
1 329.795
2 259.580
3 426.135
4 590.310
5 385.540
6 656.920
7 270.980
8 299.570
9 350.460
10 349.660
11 510.740
12 511.500
13 509.850
14 728.560
15 726.644
16 725.695
17 213.850
18 195.165
19 206.955
20 273.775
21 519.650
22 506.085
23 646.280
24 631.175
25 208.215
26 272.570
27 273.683
28 670.690
29 379.590
30 694.710
31 288.890
dtype: float64

In [ ]: 1 #Example: what is a different of cars.sum() and cars.sum(axis=1)?


2 ​

In [37]: 1 cars.median()
2 ​

Out[37]: mpg 19.200


cyl 6.000
disp 196.300
hp 123.000
drat 3.695
wt 3.325
qsec 17.710
vs 0.000
am 0.000
gear 4.000
carb 2.000
dtype: float64

localhost:8888/notebooks/Lecture 6-Part 2.ipynb 8/11


2/15/23, 2:16 PM Lecture 6-Part 2 - Jupyter Notebook

In [38]: 1 cars.mean()
2 ​

Out[38]: mpg 20.090625


cyl 6.187500
disp 230.721875
hp 146.687500
drat 3.596563
wt 3.217250
qsec 17.848750
vs 0.437500
am 0.406250
gear 3.687500
carb 2.812500
dtype: float64

In [ ]: 1 #Example:mwhat is value of median and mean parameter for gear variable?


2 ​

In [39]: 1 cars.max()
2 ​

Out[39]: name Volvo 142E


mpg 33.9
cyl 8
disp 472
hp 335
drat 4.93
wt 5.424
qsec 22.9
vs 1
am 1
gear 5
carb 8
dtype: object

Looking at summary statistics that describe variable distribution

we will calculate standard deviation and variance for car dataset.

localhost:8888/notebooks/Lecture 6-Part 2.ipynb 9/11


2/15/23, 2:16 PM Lecture 6-Part 2 - Jupyter Notebook

In [41]: 1 cars.std()
2 ​

Out[41]: mpg 6.026948


cyl 1.785922
disp 123.938694
hp 68.562868
drat 0.534679
wt 0.978457
qsec 1.786943
vs 0.504016
am 0.498991
gear 0.737804
carb 1.615200
dtype: float64

In [42]: 1 cars.var()
2 ​

Out[42]: mpg 36.324103


cyl 3.189516
disp 15360.799829
hp 4700.866935
drat 0.285881
wt 0.957379
qsec 3.193166
vs 0.254032
am 0.248992
gear 0.544355
carb 2.608871
dtype: float64

In [43]: 1 cars.describe()
2 ​

Out[43]: mpg cyl disp hp drat wt qsec vs

count 32.000000 32.000000 32.000000 32.000000 32.000000 32.000000 32.000000 32.000000

mean 20.090625 6.187500 230.721875 146.687500 3.596563 3.217250 17.848750 0.437500

std 6.026948 1.785922 123.938694 68.562868 0.534679 0.978457 1.786943 0.504016

min 10.400000 4.000000 71.100000 52.000000 2.760000 1.513000 14.500000 0.000000

25% 15.425000 4.000000 120.825000 96.500000 3.080000 2.581250 16.892500 0.000000

50% 19.200000 6.000000 196.300000 123.000000 3.695000 3.325000 17.710000 0.000000

75% 22.800000 8.000000 326.000000 180.000000 3.920000 3.610000 18.900000 1.000000

max 33.900000 8.000000 472.000000 335.000000 4.930000 5.424000 22.900000 1.000000

In [ ]: 1 ​

localhost:8888/notebooks/Lecture 6-Part 2.ipynb 10/11


2/15/23, 2:16 PM Lecture 6-Part 2 - Jupyter Notebook

localhost:8888/notebooks/Lecture 6-Part 2.ipynb 11/11

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy