Introduction To Data Analytics: ITE 5201 Lecture5-Data Visualization-2
Introduction To Data Analytics: ITE 5201 Lecture5-Data Visualization-2
Analytics
ITE 5201
Lecture5-Data Visualization-2
Instructor: Parisa Pouladzadeh
Email: parisa.pouladzadeh@humber.ca
Statistical concepts of classification of Data
➢Chronological classification
➢ Chronological classification means classification on the basis of time, like months,
years etc.
➢Qualitative classification
➢ In Qualitative classification, data are classified on the basis of some attributes or
quality such as gender, colour of hair, literacy and religion. In this type of
classification, the attribute under study cannot be measured. It can only be found out
whether it is present or absent in the units of study.
➢Quantitative classification
➢ Quantitative classification refers to the classification of data according to some
characteristics, which can be measured such as height, weight, income, profits etc.
➢Frequency
➢ Frequency refers to the number of times each variable gets repeated. For
example there are 50 students having weight of 60 kgs. Here 50 students is
the frequency.
➢After collecting data, the first task for a researcher is to organize and
simplify the data so that it is possible to get a general overview of the
results.
➢The smooth curve emphasizes the fact that the distribution is not
showing the exact frequency for each category.
➢On the other hand, distributions are skewed when scores pile up on one side
of the distribution, leaving a "tail" of a few extreme values on the other side.
It is observed that if all the values in the dataset are the same,
then all geometric, arithmetic and harmonic mean values are the
same. If there is variability in the data, then the mean value
differs.
Copyright © 2018 Pearson Education, Inc. All Rights Reserved.
Median
➢These data graphics are ideal for visually spotting outliers and
trends in data.
➢Scatter plots are used when you want to show the relationship
between two variables.
◦ Negative: as one variable increases, the other decreases. Time spent studying and time spent
on video games are negatively correlated; as your time studying increases, time spent on
video games decreases.
◦ No correlation: there is no apparent relationship between the variables. Video game scores
and shoe size appear to have no correlation; as one increases, the other one is not affected.
➢Form:
◦ Linear
◦ Nonlinear
➢Strength:
◦ Weak
◦ Moderate
◦ Strong
Outliers
Copyright © 2018 Pearson Education, Inc. All Rights Reserved.
Direction
Positive Negative
Perfect Perfect
No association
Outlier
➢Ethnicity
➢White British, Afro-Caribbean, Asian, Chinese, other, etc. (note problems with
these categories).
➢Smoking status
➢smoker, non-smoker
Discrete Data
Only certain values are possible (there are gaps between the possible values).
Continuous Data
Theoretically, with a fine enough measuring device.
We would not expect to find 2.2 children in a family or 88.5 students passing an
exam or crimes being reported to the police or half a bicycle being sold in one day.
➢Range
➢Variance: Standard Deviation
➢Quartiles and Quartile Deviation
➢Mean and Mean Deviation
S xy (x − x )( y − y )
i i
r= = i =1
n n
S xx S yy
(x − x ) ( y − y )
2 2
i i
i =1 i =1
Cov= covariance
S= Standard deviation
n
6 d i
2
r = 1− i =1
n (n − 1)2
d
i =1
i
2
=6
Copyright © 2018 Pearson Education, Inc. All Rights Reserved.
n
6 d i2
r = 1− i =1
n (n − 1)
2
6(6)
= 1−
7(7 − 1)
2
36 3
= 1− = 1−
7(48) 7(4 )
25
= = 0.893
28
Computing Pearsons correlation coefficient,
r, for the same problem:
S xy (x − x )( y − y )
i i
r= = i =1
n n
S xx S yy
(
ix − x )2
(
iy − y )2
i =1 i =1
i =1 i =1 n
2
n
yi
n n
S yy = ( yi − y ) = yi − i =1
2 2
i =1 i =1 n
n
S xy = (xi − x )( yi − y )
i =1
n n
xi yi
n
= xi yi − i =1 i =1
i =1 n
Copyright © 2018 Pearson Education, Inc. All Rights Reserved.
To compute
S xx S yy S xy
first compute
n n
A = xi = 195.1 B = yi = 193.9
i =1 i =1
n n
C = xi2 = 5972.35 D = yi2 = 6254.41
i =1 i =1
n
E = xi yi = 6053.78
i =1
Copyright © 2018 Pearson Education, Inc. All Rights Reserved.
Then
A2 195.12
S xx = C − = 5972.35 − = 534.63
n 7
B2 193.92
S yy = D − = 6254.41 − = 883.38
n 7
S xy = E −
A B
= 6053.78 −
(195.1) (193.9)
= 649.51
n 7
649.51
r= = 0.945
534.63 883.38
Compare with
r = 0.893
Copyright © 2018 Pearson Education, Inc. All Rights Reserved.
Comments
This is similar to the comparison between the median and the mean, the standard
deviation and the pseudo-standard deviation. The mean and standard deviation
are more sensitive to outliers than the median and pseudo- standard deviation.
Lecture 6-Part 2
Import Dataset
In [3]: 1 cars = pd.read_csv('Downloads/mtcars.csv')
2 cars.columns
Out[3]: Index(['name', 'mpg', 'cyl', 'disp', 'hp', 'drat', 'wt', 'qsec', 'vs', 'am',
'gear', 'carb'],
dtype='object')
In [20]: 1 cars.head()
Heatmap
In order for a heatmap to work properly, your data should already be in a matrix form, the
sns.heatmap function basically just colors it in for you. For example:
mpg 1.000000 -0.852162 -0.847551 -0.776168 0.681172 -0.867659 0.418684 0.664039 0.599
cyl -0.852162 1.000000 0.902033 0.832447 -0.699938 0.782496 -0.591242 -0.810812 -0.522
disp -0.847551 0.902033 1.000000 0.790949 -0.710214 0.887980 -0.433698 -0.710416 -0.591
drat 0.681172 -0.699938 -0.710214 -0.448759 1.000000 -0.712441 0.091205 0.440278 0.712
qsec 0.418684 -0.591242 -0.433698 -0.708223 0.091205 -0.174716 1.000000 0.744535 -0.229
gear 0.480285 -0.492687 -0.555569 -0.125704 0.699610 -0.583287 -0.212682 0.206023 0.794
carb -0.550925 0.526988 0.394977 0.749812 -0.090790 0.427606 -0.656249 -0.569607 0.057
In [7]: 1 sns.heatmap(cars.corr())
In [8]: 1 sns.heatmap(cars.corr(),cmap='coolwarm',annot=True)
In [ ]: 1 #### Example1: Please use the empty code cells below to calculate the pear
2
3 1- (mpg, qsec)
4 2- (mpg, wt)
In [ ]: 1 #### Example 2: What do the dark (black) shades and light (white) shades i
Out[34]: Index(['name', 'mpg', 'cyl', 'disp', 'hp', 'drat', 'wt', 'qsec', 'vs', 'am',
'gear', 'carb'],
dtype='object')
In [35]: 1 cars.sum()
2
In [36]: 1 cars.sum(axis=1)
2
Out[36]: 0 328.980
1 329.795
2 259.580
3 426.135
4 590.310
5 385.540
6 656.920
7 270.980
8 299.570
9 350.460
10 349.660
11 510.740
12 511.500
13 509.850
14 728.560
15 726.644
16 725.695
17 213.850
18 195.165
19 206.955
20 273.775
21 519.650
22 506.085
23 646.280
24 631.175
25 208.215
26 272.570
27 273.683
28 670.690
29 379.590
30 694.710
31 288.890
dtype: float64
In [37]: 1 cars.median()
2
In [38]: 1 cars.mean()
2
In [39]: 1 cars.max()
2
In [41]: 1 cars.std()
2
In [42]: 1 cars.var()
2
In [43]: 1 cars.describe()
2
In [ ]: 1