IE5005 Lecture 02
IE5005 Lecture 02
Semester 1 AY2024/25
Course Outline
Descriptive analytics
01 •
•
•
Categories of Descriptive Analytics Methods
Fundamentals for descriptive analytics
Hands-on demo: descriptive analytics with Python
02
Data visualization tools
• Introduction to data visualization tools
• Hands-on demo with Tableau Public (basics)
• Descriptive analytics encompasses the set of techniques that describes what has
happened in the past. Examples are data queries, reports, descriptive statistics, data
visualization including data dashboards, some data-mining techniques, and basic
what-if spreadsheet models.
• Diagnostic analytics is the process of using data to determine the causes of trends
and correlations between variables. It can be viewed as a logical next step after using
descriptive analytics to identify trends. (e.g. hypothesis testing, diagnostic regression
analysis, correlation/causation).
5
*Some literature divide into 3 categories where diagnostic analytics is considered as part of descriptive analytics.
Categorization of analytical methods and models
• Predictive analytics consists of techniques that use models constructed from past data
to predict the future or ascertain the impact of one variable on another. For example,
past data on product sales may be used to construct a mathematical model to predict
future sales. Linear regression, time series analysis, some data-mining techniques,
and simulation, often referred to as risk analysis, all fall under the banner of predictive
analytics.
• Prescriptive analytics indicates a course of action to take; that is, the output of a
prescriptive model is a decision. Predictive models provide a forecast or prediction,
but do not provide a decision. However, a forecast or prediction, when combined with
a rule, becomes a prescriptive model.
6
Reading
Check out the article “4 Examples of Business Analytics in Action” from Harvard
Business School. The article reveals how corporations use data insights to optimize their
decision-making process.
7
What is data?
❑ Data are facts and figures collected, analyzed, and summarized for presentation and
interpretation, including numbers, texts, images, audios, videos, and so on.
❑ Population Vs Sample
7.8
6.1
wellness
4.6
3.9
3.5
#1 #2 #3 #4 #5
patient
9
Data types
❑ Time series data are repeated measurements of a single entity collected over multiple
points in time or a time period.
wellness of patient # 1
4.9
4.2
3.8 3.9
3.5
10
Data types
❑ Panel data (or time-series cross-section) are repeated measurements of multiple entities
collected over multiple points in time.
6.2 Patient # 2
5.9
5.6 5.4 Patient # 3
4.8 4.7 4.9 Patient # 1
4.6
4.2 4.2
3.9 3.8 3.9 3.9
3.5
11
Data types
❑ Longitudinal data are repeated observations of a certain measure collected from multiple
entities over some extended time period.
Cohort study:
The repeated
Trend study: observations are
the repeated sampled from a
8.9
observations are cohort of patients
sampled from 7.8 7.5 under certain
wellness
cohort study
trend study panel study
longitudinal
cross Time’s
time series 60
10
2
7
1
6
4
5
3
9
8
sectional Up!
13
Types of measurement scales
❑ Qualitative variables are variables that can be placed into distinct categories in a
nominal or ordinal way.
• Nominal scale classifies data into mutually exclusive and exhausting categories in
which no order or ranking can be imposed on the data. [Gender (M, F)]
• Ordinal scale classifies data into categories that can be ranked, however, precise
differences between the ranks do not exist. [Rating (good, normal, poor)]
14
Types of measurement scales
15
Types of measurement scales
17
Quiz 2
Name Weight (kg) Height (cm) Gender Year of Birth Performance
Andrew 77 175 M 1998 good
Bernhard 110 195 M 2003 average
Carolina 70 172 F 1999 average
Dennis 85 180 M 1998 poor
Eve 65 168 F 2002 good
… … … … … …
2. What are the respective scale type for variables (Name, Height, Year of Birth,
Performance)?
A. Nominal, Ordinal, Ordinal, Ordinal
B. Nominal, Ratio, Ordinal, Ordinal
C. Nominal, Ratio, Ratio, Interval
D. Nominal, Ratio, Interval, Ordinal Time’s
60
10
2
7
1
6
4
5
3
9
8
Up!
18
(Answer)
Name Weight (kg) Height (cm) Gender Year of Birth Performance
Andrew 77 175 M 1998 good
Bernhard 110 195 M 2003 average
Carolina 70 172 F 1999 average
Dennis 85 180 M 1998 poor
Eve 65 168 F 2002 good
nominal ratio ratio nominal interval ordinal
19
Case Study: 50 wealthiest people in the world
Suppose the ages of top 50 wealthiest people in the world are listed in Forbes Magazine.
Here are the data in the original form (or what we call raw data):
49 57 38 73 81
74 59 76 65 69
54 56 69 68 78
65 85 49 69 61
48 81 68 37 43
78 82 43 64 67
52 56 81 77 79
85 40 85 59 80
60 71 57 61 69
61 83 90 87 74
Little insights can be obtained from looking at this raw data without proper organization of data.
20
Descriptive univariate analytics
To describe situation, draw conclusion, or make inference about events, the data analyst
must organize the data in some meaningful way. The most convenient methods of
organizing data include the construction of frequency distribution and statistical
measures.
After organizing the data, the analyst must present them to stakeholders. The most useful
intuitive way is to draw charts or plots.
21
Frequency table
A frequency table of ‘age’ can be constructed as
After organizing the data into frequency table, the peaks (which class has the most data values
compared to other classes) and outliers (extremely large or small values relative to other data) can
be analyzed.
22
Frequency distribution
10
8
Frequency
0
35—41 42—48 49—55 56—62 63—69 70—76 77—83 84—90
Age Group
23
Frequency table
Frequency table of nominal or ordinal scale variable can also be constructed. For example, gender
Cumulative frequency table which shows the number of data values less than or equal to a specific
value can also be constructed
A large number of situations in real life follows some already known and well-defined
distribution function. So in many cases, we do not need to access all instances of a given
population.
25
Shape of frequency distribution
26
Plot Qualitative Quantitative
Univariate data visualization 5
Pie yes no 45
male female
45
Most of times, NO.
(Yes, only when there
Bar yes
are small and limited 5
number of values)
male female
Line no yes
Histogram no yes 6% 6%
8%
10% 10%
27
We can also summarize data using summary statistic.
Central tendency statistics identify the central position within the dataset.
𝑥1 +𝑥2 +⋯𝑥𝑛
• (Arithmetic) Mean =
𝑛
• Mode (1, 2, 3, 3, 3, 4, 5)
• Median (1, 2, 3, 3, 3, 4, 5)
𝑥𝑚𝑎𝑥 + 𝑥𝑚𝑖𝑛
• Midrange =
2
29
Measures of location
Location statistics identify a value in a certain position and tell us its relative position in
comparison with other data values. Some commonly used location univariate statistics
include:
30
Example
Data for bottled water sales at Queensland Amusement Park for a sample 14 summer
days are available in “BottledWater.csv”.
1 78 23
Convince yourself that the following
2 79 22
statistics are correct for variable
3 80 24
‘High Temperature (degrees F)’:
4 80 22
5 82 24
6 83 26
7 85 27
8 86 25
9 87 28
10 87 26
11 88 29
12 88 30
13 90 31
14 92 31
31
Descriptive analytics with Python
Dataset: BottledWater.csv
32
Boxplot
max
Boxplot presents a 5-number summary of the data.
1st quartile
min
special occasions.
σ 𝑤𝑖 𝑥𝑖 𝑤1 𝑥1 +𝑤2 𝑥2 +⋯𝑤𝑛 𝑥𝑛
❑ Weighted mean = σ 𝑤𝑖
= .
𝑤1 +𝑤2 +⋯𝑤𝑛
Arithmetic mean is a special case of weighted mean which assumes equal weightage in
each observation.
34
What does ‘average’ or ‘mean’ refer to?
σ 𝑤𝑖 𝑥𝑖 𝑤1 𝑥1 +𝑤2 𝑥2 +⋯𝑤𝑛 𝑥𝑛
❑ Weighted mean = σ 𝑤𝑖
= .
𝑤1 +𝑤2 +⋯𝑤𝑛
Arithmetic mean is a special case of weighted mean which assumes equal weightage in
each observation.
Suggested answer
Course Credits (𝑤𝑖 ) Grade (𝑥𝑖 )
Mathematics 3 A (4 points)
Psychology 3 C (2 points) 3∙4+3∙2+4∙3+2∙1
𝑥ҧ = = 2.7
Biology 4 B (3 points) 3+3+4+2
History 2 D (1 point)
35
Are you using ‘average’ or ‘mean’ correctly?
Since the stock price first decreased by 50% and then increased by 50%, the average
−50%+50%
growth rate of stock price is = 0%.
2
36
Are you using ‘average’ or ‘mean’ correctly?
−𝟓𝟎%+𝟓𝟎%
So the (arithmetic mean) growth rate computed as = 𝟎% can be misleading in
𝟐
this context.
37
Geometric mean
Therefore, we can tell, this statement is not right. If the average growth rate is 0%, we
should get back the stock price as $100 on day 3, isn’t it? This is where the arithmetic
mean (AM) may not work well.
We should adopt Geometric mean (GM)
𝑛
1
𝐺𝑀 = (ෑ 𝑥𝑖 )𝑛 = 𝑛 𝑥1 𝑥2 ⋯ 𝑥𝑛
𝑖=1
If we denote the average growth rate as 𝑅 , we can compute it as 1+𝑅 =
38
Are you using ‘average’ or ‘mean’ correctly?
Suppose the distance between your home and school is d. You drive from home to
school at a speed x = 60 km/h; and returns from school to home at a speed y = 20 km/h,
then your average driving speed is (60 + 20)/2 = 40 km/h. Correct or wrong?
60 km/h
home school
20 km/h
39
Harmonic mean
𝑡𝑜𝑡𝑎𝑙 𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒
We know the average driving speed can be computed as =
𝑡𝑜𝑡𝑎𝑙 𝑡𝑖𝑚𝑒
120 120
For instance, the distance d = 120km. It takes you = 2 hours and = 6 hours
60 20
𝑛
𝐻𝑀 =
1 1
+ ⋯+
𝑥1 𝑥𝑛
2
Average speed = 1 1 = 30km/h.
+
60 20
40
Summary
When dealing with additive Example:
Arithmetic mean relationships (e.g., heights, weights) Average height of students in a class.
Dataset: BottledWater.csv
43
Example
Data for bottled water sales at Queensland Amusement Park for a sample 14 summer
days are available in “BottledWater.csv”.
1 78 23
Convince yourself that the following
2 79 22
statistics are correct for variable
3 80 24
‘High Temperature (degrees F)’:
4 80 22
5 82 24
6 83 26
7 85 27
8 86 25
9 87 28
10 87 26
11 88 29
12 88 30
13 90 31
14 92 31
44
Chebyshev’s theorem
The proportion of values from a data set that will fall within 𝑘 standard deviation of the
1
mean will be at least 1 − , where 𝑘 is a number greater than 1.
𝑘2
For example
• We can estimate that at least 75% of the data values will fall within 2 standard
deviations of the mean of any data set. (𝑘 = 2).
45
Exercise 1
Suppose the mean housing price in a certain district is $50,000, and the standard
deviation is estimated to be $10,000. Estimate the price range for which at least 75% of
the houses will sell.
46
Exercise 1 (Answer)
Suppose the mean housing price in a certain district is $50,000, and the standard
deviation is estimated to be $10,000. Find the price range for which at least 75% of the
houses will sell.
Based on Chebyshev’s theorem, 75% of data values will fall within k = 2 standard
deviations around the mean.
50000 − 2 ∗ 10000, 50000 + 2 ∗ 10000 = [30000, 70000]
So, 75% of all houses sold should be estimated to be in range from $30,000 to $70,000.
47
Exercise 2
A survey of local companies found that the mean amount of travel allowance for
executives was $0.25 per mile. The standard deviation was $0.02. Using Chebyshev’s
theorem, find the minimum percentage of the data values that will fall between $0.20
and $0.30.
48
Exercise 2 (Answer)
A survey of local companies found that the mean amount of travel allowance for
executives was $0.25 per mile. The standard deviation was $0.02. Using Chebyshev’s
theorem, find the minimum percentage of the data values that will fall between $0.20
and $0.30.
49
Choice of Visualization
• Some commonly used plots
• Which plot should I use for visualization?
Let’s say you want to present the graduation frequency for a particular high school
between the years 2008-2012.
51
Source of Figure
Some commonly used plots
Bar chart
Bar charts use size to contrast and compare two or more values, using height or lengths
to represent the specific values.
The below is example data concerning sales of vehicles over the course of 5 months:
52
Source of Figure
Some commonly used plots
Heatmap
Similar to bar charts, heatmaps also use color to compare categories in a data set. They
are mainly used to show relationships between two variables and use a system of color-
coding to represent different values.
The following heatmap plots temperature changes for each city during the hottest and
coldest months of the year.
53
Source of Figure
Some commonly used plots
Pie chart
The pie chart is a circular graph that is divided into segments representing proportions
corresponding to the quantity it represents, especially when dealing with parts of a
whole. For example, let’s say you are determining favorite movie categories among avid
movie watchers.
54
Source of Figure
Some commonly used plots
Scatter plot
Scatterplots show relationships between different variables. Scatterplots are typically
used for two variables for a set of data, although additional variables can be displayed.
For example, you might want to show data of the relationship between temperature
changes and ice cream sales. It would resemble something like this:
55
Source of Figure
After-class Reading
The data visualization catalogue: This catalogue features a range of different diagrams,
charts, and graphs to help you find the best fit for your project. As you navigate each
category, you will get a detailed description of each visualization as well as some related
programming codes or software.
56
Which plot should I use?
With so many visualization options out there for you to choose from, how do you decide
what is the best way to represent your data?
A decision tree leading to the best chart
Here is a simple decision tree to help you choose the suitable type of plot to use:
57
More resources to help you decide which plot to use
Tableau public
Looker Studio
Excel
Google Analytics 4
Power BI
ggplot2
Matplotlib Seaborn
60
Excel/Google Sheets
• Types of charts and graphs in Google Sheets: a Google Help Center page with a list of
chart examples you can download.
• Excel Charts: a tutorial outlining all of the different chart types in Excel, including
some subcategories.
61
Tableau Public
62
Tableau Versions
63
Power BI
64
Power BI
https://www.microsoft.com/en-us/power-platform/products/power-bi/getting-started-with-power-bi
66
Power BI (installation)
Windows Users
https://www.microsoft.com/en/power-platform/products/power-bi/desktop?market=af
Mac Users
https://app.powerbi.com/
67
Feel free to share your feedback with me via this
link/QR code throughout the whole semester.
https://app.sli.do/event/hUgiGrg7Ln8KeEFVyCT9o3
68
Thank You!