0% found this document useful (0 votes)
4 views104 pages

Lecture 7 - Analyze the Data Using Statistics

This document provides an overview of the importance of statistics in data analysis, emphasizing the need for a basic understanding of statistical concepts to effectively interpret data. It discusses descriptive and inferential statistics, detailing their applications and limitations, particularly in the context of big data analytics. Additionally, the document covers various types of data visualizations, including best practices for using line charts, column charts, bar charts, and pie charts to effectively communicate data insights.

Uploaded by

Muhammad Asad
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views104 pages

Lecture 7 - Analyze the Data Using Statistics

This document provides an overview of the importance of statistics in data analysis, emphasizing the need for a basic understanding of statistical concepts to effectively interpret data. It discusses descriptive and inferential statistics, detailing their applications and limitations, particularly in the context of big data analytics. Additionally, the document covers various types of data visualizations, including best practices for using line charts, column charts, bar charts, and pie charts to effectively communicate data insights.

Uploaded by

Muhammad Asad
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 104

L E C T U R E 7 - A N A LY Z E

T H E D ATA U S I N G
S TAT I S T I C S
5.0 Introduction
5.0.1 INTRODUCTION

• To do their jobs efficiently and effectively, data analysts


must have a basic understanding of statistics.
• This is because data analytics relies heavily on statistics in
the process of analyzing and interpreting data.
5 . 0 . 2 W H AT W I L L I L E A R N I N T H I S M O D U L E ?

• In this module, we will create some visualizations in Excel.


• But first, we need to understand some statistical concepts in
order to make the most of visual interpretations.
• Upon completion of this module, you should be able to:
5 . 0 . 2 W H AT W I L L I L E A R N I N T H I S M O D U L E ?
L E C T U R E 7 - A N A LY Z E
T H E D ATA U S I N G
S TAT I S T I C S
5.1 Using Statistics to
Interpret Data
5 . 1 . 1 W H AT A R E S TAT I S T I C S , P O P U L AT I O N S , A N D
SAMPLES
5.1.2 PRACTICE ITEM
5 . 1 . 3 D E S C R I P T I V E S TAT I S T I C S

• After the problem statement (also known as the question to


be asked) and population is determined, some form of
statistical analysis is needed.
• There are two key branches of statistics that we will discuss
in this course:
5 . 1 . 3 D E S C R I P T I V E S TAT I S T I C S

• Descriptive statistics are used to describe or summarize the


values and observations of a data set.

• For example, a fitness tracker logged a person’s daily steps


and heart rate for a 10-day period.

• If the person met their fitness goals in 6 out of the 10 days,


then they were successful 60% of the time.
5 . 1 . 3 D E S C R I P T I V E S TAT I S T I C S

• Over that 10-day period, you could observe that the


person’s heart rate was a maximum of 140 beats per minute
(bpm), but an average of 72 bpm.

• These observations would be descriptive statistics that


could be used to describe and simplify the data set.
5 . 1 . 3 D E S C R I P T I V E S TAT I S T I C S

• Basic descriptive statistics might include the total number


of data points in a data set, the range of values that exist
for those numeric data points, or the number of times a
given value appears in a data set.

• Descriptive statistics may also answer questions about the


occurrence of trends.
5 . 1 . 3 D E S C R I P T I V E S TAT I S T I C S

• The answers to these questions can be provided in


numerical or graphical formats.

• Results of descriptive statistics are often represented in pie


charts, bar charts or histograms.
5 . 1 . 3 D E S C R I P T I V E S TAT I S T I C S

• Descriptive Statistics

• One important point to note is that while descriptive


statistics describe the current or historical state of the
observed population, it does not allow for:
• comparison of groups
• conclusions to be drawn
• predictions to be made about data sets that are not in the
population
5 . 1 . 3 D E S C R I P T I V E S TAT I S T I C S

• In the fitness tracker example, we cannot infer that the


person has poor health because they were only successful in
meeting their goal 60% of the time.

• We also cannot use the data set for this one person to
predict the fitness performance for others with similar
characteristics.

• This is where inferential statistics becomes important.


5.1.4 PRACTICE ITEM
5 . 1 . 5 I N F E R E N T I A L S TAT I S T I C S

• Descriptive statistics allows you to summarize findings based on


data that you already have recorded or observed about a
population.
• However, there are situations in which gathering data for a very
large population may not always be practical or even possible.
• It is possible, however, to study a smaller representative sample
of a population and use inferential statistics to test hypotheses
and draw conclusions about the larger population.
5 . 1 . 5 I N F E R E N T I A L S TAT I S T I C S

• Inferential statistics is the process of collecting, analyzing and


interpreting the data gathered from a sample to generalize or
predict something about a population.

• When a representative sample is used, methodological concerns


may arise and must be addressed, such as whether the groups
chosen for the study or the environment in which a study is
carried out accurately reflects characteristics of the larger
group.
5 . 1 . 5 I N F E R E N T I A L S TAT I S T I C S

• Typically, these types of analyses will include different


sampling techniques to reduce error and increase confidence in
the generalized findings.

• The type of sampling technique used will depend on the type of


data.
5.1.6 PRACTICE ITEM
5 . 1 . 7 S TAT I S T I C S A N D B I G D ATA

• Different statistical approaches are used in big data analytics.


• As we know, descriptive statistics describe a sample.
• This is useful for understanding the sample data and for
determining the quality of the data.
• Problems can occur when dealing with large amounts of data
that come from multiple sources.
• Data points can be corrupted, incomplete, or missing entirely.
5 . 1 . 7 S TAT I S T I C S A N D B I G D ATA

• Descriptive statistics can help determine how much of the data


in the sample is good for the analysis and identify criteria for
removing data that is inappropriate or problematic.

• Graphs of descriptive statistics are a helpful way to make quick


judgements about the quality of a sample.
5 . 1 . 7 S TAT I S T I C S A N D B I G D ATA

• For example, in a sample of tweets selected for analysis, some


contain only text characters, while others contain both
characters and images.

• The type of analysis or question to be answered with analysis


will determine whether tweets that contain images or tweets
with no images should be analyzed.
5 . 1 . 7 S TAT I S T I C S A N D B I G D ATA

• A number of inferential analyses are very commonly used in big


data analytics:

• Cluster analysis - Used to find groups of observations that are


similar to each other
• Association analysis - Used to find co-occurrences of values for
different variables
• Regression analysis - Used to quantify the relationship, if any,
between the variations of one or more variables
5.1.8 PRACTICE ITEM
L E C T U R E 7 - A N A LY Z E
T H E D ATA U S I N G
S TAT I S T I C S
5.2 Choosing the Right
Visualization for the Job
5 . 2 . 1 I M P O R TA N C E O F V I S U A L I Z AT I O N S
5.2.2 PRACTICE ITEM
5 . 2 . 3 C O M M O N T Y P E S O F D ATA V I S U A L I Z AT I O N S

• There are many types of data visualizations.


• Determining the best option usually depends on the answers to
the following questions, among others:

• How many variables are you going to show?


• How many data points are in each variable?
• Is your data over time or are you comparing data points at a
single point in time?
5 . 2 . 3 C O M M O N T Y P E S O F D ATA V I S U A L I Z AT I O N S
5 . 2 . 3 C O M M O N T Y P E S O F D ATA V I S U A L I Z AT I O N S

• Line Chart

• Line charts are one of the most commonly used types of


comparison charts.

• Use line charts when you have a continuous set of data, the
number of data points is high, and/or you would like to show a
trend in the data over time.
5 . 2 . 3 C O M M O N T Y P E S O F D ATA V I S U A L I Z AT I O N S

• Line Chart

• Some examples include:


• Quarterly sales for the past five years
• Number of customers per week in the first year of a new retail
shop
• Change in a stock’s price on one day, from opening to closing
bell
5 . 2 . 3 C O M M O N T Y P E S O F D ATA V I S U A L I Z AT I O N S

• Line Chart

• Some best practices for line charts include:


• Label the axes.
• Plot time on the x-axis (horizontal) and the data values on the
y-axis (vertical). Use a solid line (rather than a broken line) to
emphasize continuity of the data.
5 . 2 . 3 C O M M O N T Y P E S O F D ATA V I S U A L I Z AT I O N S

• Line Chart

• Keep the number of data sets to a minimum. There should be a


very good reason for plotting more than four lines. If needed,
add a legend to help the audience understand what they are
viewing.
• Remove or minimize gridlines to reduce distraction. Consider
using no gridlines except to emphasize certain values or time
periods.
5 . 2 . 3 C O M M O N T Y P E S O F D ATA V I S U A L I Z AT I O N S

• Line Chart

• Modify the y-axis starting point to obtain something close to a


45-degree slope in one or more of the lines. This ensures you
emphasize the change in the data without introducing
distortions that dramatize the visualization.
5 . 2 . 3 C O M M O N T Y P E S O F D ATA V I S U A L I Z AT I O N S

• Column Chart

• Column charts are positioned vertically.


• They are probably the most common chart type used to display
the values of a specific variable across similar categories.
5 . 2 . 3 C O M M O N T Y P E S O F D ATA V I S U A L I Z AT I O N S

• Column Chart

• Some examples include:


• Populations of the BRICS nations (Brazil, Russia, India, China,
and South Africa)
• Last year’s sales for the top four car companies
• Average student test scores for six math classes
5 . 2 . 3 C O M M O N T Y P E S O F D ATA V I S U A L I Z AT I O N S

• Column Chart

• Some best practices for column charts include:


• Label the axes.
• If changes over time are being shown, time should be plotted on
the x-axis.
• If time is not part of the data, consider ordering the data so
that column heights ascend or descend.
5 . 2 . 3 C O M M O N T Y P E S O F D ATA V I S U A L I Z AT I O N S

• Column Chart

• Fill the columns with a solid color. To highlight one column,


consider using an accent color and make all the other columns
the same color.
• Column charts are best when there are no more than seven
categories on the horizontal axis. This will help the viewer
clearly see the value for each column.
5 . 2 . 3 C O M M O N T Y P E S O F D ATA V I S U A L I Z AT I O N S

• Column Chart

• Start the value of the y-axis at zero to accurately reflect the full
value of each column.
• The spacing between columns should ideally be roughly half the
width of a column.
5 . 2 . 3 C O M M O N T Y P E S O F D ATA V I S U A L I Z AT I O N S

• Bar Chart

• Bar charts are similar to column charts except they are


positioned horizontally and hence used slightly differently (for
example, they do not usually show changes over time).
• Longer bars indicate larger values.
• They are best used when the names for each data point is long,
because there is space to write the information.
5 . 2 . 3 C O M M O N T Y P E S O F D ATA V I S U A L I Z AT I O N S

• Bar Chart

• Some examples include:


• Gross domestic product (GDP) of the 25 highest-producing
nations in a given year
• Number of cars sold by each sales representative in a group
• Exam scores for each student in a math class
5 . 2 . 3 C O M M O N T Y P E S O F D ATA V I S U A L I Z AT I O N S

• Bar Chart

• Some best practices for bar charts include:


• Label the axes.
• Consider ordering the bars so that the lengths go from longest
to shortest. The meaning of the data shown will most likely
determine whether the longest bar should be on the bottom or
the top for greatest impact or easiest understanding.
5 . 2 . 3 C O M M O N T Y P E S O F D ATA V I S U A L I Z AT I O N S

• Bar Chart

• Some best practices for bar charts include:


• The spacing between bars should ideally be roughly half the
width of a bar.
5 . 2 . 3 C O M M O N T Y P E S O F D ATA V I S U A L I Z AT I O N S

• Bar Chart

• Some best practices for bar charts include:


• Fill the bars with a solid color. To highlight one bar, consider
using an accent color and make all the other bars the same
color.
• Start the value of the x-axis at zero to accurately reflect the full
value of each bar.
• The spacing between bars should ideally be roughly half the
5 . 2 . 3 C O M M O N T Y P E S O F D ATA V I S U A L I Z AT I O N S

• Pie Chart

• Pie charts are used to show the composition of a total.


• Segments of different sizes visually represent percentages of
that total.
• The sum of the segments must equal 100%.
5 . 2 . 3 C O M M O N T Y P E S O F D ATA V I S U A L I Z AT I O N S

• Pie Chart

• Some examples include:


• Annual expenses for a corporation (e.g., rent, administrative,
utilities, production)
• A country’s energy sources (e.g., oil, coal, gas, solar, wind)
• Survey results for a group’s favorite type of movie (e.g., action,
romance, comedy, drama, science fiction)
5 . 2 . 3 C O M M O N T Y P E S O F D ATA V I S U A L I Z AT I O N S

• Pie Chart

• Some best practices for pie charts include:


• Limit the number of categories so that the viewer can easily
differentiate between segments and their meaning in relation to
each other. After ten or more segments, the slices begin to lose
meaning and impact.
• If necessary, consolidate smaller segments into one segment
with a label such as “Other” or “Miscellaneous”.
5 . 2 . 3 C O M M O N T Y P E S O F D ATA V I S U A L I Z AT I O N S

• Pie Chart

• Some best practices for pie charts include:


• Use a different color or gray scale for each segment.
• Order the segments clockwise according to size.
• Make sure the value of all segments equals 100%.
5 . 2 . 3 C O M M O N T Y P E S O F D ATA V I S U A L I Z AT I O N S

• Scatter Plot

• Scatter plots are very popular for visualizing correlations, or to


show the distribution of many data points.
• Scatter plots are also useful for demonstrating clustering or
identifying outliers in the data.
5 . 2 . 3 C O M M O N T Y P E S O F D ATA V I S U A L I Z AT I O N S

• Scatter Plot

• Some examples include:

• Comparing life expectancy to GDP for each country in a group


• Comparing the daily sales of ice cream at a given location to the
average outside temperature
• Comparing the weight to the height of each person in a group
5 . 2 . 3 C O M M O N T Y P E S O F D ATA V I S U A L I Z AT I O N S

• Scatter Plot

• Some best practices for scatter plots include:

• Label the axes.


• Make sure the data set is large enough to provide visualization
for clustering or outliers.
5 . 2 . 3 C O M M O N T Y P E S O F D ATA V I S U A L I Z AT I O N S

• Scatter Plot

• Some best practices for scatter plots include:

• Start the value of the y-axis at zero to accurately reflect the full
values of the data. The value of the x-axis will depend on the
data. For example, age ranges of ice cream customers might be
labeled on the x-axis, and there would be no need to start at
zero years of age.
5 . 2 . 3 C O M M O N T Y P E S O F D ATA V I S U A L I Z AT I O N S

• Scatter Plot

• Some best practices for scatter plots include:

• If scatter plot shows a correlation between values on the x- and


y-axes, consider adding a trend line.

• Do not include more than two trend lines.


5.2.4 PRACTICE ITEM
L E C T U R E 7 - A N A LY Z E
T H E D ATA U S I N G
S TAT I S T I C S
5.3 Creating Visualizations
with Excel
5 . 3 . 1 S T E P S T O C R E AT E V I S U A L I Z AT I O N S I N E X C E L
5.3.2 PRACTICE ITEM
5 . 3 . 3 L A B - C R E AT E V I S U A L I Z AT I O N S I N E X C E L
5.3.4 PRACTICE ITEM
L E C T U R E 7 - A N A LY Z E
T H E D ATA U S I N G
S TAT I S T I C S
5.4 Addressing Anomalies
in Data
5.4.1 DISCOVERING ANOMALIES THROUGH
V I S U A L I Z AT I O N
5.4.2 PRACTICE ITEM
5.4.3 OUTLIERS AND ANOMALIES

• Before data analysis can begin, considerable time must be spent


cleaning the data.
• During the data cleaning phase, you may find outliers, or
anomalies, in the data.
• If so, they need to be investigated so that the data can be
corrected or the meaning of the outlying data point can be
understood.
5.4.3 OUTLIERS AND ANOMALIES

• An outlier is defined as a value or data point that varies


significantly from others, either much smaller or much greater.
• Sometimes outliers are mistakes and sometimes they represent
an important piece of information.
• In the figure, the data point at the extreme bottom right is an
outlier.
• All the other data points cluster along the trend line.
5.4.3 OUTLIERS AND ANOMALIES
5.4.3 OUTLIERS AND ANOMALIES

• In the data analysis process, outliers that are the result of


mistakes can lead to anomalies in the results obtained, while
outliers that are not errors can be very important to an
analysis.

• This is why investigating anomalies is a very important part of


the data cleaning process—it ensures that data can be analyzed
effectively and generate accurate and valid results.
5.4.3 OUTLIERS AND ANOMALIES

• With small data sets it may be relatively easy to spot outliers by


sorting or filtering the data.

• But when it comes to large datasets and big data, other tools
are required.

• Two common types of data visualization used to find outliers


are scatter plots and box plots.
5.4.4 PRACTICE ITEM
5 . 4 . 5 L A B - I N T E R P R E T V I S U A L I Z AT I O N S W I T H
RESPECT TO OUTLIERS
5.4.6 PRACTICE ITEM
L E C T U R E 7 - A N A LY Z E
T H E D ATA U S I N G
S TAT I S T I C S
5.5 Using Excel to Address
Issues with Data
5.5.1 INTRODUCING VLOOKUP
5.5.2 PRACTICE ITEM
5 . 5 . 3 W H AT Y O U C A N D O W I T H V L O O K U P

• VLOOKUP is a very powerful data analysis tool within Excel


and is great when you need to find information in a large
spreadsheet or if you are consistently looking for the same type
of information.
5 . 5 . 3 W H AT Y O U C A N D O W I T H V L O O K U P

• VLOOKUP is an abbreviation of “vertical lookup,” and it’s a


function that searches a (vertical) column in a table for a
specified value.
• This means that the data must be organized in a table where
each row has different but related forms of data in each column.
• If an approximate match is specified in the formula, the first
column (the lookup column) must be sorted in numeric or
alphabetic order.
5 . 5 . 3 W H AT Y O U C A N D O W I T H V L O O K U P

• A VLOOKUP function consists of 4 key pieces of information:


• The value to search for
• The range to search in
• The column in the range that contains the value you want the
function to return
• An indication of whether the function should return an
approximate match (TRUE, in the function) or only an exact
match (FALSE) of the return value. The default for VLOOKUP
is an approximate match
5 . 5 . 3 W H AT Y O U C A N D O W I T H V L O O K U P

• VLOOKUP searches for a value in the leftmost column of a


table and, when the value is found, returns information from
the same row but in another column.
5 . 5 . 3 W H AT Y O U C A N D O W I T H V L O O K U P

• XLOOKUP, an alternative to VLOOKUP

• XLOOKUP is a newer lookup function, similar to VLOOKUP,


that is not available in all versions of Excel currently in use.
• With XLOOKUP, you can look in any column (not only the
leftmost in a table) for a search term and return a result from
the same row.
5 . 5 . 3 W H AT Y O U C A N D O W I T H V L O O K U P

• One difference is that XLOOKUP defaults to returning an exact


match, whereas VLOOKUP defaults to closest match unless the
FALSE keyword is used.
• In this course, you may use either VLOOKUP or XLOOKUP to
obtain the desired results if they are both available in the
spreadsheet tool you are using.
• Note: XLOOKUP is not backward compatible, so worksheets
using XLOOKUP may not be usable in earlier versions of Excel.
5.5.4 PRACTICE ITEM
5 . 5 . 5 L A B - U S I N G V L O O K U P I N D ATA A N A LY S I S
5.5.6 PRACTICE ITEM
L E C T U R E 7 - A N A LY Z E
T H E D ATA U S I N G
S TAT I S T I C S
5.6 Analyze the Data Using
Statistics Summary
5 . 6 . 1 W H AT D I D I L E A R N I N T H I S M O D U L E ?

• In this module, you learned that statistics can help the data
analyst to interpret data correctly, identify patterns and trends
in the data, and convert them into meaningful information.
5 . 6 . 1 W H AT D I D I L E A R N I N T H I S M O D U L E ?

• Using Statistics to Interpret Data

• Topic Objective: Describe the different types of statistics.

• Statistics is a tool used by data analysts to help analyze large


quantities of data and to identify patterns and trends in that
data.
5 . 6 . 1 W H AT D I D I L E A R N I N T H I S M O D U L E ?

• Using Statistics to Interpret Data

• Topic Objective: Describe the different types of statistics.

• There are two key branches of statistics: descriptive statistics,


used to describe or summarize the values and observations of a
data set, and inferential statistics, used to make generalizations
or predictions about a population.
5 . 6 . 1 W H AT D I D I L E A R N I N T H I S M O D U L E ?

• Choosing the right Visualization for the Job

• Topic Objective: Select data visualizations to best explain


analysis results.

• Three considerations when choosing the best visualization are


how many variables that need to be shown, how many
datapoints are in each variable, and if data over time or
comparisons needs to be shown.
5 . 6 . 1 W H AT D I D I L E A R N I N T H I S M O D U L E ?

• Choosing the right Visualization for the Job

• Topic Objective: Select data visualizations to best explain


analysis results.

• Line charts are good for continuous data and showing trends
over time, while column and bar charts display comparisons of
specific data points across similar categories.
5 . 6 . 1 W H AT D I D I L E A R N I N T H I S M O D U L E ?

• Choosing the right Visualization for the Job

• Topic Objective: Select data visualizations to best explain


analysis results.

• Pie charts are good for showing the percentages of a total, and
scatter plots can show the distribution of many data points and
are useful for identifying outliers in data.
5 . 6 . 1 W H AT D I D I L E A R N I N T H I S M O D U L E ?

• Creating Visualization with Excel

• Topic Objective: Create visualizations with Excel.

• Creating visualization with Excel is easily accomplished by


selecting the data to be visualized and then selecting the desired
chart type from the list of available charts. Excel has many
options for customizing charts such as adding a title, axis
labels, legends, gridlines, and data labels.
5 . 6 . 1 W H AT D I D I L E A R N I N T H I S M O D U L E ?

• Addressing Anomalies in Data

• Topic Objective: Interpret visualizations.

• Data must be cleaned before data analysis can begin. Outliers


are data points that vary significantly from the other data
points.
5 . 6 . 1 W H AT D I D I L E A R N I N T H I S M O D U L E ?

• Addressing Anomalies in Data

• Topic Objective: Interpret visualizations.

• If found in the data, they need to be investigated and verified or


removed so they do not negatively impact the accuracy of the
analysis.
5 . 6 . 1 W H AT D I D I L E A R N I N T H I S M O D U L E ?

• Addressing Anomalies in Data

• Topic Objective: Interpret visualizations.

• In small datasets, outliers can be identified by visually scanning


sorted and filtered data.
• In larger datasets, visualization tools such as scatter plots and
box plots are often used.
5 . 6 . 1 W H AT D I D I L E A R N I N T H I S M O D U L E ?

• Using Excel to Address Issues with Data

• Topic Objective: Use VLOOKUP in Excel to identify and fix


issues.

• VLOOKUP is a built-in function in Excel that performs a


vertical search for one piece of information in a table and then
extracts a specified corresponding piece of information.
5 . 6 . 1 W H AT D I D I L E A R N I N T H I S M O D U L E ?

• Using Excel to Address Issues with Data

• Topic Objective: Use VLOOKUP in Excel to identify and fix


issues.

• VLOOKUP is useful for finding information in large


spreadsheets. It is also useful for cleaning data. With
VLOOKUP, a data analyst can compare the data values in two
columns to identify duplicate values.
5 . 6 . 2 A N A LY Z E T H E D ATA U S I N G S TAT I S T I C S Q U I Z

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy