0% found this document useful (0 votes)
14 views140 pages

Unit II

Data Analytics Probability and Statistics
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views140 pages

Unit II

Data Analytics Probability and Statistics
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 140

Unit II

Introduction to Probability
• One of the primary objectives in analytics is to measure the
uncertainty associated with an event or key performance indicator.
• Probability theory is the foundation on which descriptive and
predictive analytics models are built.
Probability Theory Terminologies
Random Experiment
• Random experiment is an experiment in which the outcome is not
known with certainty.
• That is, the output of a random experiment cannot be predicted with
certainty.
• Predictive analytics mainly deals with random experiments such as
predicting quarterly revenue of an organization, customer churn
(whether a customer is likely to churn or how many customers are
likely to churn before next quarter), demand for a product at a future
time period, number of views for an YouTube video, outcome of a
football match (win, draw or lose), etc.
Sample Space
• Sample space is the universal set that consists of all possible
outcomes of an experiment.
• Sample space is usually represented using the letter ‘S’ and individual
outcomes are called the elementary events.
• The sample space can be finite or infinite.
Event
• Event (E) is a subset of a sample space and probability is usually
calculated with respect to an event.
• An event can be represented using the Venn diagram
Probability Estimation using Relative
Frequency
Algebra of Events
• Assume that X, Y and Z are three events of a sample space. Then the
following algebraic relationships are valid and are useful while
deriving probabilities of events:
FUNDAMENTAL CONCEPTS IN PROBABILITY –
AXIOMS OF PROBABILITY
• In 1933, Andrey Kolmogorov, a Russian mathematician laid the
foundation of the axiomatic theory of probability (Kolmogorov, 1956).
• According to axiomatic theory of probability, the probability of an
event E satisfies the following axioms:
FUNDAMENTAL CONCEPTS IN PROBABILITY –
AXIOMS OF PROBABILITY - Cont.,
• Using the aforementioned axioms of probability, one can derive
several mathematical relationships on probability of events using set
theory logic.
• The following elementary rules of probability are directly deduced
from the original three axioms of probability, using the set theory
relationships
FUNDAMENTAL CONCEPTS IN PROBABILITY –
AXIOMS OF PROBABILITY - Cont.,
FUNDAMENTAL CONCEPTS IN PROBABILITY –
AXIOMS OF PROBABILITY - Cont.,
Joint Probability
• Let A and B be two events in a sample space. Then the joint
probability of the two events, written as P(A ∩ B), is given by
Marginal Probability
• Marginal probability is simply a probability of an event X, denoted by
P(X), without any conditions.
Independent Events
• Two events A and B are said to be independent when occurrence of
one event (say event A) does not affect the probability of occurrence
of the other event (event B).
• Mathematically, two events A and B are independent when
P(A ∩ B) = P(A) × P(B).
Conditional Probability
• If A and B are events in a sample space, then the conditional
probability of the event B given that the event A has already
occurred, denoted by P(B|A), is defined as
APPLICATION OF SIMPLE PROBABILITY RULES
– ASSOCIATION RULE LEARNING
• In general, association rule learning (also known as association rule
mining) is a method of finding association between different entities
in a database.
Cont.,
• The strength of association between two mutually exclusive subsets
can be measured using ‘support’, ‘confidence’, and ‘lift’.

• Support between two sets (of products purchased) is calculated using


the joint probability of those events
• Confidence is the conditional probability of purchasing product Ygiven
the product X is purchased.
• Lift overcomes one of the disadvantages of using confidence.
Example

• In Table 3.2, assume that X = Apple and Y = Banana


BAYES’ THEOREM
• Bayes’ theorem is one of the most important concepts in analytics
since several problems are solved using Bayesian statistics.
• Consider two events A and B. We can write the following two
conditional probabilities
BAYES’ THEOREM Cont.,
• Bayes’ theorem helps the data scientists to update the probability of
an event (B) when any additional information is provided.
• This makes Bayesian statistics a very attractive technique since it
helps the decision maker to fine-tune his/her belief with every
additional data that is received.
• The following terminologies are used to describe various components
BAYES’ THEOREM Cont.,
Solving Monty Hall Problem Using Bayes’
Theorem
• Contestants of a game show are shown three doors.
• Behind one of the doors is an expensive item (such as a car or gold); while there
are inexpensive items behind the remaining two doors (such as a goat).
• The contestant is asked to choose one of the doors.
• Assume that the contestant chooses door 1; the game host would then open one
of the remaining two doors. Assume that the game host opens door 2, which has
a goat behind it.
• Now the contestant is given a chance to change his initial choice (from door 1 to
door 3).
• The problem is whether or not the contestant should change his/her initial
choice.
• Note that the contestant is given anoption to switch door irrespective of the item
behind his/her original choice of door.
Monty Hall Problem
Solving Monty Halls Problem Using Bayes
Theorem

• Posterior Probability the statistical probability that a hypothesis is true calculated in the light of
relevant observations.
Solving Monty Halls Problem Using Bayes
Theorem
Generalization of Bayes’ Theorem
Random Variable
Discrete Random Variables
• If the random variable X can assume only a finite or countably infinite
set of values, then it is called a discrete random variable.
• There are very many situations where the random variable X can
assume only finite or countably infinite set of values.
Discrete Random Variables Examples
Continuous Random Variables
• A random variable X which can take a value from an infinite set of
values is called a continuous random variable.
• Examples of continuous random variables are listed below
Probability Mass Function and Cumulative
Distribution Function of a Discrete
Random Variable
• For a discrete random variable, the probability that a random variable
X taking a specific value xi, P(X = xi), is called the probability mass
function P(xi).
• That is, a probability mass function is a function that maps each
outcome of a random experiment to a probability
Representation
Expected Value, Variance, and Standard
Deviation of a Discrete Random Variable
PROBABILITY DENSITY FUNCTION (PDF) AND
CUMULATIVE DISTRIBUTION
FUNCTION (CDF) OF A CONTINUOUS RANDOM
VARIABLE
• Probability density function reflects how dense is the likelihood of a
continuous random variable X taking a value in an infinitesimally small
interval around value x.
BINOMIAL DISTRIBUTION
• Binomial distribution is one of the most important discrete probability
distribution due to its applications in several contexts.
• A random variable X is said to follow a Binomial distribution when
Probability Mass Function (PMF) of Binomial
Distribution
• The PMF of the Binomial distribution (probability that the number of
success will be exactly x out of n trials) is given by
Cumulative Distribution Function (CDF) of
Binomial Distribution
• CDF of a binomial distribution function, F(a), representing the
probability that the random variable X takes value less than or equal
to a, is given by
Mean and Variance of Binomial Distribution
Approximation of Binomial Distribution using
Normal Distribution
• If the number of trials (n) in a binomial distribution is large, then it
can be approximated by normal distribution with mean np and
variance npq, where q = 1 - p.
POISSON DISTRIBUTION
• This is often known as the distribution of rare events.
• Firstly, a Poisson process is where DISCRETE events occur in a
CONTINUOUS, but finite interval of time or space.
• The following conditions must apply:
• For a small interval the probability of the event occurring is proportional to the size
of the interval.
• The probability of more than one occurrence in the small interval is negligible (i.e.
they are rare events). Events must not occur simultaneously
• Each occurrence must be independent of others and must be at random.
• The events are often defects, accidents or unusual natural happenings, such as
earthquakes, where in theory there is no upper limit on the number of events.
• The interval is on some continuous measurement such as time, length or area.
POISSON DISTRIBUTION
• In many situations, we may be interested in calculating the number of
events that may occur over a period of time (or corresponding unit of
measurement).
• For example, number of cancellation of orders by customers at an e-
commerce portal, number of customer complaints, number of cash
withdrawals at an ATM, number of typographical errors in a book,
number of potholes on Bangalore roads, etc.
• When we have to find the probability of number of events, we use
Poisson distribution.
• The probability mass function of a Poisson distribution is given by
NORMAL DISTRIBUTION
• Normal distribution, also known as Gaussian distribution, is one of
the most popular continuous distribution in the field of analytics
especially due to its use in multiple contexts
• Normal distribution is observed across many naturally occurring
measures such as birth weight, height, intelligence, etc.
• The probability density function and the cumulative distribution
function are given by
Properties of Normal Distribution
Properties of Normal Distribution
Standard Normal Variable
• A normal random variable with mean m = 0 and s = 1 is called the
standard normal variable and usually represented by Z.
• The probability density function and cumulative distribution function
of a standard normal variable are given by
Hypothesis Testing
Formulate H0 and H1

Select Appropriate Test


Choose Level of Significance

Calculate Test Statistic TSCAL

Determine Prob Determine Critical


Assoc with Test Stat Value of Test Stat
TSCR
Determine if TSCR
Compare with Level falls into (Non)
of Significance,  Rejection Region

Reject/Do not Reject H0

Draw Marketing Research Conclusion


Step 1: Formulate the Hypothesis
• A null hypothesis is a statement of the status quo, one of no
difference or no effect. If the null hypothesis is not rejected, no
changes will be made.
• An alternative hypothesis is one in which some difference or effect is
expected.
• The null hypothesis refers to a specified value of the population
parameter (e.g., m s ), not a sample statistic (e.g., X ).
Step 1: Formulate the Hypothesis
• A null hypothesis may be rejected, but it can never be accepted based
on a single test.
• In marketing research, the null hypothesis is formulated in such a way
that its rejection leads to the acceptance of the desired conclusion.
• A new Internet Shopping Service will be introduced if more than 40%
people use it:

H0:   0.40
H1:  > 0.40
Step 2: Select an Appropriate Test
• The test statistic measures how close the sample has come to the null
hypothesis.
• The test statistic often follows a well-known distribution (eg, normal,
t, or chi-square).
• In our example, the z statistic, which follows the standard normal
distribution, would be appropriate.

p-
z= Where σp is standard deviation
sp
Step 3: Choose Level of Significance
Type I Error
• Occurs if the null hypothesis is rejected when it is in fact true.
• The probability of type I error ( α ) is also called the level of significance.

Type II Error
• Occurs if the null hypothesis is not rejected when it is in fact false.
• The probability of type II error is denoted by β .
• Unlike α, which is specified by the researcher, the magnitude of β
depends on the actual value of the population parameter (proportion).

It is necessary to balance the two types of errors.


Step 3: Choose Level of Significance
Power of a Test
• The power of a test is the probability (1 - β) of rejecting the null
hypothesis when it is false and should be rejected.
• Although β is unknown, it is related to α. An extremely low value of
α (e.g., = 0.001) will result in intolerably high β errors.
Step 4: Collect Data and Calculate Test
Statistic
• The required data are collected and the value of the test statistic
computed.
Step 5: Determine Probability Value/
Critical Value

Shaded Area
= 0.9699

Unshaded Area
= 0.0301

0 zCAL = 1.88
Steps 6 & 7: Compare Prob and Make the
Decision
• If the prob associated with the calculated value of the test statistic
(zCAL) is less than the level of significance (α), the null hypothesis is
rejected.
• Alternatively, if the calculated value of the test statistic is greater than
the critical value of the test statistic ( zα), the null hypothesis is
rejected.
Broad Classification of Hypothesis Tests

Hypothesis Tests

Tests of Tests of
Association Differences

Means Proportions Means Proportions


Hypothesis Testing for Differences
Hypothesis Tests

Parametric Tests Non-parametric Tests


(Metric) (Nonmetric)

One Sample Two or More


* t test Samples
* Z test
Independent
Samples
* Two-Group t * Paired
test t test
* Z test
T-Test
• t-tests are commonly used in inferential statistics for testing a
hypothesis on the basis of a difference between sample means.
• The T-test is the test, which allows us to analyze one or two sample
means, depending on the type of t-test.
Types of T-Test
• One-sample t-test — compare the mean of one group against the
specified mean generated from a population.
• For example, a manufacturer of mobile phones promises that one of
their models has a battery that supports about 25 hours of video
playback on average.
• To find out if the manufacturer is right, a researcher can sample 15
phones, measure the battery life and get an average of 23 hours.
• Then, he can use a t-test to determine whether this difference is
received not just by chance.
Types of T-Test
• Paired sample t-test — compares the means of two measurements
taken from the same individuals, objects, or related units. For
instance, students passed an additional course for math and it would
be interesting to find whether their results became better after
course completion. It is possible to take a sample from the same
group and use the paired t-test.
Formula
Assumptions regarding t-test
• There are 5 main assumptions regarding the t-test
• The data is collected from a representative, randomly selected portion of the
total population.
• Data should follow a continuous or discrete scale of measurement. Eg:-
Grades
• Means should follow the normal distribution, as well as the population. Not
sample data, as some people may think, but means and population.
• (for independent t-test) Independence of the observations. Each subject
should belong to only one group. There is no relationship between the
observations in each group.
• (for an independent t-test with equal variance) Homogeneity of
variances. Homogeneous, or equal, variance exists when the standard
deviations of samples are approximately equal.
Analysis of variance (ANOVA)
• ANOVA, or Analysis of Variance, is a test used to determine
differences between research results from three or more unrelated
samples or groups.
The basic ANOVA situation
Two variables: 1 Categorical, 1 Quantitative

Main Question: Do the (means of) the quantitative variables depend on which group
(given by categorical variable) the individual is in?

If categorical variable has only 2 values:


• 2-sample t-test

ANOVA allows for 3 or more groups


Informal Investigation
Graphical investigation:
• side-by-side box plots
• multiple histograms

Whether the differences between the groups are significant depends on


• the difference in the means
• the standard deviations of each group
• the sample sizes

ANOVA determines P-value from the F statistic


Side by Side Boxplots

13

12

11
days 10

A B P
treatment
What does ANOVA do?
At its simplest (there are extensions) ANOVA tests the
following hypotheses:

H0: The means of all the groups are equal.

Ha: Not all the means are equal


• doesn’t say how or which ones differ.
• Can follow up with “multiple comparisons”

Note: we usually refer to the sub-populations as “groups”


when doing ANOVA.
Assumptions of ANOVA
• each group is approximately normal
check this by looking at histograms and/or normal
quantile plots, or use assumptions
can handle some nonnormality, but not severe outliers
• standard deviations of each group are
approximately equal
rule of thumb: ratio of largest to smallest sample st.
dev. must be less than 2:1
Normality Check
We should check for normality using:
• assumptions about population
• histograms for each group
• normal quantile plot for each group

With such small data sets, there really isn’t a really good way to check normality
from data, but we make the common assumption that physical measurements of
people tend to be normally distributed.
Chi-square test
• A chi-square test is a statistical test used to compare observed results
with expected results. The purpose of this test is to determine if a
difference between observed data and expected data is due to
chance, or if it is due to a relationship between the variables you are
studying.
DATA

QUALITAT QUANTIT
IVE ATIVE

CHI
SQUARE
T-TEST
TEST
• The most obvious difference between the chi-square
tests and the other hypothesis tests we have
considered (T test) is the nature of the data.
• For chi-square, the data are frequencies rather than
numerical scores.

87
Chi-squared Tests
 test
2

• For testing significance of patterns in


qualitative data.
• Test statistic is based on counts that
represent the number of items that fall in
each category
• Test statistics measures the agreement
between actual counts(observed) and
expected counts assuming the null hypothesis
CHI SQUARE FORMULA:
Chi square distribution
(0 - E)2
2 = -------------
E

90
Steps of CHI hypothesis testing
• 1. Data :counts or proportion.
• 2. Assumption: random sample selected from a population.
• 3. HO :no sign. Difference in proportion
• no significant association.
• HA: sign. Difference in proportion
• significant association.

91
• 4. level of sign.
• df 1st application=k-1(k is no. of groups)
• df 2nd &3rd application=(column-1)(row-1)
• IN 2nd application(conengency table)
• Df=1, tab. Chi= 3.841 always
• Graph is one side (only +ve)

92
• 5. apply appropriate test of significance

93
• 6. Statistical decision & 7. Conclusion
• Calculated chi <tabulated chi
• P>0.05
• Accept HO,(may be true)
• If calculated chi> tabulated chi
• P<0.05
• Reject HO& accept HA.

94
Correlation Analysis
• Correlation analysis is a tremendous tool to use in understanding how
one variable affects another.

• Take customer experience, for example. Let's say you have the overall
customer satisfaction score you need.
• But then, you want to know how that correlates with other aspects of
the customer experience such as product price, shipping time, or
quality.
Correlation Analysis
• By providing a distinct perspective on which factors impact your
business the most, you can feel more confident in the actions you
taken after the report.

• Definition
• Correlation analysis in market research is a statistical method that
identifies the strength of a relationship between two or more
variables. In a nutshell, the process reveals patterns within a
dataset’s many variables.
Correlation Analysis
• It's all about identifying relationships between variables–specifically
in research.

• Using one of the several formulas, the end result will be a numerical
output between -1 and +1.

• A value near 0 in a correlation analysis indicates a less meaningful


relationship between Variable A and Variable B.
How to Measure Correlation
• You must first conduct an online survey to analyze the correlation
between two variables. The process includes writing, programming,
and fielding a survey. The results are later used to determine strength
scores.
• You are likely to find a useful application for them in customer
satisfaction surveys, employee surveys, customer experience (CX)
programs, or market surveys.
Step 1. Write the survey
• The first step in running a correlation analysis in market research is
designing the survey. You will need to plan ahead with questions in mind
for the analysis.

• This includes anything that yields data that is both numerical and ordinal.

• Think of metrics such as:


• Agreement scales
• Importance scales
• Satisfaction scales
• Money
• Temperature
• Age
Step 2. Program + field the survey
• Once the survey is finalized, you will need to program and test it to
ensure the questions are functioning correctly.
• This is important because mislabeled scales or improper data
validation in the programming will taint the data used for correlation
analysis.
Step 3. Analyze the correlation between 2
variables
• Next, clean the survey data after the target number of responses is
reached. This protects the integrity of the data for analysis.

• The two most common ways to run a correlation include:


• The Pearson r correlation is best used when the relationship between
variables is linear, quantitative, and has no outliers.
• The Spearman rank correlation is best used when you want to see when one
ranked variable increases if the other ranked variable increases or decreases.
Pearson r correlation
• Step 1: Firstly make a chart with the given data like subject, x, and y and
add three more columns in it xy,x² and y².
• Step 2: Now multiply the x and y columns to fill the xy column. For
example:- in x we have 24 and in y we have 65 so xy will be 24×65=1560.
• Step 3: Now, take the square of the numbers in the x column and fill the x²
column.
• Step 4: Now, take the square of the numbers in the y column and fill the y²
column.
• Step 5: Now, add up all the values in the columns and put the result at the
bottom. Greek letter sigma (Σ) is the short way of saying summation.
• Step 6: Now, use the formula for Pearson’s correlation coefficient:-
∑xy = 1103
∑x = 74
∑y = 75
∑x² = 1144
∑y² = 1375
n=5

Put all the values in the Pearson’s correlation


coefficient formula:-
R = n(∑xy) – (∑x)(∑y) / √ [n∑x²-(∑x)²][n∑y²-(∑y)²
R = 5(1103) – (74)(75) / √ [5(1144)-(74)²][5(1375)-
(75)²]
R = -35 / √[244][1250]
R = -35/552.26
R = 0.0633

The correlation coefficient is 0.064


Example
• Calculate the correlation coefficient for the following table with the
help of Pearson’s correlation coefficient formula:
Example
• Make a table from the given data and add three more columns of XY,
X², and Y². also add all the values in the columns to get ∑xy, ∑x, ∑y,
∑x², and ∑y² and n =4.
Example
• ∑x = 151 • Put all the values in the Pearson’s
• ∑y = 336 correlation coefficient formula:-
• ∑x² = 5625
• ∑y² = 28724 • R = n(∑xy) – (∑x)(∑y) / √ [n∑x²-
(∑x)²][n∑y²-(∑y)²
• n=4 • R = 4(12258) – (151)(336) / √
[4(5625)-(151)²][4(28724)-(336)²]
• R = -1704 / √ [-301][-2000]
• R = -1704/775.886
• R = -2.1961

• The correlation coefficient is -2.196


Spearman’s Rank Correlation
• Spearman’s Correlation is a statical measure of measuring the strength and
direction of the monotonic relationship between two continuous variables.
• Therefore, these attributes are ranked or put in the order of their
preference.
• It is denoted by the symbol “rho” (ρ) and can take values between -1 to +1.
• A positive value of rho indicates that there exists a positive relationship
between the two variables, while a negative value of rho indicates a
negative relationship.
• A rho value of 0 indicates no association between the two variables.
Spearman’s Rank Correlation

where,
• rs = Spearman Correlation coefficient
• di = the difference in the ranks given to the two variables values for each item
of the data
• n = total number of observation
Steps and Example
• Step 1: Finding Rank
• Step 2: Calculate d2

• By putting the value of


the overall sum of d2 and
n value
rho/rs = 1 - ((6 x 20.5) / 990)
= 1 - (123 / 990)
= 1 - 0.1242
= 0.88
How to Interpret Correlation Analysis
• Correlation coefficients range from 0 to 1, where the higher the
coefficient means the stronger correlation.
• When the value is greater than 0.7, there is considered to be a strong
correlation between the two variables.
• All correlation strength scores and classifications are outlined below.
• Perfect: 0.80 to 1.00
• Strong: 0.50 to 0.79
• Moderate: 0.30 to 0.49
• Weak: 0.00 to 0.29
Data Interpretation
• Data interpretation is the process of reviewing data and arriving at
relevant conclusions using various analytical research methods.
• Data analysis assists researchers in categorizing, manipulating data,
and summarizing data to answer critical questions.
Common methods to Interpret Data
• Nominal Scale: non-numeric categories that cannot be ranked or
compared quantitatively. Variables are exclusive and exhaustive. (Eg:-
Color or other attributes that are non definitive)
• Ordinal Scale: exclusive categories that are exclusive and exhaustive
but with a logical order. Quality ratings and agreement ratings are
examples of ordinal scales (i.e., good, very good, fair, etc., OR agree,
strongly agree, disagree, etc.).
• Interval: a measurement scale where data is grouped into categories
with orderly and equal distances between the categories. There is
always an arbitrary zero point.
• Ratio: contains features of all three.
Qualitative Data Interpretation
• Observations: detailing behavioral patterns that occur within an
observation group. These patterns could be the amount of time spent in an
activity, the type of activity, and the method of communication employed.
• Focus groups: Group people and ask them relevant questions to generate a
collaborative discussion about a research topic.
• Secondary Research: much like how patterns of behavior can be observed,
various types of documentation resources can be coded and divided based
on the type of material they contain.
• Interviews: one of the best collection methods for narrative data. Inquiry
responses can be grouped by theme, topic, or category. The interview
approach allows for highly-focused data segmentation.
Quantitative Data Interpretation
• Mean: a mean represents a numerical average for a set of responses. When
dealing with a data set (or multiple data sets), a mean will represent a central
value of a specific set of numbers. It is the sum of the values divided by the
number of values within the data set. Other terms that can be used to describe
the concept are arithmetic mean, average and mathematical expectation.
• Standard deviation: this is another statistical term commonly appearing in
quantitative analysis. Standard deviation reveals the distribution of the responses
around the mean. It describes the degree of consistency within the responses;
together with the mean, it provides insight into data sets.
• Frequency distribution: this is a measurement gauging the rate of a response
appearance within a data set. When using a survey, for example, frequency
distribution, it can determine the number of times a specific ordinal scale
response appears (i.e., agree, strongly agree, disagree, etc.). Frequency
distribution is extremely keen in determining the degree of consensus among
data points.
Data Interpretation Techniques and Methods
1) Ask the right data interpretation questions
• what are the goals and objectives of my analysis?
• What type of data interpretation method will I use?
• Who will use this data in the future?
• And most importantly, what general question am I trying to answer?

• Once all this information has been defined, you will be ready for the
next step, collecting your data.
Collect and assimilate your data
• Once your data is collected, you need to carefully assess it to
understand if the quality is appropriate to be used during a study.
• This means,
• is the sample size big enough?
• Were the procedures used to collect the data implemented correctly?
• Is the date range from the data correct?
• If coming from an external source, is it a trusted and objective one?
• With all the needed information in hand, you are ready to start the
interpretation process, but first, you need to visualize your data.
3) Use the right data visualization type
• Bar chart
• Line chart
• Pie chart
• Tables
4) Start interpreting
• the way you decide to interpret the data will solely depend on the
methods you initially decided to use.
• If you had initial research questions or hypotheses then you should
look for ways to prove their validity.
• If you are going into the data with no defined hypothesis, then start
looking for relationships and patterns that will allow you to extract
valuable conclusions from the information.
5) Keep your interpretation objective
• Being the person closest to the investigation, it is easy to become
subjective when looking for answers in the data. A good way to stay
objective is to show the information to other people related to the
study, for example, research partners or even the people that will use
your findings once they are done.
• using a visualization tool such as a modern dashboard will make the
interpretation process way easier and more efficient as the data can
be navigated and manipulated in an easy and organized way.
6) Mark your findings and draw conclusions
Data Interpretation Characteristics
• Data analysis and interpretation, regardless of the method and
qualitative/quantitative status, may include the following
characteristics:
• Data identification and explanation
• Comparing and contrasting data
• Identification of data outliers
• Future predictions
Importance of Data Interpretation
• Informed decision-making
• Anticipating needs with trends identification
• Cost efficiency
• Clear foresight
Scatter plot
• A scatter plot (aka scatter chart, scatter graph) uses dots to represent
values for two different numeric variables. The position of each dot
on the horizontal and vertical axis indicates values for an individual
data point. Scatter plots are used to observe relationships between
variables.
When you should use a scatter plot
• Scatter plots’ primary uses are to observe and show relationships
between two numeric variables. The dots in a scatter plot not only
report the values of individual data points, but also patterns when the
data are taken as a whole.
• Identification of correlational relationships are common with scatter
plots. In these cases, we want to know, if we were given a particular
horizontal value, what a good prediction would be for the vertical
value.
• You will often see the variable on the horizontal axis denoted an
independent variable, and the variable on the vertical axis the
dependent variable.
When you should use a scatter plot
• Relationships between
variables can be described in
many ways: positive or
negative, strong or weak,
linear or nonlinear.
When you should use a scatter plot
• A scatter plot can also be useful for identifying other patterns in data.
We can divide data points into groups based on how closely sets of
points cluster together.
• Scatter plots can also show if there are any unexpected gaps in the
data and if there are any outlier points.
Common issues when using scatter plots
• Overplotting
• When we have lots of data points to plot, this can run into the issue of overplotting.
Overplotting is the case where data points overlap to a degree where we have
difficulty seeing relationships between points and variables.
• Solution
• There are a few common ways to alleviate this issue. One alternative is to sample
only a subset of data points: a random selection of points should still give the general
idea of the patterns in the full data.
• We can also change the form of the dots, adding transparency to allow for overlaps
to be visible, or reducing point size so that fewer overlaps occur.
• As a third option, we might even choose a different chart type like the heatmap,
where color indicates the number of points in each bin. Heatmaps in this use case
are also known as 2-d histograms.
Interpreting correlation as causation
• Causation indicates that one event is the result of the occurrence of
the other event; i.e. there is a causal relationship between the two
events.
Common scatter plot options
• Add a trend line
• When a scatter plot is used to
look at a predictive or
correlational relationship
between variables, it is
common to add a trend line to
the plot showing the
mathematically best fit to the
data.
Categorical third variable
• A common modification of the
basic scatter plot is the addition
of a third variable.
• Values of the third variable can
be encoded by modifying how
the points are plotted.
• For a third variable that indicates
categorical values (like
geographical region or gender)
Usage of Shapes
Numeric third variable
• For third variables that have numeric values, a common encoding
comes from changing the point size.
• A scatter plot with point size based on a third variable actually goes
by a distinct name, the bubble chart. Larger points indicate higher
values.
Highlight using annotations and color
Related plots
• Scatter map
• When the two variables in a scatter plot are geographical coordinates –
latitude and longitude – we can overlay the points on a map to get a scatter
map (aka dot map). This can be convenient when the geographic context is
useful for drawing particular insights and can be combined with other third-
variable encodings like point size and color.
Heatmap
• As noted above, a heatmap can
be a good alternative to the
scatter plot when there are a lot
of data points that need to be
plotted and their density causes
overplotting issues. However, the
heatmap can also be used in a
similar fashion to show
relationships between variables
when one or both variables are
not continuous and numeric.
Connected scatter plot
• If the third variable we want to
add to a scatter plot indicates
timestamps, then one chart type
we could choose is the
connected scatter plot. Rather
than modify the form of the
points to indicate date, we use
line segments to connect
observations in order.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy