Business and Market Research - Unit 4 - Final
Business and Market Research - Unit 4 - Final
Research - Unit 4
By: Tulika
Processing and analysis of data
• The data, after collection, has to be processed and analyzed in accordance with the outline laid
down for the purpose at the time of developing the research plan.
• This is essential for a scientific study and for ensuring that we have all relevant data for making
contemplated comparisons and analysis.
• Technically speaking, processing implies editing, coding, classification and tabulation of
collected data so that they are amenable to analysis.
• The term analysis refers to the computation of certain measures along with searching for
patterns of relationship that exist among data groups.
Software packages used:
• MS Excel
• SPSS (Software Packages for Social Sciences) Google Docs etc.
Data editing
• Field editing consists in the review of the reporting forms by the investigator for completing
(translating or rewriting) abbreviated and/or illegible form at the time of recording the
respondents’ responses.
• Central editing should take place when all forms or schedules have been completed and returned
to the office
• This type of editing implies that all forms should get a thorough editing by a single editor in a
small study and by a team of editors in case of a large inquiry. Here they correct errors like entry
in the wrong place, entry in wrong unit etc.
• Consistency (between questions/ values)
• Uniformity (Units / formats differ etc.)
• Completeness (Critical questions answered or not)
• Accuracy (major discrepancies, skewed responses)
• All the wrong answers should be dropped from the final results.
Data editing
Graphing
https://www.youtube.com/watch?v=DWw1xWIPZW8
Coding
• Coding refers to the process of assigning numerals or other symbols to answers so that responses
can be put into a limited number of categories or classes.
• Coding is necessary for efficient analysis and through it the several replies may be reduced to a
small number of classes which contain the critical information required for the analysis.
• Coding decisions should be usually be taken at the designing stage of the questionnaire which
makes it possible to precode the questionnaire choices thus making it faster for computer
tabulation later on.
• Coding errors should be altogether eliminated or reduced to the minimum level.
Data coding example
Classification
• Most research studies result in a large volume of raw data which must be reduced into
homogeneous groups if we are to get meaningful relationships.
• Classification of data happens to be the process of arranging data in groups or classes on the
basis of common characteristics.
• Can be one of the following two types, depending upon the nature of the phenomenon
involved:
• Classification according to attributes
• Classification according to class intervals
Classification according to attributes
• Data are classified on the basis of common characteristics which can either be descriptive (such
as literacy, gender, honesty, etc.) or numerical (such as weight, height, income etc.).
• Descriptive characteristics refer to qualitative phenomenon which cannot be measured
quantitatively; only their presence or absence in an individual item can be noticed.
• Data obtained in this way on the basis of certain attributes are known as statistics of attributes
and their classification is said to be classification according to attributes.
Classification according to class- intervals
13
Classification according to class- intervals
• Class limits may generally be stated in any of the following forms:
• Exclusive type class intervals:
• 10-20 (should be read as 10 and under 20)
• 20-30
• 30-40
• 40-50
• Under exclusive type class intervals, the upper limit of a class interval is excluded and items with value less than the
upper limit are put in the given class interval.
• Inclusive type class intervals:
• 11-20 (should be read as 11 and under 21)
• 21-30
• 31-40
• 41-50
• Here, the upper limit of a class of interval is also included in the concerning class interval.
• When the class can be measured and stated only in integers, then we should adopt inclusive
type classification.
• But when the class is continuous and can be measured in fractions as well, we can use exclusive
type class intervals.
14
TABULATION
• When a mass of data has been assembled, it becomes necessary for the researcher to arrange the
same in some kind of concise and logical order.
• This procedure is known as tabulation.
• Thus, tabulation is the process of summarizing raw data and displaying the same in compact form
(i.e., in the form of statistical tables) for further analysis.
• In a broader sense, tabulation is an orderly arrangement of data in columns and rows.
• Tabulation can be done by hand or by mechanical or electronic devices.
• The choice depends on the size and type of study, cost considerations, time pressures and the
availability of tabulating machines or computers.
• Mechanical or electronic tabulation – in relatively large queries, hand tabulation in case of small
queries, where the number of questionnaires is small and they are of relatively short length.
19
TABULATION
20
Graphing of data
Other 7%
Catching up on w ork 5%
Vacation 5%
H o w D o Yo u S p e n d the Holiday's
7%
5%
5% At hom e with family
Travel to visit family
45%
Vacation
Catching up on work
Other
38%
Histogram
A graph of the data in a frequency distribution is called a histogram.
Frequency
4
3
2
1
0
5 15 25 35 45 55 More
Analysis of Data
4th March
Analysis of Data
Analysis means computation of certain indices or measures along with searching for patterns of
relationships that exists among the data groups.
https://www.youtube.com/watch?v=MXaJ7sa7q-8&list=PL0KQuRyPJoe6KjlUM6iNYgt8d0DwI-IGR (5 videos)
- Introduction to Statistics
- Bar charts/ Pie Charts/ Histograms
- Mean, Median, Mode, Standard deviation
MEASURES OF CENTRAL TENDENCY
• Measures of central tendency (or statistical averages) tell us the point about which items have a
tendency to cluster.
• Such a measure is considered as the most representative figure for the entire mass of data.
• Mean, median and mode are the most popular averages.
• Mean, also known as arithmetic average, is the most common measure of central tendency and
may be defined as the value which we get dividing the total of the values of various given items in
a series by the total number of items.
MEAN
MEAN
MEAN
• Mean is the simplest measure of central tendency and is a widely used measure.
• Its chief use consists in summarizing the essential features of a series and in
enabling data to be compared.
• It is amenable to algebraic treatment and is used in further statistical calculations.
• It is a relatively stable measure of central tendency.
• But it suffers from some limitations viz., it is unduly affected by extreme items; it
may not coincide with the actual value if an item in series, and it may lead to
wrong impressions, particularly when the item values are not given in the
average.
• However, mean is better than other averages, specially in economic and social
studies where direct quantitative measurements are possible.
MEDIAN
• Median is the value of the middle item of series when it is arranged in ascending
or descending order of magnitude.
• It divides the series into two halves: in one half all items are less than the median,
whereas in the other half all items have values higher than the median,
• If the values of the items arranged in ascending order are: 60, 74, 80, 88, 90, 95,
100, then the value of the 4’th item viz., 88 is the value of the median.
• We can also write as: Median(M) = Value of ((n+1)/2)th item.
MEDIAN
35
MODE
• Mode is the most commonly or frequently occurring value in the series.
• The mode in a distribution is that item around which there is maximum concentration.
• In general, mode is the size of the item which has the maximum frequency, but at times such an
item may not be mode on account of the effect of the frequencies of the neighboring items.
• Like median, mode is a positional average and is not affected by the value of the extreme items.
• It is therefore, useful in all situations where we want to eliminate the effect of extreme variations.
• Mode is particularly useful in the study of popular sizes, for example, size of the shoe most in
demand.
• However, mode is not amenable to algebraic treatment and sometimes remains indeterminate
when we have two or more model values in a series.
• It is considered unsuitable in cases where we want to give relative importance to items under
consideration.
36
GEOMETRIC MEAN
HARMONIC MEAN
HARMONIC MEAN
• Harmonic mean is of limited application, particularly in cases where time and rate
are involved.
• It gives largest weight to the smallest item and smallest weight to the largest
item.
• As such, it is used in cases like time and motion study where time is a variable
and distance a constant.
39
MEASURES OF DISPERSION
• An average can represent a series only as best as a single figure can, but it
certainly cannot reveal the entire story of any phenomenon under study.
• Specially, it fails to give any idea about the scatter of the values of items of a
variable in the series around the true value of average.
• In order to measure this scatter, statistical devices called measures of dispersion
are calculated.
• Important measures of dispersion are : i) range, ii) mean deviation and iii)
standard deviation.
40
MEASURES OF DISPERSION
• Mean deviation is the average of differences of the values of items from some
average of the series.
• Such a difference is technically described as deviation.
• In calculating the deviation, we ignore the minus sign of the deviations, while
taking their total for obtaining the mean deviation.
42
MEASURES OF DISPERSION
• When mean deviation is divided by the average used in finding out the mean
deviation itself, the resulting quantity is described as the coefficient of mean
deviation.
• Coefficient of mean deviation is a relative measure of dispersion and is
comparable to similar measure of other series.
• Mean deviation and its coefficient are used in statistical studies for judging the
variability, and thereby render the study of central tendency of a series more
precise by throwing light on the typicalness of the average.
• Better measurement of variability than range as it takes into consideration the
values of all items of a series.
• However, not a frequently used measure as it is not amenable to algebraic
process.
43
MEASURES OF DISPERSION
• Standard deviation is the most widely used measure of dispersion series and is
commonly denoted by the symbol ‘σ’, pronounced as sigma.
• It is defined as the square-root of the average of squares of deviation, when such
deviations for the values of the individual items in a series are obtained from the
arithmetic average.
• When we divide the standard deviation by arithmetic average of the series, the resulting
quantity is known as coefficient of standard deviation, which happens to be the relative
measure and is often used for comparing with similar measure of other series.
• When this coefficient of standard deviation is multiplied by 100, the resulting figure is
known as coefficient of variation.
• Sometimes, the square of the standard deviation, known as variance, is frequently used
in the context of analysis of variation.
44
MEASURES OF DISPERSION
• The standard deviation (along with several other related measures like variance,
coefficient of variation, etc.) is used mostly in research studies and is regarded a
very satisfactory measure of dispersion in series.
• It is amenable to mathematical manipulation because the algebraic signs are not
ignored in its calculation.
• It is less affected by the fluctuations in sampling.
• It is popularly used in the context of estimation and testing of hypotheses.
45
MEASURES OF ASYMMETRY(SKEWNESS)
46
MEASURES OF ASYMMETRY(SKEWNESS)
• Skewness, is, thus, a measure of asymmetry and shows the manner in which the
items are clustered around the average.
• In a symmetrical distribution, the items show a perfect balance on either side of a
mode, but in a skewed distribution, the balance is thrown to one side.
• The amount by which the balance exceeds on one side measures the skewness of
the series.
• The difference between the mean, median and mode provides an easy way
expressing skewness in a series.
• In case of positive skewness, we have Z< M<X under bar and in case of negative
skewness, we have X under bar < M < Z.
MEASURES OF ASYMMETRY(SKEWNESS)
• The significance of the skewness lies in the fact that through it one can study the
formation of a series and can have the idea about the shape of the curve, whether
normal or otherwise, when the items of a given series are plotted on a graph.
• Kurtosis is the measure of flat-toppedness of a curve.
• A bell shaped curve or the normal curve is Mesokurtic because it is kurtic in the centre;
but if the curve is relatively more peaked than the normal curve, it is Leptokurtic.
• Similarly, if a curve is more flat than the normal curve, it is called Platykurtic.
• In brief, kurtosis is the humpedness of the curve and points to the nature of the
distribution of items in the middle of a series.
• Knowing the shape of the distribution curve is crucial to the use of statistical methods in
research analysis since most methods make specific assumptions about the nature of the
distribution curve.
48
Data analysis
18 March
STATISTICS IN RESEARCH
The important statistical measures that are used to summarize the survey/research data are:
• Measures of central tendency or statistical averages
• arithmetic average or mean, median and mode, geometric mean and harmonic mean
• Measures of dispersion
• Variance and its square root – the standard deviation, range etc. For comparison purposes, mostly
the coefficient of standard deviation or the coefficient of variation.
• Measures of asymmetry (skewness)
• Measure of skewness and kurtosis are based on mean and mode or on mean and median.
• Other measures of skewness, based on quartiles or on the methods of moments, are also used
sometimes. Kurtosis is also used to measure the peakedness of the curve of frequency distribution.
Measure of dispersion
Measure of central tendency and symmetry
Exercise: Central tendency
• Why do you think the mean price for Delhi is higher than the mean price for Bangalore?
The mean for Delhi is distorted because of one posh restaurant in your dataset, which charges ₹1,800 for
its food.
To get a better picture of the data, you need to calculate the median of this dataset.
• <Refer to excel>
ELEMENTS/TYPES OF ANALYSIS
• Analysis involves estimating the values of unknown parameters of the population and testing of
hypotheses for drawing inferences.
• Analysis may, therefore, be categorized as descriptive analysis and inferential analysis.
• Descriptive analysis gives information about raw data which describes the data in some manner.
• In inferential analysis, predictions are made by taking any group of data in which you are
interested.
https://www.youtube.com/watch?v=VHYOuWu9jQI&t=18s
22 Mar
Examples
Descriptive
• Determine if increase in temperature (weather conditions) causes an increase in ice cream sales
Correlational/ Causal
Analysis of Data
Uni-Variate
Analysis
Estimation of Testing
Parameter Hypothesis
Bivariate Values
Analysis
Point
Estimate
Multi Variate Parametric
Analysis Tests
Interval
Estimate Non
-Parametric
Tests
Descriptive Analysis
https://www.youtube.com/watch?v=gN0OQ6r78f4 (5 mins)
Uni-Variate Analysis
Univariate analysis refers to the analysis of one variable at a time. The commonest approaches are
as follows:
Bivariate analysis is concerned with the analysis of two variables at a time in order to uncover
whether the two variables are related
Main types:
• Simple Correlation
• Simple Regression
https://www.youtube.com/watch?v=IA0unflfvQE
Multi-Variate Analysis
Main Types:
• Multiple Correlation
• Multiple Regression
• Multi- ANOVA
https://www.youtube.com/watch?v=AmNqUu_e4nQ
Causal Analysis
• Causal analysis is concerned with the study of how one or more variables affect changes in
another variables
https://www.youtube.com/watch?v=yfea6z_Y3Ec
Inferential analysis
24 March
Analysis of Data
Uni-Variate
Analysis
Estimation of Testing
Parameter Hypothesis
Bivariate Values
Analysis
Point
Estimate
Multi Variate Parametric
Analysis Tests
Interval
Estimate Non
-Parametric
Tests
MEASURES OF RELATIONSHIP
• In case of bivariate and multi-variate populations, we often wish to know the relation of the two
and/or more variables in the data to one another.
• We may like to know, for example, whether the number of hours students devote for studies is
somehow related to their family income, to age, to gender or to similar other factor.
• We need to answer the following two types of questions in bivariate and\or multivariate
populations:
• Does there exist association or correlation between the two (or more) variables? If yes, of
what degree?
• Is there any cause and effect relationship between the two variables in case of bivariate
population or between one variable on one side and two or more variables on the other side
in case of multivariate population? If yes, of what degree and in which direction?
70
MEASURES OF RELATIONSHIP
• The first question is answered by the use of the correlation technique and the second question by
the technique of regression.
• There are several methods of applying the two techniques, but the important ones are as under:
• In case of bivariate population:
• Correlation can be studied through Karl Pearson’s coefficient correlation.
• Cause and effect relationship can be studied through simple regression equations.
• In case of multivariate population:
• Correlation can be studied through coefficient of multiple correlation
• Cause and effect relationship can be studied through multiple regression equations.
74
Inferential Analysis
• Inferential analysis is concerned with the testing the hypothesis and estimating the population
values based on the sample values.
• It is mainly on the basis of inferential analysis that the task of interpretation (i.e., the task of
drawing inferences and conclusions) is performed.
Point estimates and intervals
• The main purpose of statistics is to test a hypothesis. For example, you might run an experiment and find that a
certain drug is effective at treating headaches. But if you can’t repeat that experiment, no one will take your results
seriously.
• Hypothesis statement: A hypothesis is an educated guess about something in the world around you. It should be
testable, either by experiment or observation.
• If you are going to propose a hypothesis, it’s customary to write a statement. Your statement will look like this:
“If I…(do this to an independent variable)….then (this will happen to the dependent variable).”
For example: If I (decrease the amount of water given to herbs) then (the herbs will increase in size).
A good hypothesis statement should:
• Include an “if” and “then” statement
• Include both the independent and dependent variables.
• Be testable by experiment, survey or other scientifically sound technique.
• Be based on information in prior research (either yours or someone else’s).
• Have design criteria (for engineering or programming projects).
Hypothesis testing
• https://www.youtube.com/watch?v=Q1yu6TQZ79w
• https://www.youtube.com/watch?v=-FtlH4svqx4
Hypothesis testing
• The null and alternative hypotheses are perfect opposites of each other. Hence, they should cover
the entire range of possibilities that the hypothesised parameter can take.
• The null hypothesis always has the following signs: ‘=’ OR ‘≤’ OR ‘≥’
• The alternative hypothesis always has the following signs: ‘≠’ OR ‘>’ OR ‘<’
It is important to note that we always begin with the assumption that the null hypothesis is true.
Then:
• If we have sufficient evidence to prove that the null hypothesis is false, we ‘reject’ it. In this case,
the alternative hypothesis is proved to be true.
• If we do NOT have sufficient evidence to prove that the null hypothesis is false, we ‘fail to reject’
it. In this case, the assumption that the null hypothesis is true remains.
• Remember that in hypothesis testing parlance, we never “prove” the null hypothesis. We can
only say that we ‘fail to reject’ the null hypothesis based on the evidence that we have gathered.
Error in testing of Hypothesis
• Type I error is the rejection of a true null hypothesis (also
known as a "false positive" finding or conclusion; example: "an
innocent person is convicted"), while a type II error is the non-
rejection of a false null hypothesis (also known as a "false
negative" finding or conclusion; example: "a guilty person is
not convicted").
• By selecting a low threshold (cut-off) value and modifying the
alpha (p) level, the quality of the hypothesis test can be
increased.
• Type I errors can be thought of as errors of commission, i.e.
the researcher unluckily concludes that something is the fact.
For instance, consider a study where researchers compare a
drug with a placebo. If the patients who are given the drug get
better than the patients given the placebo by chance, it may
appear that the drug is effective, but in fact the conclusion is
incorrect. In reverse, type II errors as errors of omission.
Exercise
Let’s say you are collecting data on student age for your college in order to verify certain claims. For
this, you collect data from a sample of 40 students. Which of the following can function as a pair of
hypotheses?
• Ho: Average age of students = 22 years; Ha: Average age of students < 22 years
• Ho: Average age of students ≠ 23 years; Ha: Average age of students = 23 years
• Ho: Average age of students ≥ 22 years; Ha: Average age of students < 27 years
• Ho: Average age of students ≤ 21 years; Ha: Average age of students > 21 years
• Let’s say you are the COO of a shoe-manufacturing company. An employee has developed a new
sole and claims that incorporating it will decrease the wear after three years of use by more than
9%. Now, suppose you want to test this claim.
• What will be the null and alternative hypotheses in this scenario?
Ho: Decrease in wear after 3 years ≤ 9%; Ha: Decrease in wear after 3 years > 9%
Parametric and Non Parametric
tests
29 March
Parametric / non – parametric tests
These tests depends upon assumptions typically that the population(s) from which data are
randomly sampled have a normal distribution. Types of parametric tests are:
• Chi square
• t- test
• z- test
• F- test
Z test
• https://www.youtube.com/watch?v=BWJRsY-G8u0
F test
• https://www.youtube.com/watch?v=FlIiYdHHpwU
T test
• https://www.youtube.com/watch?v=0Pd3dc1GcHc
Chi square test
• https://www.youtube.com/watch?v=ZjdBM7NO7bY
Chi-Square Test
Karl Pearson introduced a test to
distinguish whether an observed set
of frequencies differs from a
specified frequency distribution
=
2
Chi- Square Test as a Non-Parametric Test
(O E ) 2
2
E
(O E ) 2
2
E
2. As a Test of Goodness of Fit
It enables us to see how well does the assumed
theoretical distribution(such as Binomial distribution,
Poisson distribution or Normal distribution) fit to the
observed data. When the calculated value of χ2 is less
than the table value at certain level of significance, the
fit is considered to be good one and if the calculated
value is greater than the table value, the fit is not
considered to be good.
EXAMPLE
As personnel director, you want to test
the perception of fairness of three
methods of performance evaluation.
Of 180 employees, 63 rated
Method 1 as fair, 45 rated
Method 2 as fair, 72 rated Method 3 as
fair. At the 0.05 level of
significance, is there a difference in
perceptions?
SOLUTION
Observed Expected (O-E) (O-E)2 (O-E)2 E
frequency frequency
63 60 3 9 0.15
45 60 -15 225 3.75
72 60 12 144 2.4
6.3
H0: p1 = p2 = p3 = 1/3
H1: At least 1 is different
= 0.05 Test Statistic:
2 = 6.3
n1= 63 n2 = n3 = 72
45
Decision:
Critical Value(s): Reject H0 at sign. level 0.05
Conclusion:
At least 1 proportion is different
Reject H0
= 0.05
0 5.991 2
3.As a Test of Independence
χ2 test enables us to explain whether or not two attributes
are associated. Testing independence determines whether two or
more observations across two populations are dependent on each
other (that is, whether one variable helps to estimate the other. If
the calculated value is less than the table value at certain level of
significance for a given degree of freedom, we conclude that null
hypotheses stands which means that two attributes are
independent or not associated. If calculated value is greater than
the table value, we reject the null hypotheses.
Steps involved
(O E ) 2
2
E
Determine Degrees of Freedom
df = (R-1)(C-1)
Compare computed test statistic against a
tabled/critical value
Democrat 10 10 30 50
Republican 15 15 10 40
f column 25 25 40 n = 90
BIVARIATE FREQUENCY TABLE OR
CONTINGENCY TABLE
Favor Neutral Oppose f row
Democrat 10 10 30 50
Republican 15 15 10 40
f column 25 25 40 n = 90
Row frequency
BIVARIATE FREQUENCY TABLE OR
CONTINGENCY TABLE
Favor Neutral Oppose f row
Democrat 10 10 30 50
Republican 15 15 10 40
f column 25 25 40 n = 90
22
BIVARIATE FREQUENCY TABLE OR
CONTINGENCY TABLE
Favor Neutral Oppose f row
Democrat 10 10 30 50
Republican 15 15 10 40
f column 25 25 40 n = 90
Col umn frequency
DETERMINE THE HYPOTHESIS
• Ho : There is no difference between D & R in their
opinion on gun control issue.
= 11.03
DETERMINE DEGREES OF FREEDOM
df = (R-1)(C-1) =
(2-1)(3-1) = 2
COMPARE COMPUTED TEST STATISTIC AGAINST TABLE VALUE
α = 0.05
df = 2
Critical tabled value = 5.991
Test statistic, 11.03, exceeds critical value
Null hypothesis is rejected
Democrats & Republicans differ significantly in
their opinions on gun control issues
2 TEST OF INDEPENDENCE THINKING CHALLENGE
Diet Pepsi
Diet Coke No Yes Total
No 84 32 116
Yes 48 122 170
Total 132 154 286
2 TEST OF INDEPENDENCE SOLUTION*
170·132 170·154
286 286
2 TEST OF INDEPENDENCE SOLUTION*
2
n ij E
2
ij
n 11 E
2
n 12 E
2
n 22 E
2
11
12
22
84 53.5 2
32 62.5 2
122 91.5 2
54.29
53.5 62.5 91.5
H0: No Relationship
H1: Relationship
Test Statistic: 2 = 54.29
= 0.05
df = (2 - 1)(2 - 1) = 1
Critical Value(s):
Decision:
Reject at sign. level 0 .05
Reject H0 Conclusion:
0 3.841 2
2 TEST OF INDEPENDENCE THINKING CHALLENGE 2
There is a statistically significant relationship between
purchasing Diet Coke and Diet Pepsi. So what do you
think the relationship is? Aren’t they competitors?
Diet Pepsi
• https://www.youtube.com/watch?v=-yQb_ZJnFXw
• Statistical technique specially designed to test whether the means of more than
2 quantitative populations are equal
• ANOVA uses the F-test to determine whether the variability between group
means is larger than the variability of the observations within the groups
EXAMPLE: Study conducted among men of age group 18-25 year
in community to assess effect of SES on BMI
Effect of SES on BMI Effect of age & SES on BMI Effect of age, SES, Diet on
BMI
ANOVA with repeated measures - comparing >=3 group means where the
participants are same in each group. E.g.
Group of subjects is measured more than twice, generally over time, such as
patients weighed at baseline and every month after a weight loss program
Data required
One way ANOVA or single factor ANOVA:
• Determines means of
≥ 3 independent groups
significantly different from one another.
2. State Alpha
3. Calculate degrees of Freedom
4. State decision rule
H 0 1 2 ... i
H0 : all sample means are equal
Calculation of MSE
Calculation of MSC- Mean Sum Of Squares
Mean sum of Squares within samples
between samples
SSC SSE
MSC MSE
k 1 n k
k= No of Samples, n= Total No of observations
Calculation of F statistic
Variability between groups
F
Variability within groups
𝑀𝑆𝐶
F- statistic =
𝑀𝑆𝐸
8 7 12
10 5 9
7 10 13
14 9 12
11 9 14
1. Null hypothesis –
No significant difference in the means of 3 samples
SSC 40
MSC 20
k 1 2
8 4 7 1 12 0
10 0 5 9 9 9
7 9 10 4 13 1
14 16 9 1 12 0
11 1 9 1 14 4
30 16 14
Sum of squares within samples (SSE) = 30 + 16 +14 = 60
Calculation of Mean Sum Of Squares within samples (MSE)
S S E 6 0
M S E 5
n k 1 2
Calculation of ratio F
𝑀𝑆𝐶
F- statistic = = 20/5 =4
𝑀𝑆𝐸
The Table value of F at 5% level of significance for d.f 2 & 12 is 3.88 The
calculated value of F > table value
H0 is rejected. Hence there is significant difference in sample means
Short cut method -
X1 (X1) 2 X2 (X2 )2 X3 (X3 )2
8 64 7 49 12 144
10 100 5 25 9 81
7 49 10 100 13 169
14 196 9 81 12 144
11 121 9 81 14 196
Total 50 530 40 336 60 734
• https://www.youtube.com/watch?v=2B_UW-RweSE
• https://www.youtube.com/watch?v=rR-jptLvhFw
Which test to use?
https://www.youtube.com/watch?v=ulk_JWckJ78
https://www.youtube.com/watch?v=I10q6fjPxJ0&t=350s
Non-parametric tests
• https://www.youtube.com/watch?v=IcLSKko2tsg
Important terms
• A critical region, also known as the rejection region, is a set of values for the test statistic for
which the null hypothesis is rejected. i.e. if the observed test statistic is in the critical region then
we reject the null hypothesis and accept the alternative hypothesis.
• Degrees of freedom equal your sample size minus the number of parameters you need
to calculate during an analysis. It is usually a positive whole number. Degrees of freedom is a
combination of how much data you have and how many parameters you need to estimate.
• The standard error (SE) of a statistic (usually an estimate of a parameter) is the standard
deviation of its sampling distribution or an estimate of that standard deviation. If the statistic is
the sample mean, it is called the standard error of the mean (SEM)
Using SPSS
• https://www.youtube.com/watch?v=Bku1p481z80&list=RDCMUCwM4EI8mqvsSU
R7Ou1D0qrA&start_radio=1&t=0
SPSS
1 Apr
Basics
• In Excel, you can perform some Statistical analysis but SPSS is more powerful. SPSS has built-in
data manipulation tools such as recoding, transforming variables, and in Excel, you have a lot of
work if you want to do that job.
• SPSS allows you to perform complex analytics such as factor analysis, logistic regression, cluster
analysis etc. etc.
• In SPSS every column is one variable, Excel does not treat columns and rows in that way (in
treating volume and rows SPSS is more similar to Access than to Excel).
• Excel does not give you a paper trail where you can easily replicate the exact steps that you took.
It also starts becoming unwieldy to use when the number of variables and observations starts
getting really large.
Intro
• https://www.youtube.com/watch?v=_zFBUfZEBWQ
Activity 1: Understanding the SPSS Environment
- Demonstrate the 2 views in SPSS and create a data table in SPSS with the following three
variables: Name, Age, Gender.
- Explain each of the variable properties
- Assign 3 sample values
Getting more sample data for exploration
• IBM provides SPSS users with multiple practice datasets right within the SPSS software.
• Click Open in the SPSS window.
• Click on My Computer>Program Files>IBM>SPSS.
• Click on SPSS>Samples>English.
• This will open a list of various datasets (filenames ending with .sav)
• Click on the dataset you wish to use.
Activity 2: Exploring the Data (data analysis)
Running speed and ability is known to be correlated with both the gender and with
a person's general level of athleticism.
In the sample dataset (provided in Google classroom), there are several variables
relating to this question:
● Gender - The person's physical sex (Male or Female)
● Athlete - Are you an athlete? (Yes/No)
● MileMinDur - Time to run a mile (as a duration variable, hh:mm:ss)
• https://libguides.library.kent.edu/SPSS/CompareMeans
Conclusions
● There were nearly the same number of male non-
athletes and athletes. Among females, there were
more non-athletes than athletes.
● Among the athletes, the difference in average mile
times between males and females was only 14
seconds. Among non-athletes, the difference in
average mile time between males and females was
more than two minutes.
● Within the athlete and non-athlete groups, the
standard deviations are relatively close.
● Among the athletes, the slowest male mile time and
the slowest female mile time were very close
(within fifteen seconds). Among the non-athletes,
the difference between the slowest male mile time
and the slowest female mile time was much greater
(about 1 minute, 40 seconds).
Discussion
What is the difference between dependent and independent variable?
The values of dependent variables depend on the values of independent variables. The
dependent variables represent the output or outcome whose variation is being studied
Where can this analysis be used in a business scenario?
- Understanding Price sensitivity for different groups basis geography, age, gender
- Making customer preference cohorts
Thank you!
Extra Slides
3
What is Hypothesis?
www.shakehandwithlife.in
4
Characteristics of Hypothesis
www.shakehandwithlife.in
5
Null Hypothesis
𝐻0: 𝜇 = 𝜇0
www.shakehandwithlife.in
6
Alternative Hypothesis
www.shakehandwithlife.in
7
Level of significance and confidence
www.shakehandwithlife.in
8
Risk of rejecting a Null Hypothesis
when it is true
Risk Confidence
Designation Description
𝜶 𝟏− 𝜶
More than $100 million (Large loss of
0.001 0.999
Supercritical life, e.g. nuclear
0.1% 99.9%
disaster
0.01 0.99 Less than $100 million
Critical
1% 99% (A few lives lost)
0.05 0.95 Less than $100 thousand (No lives lost,
Important
5% 95% injuries occur)
0.10 0.90 Less than $500 (No injuries
Moderate
10% 90% occur)
www.shakehandwithlife.in
9
Type I and Type II Error
Decision
Accept Null Reject Null
Situation
Null is true Correct Type I error
( 𝛼𝑒𝑟𝑟𝑜𝑟)
www.shakehandwithlife.in
Two tailed test @ 10
5% Significance level
𝑅𝑒𝑗𝑒𝑐𝑡𝑖𝑜𝑛𝑟𝑒𝑔𝑖𝑜𝑛 𝑅𝑒𝑗𝑒𝑐𝑡𝑖𝑜𝑛𝑟𝑒𝑔𝑖𝑜𝑛
𝑇𝑜𝑡𝑎𝑙𝐴𝑐𝑐𝑒𝑝𝑡𝑎𝑛𝑐𝑒𝑟𝑒𝑔𝑖𝑜𝑛 /𝑠𝑖𝑔𝑛𝑖𝑓𝑖𝑐𝑎𝑛𝑐𝑒𝑙𝑒𝑣𝑒𝑙
/𝑠𝑖𝑔𝑛𝑖𝑓𝑖𝑐𝑎𝑛𝑐𝑒𝑙𝑒𝑣𝑒𝑙
(𝛼 = 0.025 𝑜𝑟 2.5%) 𝑜𝑟𝑐𝑜𝑛𝑓𝑖𝑑𝑒𝑛𝑐𝑒𝑙𝑒𝑣𝑒𝑙 (𝛼 = 0.025 𝑜𝑟 2.5%)
(1 − 𝛼) = 95%
𝐻0: 𝜇= 𝜇0
www.shakehandwithlife.in
Left tailed test @ 11
5% Significance level
𝑅𝑒𝑗𝑒𝑐𝑡𝑖𝑜𝑛𝑟𝑒𝑔𝑖𝑜𝑛 𝑇𝑜𝑡𝑎𝑙𝐴𝑐𝑐𝑒𝑝𝑡𝑎𝑛𝑐𝑒𝑟𝑒𝑔𝑖𝑜𝑛
/𝑠𝑖𝑔𝑛𝑖𝑓𝑖𝑐𝑎𝑛𝑐𝑒 𝑙𝑒𝑣𝑒𝑙 𝑜𝑟𝑐𝑜𝑛𝑓𝑖𝑑𝑒𝑛𝑐𝑒𝑙𝑒𝑣𝑒𝑙(1 − 𝛼) =
(𝛼 = 0.05 𝑜𝑟5%) 95%
𝐻0: 𝜇= 𝜇0
www.shakehandwithlife.in
Right tailed test @ 12
5% Significance level
𝑇𝑜𝑡𝑎𝑙𝐴𝑐𝑐𝑒𝑝𝑡𝑎𝑛𝑐𝑒𝑟𝑒𝑔𝑖𝑜𝑛 𝑅𝑒𝑗𝑒𝑐𝑡𝑖𝑜𝑛𝑟𝑒𝑔𝑖𝑜𝑛
𝑜𝑟𝑐𝑜𝑛𝑓𝑖𝑑𝑒𝑛𝑐𝑒𝑙𝑒𝑣𝑒𝑙 /𝑠𝑖𝑔𝑛𝑖𝑓𝑖𝑐𝑎𝑛𝑐𝑒𝑙𝑒𝑣𝑒𝑙
(1 − 𝛼) = 95% (𝛼 = 0.05 𝑜𝑟 5%)
𝐻0: 𝜇= 𝜇0
www.shakehandwithlife.in
13
Procedure for Hypothesis
Testing
State the null (Ho)and State a significance level; Decide a test statistics; Calculate the value
alternate (Ha) Hypothesis 1%, 5%, 10% etc. z-test, t- test, F-test. of test statistics
P-value <
Calculated value Reject Ho
www.shakehandwithlife.in
14
Hypothesis
Testing of Z-TEST AND T-TEST
Means
www.shakehandwithlife.in
15
Z-Test for testing means
www.shakehandwithlife.in
16
Z-Test for testing means
www.shakehandwithlife.in
17
Z-Test for testing means
www.shakehandwithlife.in
18
Z-Test for testing means
www.shakehandwithlife.in
19
T-Test for testing means
𝑋𝑖 − 𝑋 2
𝜎𝑠 =
(𝑛 − 1)
www.shakehandwithlife.in
20
T-Test for testing means
𝑋𝑖 − 𝑋 2
𝜎𝑠 =
(𝑛 − 1)
www.shakehandwithlife.in
21
Hypothesis
testing for Z-TEST, T-TEST
difference
between means
www.shakehandwithlife.in
22
Z-Test for testing difference between
means
www.shakehandwithlife.in
23
Z-Test for testing difference between
means
www.shakehandwithlife.in
24
T-Test for testing difference between
means
www.shakehandwithlife.in
25
Hypothesis
Testing for PAIRED T-TEST
comparing two
related samples
www.shakehandwithlife.in
26
Paired T-Test for comparing two related
samples
𝑛= 𝑁𝑢𝑚𝑏𝑒𝑟𝑜𝑓𝑚𝑎𝑡𝑐ℎ𝑒𝑑𝑝𝑎𝑖𝑟𝑠
www.shakehandwithlife.in
27
Hypothesis
Testing of Z-TEST
proportions
www.shakehandwithlife.in
28
Z-test for testing of proportions
www.shakehandwithlife.in
29
Hypothesis
Testing for
difference Z-TEST
between
proportions
www.shakehandwithlife.in
30
Z-test for testing difference between
proportions
www.shakehandwithlife.in
31
Hypothesis
testing of
equality of F-TEST
variances of two
normal
populations
www.shakehandwithlife.in
32
F-Test for testing equality of variances of
two normal populations
www.shakehandwithlife.in
33
Limitations of the test of Hypothesis
Testing of hypothesis is not decision making itself; but help for decision making
Test does not explain the reasons as why the difference exist, it only indicate that
the difference is due to fluctuations of sampling or because of other reasons but the
tests do not tell about the reason causing the difference.
Tests are based on the probabilities and as such cannot be expressed with full
certainty.
Statistical inferences based on the significance tests cannot be said to be
entirely correct evidences concerning the truth of the hypothesis.
www.shakehandwithlife.in