0% found this document useful (0 votes)
65 views181 pages

Business and Market Research - Unit 4 - Final

The document provides information on processing and analyzing collected data for business and market research. It discusses editing the data by reviewing forms for completeness and accuracy. It also covers coding the data by assigning numbers or symbols to responses, classifying the data into groups based on common attributes, and tabulating the data by summarizing it into tables. Finally, it discusses representing the tabulated data visually through graphs, charts, and other formats.

Uploaded by

bansaltulika
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
65 views181 pages

Business and Market Research - Unit 4 - Final

The document provides information on processing and analyzing collected data for business and market research. It discusses editing the data by reviewing forms for completeness and accuracy. It also covers coding the data by assigning numbers or symbols to responses, classifying the data into groups based on common attributes, and tabulating the data by summarizing it into tables. Finally, it discusses representing the tabulated data visually through graphs, charts, and other formats.

Uploaded by

bansaltulika
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 181

Business and Market

Research - Unit 4
By: Tulika
Processing and analysis of data

• The data, after collection, has to be processed and analyzed in accordance with the outline laid
down for the purpose at the time of developing the research plan.
• This is essential for a scientific study and for ensuring that we have all relevant data for making
contemplated comparisons and analysis.
• Technically speaking, processing implies editing, coding, classification and tabulation of
collected data so that they are amenable to analysis.
• The term analysis refers to the computation of certain measures along with searching for
patterns of relationship that exist among data groups.
Software packages used:
• MS Excel
• SPSS (Software Packages for Social Sciences) Google Docs etc.
Data editing

• Field editing consists in the review of the reporting forms by the investigator for completing
(translating or rewriting) abbreviated and/or illegible form at the time of recording the
respondents’ responses.
• Central editing should take place when all forms or schedules have been completed and returned
to the office
• This type of editing implies that all forms should get a thorough editing by a single editor in a
small study and by a team of editors in case of a large inquiry. Here they correct errors like entry
in the wrong place, entry in wrong unit etc.
• Consistency (between questions/ values)
• Uniformity (Units / formats differ etc.)
• Completeness (Critical questions answered or not)
• Accuracy (major discrepancies, skewed responses)
• All the wrong answers should be dropped from the final results.
Data editing

• Editing is the process of examining the Editing


collected raw data (especially in surveys) to
Field Editing Central Editing
detect errors and omissions and to correct
these when possible. Coding
• Furthermore, it involves a careful scrutiny of
the completed questionnaires.
• Editing is done to assure that the data are
Classification
accurate, consistent with other facts
gathered, uniformly entered, as complete as
possible and have been well arranged to Tabulation
facilitate coding and tabulation.

Graphing
https://www.youtube.com/watch?v=DWw1xWIPZW8
Coding

• Coding refers to the process of assigning numerals or other symbols to answers so that responses
can be put into a limited number of categories or classes.
• Coding is necessary for efficient analysis and through it the several replies may be reduced to a
small number of classes which contain the critical information required for the analysis.
• Coding decisions should be usually be taken at the designing stage of the questionnaire which
makes it possible to precode the questionnaire choices thus making it faster for computer
tabulation later on.
• Coding errors should be altogether eliminated or reduced to the minimum level.
Data coding example
Classification

• Most research studies result in a large volume of raw data which must be reduced into
homogeneous groups if we are to get meaningful relationships.
• Classification of data happens to be the process of arranging data in groups or classes on the
basis of common characteristics.
• Can be one of the following two types, depending upon the nature of the phenomenon
involved:
• Classification according to attributes
• Classification according to class intervals
Classification according to attributes

• Data are classified on the basis of common characteristics which can either be descriptive (such
as literacy, gender, honesty, etc.) or numerical (such as weight, height, income etc.).
• Descriptive characteristics refer to qualitative phenomenon which cannot be measured
quantitatively; only their presence or absence in an individual item can be noticed.
• Data obtained in this way on the basis of certain attributes are known as statistics of attributes
and their classification is said to be classification according to attributes.
Classification according to class- intervals

• Unlike descriptive characteristics, the numerical characteristics refer to quantitative phenomenon


which can be measured through some statistical units.
• Data relating to income, production, age, weight etc. come under this category.
• Such data are known as statistics of variables and are classified on the basis of class intervals.
• For instance, persons whose income, say, are within Rs. 201 to Rs. 400 can form one group, those
whose incomes are within Rs. 401 to Rs. 600 can form another group and so on.
• Each group or class-interval, thus, has an upper limit as well as a lower limit which are known as
class limits.
• The difference between the two class limits is known as class magnitude.
• The number of items which fall in a given class is known as the frequency of the given class.
Classification according to class- intervals
This classification usually involves the following three
main problems:
• How many classes should be there? What should be
their magnitudes? (Typically, 5 to 15 classes, to the
extent possible, class interval should be of equal
magnitude – multiples of 2, 5 and 10 are generally
preferred).
• How to choose class limits? Mid-point of a class-
interval and the actual average of items of that class
interval should remain as close to each other as
possible. Consistent with this, the class limits should
be located at multiples of 2, 5, 10, 20, 100 and such
other figures.
• How to determine the frequency of each class? (by
using tally sheets or mechanical aids)

13
Classification according to class- intervals
• Class limits may generally be stated in any of the following forms:
• Exclusive type class intervals:
• 10-20 (should be read as 10 and under 20)
• 20-30
• 30-40
• 40-50
• Under exclusive type class intervals, the upper limit of a class interval is excluded and items with value less than the
upper limit are put in the given class interval.
• Inclusive type class intervals:
• 11-20 (should be read as 11 and under 21)
• 21-30
• 31-40
• 41-50
• Here, the upper limit of a class of interval is also included in the concerning class interval.
• When the class can be measured and stated only in integers, then we should adopt inclusive
type classification.
• But when the class is continuous and can be measured in fractions as well, we can use exclusive
type class intervals.

14
TABULATION

• When a mass of data has been assembled, it becomes necessary for the researcher to arrange the
same in some kind of concise and logical order.
• This procedure is known as tabulation.
• Thus, tabulation is the process of summarizing raw data and displaying the same in compact form
(i.e., in the form of statistical tables) for further analysis.
• In a broader sense, tabulation is an orderly arrangement of data in columns and rows.
• Tabulation can be done by hand or by mechanical or electronic devices.
• The choice depends on the size and type of study, cost considerations, time pressures and the
availability of tabulating machines or computers.
• Mechanical or electronic tabulation – in relatively large queries, hand tabulation in case of small
queries, where the number of questionnaires is small and they are of relatively short length.

19
TABULATION

• Tabulation may be classified as simple and complex tabulation.


• Simple tabulation gives information about one or more groups of independent questions,
whereas complex tabulation shows the division of data into two or more categories and as such is
designed to give information concerning one or more sets of inter-related questions.
• Simple tabulation generally results in one-way tables which supply answers to questions about
one characteristic of data only.
• Complex tabulation usually results in two-way tables (which give information about two
interrelated characteristics of data), three-way tables (giving information about three interrelated
characteristics of data) or still higher ordered tables, also known as manifold tables, which supply
information about several interrelated characteristics of data.
• Two-way, three-way and manifold tables are all examples of what is sometimes described as
cross-tabulation.

20
Graphing of data

• Visual representation of data


• Data is presented as absolute number or %age
• Most informative, simple and self explanatory
Bar chart
In a bar chart, a bar shows each category, the length of which represents the
amount, frequency or percentage of values falling into a category.

How Do You Spend the Holidays?

Other 7%

Catching up on w ork 5%

Vacation 5%

Travel to visit family 38%

At home w ith family 45%

0% 5% 10% 15% 20% 25% 30% 35% 40% 45% 50%


Pie chart
The pie chart is a circle broken up into slices that represent categories. The
size of each slice of the pie varies according to the percentage in each category.

H o w D o Yo u S p e n d the Holiday's

7%
5%
5% At hom e with family
Travel to visit family
45%
Vacation
Catching up on work
Other
38%
Histogram
A graph of the data in a frequency distribution is called a histogram.

Histogram : Daily High Temperature


7
6
5

Frequency
4
3
2
1
0
5 15 25 35 45 55 More
Analysis of Data
4th March
Analysis of Data

Analysis means computation of certain indices or measures along with searching for patterns of
relationships that exists among the data groups.

https://www.youtube.com/watch?v=MXaJ7sa7q-8&list=PL0KQuRyPJoe6KjlUM6iNYgt8d0DwI-IGR (5 videos)

- Introduction to Statistics
- Bar charts/ Pie Charts/ Histograms
- Mean, Median, Mode, Standard deviation
MEASURES OF CENTRAL TENDENCY

• Measures of central tendency (or statistical averages) tell us the point about which items have a
tendency to cluster.
• Such a measure is considered as the most representative figure for the entire mass of data.
• Mean, median and mode are the most popular averages.
• Mean, also known as arithmetic average, is the most common measure of central tendency and
may be defined as the value which we get dividing the total of the values of various given items in
a series by the total number of items.
MEAN
MEAN
MEAN

• Mean is the simplest measure of central tendency and is a widely used measure.
• Its chief use consists in summarizing the essential features of a series and in
enabling data to be compared.
• It is amenable to algebraic treatment and is used in further statistical calculations.
• It is a relatively stable measure of central tendency.
• But it suffers from some limitations viz., it is unduly affected by extreme items; it
may not coincide with the actual value if an item in series, and it may lead to
wrong impressions, particularly when the item values are not given in the
average.
• However, mean is better than other averages, specially in economic and social
studies where direct quantitative measurements are possible.
MEDIAN

• Median is the value of the middle item of series when it is arranged in ascending
or descending order of magnitude.
• It divides the series into two halves: in one half all items are less than the median,
whereas in the other half all items have values higher than the median,
• If the values of the items arranged in ascending order are: 60, 74, 80, 88, 90, 95,
100, then the value of the 4’th item viz., 88 is the value of the median.
• We can also write as: Median(M) = Value of ((n+1)/2)th item.
MEDIAN

• Median is a positional average and is used only in the context of qualitative


phenomena, for example, in estimating intelligence etc., which are often
encountered in sociological fields.
• Median is not useful where items need to be assigned relative importance and
weights.
• It is not frequently used in sampling statistics.

35
MODE
• Mode is the most commonly or frequently occurring value in the series.
• The mode in a distribution is that item around which there is maximum concentration.
• In general, mode is the size of the item which has the maximum frequency, but at times such an
item may not be mode on account of the effect of the frequencies of the neighboring items.
• Like median, mode is a positional average and is not affected by the value of the extreme items.
• It is therefore, useful in all situations where we want to eliminate the effect of extreme variations.
• Mode is particularly useful in the study of popular sizes, for example, size of the shoe most in
demand.
• However, mode is not amenable to algebraic treatment and sometimes remains indeterminate
when we have two or more model values in a series.
• It is considered unsuitable in cases where we want to give relative importance to items under
consideration.

36
GEOMETRIC MEAN
HARMONIC MEAN
HARMONIC MEAN

• Harmonic mean is of limited application, particularly in cases where time and rate
are involved.
• It gives largest weight to the smallest item and smallest weight to the largest
item.
• As such, it is used in cases like time and motion study where time is a variable
and distance a constant.

39
MEASURES OF DISPERSION

• An average can represent a series only as best as a single figure can, but it
certainly cannot reveal the entire story of any phenomenon under study.
• Specially, it fails to give any idea about the scatter of the values of items of a
variable in the series around the true value of average.
• In order to measure this scatter, statistical devices called measures of dispersion
are calculated.
• Important measures of dispersion are : i) range, ii) mean deviation and iii)
standard deviation.

40
MEASURES OF DISPERSION

• Range is the simplest possible measure of dispersion and is defined as the


difference between the values of the extreme items of a series.
• Thus,
• Range = (Highest value of an item in a series) – (Lowest value of an item in a series)
• The utility of range is that it gives an idea of the variability very quickly, but the
drawback is that range is affected very greatly by fluctuations of sampling.
• Its value is never stable, being based on only two values of a variable.
• Hence, it is used as a rough measure of variability and is not considered as an
appropriate measure in serious studies.
MEASURES OF DISPERSION

• Mean deviation is the average of differences of the values of items from some
average of the series.
• Such a difference is technically described as deviation.
• In calculating the deviation, we ignore the minus sign of the deviations, while
taking their total for obtaining the mean deviation.

42
MEASURES OF DISPERSION

• When mean deviation is divided by the average used in finding out the mean
deviation itself, the resulting quantity is described as the coefficient of mean
deviation.
• Coefficient of mean deviation is a relative measure of dispersion and is
comparable to similar measure of other series.
• Mean deviation and its coefficient are used in statistical studies for judging the
variability, and thereby render the study of central tendency of a series more
precise by throwing light on the typicalness of the average.
• Better measurement of variability than range as it takes into consideration the
values of all items of a series.
• However, not a frequently used measure as it is not amenable to algebraic
process.

43
MEASURES OF DISPERSION

• Standard deviation is the most widely used measure of dispersion series and is
commonly denoted by the symbol ‘σ’, pronounced as sigma.
• It is defined as the square-root of the average of squares of deviation, when such
deviations for the values of the individual items in a series are obtained from the
arithmetic average.
• When we divide the standard deviation by arithmetic average of the series, the resulting
quantity is known as coefficient of standard deviation, which happens to be the relative
measure and is often used for comparing with similar measure of other series.
• When this coefficient of standard deviation is multiplied by 100, the resulting figure is
known as coefficient of variation.
• Sometimes, the square of the standard deviation, known as variance, is frequently used
in the context of analysis of variation.

44
MEASURES OF DISPERSION

• The standard deviation (along with several other related measures like variance,
coefficient of variation, etc.) is used mostly in research studies and is regarded a
very satisfactory measure of dispersion in series.
• It is amenable to mathematical manipulation because the algebraic signs are not
ignored in its calculation.
• It is less affected by the fluctuations in sampling.
• It is popularly used in the context of estimation and testing of hypotheses.

45
MEASURES OF ASYMMETRY(SKEWNESS)

• When the distribution of items in a series happens to be perfectly symmetrical,


we have the following type of curve for the distribution.
• Such a curve is technically described as a normal curve and the relating
distribution as normal distribution.
• Such a curve is perfectly bell shaped curve in which case the value of X under bar
or M or Z is just the same and skewness is altogether absent.
• But if the curve is distorted on the right side, we have positive skewnesss and
when the curve is distorted towards the left, we have negative skewness.

46
MEASURES OF ASYMMETRY(SKEWNESS)

• Skewness, is, thus, a measure of asymmetry and shows the manner in which the
items are clustered around the average.
• In a symmetrical distribution, the items show a perfect balance on either side of a
mode, but in a skewed distribution, the balance is thrown to one side.
• The amount by which the balance exceeds on one side measures the skewness of
the series.
• The difference between the mean, median and mode provides an easy way
expressing skewness in a series.
• In case of positive skewness, we have Z< M<X under bar and in case of negative
skewness, we have X under bar < M < Z.
MEASURES OF ASYMMETRY(SKEWNESS)

• The significance of the skewness lies in the fact that through it one can study the
formation of a series and can have the idea about the shape of the curve, whether
normal or otherwise, when the items of a given series are plotted on a graph.
• Kurtosis is the measure of flat-toppedness of a curve.
• A bell shaped curve or the normal curve is Mesokurtic because it is kurtic in the centre;
but if the curve is relatively more peaked than the normal curve, it is Leptokurtic.
• Similarly, if a curve is more flat than the normal curve, it is called Platykurtic.
• In brief, kurtosis is the humpedness of the curve and points to the nature of the
distribution of items in the middle of a series.
• Knowing the shape of the distribution curve is crucial to the use of statistical methods in
research analysis since most methods make specific assumptions about the nature of the
distribution curve.
48
Data analysis
18 March
STATISTICS IN RESEARCH

The important statistical measures that are used to summarize the survey/research data are:
• Measures of central tendency or statistical averages
• arithmetic average or mean, median and mode, geometric mean and harmonic mean
• Measures of dispersion
• Variance and its square root – the standard deviation, range etc. For comparison purposes, mostly
the coefficient of standard deviation or the coefficient of variation.
• Measures of asymmetry (skewness)
• Measure of skewness and kurtosis are based on mean and mode or on mean and median.
• Other measures of skewness, based on quartiles or on the methods of moments, are also used
sometimes. Kurtosis is also used to measure the peakedness of the curve of frequency distribution.
Measure of dispersion
Measure of central tendency and symmetry
Exercise: Central tendency

Prices in Bangalore (₹) Prices in Delhi (₹)


480 500
450 330
350 340
220 1,800
350 250
- 210
Suppose you are the manager of a restaurant chain. You are planning to open new outlets in Bangalore and
New Delhi. You know the prices of the similar food from other outlets in each city. You are given the task of
setting competitive prices for your menu. How would you set the price?

What all measures of central tendency will you look at?


• What are the mean prices for Blr and Delhi?
572 and 370

• Why do you think the mean price for Delhi is higher than the mean price for Bangalore?
The mean for Delhi is distorted because of one posh restaurant in your dataset, which charges ₹1,800 for
its food.

To get a better picture of the data, you need to calculate the median of this dataset.

• What is the median price for Bangalore and Delhi, respectively?


₹350 and ₹335

• What is the mode for Bangalore and Delhi?


Bangalore: ₹350, Delhi: No mode
Exercise: Dispersion/ Variability

• <Refer to excel>
ELEMENTS/TYPES OF ANALYSIS

• Analysis involves estimating the values of unknown parameters of the population and testing of
hypotheses for drawing inferences.
• Analysis may, therefore, be categorized as descriptive analysis and inferential analysis.

• Descriptive analysis gives information about raw data which describes the data in some manner.
• In inferential analysis, predictions are made by taking any group of data in which you are
interested.

https://www.youtube.com/watch?v=VHYOuWu9jQI&t=18s
22 Mar
Examples

Which type of analysis is this:


• Understand Sales of ice cream data over last 12 months to understand the average annual sales
and highest & lowest number of ice creams sold.

Descriptive
• Determine if increase in temperature (weather conditions) causes an increase in ice cream sales
Correlational/ Causal
Analysis of Data

Descriptive & Causal Analysis Inferential or Statistical Analysis

Uni-Variate
Analysis
Estimation of Testing
Parameter Hypothesis
Bivariate Values
Analysis
Point
Estimate
Multi Variate Parametric
Analysis Tests

Interval
Estimate Non
-Parametric
Tests
Descriptive Analysis

• Descriptive analysis is largely the study of distributions of one variable.


• This study provides us with the profiles of companies, work groups, persons and other subjects on
any multitude of characteristics such as size, composition, efficiency, preferences etc.
• As an example, consider a survey in which the height of 1,000 people is measured. In this case,
the mean average would be a very helpful descriptive metric.
• The study of distribution of variables is termed as a descriptive analysis.
• If we are studying one variable then it will be termed as a uni-variate analysis, in the case of two
variables bi-variate analysis & multi-variate analysis in the case of three & more then three
variables

https://www.youtube.com/watch?v=gN0OQ6r78f4 (5 mins)
Uni-Variate Analysis

Univariate analysis refers to the analysis of one variable at a time. The commonest approaches are
as follows:

Frequency tables Measures of central tendency:


Arithmetic mean
Median
Diagrams: Mode
Bar charts
Measures of dispersion:
Pie charts
Range
Histograms Mean deviation
Standard deviation
Bivariate Analysis

Bivariate analysis is concerned with the analysis of two variables at a time in order to uncover
whether the two variables are related

Main types:
• Simple Correlation
• Simple Regression

https://www.youtube.com/watch?v=IA0unflfvQE
Multi-Variate Analysis

Mutivariate analysis entails the simultaneous analysis of three or more variables

Main Types:
• Multiple Correlation
• Multiple Regression
• Multi- ANOVA

https://www.youtube.com/watch?v=AmNqUu_e4nQ
Causal Analysis

• Causal analysis is concerned with the study of how one or more variables affect changes in
another variables

https://www.youtube.com/watch?v=yfea6z_Y3Ec
Inferential analysis
24 March
Analysis of Data

Descriptive & Causal Analysis Inferential or Statistical Analysis

Uni-Variate
Analysis
Estimation of Testing
Parameter Hypothesis
Bivariate Values
Analysis
Point
Estimate
Multi Variate Parametric
Analysis Tests

Interval
Estimate Non
-Parametric
Tests
MEASURES OF RELATIONSHIP

• In case of bivariate and multi-variate populations, we often wish to know the relation of the two
and/or more variables in the data to one another.
• We may like to know, for example, whether the number of hours students devote for studies is
somehow related to their family income, to age, to gender or to similar other factor.
• We need to answer the following two types of questions in bivariate and\or multivariate
populations:
• Does there exist association or correlation between the two (or more) variables? If yes, of
what degree?
• Is there any cause and effect relationship between the two variables in case of bivariate
population or between one variable on one side and two or more variables on the other side
in case of multivariate population? If yes, of what degree and in which direction?

70
MEASURES OF RELATIONSHIP

• The first question is answered by the use of the correlation technique and the second question by
the technique of regression.
• There are several methods of applying the two techniques, but the important ones are as under:
• In case of bivariate population:
• Correlation can be studied through Karl Pearson’s coefficient correlation.
• Cause and effect relationship can be studied through simple regression equations.
• In case of multivariate population:
• Correlation can be studied through coefficient of multiple correlation
• Cause and effect relationship can be studied through multiple regression equations.

Correlation / Regression: https://www.youtube.com/watch?v=xTpHD5WLuoA&t=415s


Karl Pearson’s coefficient of correlation (or simple
correlation)
• It is the most widely used method for measuring the degree of relationship between two variables.
• It assumes the following:
• That there is a linear relationship between the two variables
• That the two variables are causally related which means that one of the variables is independent and the other one is
dependent.
• A large number of independent causes are operating in both variables so as to produce a normal distribution.
• The value of ‘r’ lies between ± 1.
• Positive values of r indicate positive correlation between the two variables (i.e., changes in both the variables
take place in the same direction), whereas negative values of ‘r’ indicate the negative correlation i.e., changes
in the two variables taking place in the opposite direction.
• A zero value of ‘r’ indicates that there is no association between the two variables.
• When r =(+)1, it indicates perfect positive correlation and when it is (-) 1, it indicates a perfect negative
correlation, meaning thereby that variations in independent variables (X) explain 100% of the variations in the
dependent variable (Y).
• For a unit change in the independent variable, if there happens to be a constant change in the dependent
variable in the same direction, then the correlation will be termed as perfect positive.
• But if such change occurs in the opposite direction, the correlation will be termed as perfect negative.
• The value of ‘r’ nearer to + 1 or -1 indicates higher degree of correlation between the two variables.

74
Inferential Analysis

• Inferential analysis is concerned with the testing the hypothesis and estimating the population
values based on the sample values.
• It is mainly on the basis of inferential analysis that the task of interpretation (i.e., the task of
drawing inferences and conclusions) is performed.
Point estimates and intervals

• Point estimates - the process of finding an approximate value of some parameter—such as


the mean (average)—of a population from random samples of the population. The accuracy of
any particular approximation is not known precisely, though probabilistic statements concerning
the accuracy of such numbers as found over many experiments can be constructed
• Interval estimation, in statistics, the evaluation of a parameter—for example,
the mean (average)—of a population by computing an interval, or range of values, within which
the parameter is most likely to be located. Intervals are commonly chosen such that the
parameter falls within with a 95 or 99 percent probability, called the confidence coefficient.
Hence, the intervals are called confidence intervals; the end points of such an interval are called
upper and lower confidence limits.
Confidence intervals - https://www.youtube.com/watch?v=KG921rfbTDw
Hypothesis testing
25 Mar
Hypothesis testing

• The main purpose of statistics is to test a hypothesis. For example, you might run an experiment and find that a
certain drug is effective at treating headaches. But if you can’t repeat that experiment, no one will take your results
seriously.
• Hypothesis statement: A hypothesis is an educated guess about something in the world around you. It should be
testable, either by experiment or observation.
• If you are going to propose a hypothesis, it’s customary to write a statement. Your statement will look like this:
“If I…(do this to an independent variable)….then (this will happen to the dependent variable).”
For example: If I (decrease the amount of water given to herbs) then (the herbs will increase in size).
A good hypothesis statement should:
• Include an “if” and “then” statement
• Include both the independent and dependent variables.
• Be testable by experiment, survey or other scientifically sound technique.
• Be based on information in prior research (either yours or someone else’s).
• Have design criteria (for engineering or programming projects).
Hypothesis testing

• https://www.youtube.com/watch?v=Q1yu6TQZ79w
• https://www.youtube.com/watch?v=-FtlH4svqx4
Hypothesis testing

• The null and alternative hypotheses are perfect opposites of each other. Hence, they should cover
the entire range of possibilities that the hypothesised parameter can take.
• The null hypothesis always has the following signs: ‘=’ OR ‘≤’ OR ‘≥’
• The alternative hypothesis always has the following signs: ‘≠’ OR ‘>’ OR ‘<’
It is important to note that we always begin with the assumption that the null hypothesis is true.
Then:
• If we have sufficient evidence to prove that the null hypothesis is false, we ‘reject’ it. In this case,
the alternative hypothesis is proved to be true.
• If we do NOT have sufficient evidence to prove that the null hypothesis is false, we ‘fail to reject’
it. In this case, the assumption that the null hypothesis is true remains.
• Remember that in hypothesis testing parlance, we never “prove” the null hypothesis. We can
only say that we ‘fail to reject’ the null hypothesis based on the evidence that we have gathered.
Error in testing of Hypothesis
• Type I error is the rejection of a true null hypothesis (also
known as a "false positive" finding or conclusion; example: "an
innocent person is convicted"), while a type II error is the non-
rejection of a false null hypothesis (also known as a "false
negative" finding or conclusion; example: "a guilty person is
not convicted").
• By selecting a low threshold (cut-off) value and modifying the
alpha (p) level, the quality of the hypothesis test can be
increased.
• Type I errors can be thought of as errors of commission, i.e.
the researcher unluckily concludes that something is the fact.
For instance, consider a study where researchers compare a
drug with a placebo. If the patients who are given the drug get
better than the patients given the placebo by chance, it may
appear that the drug is effective, but in fact the conclusion is
incorrect. In reverse, type II errors as errors of omission.
Exercise

Let’s say you are collecting data on student age for your college in order to verify certain claims. For
this, you collect data from a sample of 40 students. Which of the following can function as a pair of
hypotheses?
• Ho: Average age of students = 22 years; Ha: Average age of students < 22 years
• Ho: Average age of students ≠ 23 years; Ha: Average age of students = 23 years
• Ho: Average age of students ≥ 22 years; Ha: Average age of students < 27 years
• Ho: Average age of students ≤ 21 years; Ha: Average age of students > 21 years

• Ans: Last option.


Exercise

• Let’s say you are the COO of a shoe-manufacturing company. An employee has developed a new
sole and claims that incorporating it will decrease the wear after three years of use by more than
9%. Now, suppose you want to test this claim.
• What will be the null and alternative hypotheses in this scenario?

Ho: Decrease in wear after 3 years ≤ 9%; Ha: Decrease in wear after 3 years > 9%
Parametric and Non Parametric
tests
29 March
Parametric / non – parametric tests

• Parametric tests assume a normal distribution of values, or


a “bell-shaped curve.” For example, height is roughly a
normal distribution in that if you were to graph height from
a group of people, one would see a typical bell-shaped
curve. ... Nonparametric tests are used in cases
where parametric tests are not appropriate.
• If the mean more accurately represents the center of the
distribution of your data, and your sample size is large
enough, use a parametric test. If the median more
accurately represents the center of the distribution of
your data, use a nonparametric test even if you have a
large sample size.
PARAMETRIC TESTS

These tests depends upon assumptions typically that the population(s) from which data are
randomly sampled have a normal distribution. Types of parametric tests are:
• Chi square
• t- test
• z- test
• F- test
Z test

• https://www.youtube.com/watch?v=BWJRsY-G8u0
F test

• https://www.youtube.com/watch?v=FlIiYdHHpwU
T test

• https://www.youtube.com/watch?v=0Pd3dc1GcHc
Chi square test

• https://www.youtube.com/watch?v=ZjdBM7NO7bY
Chi-Square Test
Karl Pearson introduced a test to
distinguish whether an observed set
of frequencies differs from a
specified frequency distribution

The chi-square test uses frequency


Karl Pearson
data to generate a statistic
A chi-square test is a statistical test commonly
used for testing independence and goodness of fit.
Testing independence determines whether two or more
observations across two populations are dependent on
each other (that is, whether one variable helps to
estimate the other).
Testing for goodness of fit determines if an observed
frequency distribution matches a theoretical
frequency distribution.
Conditions for the application of 2 test
• Observations recorded and collected are collected on
random basis.
• All items in the sample must be
independent.
• No group should contain very few items, say less than
10. Some statisticians take this number as 5. But 10 is
regarded as better by most statisticians.
• Total number of items should be large, say at least 50.
The 2 distribution is not symmetrical and all the values are positive.
For each degrees of freedom we have asymmetric curves.
1. Test for comparing variance

 =
2
Chi- Square Test as a Non-Parametric Test

Test of Goodness of Fit.


Test of Independence.

 (O  E )  2
  2
  
 E 
 (O  E ) 2

 
2
  
 E 
2. As a Test of Goodness of Fit
It enables us to see how well does the assumed
theoretical distribution(such as Binomial distribution,
Poisson distribution or Normal distribution) fit to the
observed data. When the calculated value of χ2 is less
than the table value at certain level of significance, the
fit is considered to be good one and if the calculated
value is greater than the table value, the fit is not
considered to be good.
EXAMPLE
As personnel director, you want to test
the perception of fairness of three
methods of performance evaluation.
Of 180 employees, 63 rated
Method 1 as fair, 45 rated
Method 2 as fair, 72 rated Method 3 as
fair. At the 0.05 level of
significance, is there a difference in
perceptions?
SOLUTION
Observed Expected (O-E) (O-E)2 (O-E)2 E
frequency frequency

63 60 3 9 0.15
45 60 -15 225 3.75
72 60 12 144 2.4
6.3
H0: p1 = p2 = p3 = 1/3
H1: At least 1 is different
 = 0.05 Test Statistic:
2 = 6.3
 n1= 63 n2 = n3 = 72
45
Decision:
 Critical Value(s): Reject H0 at sign. level 0.05

Conclusion:
At least 1 proportion is different
Reject H0
 = 0.05

0 5.991 2
3.As a Test of Independence
χ2 test enables us to explain whether or not two attributes
are associated. Testing independence determines whether two or
more observations across two populations are dependent on each
other (that is, whether one variable helps to estimate the other. If
the calculated value is less than the table value at certain level of
significance for a given degree of freedom, we conclude that null
hypotheses stands which means that two attributes are
independent or not associated. If calculated value is greater than
the table value, we reject the null hypotheses.
Steps involved

Determine The Hypothesis:

Ho : The two variables are independent Ha : The two


variables are associated

Calculate Expected frequency


Calculate test statistic

 (O  E ) 2

  2
  
 E 
Determine Degrees of Freedom
df = (R-1)(C-1)
Compare computed test statistic against a
tabled/critical value

The computed value of the Pearson chi- square statistic is


compared with the critical value to determine if the
computed value is improbable

The critical tabled values are based on sampling


distributions of the Pearson chi- square statistic.
If calculated 2 is greater than 2 table value, reject Ho
Critical values of 2
EXAMPLE
 Suppose a researcher is interested in voting preferences on
gun control issues.
 A questionnaire was developed and sent to a random
sample of 90 voters.
 The researcher also collects information about the political
party membership of the sample of 90 respondents.
BIVARIATE FREQUENCY TABLE OR
CONTINGENCY TABLE
Favor Neutral Oppose f row

Democrat 10 10 30 50

Republican 15 15 10 40

f column 25 25 40 n = 90
BIVARIATE FREQUENCY TABLE OR
CONTINGENCY TABLE
Favor Neutral Oppose f row

Democrat 10 10 30 50

Republican 15 15 10 40

f column 25 25 40 n = 90
Row frequency
BIVARIATE FREQUENCY TABLE OR
CONTINGENCY TABLE
Favor Neutral Oppose f row

Democrat 10 10 30 50

Republican 15 15 10 40

f column 25 25 40 n = 90
22
BIVARIATE FREQUENCY TABLE OR
CONTINGENCY TABLE
Favor Neutral Oppose f row

Democrat 10 10 30 50

Republican 15 15 10 40

f column 25 25 40 n = 90
Col umn frequency
DETERMINE THE HYPOTHESIS
• Ho : There is no difference between D & R in their
opinion on gun control issue.

• Ha : There is an association between responses to


the gun control survey and the party membership in
the population.
CALCULATING TEST STATISTICS
Favor Neutral Oppose f row

Democrat fo =10 fo =10 fo =30 50


fe =13.9 fe =13.9 fe=22.2
Republican fo =15 fo =15 fo =10 40
fe =11.1 fe =11.1 fe =17.8
f column 25 25 40 n = 90
CALCULATING TEST STATISTICS
Favor Neutral Oppose f row

Democrat fo =10 fo =10 fo =30 50


fe =13.9 fe =13.9 fe=22.2
Republican fo =15 fo =15 fo =10 40
fe =11.1 = 40* 25/90
fe =11.1 fe =17.8
f column 25 25 40 n = 90
CALCULATING TEST STATISTICS

(10 13.89)2 (10 13.89)2 (30  22.2)2


 
2
  
13.89 13.89 22.2

(15 11.11)2 (15 11.11)2 (10 17.8)2


 
11.11 11.11 17.8

= 11.03
DETERMINE DEGREES OF FREEDOM

 df = (R-1)(C-1) =
(2-1)(3-1) = 2
COMPARE COMPUTED TEST STATISTIC AGAINST TABLE VALUE

α = 0.05
 df = 2
 Critical tabled value = 5.991
 Test statistic, 11.03, exceeds critical value
 Null hypothesis is rejected
 Democrats & Republicans differ significantly in
their opinions on gun control issues
2 TEST OF INDEPENDENCE THINKING CHALLENGE

You’re a marketing research analyst. You ask a random


sample of 286 consumers if they purchase Diet Pepsi or Diet
Coke. At the 0.05 level of
significance, is there evidence of a relationship?

Diet Pepsi
Diet Coke No Yes Total
No 84 32 116
Yes 48 122 170
Total 132 154 286
2 TEST OF INDEPENDENCE SOLUTION*

 Eij  5 in all cells

116·132 Diet Pepsi 154·132


286 No Obs. Yes Obs. 286
Diet Coke Exp. Exp. Total
No 84 53.5 32 62.5 116
Yes 48 78.5 122 91.5 170
Total 132 132 154 154 286

170·132 170·154
286 286
2 TEST OF INDEPENDENCE SOLUTION*

2
n ij  E 
2  
ij

all cells Eij

n 11 E 
2
n 12 E 
2
n 22 E 
2

 11
 12
  22

E11 E12 E22


84  53.5 2


32  62.5 2

 
122  91.5 2

 54.29
53.5 62.5 91.5
 H0: No Relationship

 H1: Relationship
Test Statistic: 2 = 54.29
 = 0.05

 df = (2 - 1)(2 - 1) = 1

 Critical Value(s):
Decision:
Reject at sign. level 0 .05

Reject H0 Conclusion:

 = 0.05 There is evidence of a relationship

0 3.841 2
2 TEST OF INDEPENDENCE THINKING CHALLENGE 2
There is a statistically significant relationship between
purchasing Diet Coke and Diet Pepsi. So what do you
think the relationship is? Aren’t they competitors?
Diet Pepsi

Diet Coke No Yes Total


No 84 32 116
Yes 48 122 170
Total 132 154 286
YOU RE-ANALYZE THE DATA
High
Income Diet Pepsi
Diet Coke No Yes Total
No 4 30 34
Yes 40 2 42
Total 44 32 76
Low
Income Diet Pepsi
Diet Coke No Yes Total
No 80 2 82
Yes 8 120 128
Total 88 122 210
Data mining example: no need for statistics here!
Anova

• https://www.youtube.com/watch?v=-yQb_ZJnFXw

• Statistical technique specially designed to test whether the means of more than
2 quantitative populations are equal

• ANOVA uses the F-test to determine whether the variability between group
means is larger than the variability of the observations within the groups
EXAMPLE: Study conducted among men of age group 18-25 year
in community to assess effect of SES on BMI

Lower SES Middle SES Higher SES


18,17,18,19,19 22,25,24,26,24,21 25,26,24,28,29
N1= 5 N2= 6 N3= 5
Mean=18.2 Mean= 23.6 Mean=26.4
ANOVA

One way ANOVA Two way ANOVA Three way ANOVA

Effect of SES on BMI Effect of age & SES on BMI Effect of age, SES, Diet on
BMI

ANOVA with repeated measures - comparing >=3 group means where the
participants are same in each group. E.g.
Group of subjects is measured more than twice, generally over time, such as
patients weighed at baseline and every month after a weight loss program
Data required
One way ANOVA or single factor ANOVA:
• Determines means of
≥ 3 independent groups
significantly different from one another.

• Only 1 independent variable (factor/grouping variable) with ≥3 levels


• Grouping variable- nominal
• Outcome variable- interval or ratio

Post hoc tests help determine where difference exist


Assumptions

1) Normality: The value in each group is normally distributed.

2) Homogeneity of variances: The variance within each group


should be equal for all groups.

3) Independence of error: The error(variation of each value


around its own group mean) should be independent for each
value.
Steps Steps
1. State null & alternative hypotheses

2. State Alpha
3. Calculate degrees of Freedom
4. State decision rule

5. Calculate test statistic


- Calculate variance between
samples
-Calculate variance within the samples
-Calculate F statistic
1. State null & alternative hypotheses

H 0  1   2 ...   i
H0 : all sample means are equal

H a  notall of the  i are equal

At least one sample has different mean


2. State Alpha i.e 0.05

3. Calculate degrees of Freedom


K-1 & n-1
k= No of Samples,
n= Total No of observations

4. State decision rule


If calculated value of F >table value of F, reject Ho

5. Calculate test statistic


Calculating variance between samples
1. Calculate the mean of each sample.
2. Calculate the Grand average
3. Take the difference between means of various samples &
grand average.
4. Square these deviations & obtain total which will give sum of
squares between samples (SSC)
5. Divide the total obtained in step 4 by the degrees of freedom
to calculate the mean sum of square between samples (MSC).
Calculating Variance within Samples
1. Calculate mean value of each sample
2. Take the deviations of the various items in a sample from the
mean values of the respective samples.
3. Square these deviations & obtain total which gives the sum
of square within the samples (SSE)
4. Divide the total obtained in 3rd step by the degrees of freedom
to calculate the mean sum of squares within samples (MSE).
The mean sum of squares

Calculation of MSE
Calculation of MSC- Mean Sum Of Squares
Mean sum of Squares within samples
between samples

SSC SSE
MSC  MSE 
k 1 n k
k= No of Samples, n= Total No of observations
Calculation of F statistic
Variability between groups
F
Variability within groups

𝑀𝑆𝐶
F- statistic =
𝑀𝑆𝐸

Compare the F-statistic value with F(critical) value which is obtained


by looking for it in F distribution tables against degrees of freedom.
The calculated value of F > table value H0 is rejected
Between-group variance is large relative to the within-
group variance, so F statistic will be larger & >
critical value, therefore statistically
significant .
Between-Group
Variance Conclusion – At least one of group means is
Within-Group
significantly different from other group means
Variance
Within-group variance is larger, and the between-
group variance smaller, so F will be smaller
Between-Group
Variance (reflecting the likely-hood of no significant
Within-Group differences between these 3 sample means)
Variance
One way ANOVA: Table
Source of SS (Sum of Degrees of MS (Mean Variance
Variation Squares) Freedom Square) Ratio of F

Between SSC k-1 MSC= MSC/MSE


Samples SSC/(k-1)

Within SSE n-k MSE=


Samples SSE/(n-k)

Total SS(Total) n-1


Example- one way ANOVA
Example: 3 samples obtained from normal populations
with equal variances. Test the hypothesis that sample
means are equal

8 7 12
10 5 9
7 10 13
14 9 12
11 9 14
1. Null hypothesis –
No significant difference in the means of 3 samples

2. State Alpha i.e 0.05

3. Calculate degrees of Freedom


k-1 & n-k = 2 & 12

4. State decision rule


Table value of F at 5% level of significance for d.f 2 & 12 is
3.88
The calculated value of F > 3.88 ,H0 will be rejected
5. Calculate test statistic
X1 X2 X3
8 7 12
10 5 9
7 10 13
14 9 12
11 9 14
Total 50 40 60
M1= 10 M2 = 8 M3 = 12

Grand average = 10+ 8 + 12 = 10


3
Variance BETWEEN samples (M1=10,
M2=8,M3=12)
Sum of squares between samples (SSC) =
n1 (M1 – Grand avg)2 + n2 (M2– Grand avg)2 + n3(M3– Grand avg)2
5 ( 10 - 10) 2 + 5 ( 8 - 10) 2 + 5 ( 12 - 10) 2 = 40

Calculation of Mean sum of Squares between samples (MSC)

SSC 40
MSC    20
k 1 2

k= No of Samples, n= Total No of observations


Variance WITH IN samples (M1=10, M2=8,M3=12)
X1 (X1 – M1)2 X2 (X2– M2)2 X3 (X3– M3)2

8 4 7 1 12 0
10 0 5 9 9 9
7 9 10 4 13 1
14 16 9 1 12 0
11 1 9 1 14 4
30 16 14
Sum of squares within samples (SSE) = 30 + 16 +14 = 60
Calculation of Mean Sum Of Squares within samples (MSE)

S S E 6 0
M S E    5
n  k 1 2
Calculation of ratio F

Variability between groups


F
Variability within groups

𝑀𝑆𝐶
F- statistic = = 20/5 =4
𝑀𝑆𝐸

The Table value of F at 5% level of significance for d.f 2 & 12 is 3.88 The
calculated value of F > table value
H0 is rejected. Hence there is significant difference in sample means
Short cut method -
X1 (X1) 2 X2 (X2 )2 X3 (X3 )2

8 64 7 49 12 144
10 100 5 25 9 81
7 49 10 100 13 169
14 196 9 81 12 144
11 121 9 81 14 196
Total 50 530 40 336 60 734

Total sum of all observations = 50 + 40 + 60 = 150 Correction


factor = T2 / N=(150)2 /15= 22500/15=1500 Total sum of squares=
530+ 336+ 734 – 1500= 100
Sum of square b/w samples=(50)2/5 + (40)2 /5 + (60) 2 /5 - 1500=40
Sum of squares within samples= 100-40= 60
Pearson’s R

• https://www.youtube.com/watch?v=2B_UW-RweSE
• https://www.youtube.com/watch?v=rR-jptLvhFw
Which test to use?

• What type of data


• How many different groups / samples will your data have?
• What is the hypothesis test supposed to do?

https://www.youtube.com/watch?v=ulk_JWckJ78
https://www.youtube.com/watch?v=I10q6fjPxJ0&t=350s
Non-parametric tests

• https://www.youtube.com/watch?v=IcLSKko2tsg
Important terms

• A critical region, also known as the rejection region, is a set of values for the test statistic for
which the null hypothesis is rejected. i.e. if the observed test statistic is in the critical region then
we reject the null hypothesis and accept the alternative hypothesis.

• Degrees of freedom equal your sample size minus the number of parameters you need
to calculate during an analysis. It is usually a positive whole number. Degrees of freedom is a
combination of how much data you have and how many parameters you need to estimate.

• The standard error (SE) of a statistic (usually an estimate of a parameter) is the standard
deviation of its sampling distribution or an estimate of that standard deviation. If the statistic is
the sample mean, it is called the standard error of the mean (SEM)
Using SPSS

• https://www.youtube.com/watch?v=Bku1p481z80&list=RDCMUCwM4EI8mqvsSU
R7Ou1D0qrA&start_radio=1&t=0
SPSS
1 Apr
Basics

What does SPSS stand for?


- SPSS means “Statistical Package for the Social Sciences” and was first launched in 1968.
Since SPSS was acquired by IBM in 2009, it's officially known as IBM SPSS Statistics but
most users still just refer to it as “SPSS”.
Which are the 2 views of data available in SPSS?
- Data and variable
What is the extension for saving SPSS files?
- .sav
What are the guidelines for naming variables in SPSS?
- No Spaces, Cannot begin with a special character, cannot begin with a number
Differences between excel and SPSS

• In Excel, you can perform some Statistical analysis but SPSS is more powerful. SPSS has built-in
data manipulation tools such as recoding, transforming variables, and in Excel, you have a lot of
work if you want to do that job.
• SPSS allows you to perform complex analytics such as factor analysis, logistic regression, cluster
analysis etc. etc.
• In SPSS every column is one variable, Excel does not treat columns and rows in that way (in
treating volume and rows SPSS is more similar to Access than to Excel).
• Excel does not give you a paper trail where you can easily replicate the exact steps that you took.
It also starts becoming unwieldy to use when the number of variables and observations starts
getting really large.
Intro

• https://www.youtube.com/watch?v=_zFBUfZEBWQ
Activity 1: Understanding the SPSS Environment

- Demonstrate the 2 views in SPSS and create a data table in SPSS with the following three
variables: Name, Age, Gender.
- Explain each of the variable properties
- Assign 3 sample values
Getting more sample data for exploration

• IBM provides SPSS users with multiple practice datasets right within the SPSS software.
• Click Open in the SPSS window.
• Click on My Computer>Program Files>IBM>SPSS.
• Click on SPSS>Samples>English.
• This will open a list of various datasets (filenames ending with .sav)
• Click on the dataset you wish to use.
Activity 2: Exploring the Data (data analysis)

Running speed and ability is known to be correlated with both the gender and with
a person's general level of athleticism.
In the sample dataset (provided in Google classroom), there are several variables
relating to this question:
● Gender - The person's physical sex (Male or Female)
● Athlete - Are you an athlete? (Yes/No)
● MileMinDur - Time to run a mile (as a duration variable, hh:mm:ss)

Use the Compare Means procedure to summarize the relationship between


running ability, athletics, and gender.
Stepwise analysis

• https://libguides.library.kent.edu/SPSS/CompareMeans
Conclusions
● There were nearly the same number of male non-
athletes and athletes. Among females, there were
more non-athletes than athletes.
● Among the athletes, the difference in average mile
times between males and females was only 14
seconds. Among non-athletes, the difference in
average mile time between males and females was
more than two minutes.
● Within the athlete and non-athlete groups, the
standard deviations are relatively close.
● Among the athletes, the slowest male mile time and
the slowest female mile time were very close
(within fifteen seconds). Among the non-athletes,
the difference between the slowest male mile time
and the slowest female mile time was much greater
(about 1 minute, 40 seconds).
Discussion
What is the difference between dependent and independent variable?
The values of dependent variables depend on the values of independent variables. The
dependent variables represent the output or outcome whose variation is being studied
Where can this analysis be used in a business scenario?
- Understanding Price sensitivity for different groups basis geography, age, gender
- Making customer preference cohorts
Thank you!
Extra Slides
3
What is Hypothesis?

Hypothesis is a predictive statement, capable of being tested by scientific


methods, that relates an independent variables to some dependent
variable.
A hypothesis states what we are looking for and it is a proportion which can
be put to a test to determine its validity
e.g.
Students who receive counseling will show a greater increase in creativity
than students not receiving counseling

www.shakehandwithlife.in
4
Characteristics of Hypothesis

Clear and precise.


Capable of being tested.
Stated relationship between variables.
limited in scope and must be specific.
Stated as far as possible in most simple terms so that the same is easily understand by all
concerned. But one must remember that simplicity of hypothesis has nothing to do with its
significance.
Consistent with most known facts.
Responsive to testing with in a reasonable time. One can’t spend a life time collecting data to test it.
Explain what it claims to explain; it should have empirical reference.

www.shakehandwithlife.in
5
Null Hypothesis

It is an assertion that we hold as true unless we have sufficient statistical


evidence to conclude otherwise.
Null Hypothesis is denoted by 𝐻0
If a population mean is equal to hypothesised mean then Null Hypothesis can
be written as

𝐻0: 𝜇 = 𝜇0

www.shakehandwithlife.in
6
Alternative Hypothesis

The Alternative hypothesis is negation of null hypothesis and is


denoted by 𝐻𝑎
If Null is given as 𝐻0: 𝜇= 𝜇0

Then alternative Hypothesis can be written as


𝐻𝑎: 𝜇 ≠ 𝜇0
𝐻𝑎: 𝜇 > 𝜇0
𝐻𝑎: 𝜇 < 𝜇0

www.shakehandwithlife.in
7
Level of significance and confidence

Significance means the percentage risk to reject a null hypothesis when it is


true and it is denoted by 𝛼. Generally taken as 1%, 5%, 10%
(1 − 𝛼) is the confidence interval in which the null hypothesis will exist
when it is true.

www.shakehandwithlife.in
8
Risk of rejecting a Null Hypothesis
when it is true

Risk Confidence
Designation Description
𝜶 𝟏− 𝜶
More than $100 million (Large loss of
0.001 0.999
Supercritical life, e.g. nuclear
0.1% 99.9%
disaster
0.01 0.99 Less than $100 million
Critical
1% 99% (A few lives lost)
0.05 0.95 Less than $100 thousand (No lives lost,
Important
5% 95% injuries occur)
0.10 0.90 Less than $500 (No injuries
Moderate
10% 90% occur)

www.shakehandwithlife.in
9
Type I and Type II Error

Decision
Accept Null Reject Null
Situation
Null is true Correct Type I error
( 𝛼𝑒𝑟𝑟𝑜𝑟)

Null is false Type II error Correct


( 𝛽𝑒𝑟𝑟𝑜𝑟)

www.shakehandwithlife.in
Two tailed test @ 10
5% Significance level

Acceptance and Rejection


regions in case of a Two Suitable When 𝐻0: 𝜇 = 𝜇0
tailed test 𝐻𝑎: 𝜇 ≠ 𝜇0

𝑅𝑒𝑗𝑒𝑐𝑡𝑖𝑜𝑛𝑟𝑒𝑔𝑖𝑜𝑛 𝑅𝑒𝑗𝑒𝑐𝑡𝑖𝑜𝑛𝑟𝑒𝑔𝑖𝑜𝑛
𝑇𝑜𝑡𝑎𝑙𝐴𝑐𝑐𝑒𝑝𝑡𝑎𝑛𝑐𝑒𝑟𝑒𝑔𝑖𝑜𝑛 /𝑠𝑖𝑔𝑛𝑖𝑓𝑖𝑐𝑎𝑛𝑐𝑒𝑙𝑒𝑣𝑒𝑙
/𝑠𝑖𝑔𝑛𝑖𝑓𝑖𝑐𝑎𝑛𝑐𝑒𝑙𝑒𝑣𝑒𝑙
(𝛼 = 0.025 𝑜𝑟 2.5%) 𝑜𝑟𝑐𝑜𝑛𝑓𝑖𝑑𝑒𝑛𝑐𝑒𝑙𝑒𝑣𝑒𝑙 (𝛼 = 0.025 𝑜𝑟 2.5%)
(1 − 𝛼) = 95%
𝐻0: 𝜇= 𝜇0

www.shakehandwithlife.in
Left tailed test @ 11
5% Significance level

Acceptance and Rejection


regions in case of a left tailed Suitable When 𝐻0: 𝜇 = 𝜇0
test 𝐻𝑎: 𝜇 < 𝜇0

𝑅𝑒𝑗𝑒𝑐𝑡𝑖𝑜𝑛𝑟𝑒𝑔𝑖𝑜𝑛 𝑇𝑜𝑡𝑎𝑙𝐴𝑐𝑐𝑒𝑝𝑡𝑎𝑛𝑐𝑒𝑟𝑒𝑔𝑖𝑜𝑛
/𝑠𝑖𝑔𝑛𝑖𝑓𝑖𝑐𝑎𝑛𝑐𝑒 𝑙𝑒𝑣𝑒𝑙 𝑜𝑟𝑐𝑜𝑛𝑓𝑖𝑑𝑒𝑛𝑐𝑒𝑙𝑒𝑣𝑒𝑙(1 − 𝛼) =
(𝛼 = 0.05 𝑜𝑟5%) 95%

𝐻0: 𝜇= 𝜇0

www.shakehandwithlife.in
Right tailed test @ 12
5% Significance level

Acceptance and Rejection


regions in case of a Right Suitable When 𝐻0: 𝜇 = 𝜇0
tailed test 𝐻𝑎: 𝜇 > 𝜇0

𝑇𝑜𝑡𝑎𝑙𝐴𝑐𝑐𝑒𝑝𝑡𝑎𝑛𝑐𝑒𝑟𝑒𝑔𝑖𝑜𝑛 𝑅𝑒𝑗𝑒𝑐𝑡𝑖𝑜𝑛𝑟𝑒𝑔𝑖𝑜𝑛
𝑜𝑟𝑐𝑜𝑛𝑓𝑖𝑑𝑒𝑛𝑐𝑒𝑙𝑒𝑣𝑒𝑙 /𝑠𝑖𝑔𝑛𝑖𝑓𝑖𝑐𝑎𝑛𝑐𝑒𝑙𝑒𝑣𝑒𝑙
(1 − 𝛼) = 95% (𝛼 = 0.05 𝑜𝑟 5%)
𝐻0: 𝜇= 𝜇0

www.shakehandwithlife.in
13
Procedure for Hypothesis
Testing

State the null (Ho)and State a significance level; Decide a test statistics; Calculate the value
alternate (Ha) Hypothesis 1%, 5%, 10% etc. z-test, t- test, F-test. of test statistics

Calculate the p- value at Compare the p-value with


given significance level calculated P-value >
from the table value Calculated value Accept Ho

P-value <
Calculated value Reject Ho

www.shakehandwithlife.in
14

Hypothesis
Testing of Z-TEST AND T-TEST

Means

www.shakehandwithlife.in
15
Z-Test for testing means

Test Condition Test Statistics


Population normal and infinite
Sample size large or small, 𝑋−𝜇 𝐻 0
Population variance is known 𝑧= 𝜎𝑝
Ha may be one-sided or two sided 𝑛

www.shakehandwithlife.in
16
Z-Test for testing means

Test Condition Test Statistics


Population normal and finite,
Sample size large or small,
Population variance is known 𝑋 − 𝜇𝐻0
Ha may be one-sided or two sided 𝑧= 𝜎𝑝
𝑛× 𝑁 − 𝑛𝑁 − 1

www.shakehandwithlife.in
17
Z-Test for testing means

Test Condition Test Statistics


Population is infinite and may not be
normal, 𝑋−𝜇 𝐻 0
Sample size is large, 𝑧= 𝜎𝑠
Population variance is unknown 𝑛
Ha may be one-sided or two sided

www.shakehandwithlife.in
18
Z-Test for testing means

Test Condition Test Statistics


Population is finite and may not be
normal, 𝑋 − 𝜇𝐻0
𝑧= 𝜎
𝑠
Sample size is large, × 𝑁 − 𝑛𝑁 − 1
𝑛
Population variance is unknown
Ha may be one-sided or two sided

www.shakehandwithlife.in
19
T-Test for testing means

Test Condition Test Statistics


Population is infinite and normal,
𝑋−𝜇 𝐻 0
Sample size is small,
𝑡= 𝜎𝑠
Population variance is unknown 𝑛
Ha may be one-sided or two sided 𝑤𝑖𝑡ℎ𝑑. 𝑓. = 𝑛− 1

𝑋𝑖 − 𝑋 2
𝜎𝑠 =
(𝑛 − 1)

www.shakehandwithlife.in
20
T-Test for testing means

Test Condition Test Statistics


Population is finite and normal, 𝑋 − 𝜇𝐻 0
𝑡= 𝜎
Sample size is small,
𝑛 × 𝑁 − 𝑛𝑁 − 1
𝑠
Population variance is unknown
Ha may be one-sided or two sided 𝑤𝑖𝑡ℎ𝑑. 𝑓. = 𝑛− 1

𝑋𝑖 − 𝑋 2
𝜎𝑠 =
(𝑛 − 1)

www.shakehandwithlife.in
21

Hypothesis
testing for Z-TEST, T-TEST
difference
between means

www.shakehandwithlife.in
22
Z-Test for testing difference between
means

Test Condition Test Statistics


Populations are normal
Samples happen to be large,
1𝑋 −2 𝑋
Population variances are known 𝑧=
Ha may be one-sided or two sided 𝜎2𝑝1 𝜎2𝑝2
+
𝑛1 𝑛2

www.shakehandwithlife.in
23
Z-Test for testing difference between
means

Test Condition Test Statistics


Populations are normal
Samples happen to be large,
Presumed to have been drawn from
the same population
𝑋1 − 𝑋2
𝑧=
Population variances are known 1 1
𝜎2𝑝 +
Ha may be one-sided or two sided 𝑛1 𝑛2

www.shakehandwithlife.in
24
T-Test for testing difference between
means

Test Condition Test Statistics


Samples happen to be small,
Presumed to have been drawn from
the same population 𝑋1 − 𝑋2
𝑡=
Population variances are unknown but 𝑛1 − 1 𝜎2 +𝑠1 𝑛2 − 1 𝜎2 𝑠2 × 1 1
assumed to be equal +
𝑛1 + 𝑛2 − 2 𝑛1 𝑛2
Ha may be one-sided or two
sided
𝑤𝑖𝑡ℎ𝑑. 𝑓. = (𝑛1 + 𝑛2 − 2)

www.shakehandwithlife.in
25

Hypothesis
Testing for PAIRED T-TEST
comparing two
related samples

www.shakehandwithlife.in
26
Paired T-Test for comparing two related
samples

Test Condition Test Statistics


Samples happens to be small 𝐷− 0
𝑡= 𝜎𝑑𝑖𝑓𝑓.
Variances of the two populations
need not be equal 𝑛

Populations are normal 𝑤𝑖𝑡ℎ(𝑛 − 1) 𝑑. 𝑓.

Ha may be one sided or 𝐷 = Mean of differences


two sided 𝜎𝑑𝑖𝑓𝑓.= Standard deviation of differences

𝑛= 𝑁𝑢𝑚𝑏𝑒𝑟𝑜𝑓𝑚𝑎𝑡𝑐ℎ𝑒𝑑𝑝𝑎𝑖𝑟𝑠

www.shakehandwithlife.in
27

Hypothesis
Testing of Z-TEST

proportions

www.shakehandwithlife.in
28
Z-test for testing of proportions

Test Condition Test statistics


Use in case of qualitative data
Sampling distribution may take the form 𝑝− 𝑝
of binomial probability distribution 𝑧=
𝑝.𝑞
Ha may be one sided or two sided
𝑛
𝑀𝑒𝑎𝑛= 𝑛. 𝑝
𝑆𝑡𝑎𝑛𝑑𝑎𝑟𝑑𝑑𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛 = 𝑛. 𝑝. 𝑞
𝑝 = 𝑝𝑟𝑜𝑝𝑜𝑟𝑡𝑖𝑜𝑛𝑜𝑓𝑠𝑢𝑐𝑒𝑠𝑠

www.shakehandwithlife.in
29

Hypothesis
Testing for
difference Z-TEST

between
proportions

www.shakehandwithlife.in
30
Z-test for testing difference between
proportions

Test Condition Test statistics


Sample drawn from two different
𝑝1 − 𝑝2
populations 𝑧=
𝑝1𝑞 1 + 𝑝2𝑞 2
Test confirm, whether the difference
𝑛1 𝑛2
between the proportion of success is
significant 𝑝1 = proportion of success in sample one
Ha may be one sided or 𝑝2 = proportion of success in sample two
two sided

www.shakehandwithlife.in
31

Hypothesis
testing of
equality of F-TEST
variances of two
normal
populations

www.shakehandwithlife.in
32
F-Test for testing equality of variances of
two normal populations

Test conditions Test statistics


The populations are normal
2
Samples have been drawn randomly 𝑠1𝜎
𝐹=
Observations are 𝜎2𝑠2
independent; and
There is no measurement error 𝑤𝑖𝑡ℎ 𝑛1 − 1 and 𝑛2 − 1 d. f.
Ha may be one sided or two sided 𝜎2𝑠1 is the sample estimate for 𝜎2 𝑝1

𝜎2𝑠2 is the sample estimate for 𝜎2 𝑝2

www.shakehandwithlife.in
33
Limitations of the test of Hypothesis

Testing of hypothesis is not decision making itself; but help for decision making
Test does not explain the reasons as why the difference exist, it only indicate that
the difference is due to fluctuations of sampling or because of other reasons but the
tests do not tell about the reason causing the difference.
Tests are based on the probabilities and as such cannot be expressed with full
certainty.
Statistical inferences based on the significance tests cannot be said to be
entirely correct evidences concerning the truth of the hypothesis.

www.shakehandwithlife.in

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy