Submitted To: Mrs. Geetika Vashisht College of Vocational Studies University of Delhi
Submitted To: Mrs. Geetika Vashisht College of Vocational Studies University of Delhi
ON
REVISION OF STATISTICS
SUBMITTED TO:
MRS. GEETIKA VASHISHT
SUBMITTED BY:
COLLEGE OF VOCATIONAL STUDIES
SANCHIT NAGPAL
UNIVERSITY OF DELHI
BSC(HONS) COMPUTER SCIENCE
STATISTICS
Nominal
Qualitative
Ordinal
Variable
Discrete
Quantitative
Continuous
• Nominal - Categorical variables with no inherent order or ranking sequence such as
names or classes (e.g., gender). Value may be a numerical, but without numerical value
(e.g., I, II, III). The only operation that can be applied to Nominal variables is
enumeration.
• Ordinal - Variables with an inherent rank or order, e.g. mild, moderate, severe. Can
be compared for equality, or greater or less, but not how much greater or less.
• Ratio - Variables with all properties of Interval plus an absolute, non-arbitrary zero
point, e.g. age, weight, temperature (Kelvin). Addition, subtraction, multiplication, and
division are all meaningful operations.
DISTRIBUTION - (OF A VARIABLE) TELLS US WHAT VALUES THE VARIABLE TAKES
AND HOW OFTEN IT TAKES THESE VALUES
FREQUENCY DISTRIBUTION
Consider a data set of 26 children of ages 1-6 years. Then the frequency
distribution of variable ‘age’ can be tabulated as follows:
Age 1 2 3 4 5 6
Frequency 5 3 7 5 4 2
Frequency 8 12 6
CUMULATIVE FREQUENCY
Cumulative frequency of data in previous page
Age 1 2 3 4 5 6
Frequency 5 3 7 5 4 2
Cumulative Frequency 5 8 15 20 24 26
Frequency 8 12 6
Cumulative Frequency 8 20 26
. MEASURES OF CENTRAL TENDENCY
(LOCATION)
Measures of location indicate where on the number line the data are to be found.
Common measures of location are:
x1 x2 ... xn x i
x i 1
n n
Example 2: The systolic blood pressure of seven middle aged men were as follows:
The mean is
x
151 124 132 170 146 124 113
7
137.14
.
THE MEDIAN AND MODE
The Median is the middle value of the ordered data, i.e. 132.
Two individuals have systolic blood pressure = 124 mm Hg, so the Mode is 124.
EXAMPLE 2 – N IS EVEN
.
Six men with high cholesterol participated in a study to investigate the
effects of diet on cholesterol level. At the beginning of the study, their
cholesterol levels (mg/dL) were as follows:
The Median is half way between the middle two readings, i.e.
(274+292) 2 = 283.
Two men have the same cholesterol level- the Mode is 274.
GEOMETRIC PROBABILITY DISTRIBUTION
The geometric distribution is a special case of the negative binomial distribution. It deals
with the number of trials required for a single success. Thus, the geometric distribution is
a negative binomial distribution where the number of successes (r) is equal to 1.
P(X=x) = p*q*x−1
Where
P(X=5)= 0.3×(1−0.3)5−1,
=0.3×(0.7)4,
≈0.072
≈7.2%
INTERVAL ESTIMATION
Interval estimation is the use of sample data to calculate an interval of possible (or probable)
values of an unknown population parameter, in contrast to point estimation, which is a single
number.
Where
•x¯= mean
•Zα2 = the confidence coefficient
•α = confidence level
•σ = standard deviation
•n= sample size
EXAMPLE
Suppose a student measuring the boiling temperature of a certain liquid observes the readings
(in degrees Celsius) 102.5, 101.7, 103.1, 100.9, 100.5, and 102.2 on 6 different samples of the
liquid. He calculates the sample mean to be 101.82. If he knows that the standard deviation for
this procedure is 1.2 degrees, what is the interval estimation for the population mean at a 95%
confidence level?
Solution:
The student calculated the sample mean of the boiling temperatures to be 101.82, with standard
deviation
σ=0.49
. The critical value for a 95% confidence interval is 1.96, where
1−0.952=0.025
=((101.82−(1.96×0.49)),
(101.82+(1.96×0.49))) =(101.82−0.96,101.82+0.96) =(100.86,102.78)
HYPOTHESIS TESTING
Null hypothesis(H0)
•- represents a hypothesis of chance basis.
Alternative hypothesis(Ha)
- represents a hypothesis of observations which are influenced by some non-
random cause.
EXAMPLE
.
suppose we wanted to check whether a coin was fair and balanced. A
null hypothesis might say, that half flips will be of head and half will of
tails whereas alternative hypothesis might say that flips of head and
tail may be very different.
H0: P=0.5
Ha: P≠0.5
For example if we flipped the coin 50 times, in which 40 Heads and 10
Tails results. Using result, we need to reject the null hypothesis and
would conclude, based on the evidence, that the coin was probably
not fair and balanced.
As the level of confidence decreases, the size of the corresponding interval
will decrease. Suppose the student was interested in a 90% confidence
interval for the boiling temperature. In this case,
σ=0.90
, and 1−0.902=0.05
The critical value for this level is equal to 1.645, so the 90% confidence
interval is
=((101.82−(1.645×0.49)),
(101.82+(1.645×0.49))) =(101.82−0.81,101.82+0.81) =(101.01,
102.63)
Example
• Hypothesis - Floride added to a toothpaste protects teeth against cavities.
• Null Hypothesis - Floride added to a toothpaste has no effect against cavities.
Here Null hypothesis is to be tested against experimental data to nullify the effect of
floride and water on teeth's cavities.
Consider the Example . Here Null hypothesis is false i.e. Floride added to a toothpaste has effect
against cavities. But if using experimental data, we do not detect an effect of floride added on
cavities then we are accepting a false null hypothesis. This is a Type II error. It is also called a
False Positive condition (a situation which indicates that a given condition is not present but it
actually is present).
Goal of a statistical test is to determine that a null hypothesis can be rejected or not. A statistical
test can reject or not be able to reject a null hypothesis. Following table illustrates the relationship
between truth or falseness of the null hypothesis and outcomes of the test in terms of Type I or
Type II error.
GOODNESS OF FIT
• The Goodness of Fit test is used to check the sample data whether it fits from a distribution
of a population. Population may have normal distribution or Weibull distribution. In simple
words, it signifies that sample data represents the data correctly that we are expecting to
find from actual population. Following tests are generally used by statisticians:
• Chi-square
• Kolmogorov-Smirnov
• Anderson-Darling
• Shipiro-Wilk
ANOVA (ANALYSIS OF VARIANCE)
• Analysis of Variance also termed as ANOVA. It is procedure followed by statisticans to check the
potential difference between scale-level dependent variable by a nominal-level variable having two or
more categories. It was developed by Ronald Fisher in 1918 and it extends t-test and z-test which
compares only nominal level variable to have just two categories.
• TYPES OF ANOVA
1. One-way ANOVA - One-way ANOVA have only one independent variable and refers to numbers in
this variable. For example, to assess differences in IQ by country, you can have 1, 2, and more
countries data to compare.
2. Two-way ANOVA - Two way ANOVA uses two independent variables. For example, to access
differences in IQ by country (variable 1) and gender(variable 2). Here you can examine the
interaction between two independent variables. Such Interactions may indicate that differences in
IQ is not uniform across a independent variable. For examples females may have higher IQ score
over males and have very high score over males in Europe than in America.
ANOVA TEST PROCEDURE
• Setup null and alternative hypothesis where null hypothesis states that there is no
significant difference among the groups. And alternative hypothesis assumes that there
is a significant difference among the groups.
• Calculate F-ratio and probability of F.
• Compare p-value of the F-ratio with the established alpha or significance level.
• If p-value of F is less than 0.5 then reject the null hypothesis.
• If null hypothesis is rejected, conclude that mean of groups are not equal.
LINEAR REGRESSION
Once the degree of relationship between variables has been established using co-relation
analysis, it is natural to delve into the nature of relationship. Regression analysis helps in
determining the cause and effect relationship between variables. It is possible to predict
the value of other variables (called dependent variable) if the values of independent
variables can be predicted using a graphical method or the algebraic method.
Algebraic method develops two regression equations of X on Y, and Y on X.
.
where where
Y= Dependent variable X= Dependent variable
X= Independent variable Y= Independent variable
a= Constant showing Y-intercept a= Constant showing Y-intercept
b= Constant showing slope of line b= Constant showing slope of line
∑Y=Na+b∑X ∑X=Na+b∑Y
∑XY=a∑X+b∑X2 ∑XY=a∑Y+b∑Y2
PROBLEM STATEMENT:
A researcher has found that there is a co-relation between the weight tendencies of father
and son. He is now interested in developing regression equation on two variables from
the given data:
Develop Regression equation of Y on X.
Weight of
father (in 69 63 66 64 67 64 70 66 68 67 65 71
Kg)
Weight of
Son (in 70 65 68 65 69 66 68 65 71 67 64 72
Kg)
SOLUTION
• Y = a+bX
• Where , a and b are obtained by normal equations
∑Y=Na+b∑X
∑XY=a∑X+b∑X2
∑Y=810,∑X=800
∑X2=53,402,∑XY=54,049,
N=12
⇒
• Multiplying equation (i) with 800 and equation (ii) with 12, and subtracting them
-824 b = -588
⇒
b = -.0713
• Putting it in eq (i)
Y=19.96−0.713X
LOGISTICS REGRESSION
Logistic regression is a statistical method for analyzing a dataset in which there are one
or more independent variables that determine an outcome. The outcome is measured
with a dichotomous variable (in which there are only two possible outcomes).
π(x)=eα+βx/1+eα+βx
π(0)=eα+β×01/eα+β×0
DOSE PIE(x)
=e−2.490+01+e−2.490
=0.03π(2.5) 0 0.03
=eα+β×2.51+eα+β×2.5 2.5 0.09
)
=e−2.490+.165×2.51+e−2.490+.165×2.5 π(x)
5 0.23
=0.09π(5)=
eα+β×51+eα+β×5= 10 0.29
e−2.490+.165×51+e−2.490+.165×5
=0.23π(10)=eα+β×101+eα+β×10
=e−2.490+.165×101+e−2.490+.165×10
=0.29
THANK YOU