0% found this document useful (0 votes)
75 views36 pages

Submitted To: Mrs. Geetika Vashisht College of Vocational Studies University of Delhi

This document provides a 3-page summary of key concepts in statistics, including: 1) Definitions of basic statistical terms like population, sample, variable, and parameter. 2) Descriptions of different variable types and distributions. 3) Explanations of measures of central tendency like mean, median, and mode. 4) Examples of statistical concepts like interval estimation and hypothesis testing. The document serves as an introduction to fundamental statistical topics for students.

Uploaded by

sanchit nagpal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
75 views36 pages

Submitted To: Mrs. Geetika Vashisht College of Vocational Studies University of Delhi

This document provides a 3-page summary of key concepts in statistics, including: 1) Definitions of basic statistical terms like population, sample, variable, and parameter. 2) Descriptions of different variable types and distributions. 3) Explanations of measures of central tendency like mean, median, and mode. 4) Examples of statistical concepts like interval estimation and hypothesis testing. The document serves as an introduction to fundamental statistical topics for students.

Uploaded by

sanchit nagpal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 36

PRESENTATION

ON
REVISION OF STATISTICS

SUBMITTED TO:
MRS. GEETIKA VASHISHT
SUBMITTED BY:
COLLEGE OF VOCATIONAL STUDIES
SANCHIT NAGPAL
UNIVERSITY OF DELHI
BSC(HONS) COMPUTER SCIENCE
STATISTICS

• The science of collectiong, organizing, presenting, analyzing, and interpreting data to


assist in making more effective decisions
• Statistical analysis – used to manipulate summarize, and investigate data, so that
useful decision-making information results.
WHY STUDY STATISTICS?

1. Data are everywhere


2. Statistical techniques are used to make many decisions that affect our
lives
3. No matter what your career, you will make professional decisions
that involve data. An understanding of statistical methods will help
you make these decisions efectively
1.2 INTRODUCTION TO BASIC TERMS

Population: A collection, or set, of individuals or objects or events whose properties are


to be analyzed.
Two kinds of populations: finite or infinite.

Sample: A subset of the population.


Variable: A characteristic about each individual element of a
population or sample.
Data (singular): The value of the variable associated with one
element of a population or sample. This value may be a number,
a word, or a symbol.
Data (plural): The set of values collected for the variable from
each of the elements belonging to the sample.
Experiment: A planned activity whose results yield a set of data.
Parameter: A numerical value summarizing all the data of an
entire population.
Statistic: A numerical value summarizing the sample data.
Example: A college dean is interested in learning about the average age of faculty. Identify the basic
terms in this situation.

The population is the age of all faculty members at the college.


A sample is any subset of that population. For example, we might select 10 faculty members and
determine their age.
The variable is the “age” of each faculty member.
One data would be the age of a specific faculty member.
The data would be the set of values in the sample.
The experiment would be the method used to select the ages forming the sample and determining the
actual age of each faculty member in the sample.
The parameter of interest is the “average” age of all faculty at the college.
The statistic is the “average” age for all faculty in the sample.
Variables may be further subdivided:

Nominal
Qualitative
Ordinal
Variable
Discrete
Quantitative
Continuous
• Nominal - Categorical variables with no inherent order or ranking sequence such as
names or classes (e.g., gender). Value may be a numerical, but without numerical value
(e.g., I, II, III). The only operation that can be applied to Nominal variables is
enumeration.

• Ordinal - Variables with an inherent rank or order, e.g. mild, moderate, severe. Can
be compared for equality, or greater or less, but not how much greater or less.

• Interval - Values of the variable are ordered as in Ordinal, and additionally,


differences between values are meaningful, however, the scale is not absolutely
anchored. Calendar dates and temperatures on the Fahrenheit scale are examples.
Addition and subtraction, but not multiplication and division are meaningful
operations.

• Ratio - Variables with all properties of Interval plus an absolute, non-arbitrary zero
point, e.g. age, weight, temperature (Kelvin). Addition, subtraction, multiplication, and
division are all meaningful operations.
DISTRIBUTION - (OF A VARIABLE) TELLS US WHAT VALUES THE VARIABLE TAKES
AND HOW OFTEN IT TAKES THESE VALUES

FREQUENCY DISTRIBUTION

Consider a data set of 26 children of ages 1-6 years. Then the frequency
distribution of variable ‘age’ can be tabulated as follows:

Frequency Distribution of Age

Age 1 2 3 4 5 6
Frequency 5 3 7 5 4 2

Grouped Frequency Distribution of Age:


Age Group 1-2 3-4 5-6

Frequency 8 12 6
CUMULATIVE FREQUENCY
Cumulative frequency of data in previous page

Age 1 2 3 4 5 6

Frequency 5 3 7 5 4 2

Cumulative Frequency 5 8 15 20 24 26

Age Group 1-2 3-4 5-6

Frequency 8 12 6

Cumulative Frequency 8 20 26
. MEASURES OF CENTRAL TENDENCY
(LOCATION)
Measures of location indicate where on the number line the data are to be found.
Common measures of location are:

(i) the Arithmetic Mean,


(ii) the Median, and
(iii) the Mode
MEAN
Mean: Summing up all the observation and dividing by number of
observations.
Mean of 20, 30, 40 is (20+30+40)/3 = 30.

Notation : Let x1 , x2, ...xn are n observatio ns of a variable


x. Then the mean of this variable,
n

x1  x2  ...  xn x i
x  i 1
n n
Example 2: The systolic blood pressure of seven middle aged men were as follows:

151, 124, 132, 170, 146, 124 and 113.

The mean is
x
 151  124  132  170  146  124  113 
7
 137.14
.
THE MEDIAN AND MODE

• If the sample data are arranged in increasing order, the median is


(i) the middle value if n is an odd number, or
(ii) midway between the two middle values if n is an even number
• The mode is the most commonly occurring value.
.
EXAMPLE 1 – N IS ODD

The reordered systolic blood pressure data seen earlier are:

113, 124, 124, 132, 146, 151, and 170.

The Median is the middle value of the ordered data, i.e. 132.

Two individuals have systolic blood pressure = 124 mm Hg, so the Mode is 124.
EXAMPLE 2 – N IS EVEN
.
Six men with high cholesterol participated in a study to investigate the
effects of diet on cholesterol level. At the beginning of the study, their
cholesterol levels (mg/dL) were as follows:

366, 327, 274, 292, 274 and 230.

Rearrange the data in numerical order as follows:

230, 274, 274, 292, 327 and 366.

The Median is half way between the middle two readings, i.e.
(274+292)  2 = 283.

Two men have the same cholesterol level- the Mode is 274.
GEOMETRIC PROBABILITY DISTRIBUTION

The geometric distribution is a special case of the negative binomial distribution. It deals
with the number of trials required for a single success. Thus, the geometric distribution is
a negative binomial distribution where the number of successes (r) is equal to 1.

P(X=x) = p*q*x−1

Where

•p = probability of success for single trial.


•q = probability of failure for a single trial (1-p)
•x = the number of failures before a success.
•P(X−x) = Probability of x successes in n trials.
EXAMPLE
• In an amusement fair, a competitor is entitled for a prize if he throws a ring on a peg from a certain distance.
It is observed that only 30% of the competitors are able to do this. If someone is given 5 chances, what is the
probability of his winning the prize when he has already missed 4 chances?
• Solution:
• If someone has already missed four chances and has to win in the fifth chance, then it is a probability
experiment of getting the first success in 5 trials. The problem statement also suggests the probability
distribution to be geometric. The probability of success is given by the geometric distribution formula:
P(X=x) = p*q*x−1

P(X=5)= 0.3×(1−0.3)5−1,
=0.3×(0.7)4,
≈0.072
≈7.2%
INTERVAL ESTIMATION

Interval estimation is the use of sample data to calculate an interval of possible (or probable)
values of an unknown population parameter, in contrast to point estimation, which is a single
number.

μ=x¯ ± Zα/2 σ/√n

Where

•x¯= mean
•Zα2 = the confidence coefficient
•α = confidence level
•σ = standard deviation
•n= sample size
EXAMPLE
Suppose a student measuring the boiling temperature of a certain liquid observes the readings
(in degrees Celsius) 102.5, 101.7, 103.1, 100.9, 100.5, and 102.2 on 6 different samples of the
liquid. He calculates the sample mean to be 101.82. If he knows that the standard deviation for
this procedure is 1.2 degrees, what is the interval estimation for the population mean at a 95%
confidence level?

Solution:
The student calculated the sample mean of the boiling temperatures to be 101.82, with standard
deviation
σ=0.49
. The critical value for a 95% confidence interval is 1.96, where
1−0.952=0.025

. A 95% confidence interval for the unknown mean.

=((101.82−(1.96×0.49)),
(101.82+(1.96×0.49))) =(101.82−0.96,101.82+0.96) =(100.86,102.78)
HYPOTHESIS TESTING

A statistical hypothesis is an assumption about a population which may or may


not be true. Hypothesis testing is a set of formal procedures used by
statisticians to either accept or reject statistical hypotheses. Statistical
hypotheses are of two types:

Null hypothesis(H0)
•- represents a hypothesis of chance basis.

Alternative hypothesis(Ha)
- represents a hypothesis of observations which are influenced by some non-
random cause.
EXAMPLE
.
suppose we wanted to check whether a coin was fair and balanced. A
null hypothesis might say, that half flips will be of head and half will of
tails whereas alternative hypothesis might say that flips of head and
tail may be very different.

H0: P=0.5
Ha: P≠0.5
For example if we flipped the coin 50 times, in which 40 Heads and 10
Tails results. Using result, we need to reject the null hypothesis and
would conclude, based on the evidence, that the coin was probably
not fair and balanced.
As the level of confidence decreases, the size of the corresponding interval
will decrease. Suppose the student was interested in a 90% confidence
interval for the boiling temperature. In this case,
σ=0.90
, and 1−0.902=0.05

The critical value for this level is equal to 1.645, so the 90% confidence
interval is

=((101.82−(1.645×0.49)),
(101.82+(1.645×0.49))) =(101.82−0.81,101.82+0.81) =(101.01,
102.63)

An increase in sample size will decrease the length of the confidence


interval without reducing the level of confidence. This is because the
standard deviation decreases as n increases.
TYPE II ERROR

Example
• Hypothesis - Floride added to a toothpaste protects teeth against cavities.
• Null Hypothesis - Floride added to a toothpaste has no effect against cavities.
Here Null hypothesis is to be tested against experimental data to nullify the effect of
floride and water on teeth's cavities.
Consider the Example . Here Null hypothesis is false i.e. Floride added to a toothpaste has effect
against cavities. But if using experimental data, we do not detect an effect of floride added on
cavities then we are accepting a false null hypothesis. This is a Type II error. It is also called a
False Positive condition (a situation which indicates that a given condition is not present but it
actually is present).

Type II error is denoted by β and is also called beta level.

Goal of a statistical test is to determine that a null hypothesis can be rejected or not. A statistical
test can reject or not be able to reject a null hypothesis. Following table illustrates the relationship
between truth or falseness of the null hypothesis and outcomes of the test in terms of Type I or
Type II error.
GOODNESS OF FIT
• The Goodness of Fit test is used to check the sample data whether it fits from a distribution
of a population. Population may have normal distribution or Weibull distribution. In simple
words, it signifies that sample data represents the data correctly that we are expecting to
find from actual population. Following tests are generally used by statisticians:
• Chi-square
• Kolmogorov-Smirnov
• Anderson-Darling
• Shipiro-Wilk
ANOVA (ANALYSIS OF VARIANCE)
• Analysis of Variance also termed as ANOVA. It is procedure followed by statisticans to check the
potential difference between scale-level dependent variable by a nominal-level variable having two or
more categories. It was developed by Ronald Fisher in 1918 and it extends t-test and z-test which
compares only nominal level variable to have just two categories.
• TYPES OF ANOVA
1. One-way ANOVA - One-way ANOVA have only one independent variable and refers to numbers in
this variable. For example, to assess differences in IQ by country, you can have 1, 2, and more
countries data to compare.

2. Two-way ANOVA - Two way ANOVA uses two independent variables. For example, to access
differences in IQ by country (variable 1) and gender(variable 2). Here you can examine the
interaction between two independent variables. Such Interactions may indicate that differences in
IQ is not uniform across a independent variable. For examples females may have higher IQ score
over males and have very high score over males in Europe than in America.
ANOVA TEST PROCEDURE

• Setup null and alternative hypothesis where null hypothesis states that there is no
significant difference among the groups. And alternative hypothesis assumes that there
is a significant difference among the groups.
• Calculate F-ratio and probability of F.
• Compare p-value of the F-ratio with the established alpha or significance level.
• If p-value of F is less than 0.5 then reject the null hypothesis.
• If null hypothesis is rejected, conclude that mean of groups are not equal.
LINEAR REGRESSION

Once the degree of relationship between variables has been established using co-relation
analysis, it is natural to delve into the nature of relationship. Regression analysis helps in
determining the cause and effect relationship between variables. It is possible to predict
the value of other variables (called dependent variable) if the values of independent
variables can be predicted using a graphical method or the algebraic method.
Algebraic method develops two regression equations of X on Y, and Y on X.
.

Regression equation of Y on X Regression equation of X on Y

EQN: Y=a+Bx EQN: X=a+By

where where
Y= Dependent variable X= Dependent variable
X= Independent variable Y= Independent variable
a= Constant showing Y-intercept a= Constant showing Y-intercept
b= Constant showing slope of line b= Constant showing slope of line

Values of a and b is obtained by the Values of a and b is obtained by the


following normal equations following normal equations

∑Y=Na+b∑X ∑X=Na+b∑Y

∑XY=a∑X+b∑X2 ∑XY=a∑Y+b∑Y2
PROBLEM STATEMENT:

A researcher has found that there is a co-relation between the weight tendencies of father
and son. He is now interested in developing regression equation on two variables from
the given data:
Develop Regression equation of Y on X.

Weight of
father (in 69 63 66 64 67 64 70 66 68 67 65 71
Kg)

Weight of
Son (in 70 65 68 65 69 66 68 65 71 67 64 72
Kg)
SOLUTION
• Y = a+bX
• Where , a and b are obtained by normal equations
∑Y=Na+b∑X
∑XY=a∑X+b∑X2
∑Y=810,∑X=800
∑X2=53,402,∑XY=54,049,
N=12

810 = 12a + 800b ... (i)



54049 = 800a + 53402 b ... (ii)

• Multiplying equation (i) with 800 and equation (ii) with 12, and subtracting them
-824 b = -588

b = -.0713
• Putting it in eq (i)
Y=19.96−0.713X
LOGISTICS REGRESSION
Logistic regression is a statistical method for analyzing a dataset in which there are one
or more independent variables that determine an outcome. The outcome is measured
with a dichotomous variable (in which there are only two possible outcomes).

π(x)=eα+βx/1+eα+βx

•Response - Presence/Absence of characteristic.


•Predictor - Numeric variable observed for each case
•β=0⇒ P (Presence) is the same at each level of x.
•β>0⇒P (Presence) increases as x increases
•β=0⇒P (Presence) decreases as x increases.
PROBLEM STATEMENT: SOLVE THE LOGISTIC REGRESSION OF THE FOLLOWING
PROBLEM RIZATRIPTAN FOR MIGRAINE
RESPONSE - COMPLETE PAIN RELIEF AT 2 HOURS (YES/NO).
PREDICTOR - DOSE (MG): PLACEBO (0), 2.5,5,10

SOLUTION: Having α=−2.490and beta = .165}, we've following data:

π(0)=eα+β×01/eα+β×0
DOSE PIE(x)
=e−2.490+01+e−2.490
=0.03π(2.5) 0 0.03
=eα+β×2.51+eα+β×2.5 2.5 0.09
)
=e−2.490+.165×2.51+e−2.490+.165×2.5 π(x)
5 0.23
=0.09π(5)=
eα+β×51+eα+β×5= 10 0.29
e−2.490+.165×51+e−2.490+.165×5
=0.23π(10)=eα+β×101+eα+β×10
=e−2.490+.165×101+e−2.490+.165×10
=0.29
THANK YOU

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy