Basic Statistics
Basic Statistics
in
Statistics
Name:
Control no.:
COURSE OUTLINE V. INFERENTIAL STATISTICS
A. Test of Hypothesis
1. Types of Statistical hypothesis
I. INTRODUCTION 2. Types of Test of Hypothesis
A. Definition of Statistics 3. Types of Error
B. Uses of Statistics 4. Level of Significance
C. Division of Statistics 5. Measures in Decision Making
D. Population Versus Sample
E. Variable
1. Types B. Application
2. Levels of Measurement 1. Parameter
a. Test of Significance
a.1 one-sample z or t-test
a.2 two-sample
II. DATA COLLECTION AND
PRESENTATION
2. ANOVA
A. Kinds of Data
a. Functions
B. Methods of Data Collection
b. Rationale
C. Sampling Technique
c. ANOVA Calculations
D. Methods of Data Presentation
1. Textual
2. Tabular
3. Graphical 3. CHI-SQUARE TEST
a. Types
III. SUMMARIZING DATA b. Two-by-two Contingency table
A. Parameter versus Parameter c. Limitations of Chi-Square
B. Measures of Central Tendency
C. Measures of Variability
D. Coefficient of Variation 4. CORRELATION AND LINEAR
REGRESSION
IV. THE NORMAL a. Difference between Correlation
DISTRIBUTION and Regression
A. Importance b. The Scatter Diagram
B. Properties c. Correlation Coefficient
C. Areas Under The Normal Curve d. Regression Analysis
5. NONPARAMETRIC METHOD
a. Advantages and Disadvantages
b. Spearman Rank Order
Correlation
I. INRODUCTION
A. STATISTICS – deals with the collection, organization, presentation, analysis and
interpretation of data obtained by conducting a survey and experiment.
- Refers to numerical facts.
B. Uses of Statistics
- Essential tool in education, government, business, economics, medicine, psychology,
sociology, sports and others.
- To describe and draw inferences about the numerical properties of population.
- To make good decision/correct decision based on the data (how and where).
- Used to informed instructor in building and analyzing test and in preparing grades
C. Division of Statistics
1. Descriptive Statistics – consist of methods for organizing, displaying and
describing data by using tables, graphs and summary measures.
- Aims to give information about large group of data without dealing with each and
every element of this group.
Tools: measures of central tendency, measures of variability, skewness and kurtosis.
2. Inferential Statistics – consist of methods that that use sample results to help
make decisions or prediction about a population.
- Aims to give information about large group of data without dealing with each and
every element of this group.
- It uses only small portion of total set of data to draw conclusion.
Tools: testing of hypothesis using t-test, z-test, simple linear correlation, analysis of
variance, chi-square, regression analysis and time series analysis
Kinds of variable
1. Independent – presumed cause of change happens in the dependent variable.
2. Dependent – presumed effect of the change happens in the independent variable.
Types of variables
1. Qualitative – express in kind of categories.
- are nonmeasurable characteristics that cannot assume a numerical
value.
Ex. Gender, civil status,
2. Quantitative – express in amount or numbers
a. Continuous – are obtained by measurement
Ex. Height, weight, and time in minutes
b. Discrete – a variable whose values are countable. It can assume only certain
values with no intermediate values
Ex. No. of students in Statistics Class, no. of subjects a student can enroll in
a semester.
2. Ordinal Scale – it tells which one a person, object or aspects in greater than or less
than the other but does not tell how much the difference is.
- It orders the category of variables by ranking them.
Example: performance rating, prize won, I.Q. level
3. Interval Scale – provides numbers that reflect differences among items or classes.
Example: test score, I.Q. scores, performance score, Fahrenheit, Celsius, time,
etc.
4. Ratio Scale – the only scale that has a true zero. The point of origin being the fixed
one.
Example: Kelvin, height, weight, length, width, loudness, etc.
II. Data Collection and Presentation
A. Sources of Data
1. Documentary Sources – data contained in published or unpublished reports,
statistics, documents, manuscript, letter, etc.
a. Primary source – data gathered originally, first hand data
b. Secondary source – data gathered from original sources
2. Field Sources – includes living persons with sufficient knowledge about social
condition or had been an intimate contact with the subjects over a considerable
period of time.
C. Sampling Technique
1. Non-Probability sampling/Non-Random sampling – there is no way of
estimating the probability that each individual or element will be included in sample.
Types
a. Accident or incident sampling
b. Quota sampling-m the proportion of various subgroups in population are
determined and the sample is drawn to have the same percentage in it.
c. Purposive sampling – base on certain criteria laid by the researcher.
d. Convenience sampling
2. Probability sampling/Random sampling – every individual has equal chance of
being selected in sample before the selection is done.
Types
a. Simple random sampling – the item are picked out for sample at random
b. Systematic Random sampling – the items are chosen from the population at
uniform intervals of time.
c. Cluster sampling – used when the population is spread out within the geography.
d. Stratified random sampling – this procedure divides the population into
subgroups called strata.
Example
Suppose a Statistics class with 60 students was given 100-item examination and result are:
Table 1: Test Score Obtained by the Sixty Students in Statistics Class
45 73 57 69 68 35 70 62 47 60
46 70 49 45 53 60 39 65 38 59
50 69 62 35 58 69 45 28 58 65
38 49 28 36 41 58 37 35 61 48
36 51 59 55 60 37 55 59 57 36
70 36 50 63 68 30 56 70 53 57
Exercise 1.1
1. The data below shows the frequency distribution table of religion of all the students of
Pala High School. Show how many samples will be taken from each of the following
categories using stratified random e sampling. Use 5% level significance.
Religion Frequency N
Roman Catholic 355
Iglesia ni Cristo 250
Jehovah’s Witnesses 245
Others 150
Total
2. The data below shows the number of clients/patients per day certain hospital in
Tuguegarao City
30 34 36 48 44 43
44 43 33 30 33 46
33 51 34 35 35 49
49 41 45 34 30 34
32 24 33 33 29 48
Ungrouped Data:
∑𝑋
𝑥= ∑x – sum of scores or measurement
𝑛
n – number of cases
∑𝒇𝑿
𝒙= f = (weighted mean)
∑𝒇
Group Data:
F - frequency
∑𝒇𝑿𝒎
𝒙= Xm – classmark or midpoint
∑𝒇
𝑛⁄
2−𝑐𝑓
𝑀𝑑 = 𝑋 = 𝐿𝐶𝐵 + [ 𝑓
]𝑖
Grouped Data:
𝑑1
𝑀𝑜 = 𝑋 = 𝐿𝐶𝐵 + [ ]𝑖
𝑑1+𝑑2
1. Range (R) – the difference between the highest score and the lowest score value in
the distribution
- The value is easily determined if the objective is to emphasize the extreme variation.
- Disadvantages: always affected by values, not all values are considered.
2. Mean Deviation (MD) – Takes into account all the values in distribution
Ungrouped Data:
x– individual values
∑(𝑥−𝑋)
𝑀𝐷 = X – mean of the distribution
𝑛
Grouped Data:
Xm – midpoint of the distribution
∑𝑓(𝑋𝑚−𝑚)
𝑀𝐷 = X – mean of the distribution
∑𝑓
F – frequency f the distribution
3. Variance (s2) – Area
- Most reliable because it can be treated mathematically and be sed for deeper
analysis.
- The only measure of spread which can be used for statistical inferences.
A. Importance
- Is central in the study of Statistics as it is the basis for solving various types of
Statistical problems.
- The distribution of Variables such as grades of students, weight or height of
persons, incomes of Families, IQ’s of children may be said to e approximately
normal.
Many
Few Few
There are relatively few short adults, relatively few tall adults, and the height of most adults will
tend towards middle value between the shortest and the tallest.
Intervals:
A B C D E F G
The Shape of “Smoothed –out” frequency polygon is called normal curve which represents a
normal distribution.
1. Positively Skewed/Skewed to the right – a distribution that has a tail longer on the
right end.
2. Negatively Skewed/Skewed to the left – a distribution that has a tail longer the left
end.
Ex. Most students in a class have high grades, families have high incomes.
2. The mean is equal to the median, which is also equal to the mode.
5. The normal curve area may be subdivided into at least three standard scores each of the left
and to the right of the vertical axis.
6. Along the Horizontal line, the distance from the integral standard score to the next integral
score is measured by the standard deviation
7. The area under the normal curve and above the horizontal axis, 68.27% is within 1 standard
deviation from the mean, 95.45% is within 2 standard deviations from the mean, 99.74% is
within 3 standard deviations from the mean.
𝑥−𝑋
𝑧=
𝑠
Where: Z= standard score
X= mean
S= Standard Deviation
x= a given value of particular variable
Ex. 1. Suppose that we have a distribution of the test scores for which the following
statistics have been computed. Mean grade is 80, standard deviation is 16, and
A. X= 110
B. X= 77
C. X=64
D. X= 120
E. X=59
F. Find the grade of two students whose z. scores are – 0.6 and 1.2 respectively
6. Find the area under the normal curve from Z= 0.81 to z= 1.94
7. Find the area under the normal curve from Z= 0.01 to z= 1.2
8. Find the area under the normal curve from Z= 2.85 to z= 1.98
9. Find the area under the normal curve from Z= 1.27 to z= 0.02
10. Find the area under the normal curve from Z= 1.27 to z= 1.5
11. Find the area under the normal curve from Z= -2.4 to z= 2.5
Applications:
1. The hourly wages of 500 skilled workers is found to approximate a normal
distribution. The average hourly wages is computed to be P100 with a standard
deviation of P10.
Find:
A. A Percentage of workers whose hourly wages are:
a. from P100 to P110
b. from to P100 to P110
c. from to P80 to P90
B. The number of workers whose hourly wages are:
a. Greater than or equal to P115
b. from P75 to P118
c. Greater than or equal to P85
d. Less than or equal to P113
C. 1. The minimum hourly wage of the upper 10% of workers (That is with the highest
wages)
2. The Maximum hourly wage of the lowest 20% of workers (that is the lowest
wages).
2. One thousand skilled workers were given an examination to determine how much
they know about the job. If the scores are normally distributed and the score of one
worker measured in z score is 0.8, how many of the workers who took the examination
scored higher than or equal to this particular worker.
Name: Score: __________
Exercise 1.3
a. P(z≤ 2.1)
b.P(z≤ -3.0)
c. P(z≥ 1.8)
d. P(z≥ 2.8)
a. P(x≥28)
b. P(x≥42)
c. P(x≤38)
d. P(x≥39)
e. P(x≤45)
f. P(32≤ x≤40)
g. P(26≤ x≤35)
h. P(39≤ x≤42)
i. P(37≤ x ≤ 44)
j. P(47≤ x ≤48)
V. INFERENTIAL STATISTICS
A. TEST OF HYPOTHESIS
Statical Hypothesis
- a conjecture about a population parameter.
- Statement/tentative theory which aims to explain facts about the real world.
2 KINDS OF HYPOTHESIS
Types of Tests
One-Tailed – when the rejection region is located at only one extreme of the range of values
for the test statistics
- > and < Indicate the use of a one-tailed test in Ha
Two-Tailed – when the z-score is located on both sides of the mean.
- ≠ indicates the use of two-tailed test in Ha
-
Types of Error
Ho is True Ho is False
Reject Ho Type 1 error Correct Decision
Accept Ho Correct Decision Type II error
Level of Significance
- The probability of making a type 1 or alpha error in a test.
- The maximum value of the probability of rejecting the null hypothesis H 0
where in fact it is true.
- (0.5)/5% or (.01)/1%
TESTING THE DIFFERNCE BETWEEN TWO MEANS
STEPS:
Test of Hypothesis:
1. State the hypothesis
Ho: There is no significant difference between items being compared.
Ha: There is a significant difference between items being compared.
2. Set the level of significance.
3. Determine the test to be used.
Used z – test, if the population S.D. is given
Used t – test, if the sample S.D. is given
4. Determine the tabular value for the test.
For z – test, used the table c
For t – test, one must first compute for the degrees of freedom; then look table
D
For single sample of t- test,
df =# of items – 1 = n-1.
For 2 samples of t – test,
df=n1 + n2 – 2
Where:
n1= refers to the # of items in the 1st sample
n2= refers to the # of items in the 2nd sample
Test type
+ 1.28 + 1.645 + 1.96 + 2.33
One-tailed test
+ 1.645 + 1.96 + 2.33 + 2.58
Two-tailed test
Problems /applications:
1. Data from a school census show that the mean weight of college students was 45 kilos,
with a standard deviation of 3 kilos. A sample of 200 college students was found to have
a mean weight of 47 kilos. Are the 200 college students really heavier than the rest,
using .05 significance level?
2. A researcher knows that the average height of Filipino women is 1.525 meters. A
random sample of 26 women was taken and wasfound to have a mean height of 1.56
meters, with standard deviation of .10 meters. Is there a reason to believe that the 26
women in the sample are significantly smaller than the others at .05 significance level?
A researcher wishes to find out whether or not there is a significant difference between the
monthly allowances of morning and afternoon students in his school. By random sampling,he
took a sample of 239 students in the morning session. These students were found to have a
mean monthly allowance of P 142.00. The researcher also took a sample of 209 students in the
afternoon session. They were found to have a mean allowance of P148.00. The total
population of students in that school has a standard deviation of P140 is there significant
difference between the two samples at .01 level of
Name: Score:
Year & Section: Date:
EXERCISE 1.5
1. A researcher found out that the average age of Filipinos who get married is 21 years old. A
random sample of 50 married couples were taken and found out to have an average age of 22.8
with a standard deviation of 32. Is there a reason to believe that the sample is significantly
older than the others using 1% level of significance?
2. The language cluster of MCNP-ISAP claims that the standard deviation of the latest English
Proficiency Exam of MCNP-ISAP is 4.0 and the mean score is 40. A random sample of 400
students from the school was taken and found to have a mean score of 44. At 5% level of
significance enough evidence to reject the claim?
3. JM Carpetech Inc. claims that the average cost of carpet installation and repairs is P 8,550. A
sample of 60 repairs has an average of P 8,600. The standard deviation of the sample is P 8,380.
At 5% level of significance, is there enough evidence to reject the company’s claim?
4. A researcher wishes to test whether or not the case method of teaching is more effective than
the traditional method. She picks two classes of approximately equal intelligence. She Gathers a
sample of 20 students to whom she uses the case method and another sample of 21 students to
whom she uses the traditional method. After the experiment, an objective test revealed that the
first sample got a mean score of 29, while the second group got a mean score of 28.5. The
standard deviation of the population is 5.2. Based on the result of the administered test, can we
say that the case method is as effective as the traditional method?
Assumptions:
1. The various groups are assumed to be with normal populations.
2. The variances of the different groups are assumed to be equal.
3. The random samples in the groups should be independent.
Steps:
1. State the hypothesis
Ho: there is no significant difference among the samples/group.
Ha: there is a significant difference among the samples/group.
2. Level of significance
3. 3.a Compute the sum of squares
(∑ 𝑥)2
TSS=∑ 𝑥 2 − 𝑁
SSw= TSS-SSb
dft = N-1
dfb = k-1
dfw = dft-dfb
𝑆𝑆𝑏 𝑆𝑆𝑤
𝑀𝑆𝑆𝑏 = 𝑑𝑓𝑏 𝑀𝑆𝑆𝑤 = 𝑑𝑓𝑤
Problems:
1. The weight in kilograms of 3 groups of 5 each are shown in the table below. Is there unusual
variation among the groups using 5% level of significance?
2. Below are the bowling scores of 5 groups of 4players each. At 2.5% level of significance, find out
if there is unusual variation among the group?
Exercise 1.6
1. Aiza Grey, a manager of Elizza Manufacturing Company wants to see whether the average time
(in minutes) it takes her employees to commute to work is different for three groups.
The data are shown here. At a=0.05, can she conclude that there is significance difference
among the means?
2. Karl James of MND research Incorporated tests the lifetime (in hours) of four DVD disks.
The data are shown here. At a = 0.01, is there a difference in means?
Enumeration Data
Expressed in the form of frequencies which represent the number of items within specified
qualitative description or categories.
2 Classifications
1. One-way classification – has one variable described by at least twocategories.
Table 1
Gender Frequency
Male 40
Female 55
total 95
Table 2
Table 3
Name:
Exercise 1.7
1. Test the hypothesis whether the IQ level is independent of educational attainment among 150
female respondents. Use the data below at 0.05 level.
2. A study was conducted to determine the opinion of the college students regarding the increase
of tuition fees. The data below show the result of the survey. Use 5% level of significant.
Agree Disagree
Freshmen 35 68
Sophomore 40 75
Juniors 28 80
Seniors 24 86
SIMPLE CORRELATION ANALYSIS
- Use to measure the degree of linear relationship or association between two variables.
Assumptions:
1. Perfect Correlation (+ & -) = using the scatter diagram, points lie in a straight line. E.g.
As the pressure ↑, temperature also ↑ wherein the volume is constant.
2. Some Degree of Correlation (+ & - ) = as One of variable ( the independent variable ) ↑, the
other variable may rise or fall although not in straight- line fashion E.g. weight vs. height;
average grade vs. IQ; GNP vs aggregate investment.
3. No Correlation = points in the scatter diagram show no trend or direction at all.
6 ∑ 𝐷2
Rho= rs= 1 − 𝑛(𝑛2 −1) where: D= difference of x & y
n= number of samples in ordered pairs
rho= degree of relationship between x & y
Interpretation: the degree of linear relationship can be interpreted through the use of range of
values.
Student no. 1 2 3 4 5 6 7 8 9 10
Basic Statistics 70 80 85 75 80 80 91 85 92 85
Computer Literacy 75 85 76 79 90 80 89 89 90 88
2. The following are the score on the NSAT examinations and achievement grades of
15 students of a certain college. Determine the degree of relationship existing
between the two variables and the significant of the obtain r using the 0.05 level.
Use spearman rho.
Exercise 1.8
1. Determine if there is a relationship existing between the income(x) and the expenditure
(y) of a random sample of 10 families I particular urban area. Test the significance of
the r value using 0.01 given the data below. Use Spearman Rank.
2. Find the relationship between the length and weight of 12 newborn babies in a certain
private hospital. Using the data below, test if the relationship is significant at 0.025
level.
3. If a Pearson r value of -0.74 was computed on the data between the width of the road
and the number of accidents, what interpretation could be deduced? If the size of the
sample considered in this study is 30, can we say that there exists a real correlation
between the two variables at 0.01 level?
4. Which is more significant, a value of r = 0.84 from a sample of 15, or a value of r=0.36
from a sample of 80?
REGRESSION ANALYSIS
-concerned with the problem of estimation, forecasting, prediction
- literally predict the value of one variable by going back to (or regressing to) the values of
another related variable.
-it is possible to estimate the value of the dependent variable corresponding to a given value of
the independent variable
-ex. Weight of persons corresponding to specific heights, job performance of an applicant using
information available at the time of his application, academic performance in school with the
knowledge of the scores in the intelligent test.
2 way to solve
1. By graphing
2. By regression formula
Trend line- the line represents the series of points that were plotted in such a way that the line
approximates the general direction of the points and passes through the points.
2nd method: Use the equation of the Least Square Regression Line or LSRL
( equation: Y= a +bX)
The method using the LSRL is reduced to finding the equation of the trend line which in
turn is found by solving for a and b in the equation
Least Square: means that the most accurate trend line that may be drawn is one where
the sum of the squares of the vertical distances of the points from the line is least or
minimum
FORMULAS:
(∑ 𝑌)(∑ 𝑋 2 )−(∑ 𝑋) (∑ 𝑋𝑌)
𝑎= 2 or a= Y-bX
𝑛(∑ 𝑋 2 ) −(∑ 𝑋)
𝑛(∑ 𝑋𝑌)−(∑ 𝑋) (∑ 𝑌)
𝑎= 2
𝑛(∑ 𝑋 2 ) −(∑ 𝑋)
Where:
∑Y = sum of the values of Y, the dependent variable
N = the number of pairs of X and Y
∑X = sum of the values of X,
∑XY = the sum of the column XY
∑X2 = the sum of the column X2
Example:
X Y XY
1 1 1 1
3 2 6 9
4 4 16 16
6 4 24 36
8 5 43 64
9 7 63 81
11 8 88 121
14 9 126 196
∑X = 56 ∑Y = 40 ∑XY = 364 ∑X2 = 524
∑Y = 40
N=8
∑X = 56
∑XY = 364
∑X2= 524
(40)(524)−(56)(364) 8(364)−(56)(40)
𝑎= 8(524)−(56)2
𝑏= 8(524)−(56)2
LSRL: Y = a + bX
Y = 0.54 + 0.64 X (X = 16, Y = ? )
Y = 0.54 + 0.64 (16)
Y = 0.54 + 10.24
Y = 10.78
Y ≈ 10.8
Name:
Year & Section:
Exercise 1.8
1. A researcher wants to know if there is relationship between hours spent in studying a particular
subject at home and the achievement of the students I that subject. If a significant relationship
can be established, what prediction equation can be used to estimate achievement in the
subject knowing in the number of hours spent in studying the subject at home? Let a = 0.05. The
following are the results of the observation.
4. The data below represented the mid-term and final grades of 10 students during a particular
semester: