Stat Reviewer Notes
Stat Reviewer Notes
The word statistics means different things to different people. To a college student,
statistics are scores on all quizzes, seatwork, assignments and recitations made in his
subject. To a biological researcher investigating the effects of pollution to our environment,
statistics are evidence of success of research efforts. To a school president, statistics are
information on faculty and employee salary, tardiness & absenteeism, and increase or
decrease in enrollment. To a manager of a food chain, statistics may be kind of food
frequently served to customers, and to the president of a country, statistics are the
information to jobs created, housing projects, increase or decrease in economic situation, etc.
They are using statistics correctly, yet they use it in different ways and purposes.
Presently, Statistics is defined as the branch of scientific methodology which deals with the
collection, classification, description and interpretation of data obtained through survey or
experiment.
STATISTICS is a scientific body of knowledge that deals with the collection, organization or
presentation, analysis and interpretation of data.
Functions of Statistics
1. To provide investigators means of measuring scientifically the conditions that may be
involved in a given problem and assessing the way in which they are related.
2. To show the laws underlying facts and events that cannot be determined by
individual observations.
3. To show relations of cause and effect that otherwise may remain unknown.
4. To find the trends and behavior in related conditions which otherwise may remain
ambiguous.
For example, we may describe a collection of persons by stating how many are poor and how
many are rich, how many are literate and how many are illiterate, how many fall into various
categories of age, height, civil status, IQ, and many more. We may also describe a particular
barangay in terms of the number of families it has, the number of grade-schoolers, the
number of professionals, the number of households with certain kinds of appliances, the
number of siblings in each household, or the rate of unemployment.
Suppose we want to know the most favorite brand of toothpaste of a certain barangay and
we do not have enough time and money to interview all the residents of that barangay, we
may just ask selected residents. With the data obtained from the interviews, we shall draw
or make conclusions as to barangay’s favorite brand of toothpaste. This example involves
the use of inferential statistics.
TERMINOLOGIES IN STATISTICS
Some important terms are commonly used in the study of Statistics. These terms should be
understood fully in order to facilitate the study of statistics.
1. Population refers to a large collection of objects, places or things. To illustrate this,
suppose a researcher wants to determine the average income of the residents of a
certain barangay and there are 1500 residents in the barangay. Then all of these
residents comprise the population. A population is usually denoted or represented
by N. Hence, this case, N = 1500.
2. Sample is a small portion or part of a population. It could also be define as a sub-
group, subset, or representative of a population. For instance, suppose the above-
mentioned researcher does not have enough time and money to conduct the study
using the whole population and he wants to use only 200 residents. These 200
residents comprise the sample. A sample is usually denoted by n, thus n = 200.
3. Parameter is any numerical or nominal characteristics of a population. It is a value
or measurement obtained from a population. It is usually referred to as the true or
actual value. If in the preceding illustration, the researcher uses the whole population
(N=1500), then the average income obtained is called a parameter.
4. Statistic is an estimate of a parameter. It is a value or measurement obtained from
the sample. If the researcher in the preceding illustration makes use of the sample
(n=200), then the average income obtained is called statistic.
5. Data –(singular form is datum) are facts, or a set of information or observation
under study. More specifically, data are gathered by the researcher from a population
or from a sample. Data may be classified into two categories, qualitative or
quantitative
a. Qualitative data are data which can assume values that manifest the concepts of
attributes. These are sometimes called categorical data. Data falling in this
category cannot be subjected to meaningful arithmetic. They cannot be added,
subtracted or divided. Gender and nationality are qualitative data.
b. Quantitative Data are data which are numerical in nature. These are data
obtained from counting or measuring. In addition, meaningful arithmetic
operations can be done with this type of data. Test scores and height are
quantitative data.
6. A Variable is a characteristic or property of a population or sample which makes the
members different from each other. If a class consists of boys and girls, then gender
is a variable in this class. Height is also a variable because different people have
different heights. Variables may be classified on the basis of whether they are discrete
or continuous and whether they are dependent or independent.
a. Discrete Variable
A discrete variable is one that can assume a finite number of values. In other
words, it can assume specific values only. The values of a discrete variable are
obtained through the process of counting. The number of students in a class
is a discrete variable. If there are 40 students in a class, it cannot reported that
there are 40.2 students or 40.5 students, because it is impossible for a
fractional part of a student to be in the class.
b. Continuous Variable
A continuous variable is one that can assume infinite values within a specified
interval. The values of a continuous variable are obtained through measuring.
Height is a continuous variable. If one reports that the height of a building is 15
m, it is also possible that another person reports that the height of the same
building is 15.1m or 15.12m, depending on the precision of the measuring device
used. In other words, height of the building can assume several values.
c. Dependent Variable
A dependent variable is a variable which is affected or influenced by another
variable.
d. Independent Variable
An independent Variable is one in which affects or influences the dependent
variable. To illustrate Independent and dependent variables, consider the
problem entitled, The Effect of Computer-Assisted Instruction on the
Students’ Achievement in Mathematics. Here the independent variable is the
computer-assisted instruction while the dependent variable is the achievement of
students in mathematics.
7. Constant refers to the fundamental quantities that do not change in value, fixed costs
and acceleration due to gravity are examples of such.
SCALES OF MEASUREMENT
1. Nominal Scale- This is the most primitive level of measurement. The nominal
level of measurement used when we want to distinguish one object from another
for identification purposes. In this level, we can only say that one object is
different from another, but the amount of difference between them cannot be
determined. We cannot tell that one is better or worse than the other. Gender,
nationality and civil status are of nominal scale.
2. Ordinal scale – in the ordinal level of measurement, data are arranged in some
specified order or rank. When objects are measured in this level, we can say that
one is better or greater than the other. But we cannot tell how much more or how
much less of the characteristic one objects than the other. The ranking of
contestants in a beauty contest, or siblings in the family, or of honor students in
the class are of ordinal scale.
3. Interval Scale- If data are measured in the interval level, we can say not only that
one object is greater or less than another, but we can also specify the amount of
difference. The scores in an examination are of interval scale of measurement. To
illustrate, suppose Kensly Kyle got 50 in a Math examination while Kwenn Anne
got 40. We can say the Kensly Kyle got higher score than Kwenn Ann by 10 points.
4. Ratio Scale- The ratio level of measurement is like the interval level. The only
difference is that the ratio level always starts from an absolute or true zero point.
In addition, in the ratio level, there is always the presence of units of measure. If
data are measured in this level, we can say that one object is so many times as
large or as small as the other. For example, suppose Mrs. Reyes weight 50 kg,
while her daughter weighs 25 kg. We can say that Mrs. Reyes is twice heavy as
her daughter. Thus, weight is an example of data measured in the ratio.
SOURCES OF DATA
There are two sources of obtaining data. One is called primary source from which a
first-hand information is obtained usually by means or personal interview and actual
observation. On the other hand, the secondary source of information is taken from
other’s works, news reports, readings, journals, magazines, and those that are kept
by the National Statistics Office, Securities and Exchange Commission, Social Security
System and other government and private agencies.
Data are said to be an asset of a company if they are accurate, updated and available
when needed. Hence, any institution or business organization must have a database
called Management Information System where all information about their business
are made available in order to facilitate verification of claims and to come up with
wise management decision.
METHODS OF COLLECTING DATA: Its Advantages and Disadvantages
1. Direct or Interview Method – is a person-to-person interaction between a
interviewer and an interviewee. Tape recorded or written interview will help the
researcher obtain exact information from the interviewee.
Advantages: Precise and consistent answers can be obtained by modifying or
rephrasing the questions especially to illiterate or to children under study.
Disadvantages: It is time, money and effort consuming and it will be applicable
only for small population, except when conducting a census.
2. Indirect or Questionnaire Method- is an alternative method for the interview
method. Written responses are obtained by distributing questionnaires (a list of
questions intended to elicit answers to a given problem, must be given in a logical
order and not too personal) to the respondents through mail or hand-carry
Advantages: Lesser time, money, and efforts are consumed.
Disadvantages: Many responses may not b consistent due to the poor construction
of the questionnaire. The meaning of the questions may be different from each
respondent. Inconsistent responses can no longer be modified, thus, it reduces
valid number of respondents.
3. Registration Method – is enforced by private organization or government
agencies for recording purposes.
Advantages: Organized data from an institution can serve as ready references for
future study or for personal claims of people’s record.
Disadvantages: Problem arises only when an agency doesn’t have a Management
Information system and if the system or process of registration is not
implemented well.
4. Observation Method – is a scientific method of investigation that makes possible
use of all senses to measure or obtain outcomes/responses from the object of
study
Advantages: Observation method is usually applied to respondents that cannot be
asked or need not speak especially when behaviors of persons/culture of
organization/performance outcomes of employees/students are to be considered.
Disadvantages: Subjectivity of information sought cannot be avoided.
5. Experimentation- is used when the objective is to determine the cause-and-effect
of a certain phenomenon under some controlled conditions.
Advantages: There is objectivity of information since a scientific method of
inquiry is used. An equal number of respondents with relatively similar
characteristics are being examined to obtain the different effects of something
applied to the experimental group.
Disadvantages: It’s too difficult to find respondents with almost similar
characteristics. The whole method must be repeated if the desired outcome is not
reached.
Data that are collected by these methods are usually referred to as raw data.
Responses out from taped interviews, answered questionnaires, furnished
registration forms, recorded observations, and results from an experiment are
considered raw data since they are not yet organized and presented in a form
ready for interpretation.
CLASSIFICATION OF VARIABLES AND DATA
v
VARIABLE
• Dependent
QUALITATIVE QUANTITATIVE
• Independent
• Dichotomous *Discrete
• Trichotomous *Continuous
• Multinomous
DATA
SCALES OF
SOURCES PRESENTATION
MEASUREMENT
*Primary • Textual
METHODS *Nominal
* Secondary *Ordinal • Tabular
*Interview
*Questionnaire *Interval • Graphical/Chart
*Registration *Ratio -Line Graph
*Observation -Bar Graph
*Experimentation -Pie Graph
-Pictograph
-Map/Cartogram
-Scatter Point Diagram
In research, we seldom use the entire population because of the cost and time involved.
In fact, most researchers do not use the population in their study. Instead, the sample
which is small representative of a population is used. The characteristics of the whole
entire population are described using the characteristics observed from the sample.
Observe that there is a margin of error. When we use a sample, we do not get the actual
value but just an estimate of the parameter. Hence, there is an error associated when
using the sample.
To illustrate, suppose we want to find out the average age of the students in Manila.
However, due to insufficient time, only the students in three particular schools were
used to estimate the average age. Obviously, the result is not the actual average age but
just an estimate and thus, there is really an error when we use the sample instead of the
population.
Study the examples below in finding the sample size.
Example 1. A group of researcher will conduct a survey to find out the opinion of
residents of a particular community regarding the oil price hike. If there are 10,000
residents in the community and the researchers plan to use a sample using a 10%
margin of error, what should the sample size be?
Solution: N= 10,000, e= 10% or .10
Hence, the researchers will just conduct the survey using 99 residents. A 10% margin or
error means that the researcher is 90% confident that the result obtained using the sample
will closely approximate the result had he used the population.
Example 2. Suppose that in example 1, the researcher would like to use a 5% margin of
error. What should be the size of the sample?
Solution: N=10,000 e = 5% or .05
10,000 10,000 10,000 10,000
n = ------------------ , n = ----------------------, n= ------------, n = -----------, n=
384.62 or 385
1+ 10,000(.05)2 1 + 10,000(.0025) 1 + 25 26
Observe from examples 1 & 2 that as we reduce the margin of error, the sample size gets
larger. Hence if we want to have a more accurate result, we have use a larger sample.
SAMPLING TECHNIQUES
Sampling Technique- is a procedure used to determine the individuals or
members of a sample.
A – PROBABILITY OR RANDOM SAMPLING TECHNIQUE is a sampling technique
wherein each member or element of the population has an equal chance of being selected
as members of the sample.
Let us illustrate how these random numbers are use to select the members of the
sample. Let us consider the preceding example wherein Mrs. Cruz wants to
select 5 students from her 40 students. Again, we will assign a number to each
student, say from 1 to 40.
Since there are 40 students, we will use the two-digit number of the table of
random number when selecting the members of the sample. This is because the
students have been assigned with number 01, 02, 03,. . . up to 40. Looking at the
first column of the table of random numbers above, we see that the number
formed by the first two-digit is 31, hence, the student assigned to number 31 is
chosen as a member of the sample. If we proceed down the column, we see that
the number formed is 87 which cannot be used because we have only 40
members. In a similar manner, the third number is 06 so that the student assigned
to number 6 is chosen. Notice that the next two numbers from the table are 95 and
44, numbers we cannot use for the same reason as before. When we get to the
bottom of the column, we move up the column and merely shift one digit to the
right for the next random number. Thus, we will have 18 as our next number. Thus
is one of the many alternatives. We can have other ways of selecting the members
of the sample until we complete the 5 students.
2. Systematic Sampling
Let us use the example wherein Mrs. Cruz wants to select 5 students from her 40
students. First, we select a random starting point. This is done by dividing the number
of members in the population by the number of the members in the sample. Hence, in
our case we shall have i = 8. The next step is to write the numbers 1, 2, 3, 4, 5, 6, 7,
and 8 on pieces of paper and draw one number by lottery. If we were able to get 5,
this means that we will select every 5th student in the population as members of the
sample. Therefore, the 5th, 10th, 15th, 20th, and 25th student shall be the members of
the sample. If, for instance, we were able to obtain the number 6, then the members
of the sample will be the 6th, 12th, 18th, 24th and 30th students.
To do this, we will use the stratified random sampling. The word stratified comes
from the root word strata which means group or categories (singular form is
stratum). When we use this method, we are actually dividing the elements of the
population into different categories or subpopulation and then the members of the
sample are drawn or selected proportionally from each subpopulation.
Solution: the first step is to find the percentage of each stratum. This is done by
dividing the number of families in each stratum by the total of families. Then, we
multiply each percentage by desired number of families in the sample.
Strata Number of Percentage Number of Families
Families in the Sample
High 1000 1000/5000= 0.2 or 0.2x200= 40
20%
Average 2500 2500/5000=0.5 or 0.5x200=100
50%
Low 1500 1500/5000=0.3 or 0.3x 200=60
30%
N=5000 n = 200
From the above table, we see that if we are going to draw 200 members from the
population of 5000, we should draw 40 families belonging to the high-income, 100
from the average, and 60 from the low-income groups. Observe that the number of
families drawn as sample in each stratum is proportional to the number of families
from the population.
4. Cluster Sampling
Cluster sampling is sampling wherein groups or clusters instead of individuals are
randomly chosen. Recall that in the simple random sampling we select members of
the sample individually. In cluster sampling, we will select or draw the members of
the sample by group and then we select a sample of elements from each cluster or
group randomly. Cluster sampling is sometimes called area sampling because this is
usually applied when population is large.
To illustrate the use of this sampling method, let’s suppose that we want to determine
the average income of the families in Manila. Let us assume there are 250 barangay
in Manila. We can draw a random sample of 20 barangays using simple random
sampling, and then a certain number of families from each of the 20 barangays may
be chosen.
5. Multi-Stage Sampling
Multi-stage sampling is a combination of several sampling techniques. This
method is usually used by the researchers who are interested in studying a very large
population, say the whole island of Luzon or even the Philippines. This is done by
starting the selection of the members of the sample using cluster sampling and then
dividing each number or group into strata. Then, from each stratum individuals are
drawn using simple random sampling.
1. Convenience Sampling
As the name implies, convenience sampling is used because of the convenience it
offers to the researcher. For example, a researcher who wishes to investigate the most
popular noontime show may just interview the respondents through the telephone.
The result of this interview will be biased because the opinions of those without
telephone will not be included. Although convenience sampling may be used
occasionally, we cannot depend on it in making inferences about a population.
2. Quota Sampling
In this type of sampling, the proportions of the various subgroups in the population
are determined and the sample is drawn to have the same percentage in it. This is
very similar to the stratified random sampling the only difference is that the selection
of the members of the sample using quota sampling is not done randomly. To
illustrate this, let us suppose that we want to determine the teenagers’ most favorite
brand of T-shirt. If there are 1000 female and 1000 male teenagers in the population
and we want to draw 150 members for our sample, we can select 75 female and 75
male teenagers from the population without using randomization. This is quota
sampling.
3. Judgment or Purpose Sampling
Another method of drawing the members of the sample using non-probability is by
using purposive sampling. Let us suppose that the target is to find out the
effectiveness of a certain kind of shampoo. Of course, bald fellows will not be the
sample.
4. Incidental Sampling
This design is applied to those samples which are taken because they are the most
available. The investigator simply takes the nearest individuals as subjects of the
study until it reaches the desired size. In an interview, for instance, an interviewer
can simply choose to ask those people around him or in a coffee shop where he is
taking a break.
Ungrouped data are data that are not either organized, or if arranged, could only
be from highest to lowest or lowest to highest.
Grouped data- are data that are organized and arranged into different classes or
categories.
Arranging the scores from the lowest to highest will facilitate the enumeration of important
characteristics of the data. The test scores of the 50 students in Calculus arranged from
lowest to highest are shown below:
3 13 17 20 27 30 32 35 40 43
9 13 18 21 28 30 33 36 40 46
10 14 18 25 28 31 34 37 40 48
10 15 19 26 28 31 35 38 41 50
12 16 20 26 29 32 35 39 42 50
The highest scores obtained is 50 and the lowest is 3. Ten students got a score of 40 and
above, while only 4 got ten and below. Generally, the students performed well in the test with
33 students or 66% getting a score of 25 and above.
B. Stem – and – leaf plot which sorts data according to a certain pattern. It involves
separating a number into two parts. In a two-digit number, the stem consists of the
first digit, and the leaf consists of the second digit. While in the three digit number,
the stem consists of the first two digits, and the leaf consists of the last digit. In a
one-digit number, the stem is zero.
Table 1.1
Stem-and-leaf Plot of an arranged Test Scores in Calculus of 50 Students
Stem Leaves
0 3,9
1 0,0,2,3,3,4,5,6,7,8,8,9
2 0,0,1,5,6,6,7,8,8,8,9
3 0,0,1,1,2,2,3,4,5,5,5,6,7,8,9
4 0,0,0,1,2,3,6,8
5 0,0
By looking at the stem-and –leaf plot, we can easily rank the data or put them in
order. Thus, the ten lowest scores are 3,9,10,10,12,13,13,14,15 and 16 while the
ten highest scores are 40,40,40,41,42,43,46,48,50 and 50.
C. Tabular- this form of presentation is better than textual form because it provides
numerical facts in a more concise and systematic manner. Statistical tables
are constructed to facilitate the analysis of relationships. Each
class/subclass is assigned to a particular row or column and figures for
various classifications are noted in appropriate cells.
Advantages of Tabular Presentation
1. It is brief, it reduces the matter to the minimum.
2. It provides the reader a good grasp of the meaning of the quantitative
relationship indicated in the report.
3. It tells the whole story without the necessity of mixing textual matter with
figures.
4. The systematic arrangement of columns and rows makes them easily read and
readily understood.
5. The column and rows make comparison easier.
Parametric Tests are used when the data are in the interval and ratio scales. It is assumed that the data
are normally or nearly normally distributed.
1. T-Test- used to determine the significant difference between the means of two groups with 30
or less number of cases.
2. Z-test – used to determine the significant difference between the means of two groups or
conditions with more than 30 cases or observations.
3. F-test or ANOVA (Analyis of Variance) used to determine the significant difference among
means of three or more independent groups.
4. Pearson Product Moment Correlation (Pearson r) used to determine if there is a correlation
between two variables for linear relations with the interval-ratio type of scale. If curvilinear, eta
correlation is recommended.
5. Eta Correlation – it is used when relationship between two sets of variables is not linear.
6. Scheffe’s Test/ Posteriori T-test/ Tuque/ Post Hoc/ Duncan Multiple Range Test- used to
determine the significant difference between means of two groups. It is used to determine
which pairs of comparison is significantly related or associated from among group means when
the data are in interval-ratio scale.
7. Point Biserial Coefficient of Correlation- it is used to find out whether there is a correlation
between interval (quantitative) and nominal data or when the variable in a 2-category split and
this dichotomy is considered real and not arbitrary.
8. Linear Regression -The simple linear regression analysis is used when there is a significant
relationship between x and y variables. This is used in predicting the value of y given the value of
x.
9. Analysis of Co-variance (ANACOVA) – it is used to control or reduce the effect of one or more
uncontrollable variables to the dependent variable, which are known as co-variates.
NON-PARAMETRIC TESTS – are used when the data are in nominal or ordinal scales.
1. Chi-Square tests – used to determine the difference or association of two or more sets of
data in nominal-ordinal type of scale
2. Spearman rank Order Correlation Coefficient (Spearman rho) –used to measure the
relationship of paired ranks assigned to individual scores on two variables.
3. Gamma or Goodman’s and Kruskal’s Gamma (G) alternative of Spearman rho. –used to
determine whether or not there is a correlation between two ordinal variables.
4. Mann-Whitney U Test – used to test the significant difference of independently random
samples from two groups with uneven number of cases in ordinal form.
5. H-Test or Kruskal Wallis Test – used to test the significant difference of independently
random samples from three or more groups with uneven number of cases in ordinal form
6. Phi-Coefficient – used to measure the degree of association between two binary variables or
two nominal dichotomous variable.
7. Friedman ‘s Two way Analysis of Variance by Ranks – it is used when the data from related
samples of at least an ordinal scale and had been taken from similar population.
8. Kendall’s Coefficient of Concordance (W) – it is used to determine the relationship among
three or more sets or ranks.
The parametric tests. To use the parametric tests, there are some conditions that should be met.
The data must be normally distributed and the level of measurement must be either interval or ratio.
The data are said to be normal when the value of skewness equals zero and the value of kurtosis is 2.65.
The interval data provide numbers that reflect difference among items. With interval scales the
measurement units are equal. Examples are scores of intelligence tests, and time as reckoned from the
calendar. They have no zero value.
The ratio scale is the highest type of scale. The basic difference between the interval and the ratio
scales is that the interval scale has no true zero value while the ratio scale has an absolute zero value.
Common ratio scales are measures of length, width, weight , capacity and loudness and others.
The nonparametric tests. The nonparametric tests do not require normality of the distribution.
Since skewness is the measure that will tell whether the data is normal or abnormal. If the value of
skewness is either positive or negative, the distribution is said to be abnormal. Aside from skewness, there
is kurtosis that also tells whether the data is normal or abnormal. If the value of kurtosis is greater than
or less than 2.65 then the distribution is said to be abnormal.
Under these test, the levels of measurement are the nominal and ordinal data.
Nominal data are data such as male and female, yes or no responses, political affiliations like LP,
LDP, Lakas, and religious groupings Christian and non-Christian and other organizations, occupation,
course and major, color, nationality, civil status etc.
Ordinal data are data such as Strongly Agree, Agree, No Opinion, Disagree and strongly
Disagree, Academic performance of a student (excellent, very satisfactory, satisfactory, fairly satisfactory
and Did not meet expectations) and also other data which employ rankings.
Types of parametric tests are the t-test, z-test, F-test, analysis of variance for the test of difference
and the r, Pearson Product Moment Coefficient of Correlation for the tests of relationship / association,
and the tests for prediction and forecasting are the Simple Linear Regression Analysis and the Multiple
Regression Analysis.