GEC4 Mathematics in The Modern World CHAPTER 4
GEC4 Mathematics in The Modern World CHAPTER 4
Data Management
Learning Outcomes:
Introduction
Statistics is a branch of science which involves the collection, presentation, analysis and
interpretation of numerical data. It provides us procedures in collecting the data, presenting,
analysing and interpreting of gathered data that are useful to business decision-makers (Sirug,
2015)
4.1 Preliminaries
AREAS OF STATISTICS
There are two main areas of statistics: descriptive and inferential statistics.
Descriptive statistics is the simple collecting, presenting, and analyzing of the data and
its primary purpose is only to describe the characteristics of the population/sample under
investigation.
TYPES OF DATA
There are two basic types of data: qualitative and quantitative data.
GEC 4 Mathematics in the Modern World , First Semester,AY 2021-2022
Quantitative Data (also termed as numerical data)- is the data that involves quantities
which came from counting measurement. Its value differs in degree.
Examples are height, weight, number of employees, salary, etc.
This can also be classified as either discrete or continuous. Discrete data are those data
that can be counted like number of students, number of likes and shares in FB post, etc. While
continuous data are those data that are obtained through measurement like weight, length of your
hair, thickness of your eyeglasses, etc.
Qualitative data (also termed as categorical data) is the data that involves qualities which
cannot be measured. Examples are sex, nationality, color of skin, religion, etc.
Levels of Measurement
Data can be classified according to the levels of measurement. This classification includes
nominal, ordinal, interval and ratio data.
1. Nominal level is the lowest level of data and is used purely for classification and
identification purposes only. Examples of this level are gender, house number, home
ownership, etc.
3. Interval level – it specifies the precise difference between or among the values or
ranks.
4. Ratio level- has the same characteristics as the interval, however, the ratio level starts
from zero. In addition, it has a presence of units of measures.
As it has been said, in data gathering, it is usually less expensive when only a segment of
the population, or sample, is considered. Apart from economy reasons, that is, saving money,
time, and effort, gathering data from a sample is easier and at times, more practical. The following
are sampling techniques that may be used.
A. Probability Sampling - each unit in the population has a known probability of selection, and a
random number table or other randomization mechanism is used to choose the specific units
to be included in the sample
- relatively small sample can be used to make inferences about an arbitrarily
large population
1. Simple Random Sampling (SRS). This is the simplest form of probability sampling wherein
all the elements of the population have equal chances of being selected as
sample. This usually serves as the foundation of more complex sampling design.
a. Without replacement - every possible subset of n distinct units in the population has
the same probability of being selected as sample
𝑁!
- there are (𝑁) = ̅̅̅̅̅̅̅̅̅̅̅̅̅̅ possible samples
𝑛 𝑛! (𝑁 − 𝑛)!
- probability of selecting any individual sample S of n units is
1 𝑛! (𝑁 − 𝑛)!
𝑃(𝑆) = 𝑁 =
( ) 𝑁!
𝑛
- the probability that the 𝑖th unit appears in the sample is πi = n/N
b. With replacement - the probability of each element to be chosen as sample is 1/N
- may include duplicates from the population
2. Systematic Sampling - starting point is chosen from a list of population members using a
random number
3. Stratified Random Sampling. In this sampling method, the elements are divided into
subgroups called strata. Then a random sample of units is taken from each
stratum. Elements in the same stratum often tend to be more similar than
randomly selected elements from the whole population, so stratification often
increases precision.
4. Cluster Sampling. Here, observation units in the population are aggregated into larger
sampling units, called clusters, and sampling is done on clusters and uses all
members of the cluster as samples.
Note:
Elements in the same stratum often tend to be more similar than randomly selected elements
from the whole population, so stratification often increases precision.
Illustration:
Suppose you want to estimate the average amount of time that professors at CSU
say they spent grading homework in a specific week.
To take an SRS, construct a list of all professors and randomly select n of them
to be your sample. Now ask each professor in your sample how much time he or she spent
grading homework that week—you would of course have to define the words homework
and grading carefully in your questionnaire.
B. Non-Probability Sampling - Not all units in the population has a chance of being selected as
sample
GEC 4 Mathematics in the Modern World , First Semester,AY 2021-2022
1. Convenience Sampling - Sampling done based on the convenience of the researcher.
Frequency Distribution. This organizes raw data in table form, using classes and frequencies
or counts. Each raw data value is placed into a quantitative or qualitative category called
a class. The frequency of a class is the number of data values contained in that specific
class.
1. Categorical Frequency Distributions - used for data that can be placed in specific categories,
such as nominal or ordinal level data.
Example 1:
A B B AB O
O O B AB B
B B O A O
A O O O AB
AB A O B A
2. Ungrouped Frequency Distribution - used for data whose range of values is relatively small.
The single data values are used as classes.
Example 2:
12 17 12 14 16 18
16 18 12 16 17 15
15 16 12 15 16 16
12 14 15 12 15 15
19 13 16 18 16 14
3. Grouped Frequency Distributions - used for data that has a very large range. Data are
grouped into classes that are more than 1 unit in width.
Tabular Form - effective devices of presenting both qualitative and quantitative data.
- make comparisons and draw relationships between and among variables
Since the class boundaries are used in the graph, the bars in a histogram are
contiguous, unlike those in a bar or column chart.
The graph below shows the frequency polygon corresponding to the distribution above.
Cumulative Frequency
less than 99.5 0
less than 104.5 2
less than 109.5 10
less than 114.5 28
less than 119.5 41
less than 124.5 48
less than 129.5 49
less than 134.5 50
Example:
Example:
16% Singapore
Timor-Leste
Exercise 4.1
The measures of central tendency are used to determine the cluster of the data about the
center. The most common measures of central tendency are the mean, median and mode.
4.2.1 MEAN
The mean or arithmetic mean (𝒙 ̅) is the average of all the values in the data set. It can
be obtained by getting the sum of all the observations divided by the total number of observations.
∑ 𝑋𝑖
where: 𝑋𝑖 is the individual observations
𝑋̅ = 𝑖=1
̅̅̅̅̅̅̅ 𝑛 is the total number of observations
𝑛
Find the mean height of the 12 basketball players whose heights (in cm) are 150,
160, 163, 159, 174, 178, 165, 156, 187, 176, 175, 180.
Solution: Let X be the height of the players and n for the total number of players
150 + 160 + 163 + 159 + 174 + 178 + 165 + 156 + 187 + 176 + 175 + 180
𝑋̅ = ̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅
12
2,023
= ̅̅̅̅̅̅̅̅ = 168.6
12
Note:
If the observations are in whole number, the final answer must be in tenth place, while if
the raw data has one decimal place, then its final answer must be in two decimal and so on.
What is the mean of the set of values: 6.7, 4.6, 5.5, 3.4, 8.2, and 5.8
Solution:
6.7 + 4.6 + 5.6 + 3.4 + 8.2 + 5.9
𝑋̅ = ̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅
6
34.4
= ̅̅̅̅̅̅ = 5.73
6
Twelve students were given an arithmetic test and the times (in minutes) to
complete it were as follows:
GEC 4 Mathematics in the Modern World , First Semester,AY 2021-2022
10, 9, 12, 11, 8, 15, 9, 7, 8, 6, 12, 10
Solution:
117
𝑋̅ = ̅̅̅̅̅̅ = 9.8
12
Therefore, the average time to complete the arithmetic test is 9.8 minutes.
There are some cases when individual values do not have equal importance. A weighted
mean is appropriate to use. The formula in the computation of the weighted arithmetic mean is:
n
W X i i
W1 X 1 W2 X 2 . .. Wn X n
Xw i 1
n
W1 W2 ... Wn
W
i 1
i
Suppose Mark wants to determine his General Weighted Average of the subjects
for the last semester he was enrolled as follows:
Solution:
W X i i
W1 X 1 W2 X 2 . .. Wn X n
Xw i 1
n
W1 W2 ... Wn
W
i 1
i
4.2.3 MEDIAN
The median is the middlemost value in the data set. It divides the distribution into two
equal parts.
If the number of observation is even, the median is the average of the two middle values,
while if the number of observation is odd, then the median is the middlemost value in the data set.
𝑛+1
Median (Rank Value) = ̅̅̅̅̅̅̅
2
Example 1:
Find the median height of the 12 basketball players whose heights (in cm) were as
follows:
150, 160, 163, 159, 174, 178, 165, 156, 187, 176, 175, 180.
Solution:
The first step is to arrange the data in an increasing order (from lowest to highest).
Thus, 150, 156, 159, 160, 163, 165, 174, 175, 176, 178, 180, 187
Since the data set is an even number of observation, we will be getting the average
of the two middle values and following the formula for the median (rank value)
Since the middle value falls on the 6.5, then we are going to get its 6th and 7th value.
Therefore,
Example 2:
The daily rates of a sample of 9 employees at GMS Inc. are ₱550, ₱420, ₱650,
₱500, ₱700, ₱480, ₱520, ₱860, and ₱670. Find the median rate.
Solution:
The first step is to arrange the data set in an increasing order. Thus,
GEC 4 Mathematics in the Modern World , First Semester,AY 2021-2022
₱420, ₱480, ₱500, ₱520, ₱550, ₱650, ₱670, ₱700, ₱860
Since the data set is an odd number, then the median is the middlemost value. Therefore,
the median is the 5th value which is ₱550.
4.2.4 MODE
The mode is the most frequent observation. It is the observation which occur most often
in the data set.
If there is only one observation having the highest frequency, then the data set is said to
be unimodal. If it has two, then it is bimodal. If it has three observations with the same highest
frequency, it is said to be trimodal. And, if there is no repetition of the individual values in the
data set, no mode exists.
Example 1:
The following are the scores of the students in Mathematics quiz. Determine the
mode of the data set.
40, 27, 20, 40, 26, 24, 25, 29, 30, 31, 27, 33, 39, 36, 22, 36, 28, 27, 27,
26, 20, 21, 30, and 19.
Solution:
The most frequent number that appears in the data set is 27. Since there is only one
observation having the highest frequency, then it is unimodal.
Example 2:
Determine the mode of the grades of 19 engineering students in Mathematics
subject as follows:
2.2, 1.7, 2.1, 2.0, 1.9, 2.3, 2.0, 2.4, 1.9, 2.1, 2.2, 2.4, 2.0, 1.9, 2.1,
2.1, 2.1, 2.0, 2.0
Solution:
Since the grade of 2.0 and 2.1 appeared the most and with the same number of times,
then they are considered as the mode. The type of mode is bimodal.
The measures of dispersion or variability tell about the spread of the data or how the
individual values are dispersed from the mean. The common measures of dispersion are the
range, variance and standard deviation.
The range is the simplest and easiest to compute measure of dispersion. It is obtained by
subtracting the lowest value from the highest value in the data set.
The variance is defined as the average of the squared deviations from the mean. The
square root of this variance is known as the standard deviations. The variance for a sample data
is denoted by s2 while the population variance is σ2.
To determine the variance of ungrouped data, let us follow the steps below:
1. Arrange the values in order (i.e. increasing or decreasing) vertically.
GEC 4 Mathematics in the Modern World , First Semester,AY 2021-2022
2. Calculate the mean of the data set.
3. Subtract the mean from the individual values. Place this on another column.
4. Add another column for the square of the difference of individual values and the mean.
5. Get the sum of the squared deviations.
6. Divide the sum in step 5 by n-1 for a sample data and N for the population data.
Example 1:
Determine the range, variance and standard deviation of the following data on a sample
of weights of pre-school children: 25.2, 19.5, 20.4, 21.5, 18.2, 16.0, 17.8, 17.6
Solution:
a. Range = highest value – lowest value
= 25.2 – 16.0
= 9.2
𝑋 𝑋 − 𝑋̅ (𝑋 − 𝑋̅)2
16.0 –3 .24 10.50
17.6 –1.64 2.69
17.8 –1.44 2.07
18.2 –1.04 1.08
19.5 0.26 0.07
20.4 1.16 1.34
25.2 5.96 35.52
𝑋̅= 19.24 (𝑋 − 𝑋̅)2 = 53.27
c. Since the standard deviation is the square root of the variance, then
𝑠 = √8.88 = 2.98
This value of the standard deviations implies that the cluster of observation is in the range
of 2.98 units above and below the mean.
Example 2:
The marks of 10 students of a class is given to be 0, 4, 9, 12, 25, 2, 21, 7, 11 and
12. What is the variance of the data set?
Solution:
Step 1: Organize the marks of the students in a table.
Marks (x) 𝑥 − 𝑥̅ (𝑥 − 𝑥̅ )2
25 14.7 216.09
21 10.7 114.49
12 1.7 2.89
12 1.7 2.89
11 0.7 0.49
9 – 1.3 1.69
GEC 4 Mathematics in the Modern World , First Semester,AY 2021-2022
7 – 3.3 10.89
4 – 6.3 39.69
2 – 8.3 68.89
0 – 10.3 106.09
𝑥̅ = 10.3 (𝑥 − 𝑥̅ )2 = 56.4
Step 3: Get the square root of the variance to obtain the standard deviation.
√𝑠2 = √62.67 = 7.92
Exercise 4.3
1. Determine the range, variance and standard deviations of the following data sets.
a. 7, 8, 4, 3, 2, 3, 6, 5 and 7
b. 2, 8, 11, 17, 12, 6 and 4
2. The result of the college entrance examination of 10 students in a certain university were as
follows:
2.5, 3.4, 5.6, 3.8, 4.2, 2.8, 3.0, 3.0, 3.4, 4.2
Compute for the variance and standard deviation.
3. The newspaper company reported that samples of their weekly sales (in hundred thousand
pesos) are: 345, 452, 254, 137, 483, 515 and 218. Calculate and interpret the variance and
standard deviations.
____________________________________________________________________________
The measures of location describe the data in some situations and it would be beneficial
knowing how to interpret the obtained values. Quartiles, percentiles and standard scores are the
most commonly used measures of location.
Quartiles divide the distribution into four equal parts (segments of 25% each). Three
quartiles are defined: Q1, Q2 and Q3.
Percentiles divide the distribution into 100 equal parts denoted by P k where k is the
percentile rank. Say P50 means 50th percentile, P75 means 75th percentile and so on. P50 and Q2
is also the same as the median of the distribution. Same with P 25 that is equal to Q1.
GEC 4 Mathematics in the Modern World , First Semester,AY 2021-2022
𝑛𝑘
𝑄𝑘 = ̅̅̅̅ , where 𝑘 = 1, 2, 3
4
𝑛𝑘
𝑃𝑘 = ̅̅̅̅̅̅ , where 𝑘 is an integer from 1 to 99
100
If the value of 𝑄𝑘 and 𝑃𝑘 is an integer, the kth percentile/quartile is the average of the value
of the obtained percentile/quartile rank and the value preceding it. If the value is not an integer,
then it must be round up.
Percentile rank refers to the percentile ranking of a certain value. This can be obtained by
following the equation below:
Example 1:
What is the third quartile (Q3) of the following data set?
20, 40, 50, 65, 70, 75, 80, 100
Solution:
If the data set are not arranged in chronological order, then you must arrange it either
increasing or decreasing order.
Since the given data set are already arranged, then we must compute for Q 3.
Q3 =nk/ 4 = (8)(3)/4 = 24/4 = 6
Since the value of Q3 is an integer then we must get the average of the 6th and 7th value
in the data set.
Example 2:
For the data set below, which value is in the 75th percentile?
1, 3, 3, 4, 6, 7, 7, 7, 8, 9, 9, 10, 12, 15, 16, 17
Solution:
Since we want to find the P75, and we know that there are 16 values in the data set, then
computing for P75 = nk/100 = (16)(75)/100 =1200/100 = 12
Again since the obtained value is an integer, then we must get the average of the 12th and
13th value in the data set. That is, P75 = (10 + 12)/2 = 22/2 = 11
Therefore, the 75th percentile is 11. This implies that 75% in the data set have values less
than 11 and only 25% have values greater than 11.
Solution:
Using the equation for the percentile rank and substituting the given information,
4 + 0.5
= ̅̅̅̅̅̅̅̅̅ ∙ 100%
16
= 28%
========================================================================
Exercise 4.4
2. Listed are 29 ages for Academy Award-winning best actors in order from smallest to largest:
18, 21, 22, 25, 26, 27, 29, 30, 31, 33, 36, 37, 41, 42, 47, 52, 55,
57, 58, 62, 64, 67, 69, 71, 72, 73, 74, 76, 77
a. Find the 70th percentile.
b. Find the 83rd percentile.
3. At a high school, it was found that the 30th percentile of number of hours that students spend
studying per week is seven hours. Interpret the 30 th percentile in the context of this situation.
______________________________________________________________________
Standard score can be obtained by getting the ratio of the difference of the value and the
mean and the standard deviation. In symbols,
GEC 4 Mathematics in the Modern World , First Semester,AY 2021-2022
𝑋 − 𝑥̅
𝑍=
𝑠
Note: A positive (+) z-score means that the observed value is above the mean.
A negative (-) z score means that the observed value is below the mean.
A zero (0) z –score means that the observed value is equal to the mean.
Example 1:
In a given distribution, the mean is 85 and the standard deviation is 10. Find the
corresponding standard score of the ff. values:
a. 95 b. 87 c. 68 d. 55
Solution:
95 − 85
z= = 1.0
10
Since the standard score is positive, this implies that the score of 95 is 1 standard
deviation above the mean.
This implies that the score of 87 is 0.2 standard deviations above the mean.
Since the standard score is negative, this implies that score of 68 is 1.7 standard
deviations below the mean.
4. The standard score of 55 is:
55 − 85
z= = −3.0
10
This implies that the score of 55 is 3 standard deviations below the mean.
The graphical presentation of a normal distribution is the normal curve. A normal curve is
symmetrical, with the highest point at the center. Since it is symmetrical, the left side of the curve
is equal to the right side. The area of the normal distribution represents the population of a
particular distribution.
Two parameters are used to describe the normal curve; the mean and the standard
deviation. Negative standard deviations are located at the left side while the positive standard
deviations are on the right side of the curve.
The area of the normal distribution represents probability. Thus, the larger the area the
greater probability.
Source: Kanbanize
Example: 1.
Solution:
a. the area of z = 1.99 is equal to 0.4767 or 47.67%, since we will find the area to the left
of 1.99 we must add the other 50% on the left side of the normal curve. Therefore, the
total area would be 97.67%.
b. the area of z = 2.04 is equal to 0.4793 or 47.93%, since this area pertains to the left of
the normal curve and we are looking for the area to the right, then we must add the
other 50% of the normal curve. Therefore, the total area to the right of -2.04 would be
97.93%.
GEC 4 Mathematics in the Modern World , First Semester,AY 2021-2022
c. the area of z = 2.00 is equal to 0.4772 or 47.72% and the area of z = 2.47 is equal to
0.4932 or 49.32%, since we are looking for the area between these regions, we must
subtract the values. Therefore, 49.32% subtracted by 47.72% is equal to 1.60%.
d. the area of z= 1.02 is equal to 0.3461 or 34.61% and the area of z = 2.35 is equal to
0.4906 or 49.06%, since we are looking for the area between these regions, and it can
be noticed that the z scores comes from left and right of the curve, then we must add
the areas in order to get the total area. Therefore, 34.61% added by 49.06% is equal
to 83.67%.
Several problems in different fields can be solved with the application of the normal curve.
The only requirement is that the variable be normally or approximately normally distributed.
To solve problems by using standard normal distribution, transform the original variable to
a standard normal distribution variable using the standard score or z-score.
Example 1:
A survey found that women spend on average ₱146. 21 on beauty products during
the summer months. Assume that the standard deviation is ₱29.44 and the variable is
normally distributed. Find the percentage of women who spend less than ₱160. 00.
Solution:
Step 1: Draw the normal curve and represent the area.
160.00
𝑥̅ =146.21
₱160.00 − ₱146.21
z = ̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅ = 0.47
₱29.44
Step 3: Find the area, using the Table for Areas Under the Normal Curve. Look for z =
0.47. From the table, the area of z= 0.47 is 0.1808. Since the question is to look for
the percentage of women who spend less than 160.00, so we need to add the area
below the mean which is 50%.
Example 2:
To qualify for a police academy, candidates must score in the top 10% on a general
abilities test. The test has a mean of 200 and a standard deviation of 20. Find the lowest
possible score to qualify. Assume that the test scores are normally distributed.
Solution:
Since the test scores are normally distributed, the test value x that cuts off the
upper 10% of the area under the normal distribution is desired. (refer to figure below). The
shaded region represents the students who qualify for the test.
𝑥̅ 1.28
Step 1: Subtract 0.1000 from 1.000 to get the area under the normal distribution to the left of x:
1.000 – 0.1000 = 0.9000
Step 2: Find the z value that corresponds to an area of 0.9000. If the specific value cannot be
found, use the closest value. In this case, 0.8997. The corresponding value is 1.28.
Therefore, a score of 226 should be used as a cut off. Anybody scoring 226 and above qualifies.
========================================================================
Exercise 4.5
When conducting research studies, researchers wish to determine whether two variables
are related. If these variables are found to be related, they may then find an equation that can be
used to model the relationship. A correlation is a relationship between two variables. The data
can be represented by the ordered pairs (𝑥, 𝑦) where x is the independent (or explanatory)
variable, and y is the dependent (or response) variable.
The correlation coefficient is a measure of the strength and the direction of a linear
relationship between two variables. The symbol r represents the sample correlation coefficient.
The formula for 𝑟 is
𝑛 ∑ 𝑥𝑦 − ∑ 𝑥 ∑ 𝑦
𝑟 = ̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅
√𝑛 ∑ 𝑥 2 − (∑ 𝑥)2 √𝑛 ∑ 𝑦 2 − (∑ 𝑦)2
where:
r = Pearson r correlation coefficient
n = number of observations
∑ 𝑥𝑦 = sum of the products of paired scores
∑ 𝑥 = sum of x scores
∑ 𝑦 = sum of y scores
∑ 𝑥 2 = sum of squared x scores
∑ 𝑦 2 = sum of squared y scores
A value of +1 indicates that there is a perfect positive correlation. This means that if one
variable increases, the other variable also increases. The value of -1 indicates that there is a
negative correlation. This implies that as one variable increases, the other variable decreases. A
value of 0 indicates that there is no correlation between variables. (Tolentino, et al., 2018) The
complete list of values was presented below to further interpret the value of computed r.
0.0 - no correlation
±1.00 - perfect correlation
±0.01 − ±0.25 - very low correlation
±0.26 − ±0.50 - moderately low correlation
±0.51 − ±0.75 - high correlation
±0.76 − ±0.99 - very high correlation
Determine the correlation between the age and the weight of 10 preschool children at barangay
Marilima as shown in the table below:
Solution:
To determine the relationship between age and weight of 10 children, Pearson product
moment correlation must be used.
Step 1: To obtain the values, we may construct another table adding columns for specific values
necessary in the computation of the Pearson r.
The result of 0.93 indicates a positive with very high correlation between the age of pre-
school children and their weight. This implies that as age of the pre-school children increases,
their weight also increases.
Example 2:
For the following data set, find the Pearson r and the r2.
x 12 2 5 9 11 10 4 1
y 10 3 7 5 9 8 6 3
Solution:
The following steps are helpful for the computation of the correlation coefficient.
X Y XY X2 Y2
12 10 120 144 100
𝑛 ∑ 𝑥𝑦 − ∑ 𝑥 ∑ 𝑦
𝑟 = ̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅
√𝑛 ∑ 𝑥 2 − (∑ 𝑥)2 √𝑛 ∑ 𝑦 2 − (∑ 𝑦)2
8 (412) − (54)(51)
= ̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅
√[(8)(492) − 542 ] √[(8)(373) − 512 ]
3296 − 2754
= ̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅
√(3936 − 2916) √2984 − 2601
542 542
= ̅̅̅̅̅̅̅̅̅̅̅̅̅ = ̅̅̅̅̅ = 0.09
√390660 625
Simple linear regression is a statistical method that allows us to summarize and study
relationships between continuous (quantitative) variables. One variable, denoted by x represents
the predictor variable or the independent variable. The other variable, denoted by y represents
the response or the dependent variable.
Simple linear regression is appropriate when the following conditions are satisfied.
The dependent variable Y has a linear relationship to the independent variable X. To check
this, make sure that the XY scatterplot is linear and that the residual plot shows a random
pattern.
For each value of X, the probability distribution of Y has the same standard deviation σ.
When this condition is satisfied, the variability of the residuals will be relatively constant
across all values of X, which is easily checked in a residual plot.
The least square regression equation can be formed from a set of sample data using the
formula:
GEC 4 Mathematics in the Modern World , First Semester,AY 2021-2022
𝑦̂ = 𝑎 + 𝑏𝑥
The constants a, b in the regression equation are called the regression coefficients. The
values of a and b can be found using the following equations:
∑ 𝑥𝑦 − 𝑛𝑥̅ 𝑦̅
𝑎 = 𝑦̅ − 𝑏𝑥̅ and 𝑏 = ̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅
1
∑ 𝑥 2 − (𝑥)2
𝑛
The regression equation can be used to predict the value of one variable when the value
of the other variable is known.
Example 1:
The values of x and their corresponding values of y are shown in the table below
X 0 1 2 3 4
Y 2 3 5 4 6
Solution:
Step 1: Organize the listing of the values of x and y. Include the necessary values like x 2, y2, xy
and the mean of x and y.
𝑥 𝑦 𝑥𝑦 x2 y2
0 2 0 0 4
1 3 3 1 9
2 5 10 4 25
3 4 12 9 16
4 6 24 16 36
x = 10 y = 20 xy = 49 x2 = 30 x2 = 90
𝑥̅ = 2.0 𝑦̅ = 4.0
∑ 𝑥𝑦 − 𝑛𝑥̅ 𝑦̅ 49 − 5(2.0)(4.0)
9
𝑏 = ̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅
2 1 2
= ̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅
1 2
= ̅̅̅̅ = 0.9
∑ 𝑥 − (∑ 𝑥) 30 − (10) 10
𝑛 5
Example 2:
The table below shows the height, 𝑥, in inches and the pulse rate, 𝑦, per minute,
for 9 people. Find the correlation coefficient and interpret your result.
x 68 72 65 70 62 75 78 64 68
y 90 85 88 100 105 98 70 65 72
Solution:
Step 1:
Height Pulse rate
𝑥𝑦 𝑥2 𝑦2
(𝑥) (𝑦)
68 90 6120 4624 8100
72 85 6120 5184 7225
65 88 5720 4225 7744
70 100 7000 4900 10000
62 105 6510 3844 11025
75 98 7350 5625 9604
78 70 5460 6084 4900
64 65 4160 4096 4225
68 72 4896 4624 5184
∑ 𝑥 =622 𝑦 = 773 𝑥𝑦 = 53336 𝑥 2 = 43206 ∑ 𝑦 2 =68007
𝑥̅ = 69.1 𝑦̅ = 85.9
Exercise 4.6
1. A researcher carefully computes the correlation coefficient between two variables and
gets r = 1.23. What does this value mean?
a. Sketch a scatterplot.
b. Compute the correlation coefficient, r.
c. Compute the coefficients of the linear regression line, y = b 1x + b0.
d. What is the estimated value for X = 7?
Reflective Journal
Write your reflections in learning the topics. Describe your strengths and
weaknesses to learn the concepts.
________________________________________________________________
________________________________________________________________
________________________________________________________________
________________________________________________________________
________________________________________________________________
________________________________________________________________
________________________________________________________________
________________________________________________________________
________________________________________________________________
________________________________________________________________
________________________________________________________________