0% found this document useful (0 votes)
12 views108 pages

Principle of Statistics

Uploaded by

maryammuhamedd12
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views108 pages

Principle of Statistics

Uploaded by

maryammuhamedd12
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 108

Principles of Statistics

Compiled by
Mohamed El-Mahdy M. Ali
Prof. of
Mathematics and Actuarial Statistics
Faculty of Commerce
Port-Said University
2022/2023
About the Author
 Deputy, faculty of commerce, Port-Said University,
2012.
 Head of Statistics, Math. And Insurance department,
From 1999 till 2019.
 Prof. of Math and Actuarial Statistics, from 1999.
 Member of Egyptian Universities Promotion Committees (EUPC),
Insurance Committee, 2012 – 2015, 2016-2019.
 Member of Arbitration Committee of Insurance at (EUPC), 2004 till
2020.
 External reviewer at (NAQAAE), National Authority for Quality
Assurance and Accreditation of Education, 2018 till now.
 Consultant expert at (EFSA), 1995 till present.
 Consultant, Insurance Institute in Egypt (IIE), affiliated with Chartered
Insurance Institute in London (CII), 2012 – 2015.
 Consultant, Commercial International Life Insurance Co. (CIL), 2013.
 Fellowship, Wharton School, Pennsylvania, USA, 1989-1990.
 Supervised more than 50 Msc, PhD Thesis and Dissertations in statistics
and insurance.
 Arbitrated more than 200 MSc., PhD Thesis, Dissertations, researches
and books in statistics and insurance.
 Publishing more than 30 scientific researches, Arabic and English
copies.
 Publishing more than 50 Mathematics, Statistics and Insurance books,
Arabic and English copies.
 Teaching Quantitative methods, Risk management and Statistical
analysis curricula for MBA, DBA and postgraduate in different
universities and academies.
Contents
Subject Page
Contents 1
Preface 2
Chapter 1: Basic terms 3 - 12
Chapter 2: Tabular and graphical presentations 13 - 19
Chapter 3: Measures of central tendency 20 – 40
Chapter 4: Measures of variability 41 - 57
Chapter 5 : Measures of distribution Shape and relative location 58 -78
Chapter 6: Correlation and regression 79 -105
References 106

-3-
Preface
The basic aim of this textbook is to give students a conceptual introduction to
the field of descriptive statistics and its applications.
This book contains six chapters; Chapter one deals with basic terms. Chapter
two includes tabular and graphical presentations. Chapter three involves
measures of central tendency. Chapter four contains measures of variability or
spread. Chapter five includes measures of distribution shape and relative
location. And finally, chapter six discusses measures of association between
two variables, using correlation and regression.

-2-
Chapter (1)
Basic Terms
As we know, every field of knowledge has its own terms or terminologies. This chapter
discusses the basic terms related to statistics. These terms include statistics, its types, data
and its types and sources, population and sample, variable and its types.

(1) Statistics:
In the daily usage, the term statistics refers to numerical facts. But in the broader sense,
statistics is a method or a tool used to get some information from data. So, statistics can be
defined as art and science of colleting or gathering, summarizing or tabulating, presenting,
analyzing, and interpreting data to make decisions.

(2) Types of Statistics:


Statistics may be classified into two major types; theoretical and applied.
A- Theoretical Statistics:
Theoretical or mathematical statistics refers to the derivation and proof of statistical
formulas, rules, laws, and theorems.
B- Applied Statistics:
Applied statistics contains the applications of those theorems and formulas to solve real
world problems. Applied statistics can be classified into two major types; descriptive and
inferential statistics.
1- Descriptive Statistics:
Descriptive statistics involves collecting, arranging or organizing, summarizing,
displaying, and describing a set of data by using tables, graphs, and summary measures.
2- Inferential Statistics:
Inferential statistics include all methods that use sample statistics (or measures) to help
make decisions or predictions about a population parameters (or measures). So, inferential
statistics or statistical inference is the process of making an estimate, prediction, or
decision about a population depending on sample data.

-3-
(3) Population and Sample:
A- Population:
A population contains all elements or members (individuals, items or objects) whose
characteristics are being studied. So, the population is the entire set of all items or
observations or measurements of interest in any statistical study. A survey that involves all
elements or members of the population is called a census. A descriptive measure of a
population is called a parameter.
B- Sample:
A sample is a part or a portion of the population. So, a sample is a set of data or
observations drawn or selected from the population. It is only contains few elements. An
element or member of a sample or population is a specific subject or object about which
the information is collected. The method of collecting information from a part of the
population is called a sample survey. A sample that represents the characteristics of the
population as closely as possible is called a representative sample. When a sample drawn
in such a way that every element of the population has the same chance of being selected
is called a random sample. If all samples of the same size have the same chance of being
selected from a population, this process called simple random sampling. Such a sample is
called a simple random sample. A descriptive measure computed from a sample is called a
statistic.
One way to select a random sample is by lottery or draws or may be by computer program.
For example, if we want to select 5 students from a class of 80, we write each of the 80
names on a separate piece of paper. Then we place all 80 slips in a box and mix them
thoroughly. Finally, we randomly draw 5 slips from the box. The 5 names drawn give a
random sample.
A sample may be selected with or without replacement. In sampling with replacement
each time we select an element from the population, we put it back in the population
before we select the next element. Thus, in sampling with replacement, the population size
remains the same and contains the same number of elements each time a selection is made.
As a result, we may select the same element more than once in this sample.

-4-
Sampling without replacement occurs when the selected item is not replaced in the
population. In this case, each time we select an element, the size of the population is
reduced or decreased by one element. Thus we cannot select the element more than once
in this sample.

(4) Data:
The term data like the word observations is a plural noun and refers to the actual
measurements that result from an investigation or survey. Data are defined as the facts and
figures collected or gathered, tabulated or summarized, and analyzed for presentation and
interpretation. All the data collected in a specific study are referred to as the data set for
the study. Data can be classified into two major types; quantitative and qualitative.
A- Quantitative Data:
Quantitative data are numerical observations. It requires numeric values that indicate how
many, how much, how long or how deep. For example, height, weight, age, distance,
wages and prices. The data in these examples are real numbers. Quantitative data are
obtained using either the interval or ratio scale of measurement as we will see later. If
numbers are used to label or describe the categories, such as marital status, with possible
responses being 1 for single, 2 for married, 3 for divorced, and 4 for widowed, these data
are qualitative because the numbers represent the category of the response and they have
no real numerical meaning.
B- Qualitative Data:
Qualitative data are categorical observations. It is the observations that can be sorted or
classified into categories such as sex, occupation, marital status and race. Qualitative data
use either the nominal or ordinal scale of measurement, as we will explain later, and may
be nonnumeric or numeric.

(5) Data Sources:


The availability of accurate and appropriate data is very important for deriving reliable
results. Data can be obtained from existing sources (internal or external) or from surveys
and experimental studies designed to collect new data.
-5-
A- Existing Sources:
Existing sources can be divided into internal sources and external sources.
1- Internal sources:
In most cases, data needed for a particular study already exist. Many times data come from
internal sources such as firm’s own personnel files and records. A police department might
use data that exist in its own records to analyze changes in the nature of crimes over a
period of time. Companies and organizations maintain a lot of databases concerning their
employees, customers, and business operations. These databases contain employee
salaries, ages, years of experience, data on production quantities and sales, advertising
expenses, distribution costs, inventory levels, and data about their customers. All these
data can be obtained from internal records of these firms and organizations, educational
associations and ministries. Internet is an important source of data and statistical
information. Government associations and agencies are another source of existing data.
For example, a firm that wants to predict the future sales of its product may use the data of
past periods from its own records.
2- External Sources:
For most studies, however, all the data that are needed or required are not usually available
from internal sources. In these cases, the researcher may have to depend on outside
sources to obtain data. Such sources are called external sources. For example, the
statistical year book issued by the Central Agency for Public Mobilization and Statistics in
Egypt, which contains various kinds of data on Egypt, is an external source of data.
A large number of government and private publications can be used as external sources of
data. Most of the data contained in such books can be accessed on internet sites. Data
obtained from external sources may be primary or secondary data. Data obtained from the
organization or association that originally collected them are called primary data. If we
obtain data from the Bureau of Labor Statistics that was collected by this Bureau, then
these are primary data. Data obtained from a source that did not originally collect them are
called secondary data. For example, data originally collected by the Bureau of Labor
Statistics and published in the Statistical Year Book issued by the Central Agency for
Public Mobilization and Statistics are secondary data.
-6-
B- Statistical Studies:
If the data required for a specific study are not available through existing sources (internal
or external), they can be obtained by making a statistical study. Statistical studies can be
divided as experimental or observational.
1- Experimental study:
In an experimental study, the variables of interest must be identified and controlled by the
investigator or the researcher in such a way so that the data can be obtained about how
they affect the variables of interest. In any experiment, data are collected from members of
a population or sample with control over the factors that affect the characteristic of interest
or the results of the experiment. For example, a pharmaceutical company might be
interested in conducting an experiment to know how a new drug affects blood pressure.
So, blood pressure is the variable of interest in this study. The dosage level of the new
drug is another variable. To obtain data about the effect of the new drug, the researcher
select a sample of patients. The dosage level of the new drug is controlled because
different patients are given different dosage levels. The data on blood pressure must
collect before and after the new drug for the same group of patients. Statistical analysis of
the experiment data can help the company to decide how the new drug affects blood
pressure.
2- Observational Study:
Observational or non experimental statistical studies make no attempt to control the
variables of interest. A survey is probably the most common method of observational
study. In a survey, data are collected from members of a population or sample with no
particular control over the factors that may affect the characteristic of interest or the result
of the survey. In a personal interview survey, research questions must be identified firstly,
and then a questionnaire is designed and submitted to the selected sample of persons. For
example, some restaurants use observational studies to obtain data about their customers’
opinions of the quality of food, service, atmosphere, and so on. A survey may be a census
or a sample survey.

-7-
(6) Variable:
A variable is the characteristic under study that assumes different values for different
elements or items. In contrast to a variable, the value of a constant is fixed. The value of a
variable for an element or item is called an observation or measurement. A variable may
be classified as quantitative or qualitative.
A- Quantitative Variables:
A variable that can be measured numerically is called a quantitative variable. And the data
collected on a quantitative variable are called quantitative data. Quantitative variables
divided into two types; discrete and continuous.
1- Discrete Variables:
A variable whose values are countable is called a discrete variable. A discrete variable can
assume only integers or certain values with no intermediate values, such as the number of
students in group (A), the number of cars owned by students in this group.
2- Continuous Variables:
A variable that can assume any numerical value over a certain interval is called a
continuous variable, such as time, distance, height, age and weight.
B- Qualitative Variables:
A variable that cannot assume a numerical value with real number but can be classified
into categories is called a qualitative or categorical variable. The data collected on such a
variable are called qualitative data. Examples of qualitative variables, the status of an
undergraduate college student, the gender, hair color and the maker of a car.

(7) Scales of Measurement:


There are four scales of measurement; nominal, ordinal, interval, and ratio. The scale of
measurement specifies the amount of information contained in the data set, indicates the
suitable data organization, summarization and statistical analyses.
A- Nominal Scale:
When the data for a variable contains labels or names used to identify the element or
observation, the scale of measurement is considered a nominal scale. In cases where the
scale of measurement is nominal, a numeric code as well as nonnumeric labels may be
-8-
used because the numeric codes provide or denote the labels or categories even though the
data appear as numeric values.
B- Ordinal Scale:
The scale of measurement for a variable is called an ordinal scale if the data have the
properties of nominal data and it can be ordered or ranked, and the order or rank of the
data is meaningful. For example, the result of an exam may be excellent, very good, good,
and poor or weak. In this case, the data can be ranked or ordered with respect to the higher
grade or lower grade. Data may record as excellent represent the best, followed by very
good then good and finally poor. Thus, the scale of measurement is ordinal. Note that the
ordinal data can also be recorded using a numeric code such as 1 for excellent, 2 for very
good, 3 for good and so on.
C- Interval Scale:
The scale of measurement for a variable becomes an interval scale if the data show the
properties of ordinal data and the interval between any two values is expressed in terms of
a fixed unit of measure. Interval data are always numeric, such as the degree of heat where
the zero degree don’t mean or indicate that nothing heat exists. So, zero is a subjective
point, and not objective for all cases.
D- Ratio Scale:
The scale of measurement for a variable is a ratio scale if the data have all the properties of
interval data and the ratio of two values is meaningful. Variables such as distance, height,
weight, and time use the ratio scale measurement. This scale requires that a zero value be
included to indicate that nothing exists for the variable at the zero point.

(8) Sampling Techniques:


A sample may be random or nonrandom based on how a sample is drawn from a
population.
As we explained before, a random sample is a sample drawn in such a way that every item
of the population has the same chance of being selected in the sample. In a nonrandom
sample, some items of the population may not have any chance of being selected in the
sample.
-9-
Assume that the population of second year in English Section in faculty of commerce at
Port-Said contains 400 students, and we want to select 10 of them to represent this
population. If we write the names of all students on pieces of paper, put them in a box or
hat, mix them, and then draw 10 names, the sample selected by this way is called a random
sample. If we rank or arrange the names of all students alphabetically and select the first
10 names, the sample is said to be a nonrandom sample because the students who are not
among the first 10 have no chance of being selected in the sample.
There are many methods to select a random sample. Four of these methods are explained
briefly here.
A- Simple Random Sampling:
As we discussed before, the simple random sample is a sampling technique under which
each sample of the same size has the same chance or probability of being selected from a
population. This sample can be selected by a lottery or drawing method as discussed
before, or by using random numbers table, and by computer programs.
B- Systematic Random Sampling:
If the population size is large, such as 50000 students at Port-Said University, and we need
to select 200 students from this population, the selection of simple random sample
becomes very weariful or tedious and time-consuming. In such cases, it is more
convenient to use systematic random sample.
To select a systematic random sample in this case, we would arrange the population
alphabetically (or based on some other characteristic). Because the sample size is 200
students, the ratio of population to sample size is 250, (50000/200=250). Using this ratio,
we randomly select one student from the first 250 students in the arranged list using one of
the methods that mentioned before. Assume that the first selected student number is 150.
We then select each 150th student from every 250 students in the population list. Thus, our
sample involves the students with numbers 150, 400, 650, 900, 1150, 1400, and so on. So,
we can say that, in systematic random sample, we first randomly select one element from
the first k elements. Then each kth element, beginning with the first selected element, is
involved in the sample.
C- Stratified Random Sample :
- 10 -
In a stratified random sample, we first classify the population into subpopulations
according to some characteristic (such as income, expenditure, sex or gender, education,
race, employment or position), which are called Strata. Then, select one random sample
from each Strata. The collection of all selected random samples from all strata represents
the stratified random sample. Usually, the sizes of all samples selected from all strata are
proportionate to the size of the subpopulations in these strata. Note that the elements of
each Strata are identical with regard to a specific characteristic.
D- Cluster Sampling:
If the population is scattered over a very wide geographical area, the simple random
sample may be very costly. In such a case, the cluster sampling is employed. In cluster
sampling, the entire population is first classified into geographical groups called clusters.
Each cluster is representative of the population. Then a random sample of clusters is being
selected. Finally, a random sample of elements from each of the selected clusters is
selected.

(9) Statistics and Computer:


Sometimes, we need to deal with large amounts of dataset, for example 100 or more
observations. In such cases, the researchers or analysts would have to perform various
calculations on the data using computer programs instead of work by hand, which may be
time-consuming and tedious.

- 11 -
Exercises
(1) Briefly explain the meaning of statistics and its types.
(2) Briefly explain each of the following terms: Population, sample, random sample,
sampling with replacement, sampling without replacement.
(3) Discuss whether each of the following constitutes a population or a sample.
a- Credit card debts of 1200 persons selected from Port-Said city.
b- Monthly salaries of all employees of Port-Said University.
c- Number of cars owned by all students in faculty of commerce at Port-Said.
d- Number of computers sold during the last month at all computer malls at Port-Said.
(4) Briefly explain the meaning of an element, a variable, a data set, discrete
variable, continuous variable, and qualitative variable.
(5) Indicate which of the following variables are quantitative and which are
qualitative.
a- Number of your family members.
b- Color of your hair.
c- Marital status of employees.
d- Number of cars owned by your family.
e- The months in which you take your vacation.
(6) Briefly describe the types of data sources.
(7) Explain the four types of the sampling techniques.
(8) For each of the following questions, determine whether the possible responses
are nominal, ordinal, interval, or ratio scale.
a- How old are you?
b- What is your sex?
c- What is your weight?
d- Are you engaged?
e- What is the kind of music you prefer?
f- What size of soft drink you prefer(small, medium, large)?
g- What is the degree of heat today?

- 12 -
Chapter (2)
Tabular and Graphical Presentations
This chapter describes tabular and graphical methods usually used to organize and
summarize qualitative and quantitative data. We start with data concerning one variable.
Then, we introduce methods for summarizing data when dealing with the relationship
between two variables.
(1) Summarizing Qualitative data:
When data are collected, the information obtained from every element of the population or
sample is recorded in a sequence which it becomes available. This sequence is random and
unordered or unranked, Such data are called raw data.
So, raw data means data recorded in a sequence in which they are collected and before
they are ordered or ranked.
A- Frequency Distribution:
A frequency distribution is a tabular summary of data showing the number of elements
(frequency) that belong to each category (class).
B- Relative Frequency and Percent Frequency Distributions:
The frequency distribution shows the number of elements (frequency) in each of several
categories (classes). However, we are usually interested in the proportion (the relative
frequency), or the percentage of elements in each category (class). The relative frequency
of a class (RF) equals the fraction or proportion of elements belonging to a class. Assume
that the number of observations denoted by (n) and the frequency of a class is (Fc). The
relative frequency (RF) and the percentage frequency of each class (PF) can be computed
as follows:
𝑭𝒄
𝑹𝑭 = , 𝒂𝒏𝒅 𝑷𝑭 = 𝑹𝑭 × 𝟏𝟎𝟎 (1)
𝒏

A relative frequency distribution gives a tabular summary of data showing the relative
frequency for each class. A percent frequency distribution includes the frequency percent
(or percent frequency) for each class of the frequency distribution (or frequency table).

- 13 -
C- Bar Graphs:
A bar graph or bar chart is a graphical device for depicting qualitative data tabulated or
summarized in a frequency, relative frequency, or percent frequency distribution.
So, a bar chart is a graph constructed by bars or rectangles whose heights represent the
frequencies or relative frequencies or percent frequencies.
To construct a bar chart, we mark or specify the labels that are usually used for the classes
or categories on the horizontal axis. Note that all classes are represented by fixed or equal
width intervals. A frequency, relative frequency, or percent frequency scale usually used
for the vertical axis. Then, using a bar of equal width drawn above each class label, we
extend the height or the length of the bar or rectangle until we reach the frequency, relative
frequency, or percent frequency of the class. For qualitative data, the bars must be
separated to emphasize the fact that each class is separate.
We can also construct a bar graph by making the classes on the vertical axis and the
frequencies on the horizontal axis.
D- Pie Chart:
The pie chart provides another graphical device for presenting the relative frequency and
percent frequency distributions for qualitative data. A pie chart is a circle divided into
segments or sectors that represent the relative frequencies or percentages of a population
or a sample data belonging to different classes or categories. To construct a pie chart, we
first draw a circle and divide it into sectors or portions that correspond to the relative
frequency or percent frequency of each category or class. Because the circle contains 360
degrees, the angle of each sector of the circle is computed by multiplying the relative
frequency by 360.

(2) Summarizing Quantitative data:


In this section, we will introduce the tabular and graphical techniques used for
summarizing quantitative data. These techniques include frequency distribution, relative
and percent frequency distributions, cumulative distributions, histogram, ogive, and line
chart.

- 14 -
A- Frequency Distribution:
As we discussed before, a frequency distribution is a tabular summary of data showing the
elements (frequency) in each several categories (classes). This definition holds true for
qualitative and quantitative data. However, when dealing with quantitative data, we must
be careful in defining and determining the non-overlapping classes to be used in the
frequency distribution. There are three steps necessary to define and determine the classes
of the quantitative data for constructing a frequency distribution. They are:
- Determine the number of non-overlapping classes.
- Determine the width of each class.
- Determine the class limits, and the class midpoint.
1- Number of Classes:
As a general rule, we recommend using between 5 and 10 classes depending on the size of
the dataset. For small size of data (up to 50), 5 classes may be used to tabulate and
organize the data. For a large size of data, a larger number of classes is required to
illustrate the variation in the data. The decision concerning the suitable number of classes
is arbitrary made by the researcher or analyst. One rule which may help you to determine
the suitable number of classes is called Sturge’s Rule or Formula as follows:

C = 1 + 3.3 log (n) (2)


Where C is the number of classes, n is the data size.

The approximate number of classes given by this formula must be rounded to the nearest
integer based on the preference of the analyst.
2- The Class Width:
The second step in constructing a frequency distribution for quantitative data is to select a
width or length for the classes. It is better to use an equal class width for all classes.
Usually, the class width depends upon the number of classes. A larger number of classes
means a smaller class width, and vice versa. The approximate class width can be
determined using the following formula:

- 15 -
𝐿𝑉 − 𝑆𝑉
𝑙 = 3
𝐶

Where I is the approximate class width or class interval, LV is the largest data value, SV is
the smallest data value, and C is the number of classes.
The approximate class width given by this formula must be rounded to the nearest unit
based on the preference of the analyst.
3- Class Limits:
Class limits must be determined in a way where each element or measurement belongs to
only one class. Each class includes two limits; the lower class limit and the upper class
limit. The lower class limit identifies the smallest measurement belongs to the class, and
the upper class limit identifies the largest measurement belongs to that class. For
qualitative data we did not need to determine class limits because each observation
belongs to a separate category or class.
After specifying the number of classes, the class width, and the class limits, a frequency
distribution can be constructed by counting the number of elements or measurements
belonging to each class.
4- Class Midpoint or Mark:
In some cases, we need to know the midpoint of the classes in a frequency distribution for
quantitative data. The class midpoint is the value located in the halfway between the lower
and upper class limits. In other words, the class midpoint is obtained by dividing the
summation of class limits or boundaries (in case of overlapping classes) by 2. So, the class
midpoint can be determined by this formula:

𝐿𝐿 + 𝑈𝐿 𝐿𝐵 + 𝑈𝐵
𝑚 = 𝑜𝑟 4
2 2

Where m is the class midpoint or mark, LL is the lower limit, UL is the upper limit of the
class, LB is the lower boundary, and UB is the upper boundary.

- 16 -
B- Relative Frequency and Percent Frequency Distributions:
The construction of the relative frequency and the percent frequency distributions for the
quantitative data is the same as for qualitative data. The only difference is the way of
determining the number of classes and the width of each class as explained before. The
relative frequency is the proportion of elements belonging to each class, and the percent
frequency of a class is the relative frequency multiplied by 100.
Less than method for writing classes
We can write the classes in a frequency table using the less than method. This method is
more suitable when the dataset contains fractional values.
Single – valued classes
When the dataset include a few distinct integer values, it is better to construct a frequency
table using single valued classes. This method is used for discrete data with a few possible
values.
C- Histogram:
A common graphical presentation of quantitative dataset is a histogram. A histogram is a
graph in which classes are marked on the horizontal axis and the frequencies, relative
frequencies, or percent frequencies are presented by heights of the bars or rectangles. In a
histogram, the bars are drawn adjacent or closed to each other with no gap among the bars.
A histogram is constructed by placing the variable or marking the classes of interest on the
horizontal axis, and the frequency, relative frequency or percent frequency on the vertical
axis. The frequency, relative frequency, or percent frequency of each class is represented
by drawing a bar or rectangle whose base is specified by class limits on the horizontal
axis, and whose height is the corresponding frequency, relative frequency, or percent
frequency.
D- Frequency Polygon and Frequency Curve:
Another common way of presenting a frequency distribution graphically is the frequency
polygon. A frequency polygon is drawn by plotting the frequency of each class above the
midpoint of that class and joining the midpoints of the top of successive bars by straight
lines. The polygon is usually closed by adding one class with zero frequency before the
first class and also after the last class, and then extending a straight line to the midpoint of

- 17 -
each of these additional classes. Frequency polygon is very useful for obtaining a good
idea about the shape of the distribution of data.
As with histogram, we can plot a polygon for relative frequency and percent frequency
distributions. Relative and percent frequency polygons allow visual comparison of two
distributions by drawing them in one graph or chart.
For a large dataset, as the number of classes is increased and the width of classes is
decreased, the frequency polygon becomes a smooth curve. Such a curve is called a
frequency distribution curve or simply a frequency curve.
E- Cumulative Distributions:
Rather than illustrating the frequency of each class, the cumulative frequency distribution
contains the number of data elements (frequencies) with values less than or equal to the
upper class limit of each class. So, a cumulative frequency distribution is constructed for
quantitative data only, and it gives the total frequency (values) that falls below the upper
boundary or upper limit of each class.
In a cumulative frequency distribution, each class has the same lower limit but a different
upper limit.
The cumulative relative distribution is obtained by dividing the cumulative frequency of
each class by the total frequencies. And the cumulative percent distribution is constructed
by multiplying each cumulative relative frequency by 100. So:

𝐶𝐹
𝐶𝑅𝐹 = , and CPF = 𝐶𝑅𝐹 × 100 (5)
𝑛

Where, CRF is cumulative relative frequency, CF is cumulative frequency of a class, n is


the total frequencies, CPF is cumulative percent.
F- Ogive:
An ogive is a curve or graph drawn for cumulative frequency, cumulative relative, or
cumulative percent distributions by joining with straight lines the dots pointed above the
upper boundaries or the upper limits of classes at height equal to the cumulative
frequencies, cumulative relative or cumulative percent of corresponding classes. When

- 18 -
drawing an ogive, we must add a class with zero frequency before the first class to start the
curve from the horizontal axis.
G- Line Chart:
The last graphical chart to be illustrated in this section is the line chart. A line chart is
usually used to represent the time-series data. A line chart is obtained by plotting the
frequencies of the categories or classes above the points that representing these categories
or classes on the horizontal axis, and then joining the successive points with straight lines.
Bar, pie, and line charts are usually used in reports prepared by organizations,
governments, and media. The objective of all such charts is to present a clearly data
summary in a form that allows the reader to take quick idea about the trends or relevant
comparisons of the data.

(3) Scatter Diagram and Trendline:


A scatter diagram or scatter plot is a graphical presentation of the relationship between two
quantitative variables, and a trend line is a line that illustrates an approximation of this
relationship.

- 19 -
Chapter (3)
Measures of Central Tendency
In this chapter, we provide several numerical measures that represent additional
alternatives for summarizing datasets consisting of a single variable. When a data contains
more than one variable, the same numerical measures can be calculated separately for each
variable.
As we said in the first chapter, if the measures are calculated for a sample data, they are
called sample statistics. If the measures are computed for a population data, they are called
population parameters.
A measure of central tendency, also called measure of location, gives the center of a
histogram or a frequency distribution curve.
The most common measures of location are; the mean, the median, and the mode.
However, other measures of central tendency, such as the trimmed mean, the weighted
mean, the combined mean, the geometric mean, and the harmonic mean, are also discussed
in this chapter.
A measure of position determines the position of a single value in relation to the other
values in a sample or a population dataset. There are many measures of position, like
quartiles, deciles, quintiles, and percentiles.

Measures of Central Tendency:


(1) Mean:
The mean, also called the average or the arithmetic mean is the most important and
frequently used measure of central location or tendency. For ungrouped data, the mean is
computed by dividing the summation of all values or measurements by the number of
values in the data set. If the data are for a sample, the mean is denoted by(𝑥), if the data
are for a population, the mean is denoted by (µ). Thus:
𝑋
Mean for population data: 𝜇 = (1)
𝑁
𝑥
Mean for sample data: 𝑥= (2)
𝑛

In the above formula, the numerator is the sum of values of the (n) measurements. That is:
- 20 -
𝑥 =x1 + x2 + ……. + xn
Where the Greek letter 𝑖𝑠 the summation sign.

Example (1):
The following data represent the net profit of 6 firms by millions of LE during year 2021.
[30 - 20 – 18 – 13 – 10 – 29]
Find the population mean.
Solution:


 xi , N6
N

1
 ( 29  10  13  18  20  30)
6

1
 (120)  20
6

Example (2):
Assume that the monthly salaries of a sample of five employees in a firm during the year
2021 are:

[ 1550 , 1310 , 1370 , 1320 , 1450 ]


Compute the sample mean of the salaries.
Solution:
1
x (1550  1310  1370  1320  1450)
5

x
 xi , n=5
1
 (7000)
n 5
x  1400

(2) The Weighted Mean:


In some cases, certain values in a data set may be considered more important than other
values. For example, to determine students’ grades or marks in a specific course, an

- 21 -
instructor may assign a weight to the final exam twice as much as to each of the other
exams. In such cases, it is more appropriate to use the weighted mean.
For a sequence of (n) values x1, x2,……, xn that are assigned weights w1, w2, ……, wn
respectively, the weighted mean is computed as follows:

wx  
wi xi
(3)
w i

Example (3):
Assume that the weight of each monthly exam is 10%, the weight of the mid-term exam is
20%, and the weight of the final exam is 50%. Find the weighted mean for a student who
has 78, 82, 85 for the three months exams, 80 for the mid-term, and 70 for the final exam.
Solution:

wx 
w x i i

w i

20(80)  10(82)  10(85)  10(78)  50(70)



20  10  10  10  50
7550

100
 75.5

(3) The Combined Mean:


When dealing with more than one sample or data set, we can compute the combined
mean of all samples or data sets. The combined mean for two or more data sets is
computed by the following formula:

𝑛 𝑖 𝑥𝑖
𝑥= (4)
𝑛𝑖

Where ni is the sample size or the number of the data set (i), and (𝑥𝑖 ) is the (ith) sample
mean.

- 22 -
Example (4):
Suppose a sample of 5 statistics books gave a mean price 50 LE, a sample of 8 financial
mathematics books gave a mean price of 45 LE, and a sample of 7 accounting books gave
a mean price of 60 LE. find the combined mean.
Solution:
𝑛𝑖 𝑥𝑖 5 × 50 + 8 × 45 + 7 × 60 1030
𝑥= = = = 51.5
𝑛𝑖 5+8+7 20

(4) Trimmed Mean:


The trimmed mean is very useful as a measure of location or central tendency when the
ranked dataset contains a few outliers at each end. The trimmed mean is computed by
eliminating or dropping a specific percentage of the ordered or ranked observations from
each end of the dataset.

Example (5):
The following data give the expenses of 10 students in one day by LE.
No 1 2 3 4 5 6 7 8 9 10
LE 50 24 30 75 15 41 25 40 55 47

Find the 10% trimmed mean.


Solution:
- The ranked data from small to large are:
No 1 2 3 4 5 6 7 8 9 10
L.E 15 24 25 30 40 41 47 50 55 75
- After dropping 10% from the two ends, the remaining ranked data will be:
[24,25,30,40,41,48,50,55 ]
- The mean of the remaining 80%, will give the 10% trimmed mean.
- T m = (24+25+30+40+41+47+50+55)/8 = 39

(5) Geometric Mean:


- 23 -
When the data related to studying some phenomena such as inflation or population
changes during a period of time, which include periodic increases or decreases, the
geometric mean is employed to get the average change over the entire period under study.
To compute the geometric mean of a sequence of (n) values x1 , x2 , …., xn, we get the
product of multiply them together and then find the nth root of this product, thus:

𝑛
GM = Π𝑥𝑖 , i =1,2,…., n (5)
or, GM=(x1*x2.....xn)1/n
or, GM = 𝑛 𝑥1 ∗ 𝑥2 ∗ … .∗ 𝑥𝑛
Where, GM is the geometric mean.

Example (6):
Assume that the inflation rates for the last five years are 5%, 7%, 8%, 9%, and 10%
respectively. Find the suitable mean of price indexes and inflation rate over the five-year
period.
Solution:
The geometric mean of price indexes is:
GM= (1.05*1.07*1.08*1.09*1.1)1/5=1.0779
The mean of inflation rates is:
Inflation mean = 1.0779 – 1 = 0.0779 = 7.79%

(6) Harmonic Mean:


The harmonic mean is the reciprocal of mean of the values reciprocals. It is computed as:

𝑛
𝐻𝑀 = 1 (6)
𝑛
𝑖=1 𝑥
𝑖

Where, HM is the harmonic mean, n is the number of values.

- 24 -
Example (7):
Find the harmonic mean of the values, 5, 8, 6 , 10, 12.
Solution:
HM=5/(1/5+1/8+1/6+1/10+1/12) =7.4074

(7) The Median:


The median is the value that falls in the middle when the data are arranged or ranked in
increasing or ascending (from smallest to largest) or decreasing or descending (from
largest to smallest) order. For the odd number of elements, the median is the middle value.
For an even number of observations, the median is the average of the two middle values.
So, the rank of the median will be:

n 1 When n is odd number (7)


Mr 
2
1 n  n  1
Mr      1  [ M 1  M 2 ]
2 2  2  2
n n
M1  , M2  1
2 2 When n is even number.

When using the median as a measure of central location, at most one half of the elements
fall below the median, and at most the other half fall above it.
The median is often the preferred measure of central location when dealing with ordinal
data, or cardinal (quantitative) data which contains outliers or extreme values in one side
after ranking the data. In such cases, the mean is not a suitable measure for central
tendency because of the distorting effect of outliers or extreme values in one side after
ranking the data. Because the mean is heavily influenced by extremely small and large
data values, the trimmed mean and the median may be more suitable in such cases.

Example (8):
Find the median of the following marks of five students:

- 25 -
[15, 12, 8, 16, 18]
Solution:
The ranked data are:

8 , 12 , 15 , 16 , 18

The median rank is:

n 1 5 1 6
Mr    3
2 2 2
So, the median is the value number three in the ranked data, which equal 15.

Example (9):
The following data represent the sales of 6 branches of a company by thousand LE:
54 , 48 , 33 , 66 , 72 , 99
Find the median of sales.
Solution:
- We first arrange the data in ascending order as follow:
33 , 48 , [ 54, 66] ,72 , 99
Where the number of data is even, then we get the rank of the two middle values:
n n
M1  , M2  1
2 2
6 6
M1  3 , M2  1 4
2 2

- So, the median is:


1
Mr  (M 1  M 2 )
2

1
 (54  66)
2

1
 (120)  60
2

- 26 -
Computing the median from an ogive:
As we explained before, the ogive is a curve drawn for the cumulative frequency
distribution and cumulative percentage (cumulative relative frequency), by joining with
strait lines the dots marked or denoted above the upper boundaries of classes at heights
equal to the cumulative frequencies or percentages of respective classes.

Example (10):
The following table gives the frequency distribution of the pocket money by LE for 50
students:

C 0-10 10-20 20-30 30-40 40-50


F 6 12 20 8 4

a- Construct a cumulative frequency distribution.

b- Draw an ogive for the cumulative frequency distribution.

c- Find the median of pocket money of students.

d- Prepare the ascending and descending cumulative frequency distributions and draw
a graph representing them, and determine the median.
Solution:

- Constructing the ascending and descending cumulative frequencies.

C F AC CF1 DC CF2
Less than 0 0
0 – 10 6 0-10 6 0-50 50
10 – 20 12 0-20 18 10-50 44
20 – 30 20 0-30 38 20-50 32
30 – 40 8 0-40 46 30-50 12
- 27 -
40 – 50 4 0-50 50 40-50 4
More than
Total 50 0
50

Where:
C : is classes
F : is frequencies
AC : is ascending cumulative classes
DC : is descending cumulative classes
CF1 : is ascending cumulative frequencies
CF2: is descending cumulative frequencies

- Drawing an ogive and finding the median

Ogive and median

- Drawing ascending and descending cumulative frequencies together and finding


median:

- 28 -
Ascending and descending cumulative frequencies curves and median
From the figures, you can see that the mean value is about 24 L.E.

(8) The Mode:


Mode is an element that is most popular or common. In statistics field, the mode
represents the most frequent value in the dataset. In other words, the mode is the value that
occurs or repeated with the highest frequency in the dataset.
The mode doesn’t necessarily lie or located in the middle of the dataset because it
indicates the location of greatest clustering or concentration of values.
Sometimes, no single value occurs or repeated more than once, so there is no mode in the
raw data. In such case, it is more useful to group the data into classes and refer to the class
with the largest frequency as the mode class because there is no mode in the raw data. A
distribution is then said to be unimodal if there is only one such class, and bimodal if there
are two such classes, or if the dataset contains two equal modes.
While the midpoint of the modal class is sometimes referred to as the mode, it identifies
not the element that occurs or repeated most frequently as the true mode does, but the
element about which there is the greatest clustering of values; so, it corresponds
graphically to the highest point on the frequency polygon.
When the dataset are nominal, the only measure of the central tendency is the mode.

Example (11):
The manager of a men’s store observes that 10 T-shirts sold one day had the following
neck sizes:

[42 , 36 , 45 , 34 , 38 , 44 , 39 , 42 , 44 , 44]
Find the mode of these neck sizes.
Solution:
The mode of the neck sizes is:
Me = 44
As you see, the value 44 is repeated three times.

- 29 -
Example (12);
Assume that the number of T-shirts in the above example is 9 only, and the neck size 44 is
repeated only twice. Find the mode.
Solution:
In this case the dataset are bimodal, and the two modes are 42 and 44.

The Relationships Among Mean, Median, and Mode:


If the distribution of dataset is symmetric and unimodal, the three measures are coincide.
So, the mean, median, and the mode are equals. If the distribution of dataset is not
symmetric, it is said to be skewed.
If the distribution is skewed to the right, or positively skewed, it has a long tail extending
to the right, indicating the presence of a small proportion of relatively large extreme
values, but only a short tail extending to the left. These extreme values or the outliers pull
the mean to the right more than the median, when the mode is located under the top or the
peak of the distribution with value less than both the median and mean. A mean value that
is larger than the median provides some evidence of positively skewed distribution.
When the distribution is skewed to the left, or negatively skewed, it has a long tail to the
left but short tail to the right. In this case, the extreme values or the outliers pulled the
mean value to the direction of the skewness. A mean value will be less than the median
indicating a negative skewness, and the mode is located under the top of the distribution,
which is larger than both the median and mean.
The following figures indicate these relationships:

Symmetric skewed to the left skewed to the right

Mean=median=mode mean> median>mode mode>median>mean

- 30 -
Computing the Mean for Grouped Data:
As we explained before, the mean or arithmetic mean is computed by dividing the sum of
all values by the number of these values in case of raw data or ungrouped data. But when
the data are given in the form of a frequency table it is grouped data.
In such cases, we cannot compute the sum of individual values. Instead, we find an
approximation for the sum of these data. The formula used to compute the mean for
grouped data is:
𝑚𝑓
𝜇= for population data
𝑁
𝑚𝑓
𝑥= for sample data
𝑛

Where m is the midpoint or the center of classes and f is the frequency of each class, N is
the size of population and n is the size of sample.
So, when calculating the mean for grouped data, we must get the center of each class, then
multiplying the midpoints by the frequencies of the related classes. The sum of these
products is denoted by 𝑚𝑓 gives an approximation for the sum of all values.

Example (13);
The following frequency table illustrates the daily commuting times in minutes from home
to work for all 50 workers of the faculty of commerce at Port Said using the less than
classes.

Time in minutes No of workers


0 – 10 8
10 – 20 18
20 – 30 12
30 – 40 8
40 – 50 4
Total 50

Find the mean of the daily commuting times of those workers.


- 31 -
Solution:
Because the data includes all the workers of the faculty, it represents the population of
workers. The following table illustrates the solution.

C f m mf
0- 8 5 40
10 - 18 15 270
20 - 12 25 300
30 - 8 35 280
40 – 50 4 45 180
Total N = 50 𝑚𝑓 = 1070
𝑚𝑓 1070
𝜇= = = 21.4
𝑁 50

Example (14);
A car rental company selected 100 new identical cars randomly after the customary 1000 –
mile break – in period, and obtained the following gasoline mileage (miles per gallon) data
on them.

Miles per gallon No of cars


10 – 14 15
14 – 18 10
18 – 22 20
22 – 26 35
26 – 30 20
Total 100

Find the average gasoline consumption for these cars.


Solution:

- 32 -
C f m mf
10 - 15 12 180
14 - 10 16 160
18 - 20 20 400
22 - 35 24 840
26 – 30 20 28 560
Total 100 2140
𝑚𝑓 2140
𝑥= = = 21.4
𝑛 100

Measures of Position:
As we discussed before, a measure of position determines the position of a single value in
relation to other values in a dataset.
There are many measures of position, such as deciles, quintiles, quartiles, and percentiles.
Each measure divides the ranked dataset into equal parts, and the number of each sub
measure equal the number of equal parts minus one. For example, the number of quartiles
is 3 where the number of parts is 4. The following figure indicates some of these measures
and their positions.

Percentile P10 P20 P30 P40 P50 P60 P70 P80 P90
Decile D1 D2 D3 D4 D5 D6 D7 D8 D9
Quintile q1 q2 q3 q4
Quartile Q1 Q2 Q3
Median Mr

(1) Quartiles and Interquartile Range:


Quartiles are summary measures that divide a ranked dataset into 4 equal parts. The
number of quartiles is 3. These 3 measures are the first quartile denoted by Q 1, the second
quartile denoted by Q2 which is equal to and also called the median, and the third quartile
dented by Q3.The dataset must be ranked in increasing order (from small to large) before
- 33 -
computing or determining the quartiles. So, we can define quartiles as three summary
measures that divide a ranked dataset into four equal parts. The first quartile is the value of
the middle term among the elements that are less than the median or the second quartile
which in turn divides the ranked dataset into two equal parts. The third quartile is the value
of the middle term among the elements that are larger than the median or second quartile.
The following figure illustrates the position of the three quartiles:
25% 25% 25% 25%
Q1 Q2 Q3
So, approximately one-fourth or 25% of the elements in a ranked dataset are smaller than
Q1 and about 75% are larger than Q1. The second quartile Q2 or the median divides the
ranked dataset into two equal parts where 50% of the elements are less than the median
and 50% are greater than the median. Also, about 75% of the ranked dataset are smaller
than Q3 and 25% are larger than Q3.
The difference between Q3 and Q1 is called the interquartile range and denoted by (IQR),
where:

IQR = Q3 – Q1 (8)

(2) Percentile and Percentile Rank:


Percentiles are summary measures that divide a ranked dataset into 100 equal parts. So, the
number of percentiles is 99 and denoted by Pk, where k is an integer in the range 1 to 99.
The 25th percentile is equal to Q1 and the 50th percentile is equal to Q2 or the median, and
so on. The following figure describes the positions of percentiles.
1% 1% 1% ………. 1% 1% 1%
P1 P2 P3 …… P97 P98 P99

Thus, the kth percentile is the value that divides the ranked dataset into two parts such that
about k% of the ranked dataset are less than the value of Pk and about (100-k)% of the
dataset are greater than this value.

- 34 -
The approximate value of the kth percentile denoted by Pk is computed as follow:

n(i )
Ri   0.5 (9)
k

Where Ri is the percentile rank, (i) is the rank of percentile, and k is the number of parts of
the dataset related to the required percentile. The value of P k is the corresponding value of
the percentile rank.
We can also compute the percentile rank for any value included in the dataset, for example
(xi) by calculating the percentage of the values that less than (x i) in the dataset by using the
following formula:

𝑛𝑖
𝑅𝑥𝑖 = × 100 (10)
𝑛

Where Rxi is the percentile rank of the value (xi), (ni) is the number of value that less than
(xi) in the ranked dataset, and (n) is the number of observations in the dataset.

(3) Deciles and Quintiles:


Deciles are summary measures that divide any ranked dataset into 10 equal parts, each part
contains 10% of the dataset. The number of deciles is 9, and they computed by the same
way as percentiles. The first decile is the same as the tenth percentile, the second decile is
the same as the 20th percentile and the fifth decile is the same as the median and so on.
Quintiles also are summary measures that divide any ranked dataset into 5 equal parts, and
each part includes 20% of the dataset. The number of quintiles is 4, and they computed by
the same way as percentiles. The first quintile is the same as the second decile and also the
20th percentile.
To compute the measures of position, we use the following three steps:
- Arrange the dataset in ascending order (from small to large)
- Compute the rank or the position of the measure
- 35 -
- Calculate or find the value of the measure

Example (14):
The following data lists the number of car thefts during year 2020 in 12 cities in Egypt.
[40,34,21,30,42,12,13,41,18,19,14,26]
- Find the three quartiles
- Find the percentile rank of the number 40
- Compute the inter quartile range
Solution:
- First, we rank or arrange the data in ascending order as follow:
12 – 13 – 14 – 18 – 19 – 21 – 26 – 30 – 34 – 40 – 41- 42
- Finding the rank of quartiles:
12 (1)
RQ1   0.5  3.5
4
12 (2)
RQ2   0.5  6.5
4
12 (3)
RQ3   0.5  9.5
4
- Computing the value of measures:
Q1 = (14 + 18) / 2 = 16
Q2 = (21 + 26) / 2 = 23.5
Q3 = (34 + 40) / 2 = 37
IQR = Q3 – Q1 = 37 – 16 =21
- The percentile rank of 40
𝑛𝑖
- 𝑅𝑥𝑖 = × 100
𝑛

= (9/12)*100 =75

Example (15):
The following data lists the marks of 40 students in statistics course:
16 15 17 12 8 7 19 14 16 9
7 11 12 18 9 10 17 16 17 5
- 36 -
8 10 19 13 14 16 15 17 6 11
13 15 17 14 11 12 7 13 14 16
Find:
- The third decile
- The first quintile
- The third quartile
- The 70th percentile
Solution:
- Arranging the data in ascending order:
1 2 3 4 5 6 7 8 9 10
5 6 7 7 7 8 8 9 9 10

11 12 13 14 15 16 17 18 19 20
10 11 11 11 12 12 12 13 13 13

21 22 23 24 25 26 27 28 29 30
14 14 14 14 15 15 15 16 16 16

31 32 33 34 35 36 37 38 39 40
16 16 17 17 17 17 17 18 19 19
- Finding the ranks of required measures:
For D3:(n = 40 , i = 3 , k = 10)
40 (3)
R D3   0.5  12.5
10
For q1: (n = 40 , i = 1 , k = 5)
40 (1)
R q1   0.5  8.5
5
For Q3: (n = 40 , i = 3 , k = 4)
40 (3)
R Q3   0.5  30.5
4
For P70: (n = 40 , i = 70 , k = 100)
40 (70)
R P70   0.5  28.5
100
- Computing the values of the measures
- 37 -
11  11
D3   11
2
99
q1  9
2
16  16
Q3   16
2
16  16
P70   16
2

- 38 -
Exercises
Solve the following exercises.
1- Explain the meaning of an outlier or extreme value. Which measure of central
tendency for a dataset that contains outliers? Give an example.
2- Which of the three measures of central tendency (mean, median and mode) can
assume more than one value for a cardinal dataset? Give an example.
3- Which of the three measures of central tendency (mean, median and mode) can be
computed for cardinal data only? And which of them can be calculated for both
cardinal and ordinal dataset? And which of them can be found for nominal data
only? Give an example for each.
4- Is it possible for a cardinal data to have no mean, no median, or no mode? Give an
example of data for which this summary measure does not exist.
5- Explain the relationships among the mean, median, and mode for symmetric and
skewed distributions. Illustrate with graphs.
6- Consider a sample with dataset values of marks of 5 students in statistics course as
follow:
10 – 19 – 14 – 17 – 15
Find the mean, the harmonic mean and the median.
7- Consider a sample of 6 values as follow:
12 – 13 – 17 -11 – 18 -16
Compute the mean, the harmonic mean and the median.
8- Consider a sample with 20 values as follow:
43 – 35 – 56 – 47 – 58 – 54 – 56 – 55 – 45 – 56 – 54 – 38 - 56 – 74 - 98 -87 - 67 –
98 – 65 - 78
Calculate the mode, the 10% trimmed mean, seventh decile, second quintile, third quartile,
the eighty fifth percentile, and the percentile rank of the value 55.
9- Compute the combined mean for the following datasets:
a- 13-15-25-21-34
b- 23-54-38-48-65-45-54
c- 45-56-76-88-53-65-49-58
- 39 -
10- Suppose that a professor gives two exams and a final, assigning the final
exam a weight of 65%, the midterm exam a weight of 20%, and 15% of test exam.
Find the weighted mean for a student who scores are: 76 for the midterm, 64 for the
test exam, and 80 for the final exam.
11- Assume that the inflation rates for the last four months are: 13%, 14%, 12%,
and 11% respectively. Find the price indexes at the end of each month and compute
the best mean rate of inflation over the four-month period.

- 40 -
Chapter (4)
Measures of Variability
Till now, we are able to compute the measures of location or central tendency and
measures of position, but these measures fail to tell us the whole image about the
distribution of dataset. Once we know the average value of a set of quantitative data, our
next question must be: how typical is the average value of all measurements in the dataset?
In other words, how spread out are the measurements about their average value? Are the
measurements highly variable and widely dispersed about the average value, as depicted
by the smoothed relative frequency polygon, or do they exhibit low variability and cluster
about the average value.
The importance of looking beyond the average value is borne out by the fact that many
persons make use of the concept of variability in everyday decision making, whether they
compute a measure of the dispersion.
Because the measures of central tendency do not reveal the whole picture of the
distribution of data, two data sets with the same mean may have completely different
spreads. The variation among the values of observations for one data set may be much
larger or smaller than for the other dataset. So, we need a measure that can provide some
information about the variation among the data values. The measures that help us learn
about the spread or dispersion or variation of a dataset are called the measures of
dispersion.
This part explains some measures of dispersion: range, mean absolute deviation, variance,
standard deviation, interquartile range, and semi-interquartile range, semi-standard
deviation, semi-variance, and coefficient of variation.

(1) Range:
The simplest measure of dispersion or variability is the range. It is obtained by calculating
the difference between the largest value and the smallest value in the dataset.
Although the range is the simplest and easiest measure of dispersion, it is rarely used as
the only measure of spread. The reason is that the range is based on only two values of the
observations, and it is highly influenced by extreme values.
- 41 -
G= Lv – Sv (1)
Where Lv is the largest value
Sv is the smallest value
G is the range

Example (1)
The following data represent the scores of eight students in statistics and investment
mathematics:
Stat (A): 12 6 10 8 14 19 7 18
IM (B): 6 14 15 18 16 17 19 18
Find the rang for the two data and comment.
Solution:
For A:
G = Lv – Sv

= 19 – 6 = 13
For B:
G = 19 – 6 = 13
The range is the same for the two courses.

(2) Interquartile and Semi-interquartile Range:


As we discussed before, the interquartile range denoted by (IQR) is the difference between
the third quartile and the first quartile. The semi-interquartile range defined as the
interquartile range divided by 2. They are two measures of variability that overcomes the
dependency on outliers or extreme values.
IQR Q 3  Q1
SIQR   (2)
2 2

Example (2):
Find the semi-interquartile range for the following data.
[16, 19, 15, 14, 17, 18, 8, 15, 12, 10]
- 42 -
Solution:
- Fist we rank the data with ascending order:
8 10 12 14 15 15 16 17 18 19
Q1 Q3
-finding Q3:
10 (3)
R3   0.5  8
4

Q 3  17
- finding Q1: -
10 (1)
R1   0.5  3
4

Q1  12
-finding SIQR: -
Q 3  Q1 17  12
SIQR    2.5
2 2

(3) Mean Absolute Deviations:


As we see, both the range and interquartile range or semi-interquartile range depends upon
two values only. The range use the largest and smallest value, where the interquartile and
semi-interquartile range uses the first and the third quartiles. So, these measures depend on
some values only and not all values in the dataset when describing the variability of the
data.
The mean absolute deviation is a measure of dispersion that use the absolute deviations
between each value and the mean of the data, and compute the average of these absolute
deviations as follows:

A.DP 
 xi  
for population (3)
N

A.DS 
 xi  x
for sample (4)
n

Example (3):
Assume that the following data list the scores of 6 students:
14, 14, 16, 20, 13, 19
- 43 -
Find the mean absolute deviations.
Solution:
- First, we compute the mean for sample as follow:

x
x i

n
96
x  16
6
- Second, we calculate the mean absolute deviations as follow:
x 14 14 16 20 13 19 Σ=96
𝑥−𝑥 2 2 0 4 3 3 14

A.DS 
 xi  x

14
 2.333
n 6

(4) Median Absolute Deviations:


This measure uses the absolute deviations between each value and the median of data, and
then compute the median of these absolute deviations.
A.DPM  Mof x  M for population (5)

A.DSM  mof x  m for sample (6)

Where M is the median of population, and m is the median of sample, and there is no
difference in calculations.

Example (4):
Find the median absolute deviations for the data in example (3).
Solution:
- First, we rank the data in ascending order as follow:
13 – 14 – 14 – 16 – 19 – 20
- Second, we compute the median as follow:
M = (14 + 16) / 2 = 15
- Find the absolute deviations between each value and the median as follow:
𝑥−𝑀 =2,1,1,1,4,5
- Rank the absolute deviations as:

- 44 -
1,1,1,2,4,5
- MedianA.D  median x  m

= (1 + 2) / 2 = 1.5

(5) Variance and Standard Deviation:


Variance and standard deviation are the two most widely used and accepted measures of
the variability of a cardinal dataset. They are closely related to each other. They take into
account all the values in the dataset. They based on the difference between the value of
each observation and the mean. The difference between each value and the mean is called
a deviation about the mean. These deviations about the mean are squared when computing
the variance. The variance of population data is the average of these squared deviations.
The value of variance and standard deviation tells us how closely the values of a dataset
are clustered around the average. In general, a smaller value of variance or standard
deviation indicates that the values of the dataset are spread over a relatively smaller range
around the mean. In contrast, a larger value of variance or standard deviation indicates that
the values of that dataset are spread over a relatively larger range around the mean.
The standard deviation is the positive square root of the variance. The variance for
population data is denoted by (σ2), and the variance for sample data is denoted by (s 2).
Consequently, the standard deviation computed for population data is denoted by (σ), and
the standard deviation calculated for sample data is denoted by (s).
The sample standard deviation (s) is an estimator of the population standard deviation (σ),
and the standard deviation is easier to interpret than the variance because the standard
deviation is measured in the same units as the data.
The following formulas are used to compute the variance and standard deviation for
population and sample data:

 2

 ( x i  ) 2
and 
 (x i  ) 2 (7)
N N

N  x i2  ( x i ) 2 N x i2  ( x i ) 2
  2
and   (8)
N2 N2

- 45 -
S 2

 (x i  x) 2
and S
 (x i  x) 2 (9)
n 1 n 1

n  x i2  ( x i ) 2 n  x i2  ( x i ) 2
S  2
, and S  (10)
n (n  1) n (n  1)

Example (5):
The following data list the ages of 8 managers in a firm by years:

55 54 51 55 53 53 54 52
Find the mean, variance, and standard deviation for population and sample.
Solution:
- Finding the mean;

xi (xi)2 xi (xi)2
55 3025 53 2809
54 2916 53 2809
51 2601 54 2916
55 3025 52 2704
x = 427 x2 = 22805


 xi , x
 xi
N n

427
x  53.375
8
- Computing variance and standard deviation for population:
N x i2  ( x i ) 2 8 (22805)  (427) 2
  2

N2 (8) 2

182440  182329 111


   1.734
64 64
   2  1.734  1.317
- Computing variance and standard deviation for sample:

- 46 -
n  x i2  ( x i ) 2 8 (22805)  (427) 2
S 
2

n (n  1) 8 (8  1)

111
  1.982
56

S  S 2  1.982  1.408

Example (6)
Suppose that the following table implies the positive profits (+) and negative profits (-) for
fire insurance line during the past 10 years by millions of L.E.

-9 11 -8 14 -7
6 -5 15 -6 14

Compute the variance and the standard deviation, and compare the population parameters
with the sample statistics.
Solution
Mean for population and sample = 25/10=2.5 million L.E profit, and sample or population
size =10
The following table and figure implies the calculations.

- 47 -
Then, compute the required measures.
(6) Semi-Variance and Semi-Standard Deviation:
Semi-standard deviation only measure the standard deviation of the downside deviations
and ignores the upside fluctuations. The calculation of semi-standard deviation involve
only the values below the threshold in case of profit (negative profits or negative cash
flows), which is called the downside deviations or risk, or the values above the threshold
in case of loss (positive losses), which may known as upside deviations or risk.

Example (7)
Find the semi-variance and semi-standard deviation from the data involved in example (6),
and compare the population with the sample computations.
Solution
Here, we will deal with the negative profits only.

- 48 -
(7) Coefficient of Variation:
One disadvantage of the standard deviation as a measure of variability is that it is a
measure of absolute variability and not of relative variability. In some cases, we may need
a descriptive statistic that indicates how large the standard deviation is relative to the
mean, or how large the semi-interquartile range is relative to the median. Also, sometimes
we may need to compare the variability of two different datasets that have different units

- 49 -
of measurements. These measures are called the coefficient of variation, or the quartile
coefficient of variation. So, the coefficient of variation is a relative measure of variability.
It measures the standard deviation relative to the mean, and denoted by (CV) or the semi-
interquartile range relative to the median, and denoted by (QCV) and they are usually
expressed as a percentage as follow:

 (100)
CV  %
 For population (11)
S (100)
CV  %
x For sample (12)
SIQR Q Q
QCV  100  3 1
Mr 2M r
(13)

Example (8):
For data included in example (6), Compute the coefficient of variation for population and
sample and compare.
Solution:
For population:
 (100) 1.317(100)
CV _ pop    2.47%
 53.375
For sample:
S (100) 1.408(100)
CV _ sample    2.64%
x 53.375
Variance and Standard Deviation for Grouped Data
The basic formulas used to compute the population and sample variance and standard
deviation for grouped data are as follow:
2 2
2 𝑓(𝑚 −𝜇 ) 2 𝑁 𝑚 2 𝑓−( 𝑚𝑓 )
𝜎 = , 𝜎 = for population
𝑁 𝑁2
2 2
2 𝑓(𝑚 −𝑥 ) 2 𝑛 𝑚 2 𝑓−( 𝑚𝑓 )
𝑠 = , 𝑠 = for sample
𝑛 −1 𝑛 (𝑛 −1)

- 50 -
𝜎= 𝜎2 , s = 𝑠2

Example (9):
The following table gives the frequency distribution of the amount of telephone calls for
one week for a sample of 100 families.

Amount by LE No of families
40 - 18
60 - 22
80 - 32
100 - 20
120 – 140 8
Total 100
Find the variance, and standard deviation.
Solution:
C f m mf m2f
40 - 18 50 900 45000
60 - 22 70 1540 107800
80 - 32 90 2880 259200
100 - 20 110 2200 242000
120 – 140 8 130 1040 135200
Total 100 8560 789200

2
2 𝑛 𝑚 2 𝑓−( 𝑚𝑓 ) 100 789200 − (8560 )2
𝑠 = =
𝑛 (𝑛 −1) 100(100−1)

78920000 − (73273600 ) 5646400


= = = 570.34
100(99) (9900)

S = 570.34 = 23.88

- 51 -
(8) Five-Number Summary and Box Plot:
Five-number summary is a technique of exploratory data analysis. In this technique, the
following five measures are used to summarize the dataset.
1- Smallest value (Sv)
2- First quartile (Q1)
3- Median or second quartile (Q2)
4- Third quartile (Q3)
5- Largest value (Lv)
The simplest and easiest way to find a five-number summary is to first rank the data in
ascending order. Then identify the smallest and largest values and compute the three
quartiles.
The box plot is a graphical summary of data that based on five-number summary
measures. To draw a box plot, we compute the three quartiles and the interquartile range.
So, to construct or draw the box plot, the following steps are employed:
1- A box is drawn with ends located at the first and third quartiles. This box includes
the middle 50% of the dataset.
2- A vertical line is drawn in the box above the location of the median or second
quartile.
3- Compute and locate the limits or the lower inner fence (LIF) and the upper inner
fence (UIF) for the box plot by using the interquartile range (IQR).
Where:
LIF = Q1 – 1.5 (IQR)
UIF = Q3 + 1.5 (IQR)
4- Two whiskers are drawn by solid or dashed lines from the ends of the box to the
smallest and largest values inside the lower and upper inner fences or limits as
sometimes called.
5- If there are some observations exceed the upper inner fence or less than the lower
inner fence, they called mild outliers.
6- To identify the extreme value, we compute the two outer fences, the lower outer
fence (LOF) and the upper outer fence (UOF) as follow:
- 52 -
LOF = Q1 – 3 (IQR)
UOF = Q3 + 3 (IQR)
If any value exceeds the upper outer fence or less than the lower outer fence, it is called an
extreme value. So, outliers may divide into two types; mild outlier and extreme outlier.
The location of each outlier is shown with the symbol (*).
A box- and-whisker plot can help us to visualize the center, the dispersion or variability,
and the skewness of a dataset. It also helps for detecting outliers. We can compare
different distributions by making box-and-whisker plots for each of them.
So, we can define a box-and-whisker plot as a plot that shows the center, variability, and
skewness or shape of a dataset. It is constructed by drawing a box and two whiskers that
use the three quartiles and the smallest and largest values in the dataset between the two
inner fences.
For a symmetric dataset, the line representing the median or second quartile must be
located in the middle of the box, and the dispersion of values should be over almost the
same range of the two sides of the box.

Example (10):
The following data represent the scores of 15 students in statistic course:
8, 10, 7, 0, 11, 12, 9, 13, 11, 14, 15, 20, 13, 10, 7
Construct a box plot for this data.
Solution:
- First, we rank the data in ascending order as follow:

0, 7, 7, 8, 9, 10, 10, 11, 11 , 12, 13, 13, 14, 15, 20

- Computing the quartiles:

n (i) 1 15(1) 1
Mr     8
k 2 2 2

n  1 15  1
Or M r   8
2 2

- 53 -
So, Q2 = Mr = 11 , which is the value number 8

n  1 15  1
RQ1   4
4 4

Q1 = 8, which is the value number 4

3(n  1) 48
RQ3    12
4 4

Q3 = 13, which is the value number 12

- Computing the interquartile range:

IQR = Q3 – Q1 = 13 - 8 = 5
- Calculating the two inner fences:
LIF = Q1 – 1.5 (IQR)
= 8 – 1.5 (5)
= 0. 5
UIF = Q3 + 1.5 (IQR)
= 13 + 1.5 (5)
= 20.5
- Determining the smaller and larger value within the two inner fences:
Lv = 20 , Sv = 7
- Finding the two outer fences:
LOF = Q1 – 3 (IQR)
= 8 – 3 (5)
= -7
UOF = Q3 + 3 (IQR)
= 13 + 3 (5)
= 28
- Drawing the box plot:

- 54 -
lif=0.5 L=20
s=7
uif=20.5
* 55
8 11 13

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21

The value zero is a mild outlier, but there is no extreme value.

Example (11):
The following dataset list the yearly incomes by 1000 LE for a sample of 12 employees in
a firm.
37, 30, 44, 70, 34, 65, 40, 50, 55, 105, 39, 59
Draw a box- and-whisker plot for these data.
Solution:
- Rank the data in ascending order, and compute the three quartiles and
interquartile range. The ranked data are:
-
[30, 34, 37, 39, 40, 44, 50, 55, 59, 65, 70, 105]
Q1 Q2 Q3

Q3 = (59 + 65) / 2 = 62
Q2 = (44 + 50) / 2 = 47
Q1 = (37 + 39) / 2 = 38
IQR = Q3 – Q1 = 62 – 38 = 24
- Find the two inner fences:
LIF = Q1 – 1.5(IQR) = 38 – 1.5(24) = 2
UIF = Q3 + 1.5(IQR) = 62 + 1.5(24) = 98
- The smallest and largest values within the two inner fences are:
Sv = 30 , Lv = 70

- 55 -
- Draw a horizontal line that covers all values in the data and draw the box with its
sides located above the positions of the first and third quartiles. Inside the box,
draw a vertical line above the position of the median or second quartile.
- Draw two lines from the box to the smallest and largest values within the two
inner fences. These two lines are called whiskers.
- A value that falls outside the two inner fences is shown by marking an asterisk
(*), and is called an outlier. The box-and-whisker plot is shown as:

LIF=2 Q1=38, Q2=47, Q3=62 UIF=98


Sv = 30 Lv=70 *
--------------------------------------------------------------
0 10 20 30 40 50 60 70 80 90 100 110

- The value 105 that falls outside the two inner fences is called outlier. This outlier
may be mild outlier or extreme outlier. To determine that, we compute the two
outer fences as follow:
- LOF=Q1 – 3(IQR) = 38 – 3(24)= - 34
- UOF=Q3 + 3(IQR) = 62 + 3(24) = 134
So, this outlier is mild and not extreme value because it falls within the two outer fences.

- 56 -
Exercises
Answer the following questions:
1- Is it possible for a standard deviation to be negative? Why?
2- Is it possible for the standard deviation to be larger than the variance of the same
data? Why?
3- Calculate the range, the interquartile range, the semi-interquartile range for the
following sample data.
20,18,12,6,8,17,15,18, 16, 14
4- Compute the variance, the standard deviation, the mean absolute deviation, and the
median absolute deviation for the data involved in the above exercise.
5- Calculate the coefficient of variation and the quartile coefficient of variation for the
data included in exercise 3.
6- All the 15 stocks in your portfolio had the following rates of change in value over
last month.
4, 2, 6, -5, -3, 7, -10, 15, 14, -8, 7, 9, -4, 0, 11
Find the semi-variance and semi-standard deviation.
7- The following data represent the total value of assets of 15 banks by 100 millions
L.E.
65, 122, 210, 46, 58, 105, 86, 76, 58, 315, 64, 95, 134, 79, 65
Display the five summary measures and construct the box plot for banks. Are the data
symmetric or skewed? Is there any outlier? Which type?

- 57 -
Chapter (5)
Measures of Distribution Shape and Relative location
We have discussed several measures of central tendency, position and dispersion. In
addition, it is often very important to explain some measures of the shape and relative
location of the distribution of dataset.
We noted before that a histogram is a graphical display showing the shape of the
distribution of data. A box plot is also a display that indicates the skewness of distribution
and detecting outliers. In addition, it shows where the data are centered and how spread
out the data is.
The major numerical measures of the shape of a distribution are; skewness, kurtosis, and
moments. In addition, we will discuss the standardized value (z-score), chebyeshev’s
theorem and empirical rule.

(1) Z-Scores:
The standardized value or z-score is a measure of the relative location of values within the
dataset. Measures of relative location help us to know how far a specific value is from the
mean. We can compute the relative location of any value by using the mean and standard
deviation of the dataset as follow:

x  xx
z , or z 
 s (1)

Where: (µ) is the population mean, (σ) is the population standard deviation, (x) is any
𝑥
observation, is the mean of sample, and (s) is the standard deviation of sample.
So, the z–score can be defined as the number of standard deviations is from the mean, or
the difference between the observation x and the mean, measured by standard deviation
units. If the value of an observation is greater than the mean, its z-score will be larger than
zero (positive), when the value of observation is less than the mean, its z-score will be
smaller than zero (negative).
- 58 -
The mean of z-scores for all observations equal zero, and the standard deviation equal one.

Example (1):
A population contains 5 observations:

12 - 10 - 11 - 8 - 9

Find the mean and standard deviation of this data and compute the standardized values and
its mean and standard deviation.

Solution:

x 
xi x i2 (x – ) z z2

12 144 2 1.4142 2.0

10 100 0 0 0.0

11 121 1 0.7071 0.5

8 64 –2 –1.414 2.0

9 81 –1 –0.7071 0.5

 = 50 510 0 0.0 5

- The mean and standard deviation for data:

- The mean of population is:


 xi 
50
 10
N 5

- The standard deviation of population is:

- 59 -
N  x i2  ( x i ) 2

N2

5  510  (50) 2

(5) 2

2550  2500

25

 2.0  1.4142

- The mean and standard deviation of z-scores:

z 
z  0  0
N 5

N  z 2  ( z) 2
z 
N2

5  5  (0) 2

(5) 2

 z  1.0  1

(2) Chebyshev’s Theorem(or Inequality):

Chebyshev’s theorem or Inequality enables us to make statements about the proportion of


data values that located within a specified number of standard deviations of the mean. So,
this theorem gives a lower bound for the area under a curve between two points that are at
the same distance around the mean.

This theorem states that for any number (z) larger than 1, at least (1-1/z2 )% of the data
value lie within (z) standard deviations about the mean.

The importance of Chebyshev’s theorem stems from the fact that it applies to both sample
and population dataset of elements, regardless of the shape of their distribution.

- 60 -
Consequently, the value (1-1/z2)% is a conservative lower bound on the percentage or
fraction of items in the interval (  z ) where z > 1. This is so because when z = 1, the

value of (1-1/z2) equal zero, and when z > 1, the value of (1-1/z2) is negative.

The illustration of this theorem for some specific values of (z) standard deviations as
follow:

- At least 50% of the data must be within 1.41 standard deviations about the mean
(  1.41 ) .
- At least 75% of the data must be within 2 standard deviations about the mean
(  2) .
- At least 89% of the data must be within 3 standard deviations about the mean
(   3 ) .
- At least 94% of the data must be within 4 standard deviations about the mean
(  4 ) .

The following figure illustrates Chebyshev’s theorem or inequality.

Example (2):

Assume that the final exam marks for 200 students in statistics course had a mean of 70
marks and a standard deviation of 2 marks. How many students had marks between 66 and
74? And how many had marks between 65 and 75 using chebyeshev’s theorem?

Solution:
- 61 -
x1  x 66  70
z 1   2
s 2

x 2  x 74  70
z 2  2
s 2

Using Chebyshev’s theorem, we can say that at least 75% of students or 150 students
must have marks between 66 and 74 as follow:

1 1
𝑛1 = 1 − × 200 = 1 − × 200 = 150
𝑧2 22

x3  x 65  70
z3    2.5
s 2

x4  x 75  70
z4    2.5
s 2

Using Chebyshev’s theorem, we can say that at least 84% of students or 168 students
must have marks between 65 and 75 as follow:

1 1
𝑛2 = 1 − × 200 = 1 − × 200 = 168
𝑧2 2.52

Example (3):

Assume that the average scale of blood pressure for a sample of 500 persons was 187, and
the standard deviation was 22. Using Chebyshev’s inequality compute at least how many
persons in this group have a blood pressure scale between 121 and 253.

Solution:

µ = 187 , σ = 22
z1 = ( 121 -187 ) / 22 = - 66 / 22 = -3
z2 = ( 253 -187 ) / 22 = 66 / 22 = 3
1-1/z2 = 1-1/32 =1-1/9 =1-0.11 = 0.89
N= 500 × o.89 = 445

- 62 -
Hence, according to Chebyshev’s inequality, at least about 89%, or approximately 445
persons of the group have blood pressure scale between 121 and 253.

(3) The Empirical Rule:

As we explained before, Chebyshev’s theorem is applicable to any dataset regardless of


the shape of the distribution of the dataset. In many applications, however, datasets follow
a symmetric or bell shaped distribution, also called normal distribution.

In such cases, the empirical rule can be used to determine the fraction or percentage of
data elements that must be located within a specified number of standard deviations about
the mean. The empirical rule applies to both population and sample dataset.

According to the empirical rule; for data having a bell-shaped distribution, approximately:

- 68.3% of the data lie within one standard deviation about the mean, or within
(  ) .

- 95.4% of the data lie within two standard deviations about the mean, or within
(  2) .

- 99.7% or almost of the data lie within three standard deviations about the mean, or
within
(  3) .

The following figure illustrates the empirical rule.

- 63 -
Example (4):

Suppose that the age distribution of a sample of 4000 persons is normal with a mean of 40
years and a standard deviation of 6 years. Find the approximate number of persons who
are 34 to 46 years old.
Solution:
x = 40 , s=6
z1 = (34 – 40) / 6 = -1
z2 = (46 - 40) / 6 = 1
So, the area from 34 to 46 equals the area from (¬x - s) to (¬x + s). Because the area
within one standard deviation about the mean is approximately 68.3% for normal curve, so
the number of persons in the sample with 34 to 64 years old is approximately:
N = o.683 × 4000 = 2732

Example (5):
The yearly incomes of all families in a city may follow a bell-shaped distribution or be a
skewed distribution with a mean of 6000 LE and standard deviations of 100 L.E. using the
empirical rule, and the Chebyshev’s theorem find the percentage of all families with their
incomes between 5800 and 6200 LE.
Solution:
If the distribution is skewed, we will use Chebyshev’s theorem as follow:
 + z  = 6000 + z (100) = 6200

100 z = 6200 – 6000 = 200

z=2
 – z  = 6000 – z (100) = 5800
100 z = 5800 – 6000 =- 200

z= -2
So, the percentage of families with incomes between 5800 and 6200 LE is the same as
(  2) , or 75%.
- 64 -
If the distribution is bell-shaped, we will use the empirical rule as follow:

(6000 200 (  2) = 95%


∓ )=

(4) Moments:
In mathematical slang or vernacular, moment means raised to the power of some value. In
statistics, moments are used to describe the various characteristics of a frequency
distribution like; central tendency, variability, skewness, and kurtosis.
The moments may take about the origin point or zero, or about assumed mean (for
example the value A, where A ˃ 0), and take about the mean. In the last case, the moments
taken about the mean are called central moments, and are denoted by µ1, µ2, µ3, and µ4.
Finally, the moments may calculate by standard units, and called the standard moments.

1- Moments Taken About the Origin:


The moments taken about zero are denoted by (mr’) and are calculated as:

mr '
 ( x  0) r

 xr
r = 1, 2, 3, 4 (2)
n n where,
So, the first moment about zero equal the mean or (𝑥) as follows:

m'
x 1


x  x
1
n n (3)
And the other three moments about zero are:

m2 ' 
x2 , m3 '
 x3
, m4 ' 
 x4
n n n

Example (6):
The following data represent the number of bedrooms (x) for five families:

x = 1,3,2,5,4

Find the first four moments about zero.

Solution:

- 65 -
X x2 x3 x4

1 1 1 1

3 9 27 81

2 4 8 16

5 25 125 625

4 16 64 256

 15 55 225 979

m1 ' 
 x  15  3
n 5

m2 ' 
 x2 
55
 11
n 5

m3 ' 
 x3 
225
 45
n 5

m4 ' 
 x4 
979
 195.8
n 5

2- Moments Taken about the Mean:


Moments taken about the mean or the central moments are computed as follow:

r 
 (x  x) r

(4)
n
So, the first moment about the mean equal zero as follow:

1 
 (x  x)  0
(5)
n
The second central moment is equal the population variance or approximately the sample
variance as follows:

- 66 -
2 
(x  x) 2
1
  x 
2
( x) 2 

n n  n 
(6)
n x 2  ( x) 2
 2
 S2  2
n
The third and fourth central moments are calculated as:

3 
 (x  x) 3

, 4 
 (x  x) 4

(7)
n n

3- Moments Taken About Assumed Mean (A):


These moments are calculated as follow:

mr ' 
 ( x  A) r 
 dr
, where d = x – A (8)
n n

4- The Standardized Moments:


The standardized moments use the standard value of data, not the data itself. So, the
standard moments are unitless because they depend on the standard values (z) of dataset.
The standard moments are denoted by(r) and they calculated as:

1  (x  x) r
r
1
r   ( z) r   (9)
n n Sr  2 
r

Where:
1
r 
n
 (x  x)r ,  2  S , µ2  S2

xx x
And z  or z =
S 
The first standard moment equal zero because the mean of the standard distribution equal
zero as follows:
1
1 
n
z  z  0

- 67 -
The second standard moment equal one because both the variance and the standard
deviation of the standard distribution equals one as follows:

1 s 2 2
2   z  2  1
2
(11)
n s 2

The third and the fourth standard moments are used as measures of skewness and kurtosis:
1 3 3
3   z3  
n 2
3
( 2 )
3
2
(12)
1 
4   z4  4 2
n ( 2 )

The relationship Between the Central Moments and the Moments Taken About any
Value:
Because the calculations of central moments are more difficult than that taken about zero
or any value, we can compute the central moments using the moments taken about any
value by the following equations:

 2  m2  (m1 ) 2
3  m3  3m1m2  2 (m1 ) 3
 4  m4  4m1m3  6 (m1 ) 2 (m2 )  3 (m1 ) 4
Example (7):
Find the central moments and first four moments about the value 4 for the following data.
x=1 , 4 , 5 , 2
Solution:
Finding the mean:

x
 x  12  3
n 4
Finding the central moments:

- 68 -
x dxx d2 d3 d4

1 1 – 3 = –2 4 –8 16

4 4–3=1 1 1 1

5 5–3=2 4 8 16

2 2 – 3 = –1 1 –1 1

∑ 0 10 0 34

1 1 0
1 
n
 (x  x)   d   0
n 4
1 1 1
 2   ( x  x )   d  (10)  2.5
2 2

n n 4
1 1 1
3   ( x  x )   d  (0)  0
3 3

n n 4
1 1 1
 4   ( x  x )   d  (34)  8.5
4 4

n n 4
Finding the first four moments about the value 4:

X d=x–A d2 d3 d4

1 1 – 4 = –3 9 –27 81

4 4–4=0 0 0 0

5 5–4=1 1 1 1

2 2 – 4 = –2 4 –8 16

∑ –4 14 –34 98

- 69 -
1 1
m1 
n
d  4
( 4)  1

1 1
m2  d  (14)  3.5
2

n 4

1 1
m3 
n
 d3 
4
( 34)  8.5

1 1
m4 
n
d4 
4
(98)  24.5

Example (8):
Prove the relationship between the central moments and the moments taken about the
value 4 using the results of example (7).
Solution:
From example (7), we have:
m1’ = –1 , µ1 = 0
m2’ = 3.5 , µ2 = 2.5
m3’ = –8.5 , µ3 = 0
m4’ = 24.5 , µ4 = 8.5
Because the first central moment equal zero, so the other central moments are:
 2  m2  (m1 ) 2
 3.5  (1) 2  3.5  1  2.5

 3  m3  3 m1m2  2 (m1 ) 3


 8.5  3 (1 3.5)  2 (1) 3
 8.5  3 (3.5)  2 ( 1)
 8.5  10.5  2
 10.5  10.5  0
 4  m4  4 m1m3  6 (m1 ) 2 m2  3 (m1 ) 4
 24.5  4 (1  8.5)  6 (1) 2 (3.5)  3 (1) 4
 24.5  34  21  3

- 70 -
4  24.5  37  21
 45.5  37  8.5
The same results as example (7).

Example (9):
Find the standard moments for the data included in the above example.
Solution:
1  0
2  1
3 0
3   0

2  
3
2.5 
3

4 8.5 8.5
4     1.36
( 2 ) 2
(2.5) 2
6.25

(5) Measures of skewness:


Measures of central tendency obtain a representative summary value for the dataset. From
the measures of dispersion or variability, we can know that whether most of elements of
the dataset are close to or far away from these central tendency measures. But all these
measures are not enough to draw sufficient inferences about the shape of the distribution
of the dataset. Another aspect of the distribution of the dataset is to know its symmetry.
Using the graphic display of the distribution of data, we can know whether the shape of
the distribution is symmetric about the mean or not. We know that for symmetric
distribution, the mean, median, and mode are equals. This symmetry is well studied by the
knowledge of the measures of skewness.
The measures of skewness are statistical techniques that indicate the direction and extend
the skewness of the distribution of quantitative dataset. A frequency distribution of
quantitative data that is not symmetrical (normal) is called asymmetrical or skewed. As we
discussed before, there are three types of distributions:
- Symmetrical distribution, where:
Mean = median = mode
- Positively skewed distribution (right skewed), where:
- 71 -
Mean ˃ median ˃ mode
- Negatively skewed distribution (left skewed), where:
Mean < median < mode
The following figures display these types:

For an asymmetrical distribution, the distance between the mean and the mode may be
used to measure the absolute degree of skewness because the mean is equal to the mode in
symmetrical distribution. Thus, the absolute measure of skewness is:
Skewness = mean – mode
Or, sk = Q3 + Q1 – 2 median
For skewed distribution, if mean is larger than mode, the skewness becomes positive (+),
otherwise, it will be negative (-).
The common relative measures of skewness are called Karl Pearson’s Coefficient of
Skewness, and computed as:

(x  M e )
sk1  (13) where, Me is the mode.
S
3( x  M r )
sk 2  (14) where, Mr is the median.
S
The value of (sk) range is between ∓ 3, for the moderately skewed distributions, this value
lies between ∓ 1.

- 72 -
There are two other relative measures of skewness, they are:
- Bowley’s Coefficient of skewness:
(Q3  Q2 )  (Q2  Q1 )
sk3 
Q3  Q1
(15)
Q  2Q2  Q1
 3
Q3  Q1

- Kelley’s Coefficient of skewness:


(P90  P50 )  (P50  P10 )
sk 4 
(P90  P10 )
(16)
P90  2P50  P10

P90  P10

Skewness may be defined as the standardized third central moment, and computed as:
3 3
sk5 =  3  
 
3 3 (17)
S 2

Example (10):
Assume that the pocket money of five students are:
x= 10, 15, 20, 10, 55
Find the five relative measures of skewness for this population data.
Solution:
The ranked data are:
10 – 10 – 15 – 20 - 55
𝑥 110
𝑥 = = = 22 , Me = 10
𝑛 5

RQ1=[5*(1 / 4)]+0.5 = 1.75, Q1=10+0.75(10-10)=10


Q2 = P50 = Mr =15
RQ3 = 5*(3 / 4)+0.5= 4.25,Q3= 20 + 0.25(55-20)=28.75
RP90 = (5*90 / 100) + 0.5 = 5
P90 = 55
- 73 -
RP10 = (5*10 /100) + 0.5 = 1
P10=10
(𝑥−𝑥)2 (10−22)2 + (10−22)2 + (15−22)2 +(20−22)2 + (55−22)2
𝑠= = √
𝑛 5

(−12)2 + (−12)2 + (−7)2 + (−2)2 + (33)2


=√
5

144 + 144 + 49 + 4 + 1089 1430


= = = √286 = 16.9
5 5

( x  M e ) 22−10
sk1  = = 0.71
S 16.9

3( x  M r ) 3(22−15)
sk 2  = =
21
= 1.24
S 16.9 16.9

Q3  2Q2  Q1 28.75−2 ×15+10


sk3 =  = = 0.47
Q3  Q1 28.75−10

P90  2 P50  P10 55  2  15  10 35


sk 4     0.78
P90  P10 55  10 45

3 3
sk5 =  3  
S3  2 3

1

3 3 3 3 3
3 
3
( x  x ) =(−12) + (−12) + (−7) +(−2) + (33)
n 5

−1728 − 1728 − 343 − 8 + 35937


= = 6426
5
1
2  
2
( x  x )  s 2  286
n
3 3
sk5=  3 
6426 6426

 
3 3 = 3 = =1.33
S 2 ( 286) 4836.7

(6) Measures of Kurtosis:


Sometimes we need to know the peakedness or the flatness of the distribution of dataset.
This is understood by what is known as kurtosis. Kurtosis is the degree of flatness or

- 74 -
peakedness in the region of mode of a frequency curve. It is measured relative to the
peakedness of the normal curve. It tells us the extent to which a distribution is more
peaked or flat-topped than the normal curve.
Kurtosis is a measure of whether the data are heavy-tailed or light-tailed relative to a
normal distribution. That is, data sets with high kurtosis tend to have heavy tails, or
outliers. Data sets with low kurtosis tend to have light tails, or lack of outliers.
If the curve is more peaked than the normal curve, it is called leptokurtic. In this case,
elements are more clustered about the mode. If the curve is more flat-topped than the
normal curve, it is called platykurtic. The normal curve itself is called mesokurtic.
The kurtosis of the normal distribution is 3 or (k – 3 = 0). The positive excess kurtosis (k –
3 = +) indicates flatness, and the negative excess kurtosis (k - 3 = -) indicates
peakedness.This definition is used so that the standard normal distribution has a kurtosis of
zero. In addition, with the second definition positive kurtosis indicates a "heavy-tailed"
distribution and negative kurtosis indicates a "light tailed" distribution.
You may remember that the mean and standard deviation have the same units as the
original data, and the variance has the square of those units. However, the kurtosis, like
skewness, has no units: it’s a pure number, like a z-score.
Traditionally, kurtosis has been explained in terms of the central peak. You’ll see
statements like this one: Higher values indicate a higher, sharper peak; lower values
indicate a lower, less distinct peak.
So, we can briefly say that:
- A normal distribution has kurtosis exactly 3 (excess kurtosis exactly 0). Any
distribution with kurtosis ≈ 3 (excess ≈ 0) is called mesokurtic.
- A distribution with kurtosis < 3 (excess kurtosis < 0) is called platykurtic.
Compared to a normal distribution, its tails are shorter and thinner, and often its
central peak is lower and broader.
- A distribution with kurtosis > 3 (excess kurtosis > 0) is called leptokurtic.
Compared to a normal distribution, its tails are longer and fatter, and often its
central peak is higher and sharper.
The following figure illustrates the types of kurtosis:
- 75 -
Kurtosis can be measured by the fourth standardized moment, which depends upon the
second and fourth central moments as follow:
4
4  (18)
( 2 ) 2

Also, kurtosis may be measured using quartiles and percentiles as follow:


SIQR 1 Q  Q1
k or k  ( 3 ) (19)
P90  P10 2 P90  P10

Where:

k = 0.263 for mesokurtic

k > 0.263 for leptokurtic

k < 0.263 for platykurtic

Example (11):
Assume that the pocket money of 50 students prove that:
Q1 = 70 L.E, Q3 = 90 L.E, P10 = 60 L.E, P90 = 100L.E
Find the coefficient of kurtosis and describe the shape of the distribution.
Solution:
SIQR 1 Q3  Q1
k 
P90  P10 2 P90  P10
1 90  70
 ( )  0.25
2 100  60
So, the distribution is approximately normal or mesokurtic.
- 76 -
Exercises
1- Consider a sample of data contains values of 20, 12, 18, 15, and 25. Compute the z-
score for each the five values.
2- Consider a sample with a mean of 400 and a standard deviation of 50. What are the
z-score for the data values: 350, 500, 300, 450, and 550?
3- Consider a sample with a mean of 40 and a standard deviation of 5. Use
Chebyshev’s theorem to find the percentage of the data within each of the following
ranges:
a- 30 – 50
b- 25 - 55
c- 35 - 45
d- 28 – 52
4- Assume that a dataset have a bell-shaped distribution with a mean of 40 and a
standard deviation of 5. Use the empirical rule to compute the percentage of data
within every range of the following:
a- 30 – 50
b- 25 - 55
c- 35 - 45
5- A survey showed that on average, the college students sleep 7 hours per night with a
standard deviation of 1.5 hours:
a- Use Chebyshev’s theorem to find the percentage of students who sleep between
4 and 10 hours.
b- Use Chebyshev’s theorem to determine the percentage of students who sleep
between 2.5 and 11.5 hours.
c- Use the empirical rule to compute the percentage of students who sleep between
5.5 and 10 hours.
6- Find the first four moments about zero, 4, mean, and compute the standardized
moments for the following data:
4–2–5–3–1
7- Calculate the five measures of skewness for the following data:
- 77 -
26 -23 – 32 – 25 – 19 – 18 – 21 – 20 – 26 – 28
8- Find the kurtosis by two methods for the above data.
9- Compare the skewness and kurtosis for population and sample for the data in
example 7.

- 78 -
Chapter (6)
Correlation and Regression
In descriptive statistics, we have explained the numerical measures that used to summarize
the dataset for only one variable. Sometimes the researcher or decision maker is interested
to know the relationship between two variables.
In this part, we will discuss covariance and correlation as descriptive measures of the
relationship between two quantitative variables. Then, we will explain simple linear
regression analysis.

(1) Covariance:
The formulas for computing the covariance of a population of size (N) or a sample of size
(n) with observations of two variables (x) and (y) are defined as follow:

1
𝑠𝑥𝑦 = (𝑥 − 𝑥 )(𝑦 − 𝑦) (1)
𝑛−1

𝜎𝑥𝑦 = 1 ( 𝑥− 𝜇 𝑥 )(𝑦 − 𝜇 𝑦 )
(2)
𝑁

Where:
𝑠𝑥𝑦 is the sample covariance
𝜎𝑥𝑦 is the population covariance
𝑥 is the sample mean of variable x
𝑦 is the sample mean of variable y
𝜇𝑥 is the population mean of variable x
𝜇𝑦 is the population mean of variable y
According to the above formulas, we must get the summation of the products obtained by
multiplying the deviation of each x from its mean 𝑥 or 𝜇𝑥 by the deviation of the
corresponding y from its mean 𝑦 or 𝜇𝑦 , this summation is then divided by (n-1) or (N).
The positive value for 𝑠𝑥𝑦 or 𝜎𝑥𝑦 indicates a positive linear association between x and y;
that is, as the value of x increases, the value of y increases. A negative value for 𝑠𝑥𝑦 or 𝜎𝑥𝑦
indicates a negative linear association between x and y; that is, as the value of x increases,
- 79 -
the value of y decreases. If the value of 𝑠𝑥𝑦 or 𝜎𝑥𝑦 is close to zero, there is no linear
association between x and y. A large positive value for the covariance indicates a strong
positive linear relationship between the two variables, and a large negative value for the
covariance indicates a strong negative linear relationship between them.
However, we have a problem when using the covariance as a measure of the strength of
the linear relationship, that is the value of the covariance depends on the units of
measurement of x and y. For example, assume that we are interested in measuring the
relationship between heights and weights for some students. Clearly the strength of the
relationship must be the same whether we measure heights in meter or cm. Measuring the
heights in cm, however, gives larger value of covariance than when measure heights in
meter, when in fact the relationship does not change. A measure of the relationship
between variables that is not affected by the units of measurement is the coefficient of
correlation.

Example (1):
On 10 occasions during the past four months, a store used TV advertising to promote its
sales. The following table shows the number of advertising (commercials) and the value of
sales in hundreds of LE at the store during the following week.
Table (1)
No of Advertising and Sales Value

Week No of Adv. Sales (100s L.E)


1 2 50
2 5 57
3 1 41
4 3 53
5 4 54
6 1 38
7 5 62

- 80 -
8 3 45
9 4 55
10 2 45

Find the population and sample covariance as a measure for the strength of the linear
relationship between the number of commercials x and the sales value y.
Solution:

(1)x (2)y (3)x-𝑥 (4)y-𝑦 (5)=(3× 4)


2 50 -1 0 0
5 57 2 7 14
1 41 -2 -9 18
3 53 0 3 0
4 54 1 4 4
1 38 -2 -12 24
5 62 2 12 24
3 45 0 -5 0
4 55 1 5 5
2 45 -1 -5 5
= 30 500 0 0 94
𝑥 = 30/10 = 3
𝑦 = 500/10 = 50
1 1
𝑠𝑥𝑦 = (𝑥 − 𝑥 )(𝑦 − 𝑦) = 94 = 10.444
𝑛−1 10−1
1
𝜎𝑥𝑦 = 1 ( 𝑥− 𝜇 𝑥 )(𝑦− 𝜇 𝑦 )
= 94 = 9.4
𝑁 10

So, the covariance for a population data is less than that for a sample data, like the
variance and the standard deviation.

- 81 -
(2) Correlation Coefficient:
Another measure of the association or the relationship between two variables is the
correlation coefficient; it is also called Pearson Coefficient of Correlation or Pearson
Product-Moment Coefficient of Correlation.
In this part, we will discuss the simple linear correlation, which measures the strength of
the linear association between two variables.
The correlation coefficient computed for the population data is denoted by the Greek letter
rho 𝜌 and the one calculated for sample data is denoted by (r) or sometimes (rxy), and its
value lies in the range (-1 to 1). There are many formulas used to compute the correlation
coefficient, the following are examples of those formulas for a sample:

𝑟𝑥𝑦 = 𝑠 𝑥𝑦
(2)
𝑠𝑥 𝑠𝑦

r  r2 
 ( x i  x ) ( y i  y) (3)
 ( x i  x ) 2  ( y i  y) 2
n  x i y i  ( x i ) ( y i )
r  r2  (4)
n  x i2  ( x i ) 2 n  y i2  ( y i ) 2

If r = 1, it is said to be a perfect positive linear correlation. If r = -1, the correlation is said


to be a perfect negative linear correlation. When r is very close to zero, there is no linear
correlation between the two variables.
In the real world problems, there is no perfect positive or perfect negative correlation. The
correlation coefficient is usually greater than (-1) and less than (+1). If the correlation
coefficient is close to 1, we say there is a strong positive linear correlation between the
two variables. If it is positive but close to zero, then the variables have a weak positive
linear correlation. In contrast, if the correlation coefficient is negative and close to -1, this
means that there is a strong negative linear correlation. If it is negative but close to zero,
there exists a weak negative linear correlation between the variables.

- 82 -
Example (2):
Compute the correlation coefficient for the sample and population data involved in
example (1).
Solution:
From the first example, the covariance for sample sxy is 10.444. And the covariance for
population 𝜎xy is 9.4; the standard deviations of the two variables are calculated as follow:
(x-𝑥 ) (y-𝑦) (x-𝑥 )2 (y-𝑦)2
-1 0 1 0
2 7 4 49
-2 -9 4 81
0 3 0 9
1 4 1 16
-2 -12 4 144
2 12 4 144
0 -5 0 25
1 5 1 25
-1 -5 1 25
=0 0 20 518

2
(𝑥−𝑥) 518 518
𝑠𝑥 = = = 7.58 , σx= = 7.197
𝑛−1 9 10

2
(𝑦−𝑦 ) 20 20
𝑠𝑦 = = = 1.49 , σy= = 1.414
𝑛−1 9 10

𝑟𝑥𝑦 = 𝑠 𝑥𝑦
=
10.444
=0.924
𝑠𝑥 𝑠𝑦 7.58×1.49

𝜌𝑥𝑦 = 𝜎 𝑥𝑦
=
9.4
=0.924
𝜎𝑥 𝜎𝑦 7.197 ×1.414

So, there is a strong positive linear correlation between the two variables, and the
correlation coefficient is the same for population and sample.

- 83 -
Example (3):
Calculate the coefficient of correlation for x and y, where:
yi 5 6 6 8 7
xi 4 5 3 6 5
Solution:
N yi xi xiyi (xi)2 (yi)2
1 5 4 20 16 25
2 6 5 30 25 36
3 6 3 18 9 36
4 8 6 48 36 64
5 7 5 35 25 49
 32 23 151 111 210
n  xi y i  (  xi ) (  y i )
r 
n xi2  ( xi ) 2 n yi2  ( yi ) 2

5(151)  (23) (32)


r 
5(111)  (23) 2 5(210)  (32) 2
19
r   0.73
26) 26

(3) Spearman Rank Correlation:


In the previous two parts of this chapter, we have dealt only with quantitative variables. In
many situations, however, one or both variables may be represented by ordered or ranked
data. In such cases, we cannot use the covariance or Pearson correlation coefficient to
describe the association between the two variables. We can, however, use the Spearman
rank correlation coefficient, which utilizes the ranks of the data rather than the original
data themselves. Spearman rank correlation is denoted by (r s) for sample, and denoted by
(ρs) for population, and is defined as:
n ab  ( a) ( b)
rs  (5)
n a 2  ( a ) 2 n  b 2  (  b) 2

- 84 -
6  d i2
rs  1  (6)
n (n 2  1)

Where:
a and b are the ranks of the variables data, and di = (ai – bi).

Example (4):
The following table gives the price and quality scale of a sample of five brands of fans:

Fan Brand A B C D E
Quality Scale 8 9 4 4 6
Price 550 600 420 480 620

Find the Spearman rank correlation coefficient between the quality scales of the fan brands
and the fan price by two methods.
Solution:

Brand Quality Price ai bi di = ai - bi d i2


A 8 550 4 3 1 1
B 9 600 5 4 1 1
C 4 420 1.5 1 0.5 0.25
D 4 480 1.5 2 –0.5 0.25
E 6 620 3 5 –2 4
 6.5

First method:
6  d i2
rs  1 
n (n 2  1)
6 (6.5) 39
rs  1   1
5 (25  1) 120
 1  0.325  0.675

- 85 -
Second method:
(Q)_rank (P)_rank
Brand a i2 b i2 a i bi
(a) (b)
A 4 3 16 9 12
B 5 4 25 16 20
C 1.5 1 2.25 1 1.5
D 1.5 2 2.25 4 3
E 3 5 9 25 15
 15 15 54.5 55 51.5

n ab  ( a) ( b)
rs 
n  a 2  ( a ) 2 n  b 2  (  b) 2
5(51.5)  (15) (15)

5(54.5)  (15) 2 5(55)  (15) 2

257.5  225
rs 
(272.5  225) (275  225)

32.5

47.5 50

32.5 32.5
   0.667
2375 48.734
So, there is a moderate positive association between the quality scales of the fan brands
and their prices.

(4) Regression Analysis:


Assume that an economist want to investigate the relationship between food expenditure
and income. What factors or variables does a household consider when deciding how
much money he should spend on food every week or every month? Certainly, income of
the household is one factor. However, many other variables also affect food expenditure.
For instance, the assets owned by the household, the family members or dependents, the
preferences and tastes of the family members, and any special dietary needs of household

- 86 -
members are some of the variables that influence a household’s decision about food
expenditure. These variables are called independent or explanatory variables because they
all vary independently, and they explain the variation in food expenditures among
different households or families. In other words, these variables explain why different
families spend different amounts of money on food. Food expenditure is called the
dependent variable because it depends on the independent variables. Studying the effect of
two or more independent variables on a dependent variable using regression analysis is
called multiple regression. However, if we select only one independent variable and study
the effect of that single variable on a dependent variable, it is called a simple regression.
Thus, a simple regression includes only two variables: one independent and one
dependent. Note that whether it is a simple or a multiple regression analysis, it always
includes one and only one dependent variable. It is the number of independent variables
that changes in simple and multiple regressions.
A regression model is a mathematical equation that describes the relationship between two
or more variables. A simple regression model includes only two variables: one
independent and one dependent. The dependent variable is the one being explained, and
the independent variable is the one used to explain the variation in the dependent variable.
- Linear Regression Model:
A simple regression model that gives a straight-line relationship between two variables is
called a linear regression model.
The relationship between two variables in a regression analysis is expressed by a
mathematical equation called a regression equation or model. A regression equation, when
plotted, may assume one of many possible shapes, including a straight line. A regression
equation that gives a straight-line relationship between two variables is called a linear
regression model; otherwise, the model is called a nonlinear regression model. In this
chapter, only linear regression models are explained.
The equation of a linear relationship between two variables x and y is written as follow:
y = a + bx
Each set of values of a and b gives a different straight line. For example, when a =10 and b
= 5, this equation becomes as follow:
- 87 -
y = 10 + 5x
To plot a straight line, we need to know at least two points that lie on that line. We can
find two points on a line by specifying any two values to x and then computing the
corresponding values of y.
1. When x = 0, then y =10 + 5(0) =10.
2. When x =10, then y =10 +5(10) =60.
These two points are plotted by joining these two points; we obtain the line representing
the above equation as follow.

Y B (10,60)
60
A (0,10)

10
0 5 10 15 x

Note that in the above figure, the line intersects the y (vertical) axis at 10. Consequently,
10 is called y -intercept. The y-intercept is given by the constant term in the equation. It is
the value of y when x equals zero.
In the equation y =10 + 5x, 5 is called the coefficient of x or the slope of the line. It gives
the amount of change in y due to a change of one unit in x. For example:
If x = 10, then y = 10 + 5(10) = 60
If x = 11, then y = 10 + 5(11) = 65
So, as x increases by 1 unit (from 10 to 11), y increases by 5 units (from 60 to 65). This is
true for any value of x.
- Simple Linear Regression Analysis:

In a regression model, the independent variable is usually denoted by x, and the dependent
variable is usually denoted by y. The x variable, with its coefficient, is written on the right
side of the = sign, whereas the y variable is written on the left side of the = sign. The y-
intercept and the slope, which we earlier denoted by a and b, respectively, can be
represented by any of the many commonly used symbols. Let us denote the y-intercept

- 88 -
(which is also called the constant term) by A, and the slope (or the coefficient of the x
variable) by B. Then, our simple linear regression model is written as follow:

Constant term or y-intercept Slope

y = A + Bx (7)

Dependent variable Independent variable

In equation (7), A gives the value of y for x= 0, and B gives the change in y due to a
change of one unit in x. Equation (7) is called a deterministic model. It gives an exact
relationship between x and y.
This model simply states that y is determined exactly by x, and for a given value of x there
is one and only one (unique) value of y.
However, in many cases the relationship between variables is not exact. For example, if y
is food expenditure and x is income, and then model (7) would state that food expenditure
is determined by income only and that all families with the same income spend the same
amount on food. As mentioned earlier, however, food expenditure is determined by many
variables, only one of which is included in model (7). In fact, different families with the
same income spend different amounts of money on food because of the differences in the
sizes of the families, the assets they own, and their preferences and tastes. So, to have to
take these variables into account and to make our model complete, we add another term to
the right side of model (7).
This term is called the random error term. It is denoted by ∈(Greek letter epsilon). The
complete regression model is written as follow:
y = A + Bx + ∈
Random error term (8)

The regression model (8) is called a probabilistic model or a statistical relationship.

- 89 -
The random error term ∈ is included in the model to represent the following two
phenomena:
1. Missing or omitted variables. As mentioned earlier, food expenditure is affected by
many variables other than income. The random error term ∈ is included to capture the
effect of all those missing or omitted variables that have not been included in the model.
2. Random variation. Human behavior is unpredictable. For example, a family may have
many parties during one month and spend more than usual on food during that month.
The same family may spend less than usual during another month because it spent quite a
bit of money to buy furniture. The variation in food expenditure for such reasons may be
called random variation.
In model (8), A and B are the population parameters. The regression line obtained for
model (8) by using the population data is called the population regression line. The values
of A and B in the population regression line are called the true values of the y-intercept
and the slope, respectively.
However, population data are difficult to obtain. As a result, we almost always use sample
data to estimate model (8). The values of the y-intercept and slope calculated from sample
data on x and y are called the estimated values of A and B and are denoted by a and b,
respectively.
Using a and b, we write the estimated regression model as follow:

yˆ  a  bx (9)

Where ŷ, (pronounced y hat) is the estimated or predicted value of y for a given value of x.
Equation (9) is called the estimated regression model; it gives the regression of y on x.
- Scatter Diagram:
A scatter diagram is a plot of paired observations. The following example displays how to
draw a scatter diagram.

Example (5):

- 90 -
Assume we take a sample of seven employees from the faculty of commerce and collect
information on their incomes x and food expenditures y for the last month. The
information obtained (in hundreds of Egyptian Pounds) is given in the following table:

X 55 83 38 61 33 49 67
Y 14 24 13 16 9 15 17

Draw a scatter diagram for these data.


Solution:
In the above table we have a pair of observations for each of the seven employees. Each
pair consists of one observation on income and a second on food expenditure. For
example, the first employee’s income for the last month was 5500L.E and its food
expenditure was 1400L.E. By plotting all seven pairs of values we obtain a scatter diagram
or scatter plot as follow:

The above figure gives the scatter diagram for the data of the above table. Each dot in this
diagram represents one employee.
- 91 -
A scatter diagram is helpful in detecting a relationship between two variables. For
example, by looking at the scatter diagram of the above figure, we can observe that there
exists a strong linear relationship between food expenditure and income. If a straight line
is drawn through the points, the points will be scattered closely around the line.
As shown in the following figure, a large number of straight lines can be drawn for the
scatter plot. Each of these lines will give different values for a and b in model (8).

In regression analysis, we try to find a line that best fits the points in the scatter diagram.
Such a line provides the best possible description of the relationship between the
dependent and independent variables. The least squares method, explained in the next
section, gives such a line. The line obtained by least squares method is called the least
squares regression line.

- Least Squares Line:


The value of y obtained for an observation from a survey is called observed or actual value
of y. As mentioned earlier in this section, the value of y, which denoted by ŷ obtained for a
given x by using the regression line is called the predicted or estimated value of y. The

- 92 -
random error ∈ denotes the difference between the actual value of y and the predicted
value of y for population data. For example, for a given employee, ∈ is the difference
between what this employee actually spent on food during the last month and what is
estimated or predicted using the population regression line. ∈ is also called the residual
because it measures the difference (positive or negative) between the actual value of food
expenditure and the predicted value by using the regression model. If we estimate model
(8) by using sample data, the difference between the actual y and the predicted y based on
this estimation cannot be denoted by ∈. The random error for the sample regression model
is denoted by e. Thus, e is an estimator of ∈. If we estimate model (8) using sample data,
then the value of e is given as:
e = Actual expenditure - Predicted expenditure,
e  y  yˆ

Regression Line and Random Errors


In the above figure, e is the vertical distance between the actual position of an employee
and the corresponding estimated point on the regression line as displayed by the arrows.
Note that in such a diagram, we always measure the dependent variable on the vertical axis
and the independent variable on the horizontal axis.
The value of an error is positive if the point that gives the actual food expenditure is above
the regression line and negative if it is below the regression line. The sum of these errors is
always zero. In other words, the sum of the actual food expenditures for the seven
- 93 -
employees included in the sample will be the same as the sum of the food expenditures
predicted or estimated from the regression model. Thus:

 e   (y  yˆ )  0
So, to find the line that best fits the scatter of points, we cannot minimize the sum of
errors. Instead, we minimize the error sum of squares, denoted by SSE, which is obtained
by adding the squares of errors. Thus:

SSE   e 2   ( y  yˆ ) 2 (9)

The least squares method gives the values of a and b for model (8) such that the sum of
squared errors (SSE) is minimum.
The values of a and b that give the minimum SSE are called the least squares estimates of
A and B, and the regression line obtained with these estimates is called the least squares
line.
For the least squares regression line:
yˆ  a  bx

b
ssxy
, a  y  bx  y b
x
ssxx n n

 xy   n 
( x)( y)
ssxy  (10)

 x  
2
( x)
2
ssxx
n
Where, SS means “sum of squares.” The least squares regression line is also called the
regression of y on x.

Example (6):

- 94 -
Find the least squares regression line for the data on incomes and food expenditures on the
seven employees given in example (5). Use income as an independent variable and food
expenditure as a dependent variable.
Solution:
The following table shows the calculations required for the computation of a and b.
x y xy x2
55 14 770 3025
83 24 1992 6889
38 13 494 1444
61 16 976 3721
33 9 297 1089
49 15 735 2401
67 17 1139 4489
Σx = 386 Σy = 108 Σxy=6403 Σx2= 23,058
The following steps are used to compute a and b.
1. Compute x, y, x and y :
Σx = 386, Σy = 108

x
 x  386  55.14
n 7

y
 y  108  15.43
n 7
2. Compute xy and x2:
To calculate xy, we multiply the corresponding values of x and y. Then, we sum all the
products. The products of x and y are recorded in the third column of the above table. To
compute x2, we square each of the x values and then add them. The squared values of x
are listed in the fourth column of the table. From these calculations:
Σxy=6403, Σx2= 23,058
3. Compute SSxy and SSxx:
( x)( y ) (386)(108)
SS xy   xy   6403   447.57
n 7

- 95 -
( x ) 2 (386) 2
SS xx   x 2
 23058  1772.86
n 7
4. Compute a and b:
SS xy 447.57
b   0.2525
SS xx 1772.86
a  y  bx
 15.43  (0.2525)(55.14)  1.5 05
ˆ  a  bx
y
ˆ  1.505  0.2525x
Thus, our estimated regression model is: y

This regression line is called the least squares regression line. It gives the regression of
food expenditure on income.
Note that we have rounded some calculations performed by hand to four decimal places.
We can round the values of a and b in the regression equation to two decimal places, but
we do not do this here because we will use this regression equation for prediction and
estimation purposes later.
Using this estimated regression model, we can find the predicted value of y for any
specific value of x. For example, assume we randomly select an employee whose monthly

- 96 -
income is 6100 L.E, so that x = 61 (because x denotes income in hundreds of pounds). The
estimated value of food expenditure for this employee is computed as:
yˆ =1.505+0.25252 (61) = 16.9075 hundreds of pounds or 1690.75 L.E.
In other words, based on our regression line, we predict that an employee with a monthly
income of 6100 L.E is expected to spend 1690.75 L.E per month on food. This value can
also be interpreted as a point estimator of the mean value of y for x equal 61. Thus, we can
state that, on average, all employees with a monthly income of 6100 L.E spend about
1690.75 L.E per month on food.
In our example, there is one employee whose income is 6100 L.E. The actual food
expenditure for this employee is 1600 (from the data). The difference between the actual
and predicted values gives the error of prediction. Thus, the error of prediction for this
employee is:
e = y –yˆ= 16 - 16.9075 = - 0.9075 hundred, or equal - 90.75L.E. Therefore, the error of
prediction is - 90.75L.E. The negative error indicates that the predicted value of y is
greater than the actual value of y. Thus, if we use the regression model, this employee’s
food expenditure is overestimated by 90.75L.E as shown in the following figure.

- 97 -
- Interpretation of a and b:
How do we interpret a = 1.505 and b = 0.2525 obtained from the regression of food
expenditure on income? A brief explanation of the y-intercept and the slope of a regression
line were given before. Now we will explain the meaning of a and b in more detail.
1. Interpretation of a:
Consider an employee with zero income. Using the estimated regression line, we get the
predicted value of y for x = 0 as follow:
yˆ =1.505 + 0.2525(0) =1.505 hundred =150.5 L.E.
Thus, we can say that an employee with no income is expected to spend 150.5L.E per
month on food. Alternatively, we can also say that the point estimate of the average
monthly food expenditure for all employees with zero income is 150.5 L.E. Thus, a =
150.5 gives the predicted or mean value of y for x = 0 based on the regression model
estimated for the sample data.
However, we should be very careful when making this interpretation of a. In our sample of
seven employees, the incomes vary from a minimum of 3300 L.E to a maximum of 8300
L.E (using hundreds, the minimum value of x is 33 and the maximum value is 83). So, our
regression line is valid only for the values of x between 33 and 83. If we predict y for a
value of x outside this range, the prediction usually will not hold true. Thus, since x = 0 is
outside the range of incomes that we have in the sample data, the prediction that an
employee with zero income spends 150.5 L.E per month on food does not carry much
credibility. The same is true if we try to predict y for an income greater than 8300 L.E,
which is the maximum value of x.
2. Interpretation of b:
The value of b in a regression model gives the change in y (dependent variable) due to a
change of one unit in x (independent variable). For example, by using the regression
equation obtained from the example, we see:
When x = 50, yˆ =1.505 + 0.2525(50) =14.13
When x = 51,yˆ =1.505 + 0.2525(51) = 14.3825
Thus, when x increased by one unit, from 50 to 51, ŷ increased by 14.3825 - 14.1300 =
0.2525, which is the value of b. Because our unit of measurement is hundreds of pounds,
- 98 -
we can state that, on average, a 100 L.E increase in income will result in a 25.25 L.E
increase in food expenditure.
We can also state that, on average, a 1 L.E increase in income of an employee will
increase the food expenditure by 0.2525 L.E. Note the phrase “on average” in these
statements. The regression line is seen as a measure of the mean value of y for a given
value of x. If one employee’s income is increased by 100 L.E, that employee’s food
expenditure may or may not increase by 25.25 L.E. However, if the incomes of all
employees are increased by 100 L.E each, the average increase in their food expenditures
will be very close to 25.25 L.E.
Note that when b is positive, an increase in x will lead to an increase in y, and a decrease in
x will lead to a decrease in y. In other words, when b is positive, the movements in x and y
are in the same direction. Such a relationship between x and y is called a positive linear
relationship.
The regression line in this case slopes upward from left to right. On the other hand, if the
value of b is negative, an increase in x will lead to a decrease in y, and a decrease in x will
cause an increase in y. The changes in x and y in this case are in opposite directions. Such
a relationship between x and y is called a negative linear relationship. The regression line
in this case slopes downward from left to right.
For a regression model, b is computed as b =SSxy / SSxx. The value of SSxx is always
positive, and that of SSxy can be positive or negative. Hence, the sign of b depends on the
sign of SSxy.
If SSxy is positive (as in our example), then b will be positive, and if SSxy is negative, then
b will be negative.
- Coefficient of Determination:
After performing the regression model, we may ask a question: How good is the
regression model? In other words: How well does the independent variable explain the
dependent variable in the regression model? The coefficient of determination is the
concept that answers this question.
Assume that we have information only on the food expenditures of employees and not on
their incomes. In this case, we cannot use the regression line to predict the food
- 99 -
expenditure for any employee. In the absence of a regression model, we may use the
average value (ȳ) to estimate or predict each employee’s food expenditure. Consequently,
the error of prediction for each employee is given by the difference between the actual
food expenditure of an employee and the mean food expenditure, that is.
E = y - ȳ
If we compute such errors for all employees in the sample, and then square and add them,
the resulting sum is called the total sum of squares and is denoted by SST. Actually SST is
the same as SSyy and is computed as:

( y ) 2
SST  SS yy   ( y  y ) 2   y 2  (11)
n

As we explained earlier, the error sum of square (SSE), which is given by formula (9) will
be less than the total sum of square (SST). The difference between them is called the
regression sum of square, and is denoted by (SSR). So:
SSR = SST – SSE
Where:

SSR   ( yˆ  y ) 2 (12)

And, SST, SSE, are defined before as:

SST  SS yy   ( y  y ) 2

SSE   e 2   ( y  yˆ ) 2
So, the sum of squared errors will decrease from SST to SSE when we use (ŷ) instead of
(ȳ) to predict the dependent variable, which is food expenditure in our example. Hence,
SSE is the portion of SST that is not explained by the regression model. The sum of SSR
and SSE is always equal to SST. Thus:
SST = SSR + SSE

- 100 -
The ratio of SSR to SST called the coefficient of determination. The coefficient of
 2
determination calculated for population data is denoted by ( )(the Greek letter rho), and
the one calculated for sample data is denoted by (r2). The coefficient of determination
gives the portion of SST that is explained by the use of the regression model. The value of
the coefficient of determination always lies in the range (0 to 1). The coefficient of
determination can be computed by using the following formula:

SSR SST  SSE


r2  or , 0  r2 1 (13)
SST SST

To facilitate the computation of (r2), the following formula is more suitable and simple to
use.
bSSxy
r2  (14)
SS yy

Example (7):
Find and explain the meaning of SST, SSE and r2 for the data involved in example (5) and
the least square line resulted from example (6).
Solution:
To compute SST or SSyy and r2, we first calculate (y2) as shown in the following table:
x y y2
55 14 196
83 24 576
38 13 169
61 16 256
33 9 81
49 15 225
67 17 289
Σx = 386, y = 108, y 2= 1792
- 101 -
The value of SST or SSyy is computed as:
( y ) 2
SS yy   y 
2

n
(108) 2
 1792   125.714
7
From example (6), we have:
b =0.2525, SSxy = 447.57
bSSxy (0.2525)(447 / 57)
So, r    0.90
2

SS yy 125.714
To compute SSE, the following table is employed:

x y ŷ= 1.5050 + .2525x e =(y –ŷ) e2 = ( y -ŷ)2


55 14 15.3925 -1.3925 1.9391
83 24 22.4625 1.5375 2.3639
38 13 11.1000 1.9000 3.6100
61 16 16.9075 -.9075 0.8236
33 9 9.8375 -.8375 0.7014
49 15 13.8775 1.1225 1.2600
67 17 18.4225 -1.4225 2.023
Σe2 = (y -yˆ)2 = 12.7215
Thus, we can state that SST is reduced by approximately 90% (from 125.74 to 12.72)
when we use (ŷ) instead of (̄y) to predict the food expenditures of the employees. Note that
(r2) is usually rounded to two decimal places.
The total sum of squares (SST) is a measure of the total variation in food expenditures, the
regression sum of squares (SSR) is the proportion of total variation explained by the
regression model (or by income), and the error sum of squares (SSE) is the proportion of
total variation not explained by the regression model. Hence, for this example, we can say
that 90% of the total variation in food expenditures of employees occurs because of the

- 102 -
variation in their incomes, and the remaining10% is due to randomness and other
variables.
Usually, the higher the value of (r2), the better is the regression model. This is so because
if (r2) is larger, a greater portion of the total errors is explained by the included
independent variable, and a smaller portion of errors is attributed to other variables and
randomness.

- Standard Deviation of Random Errors:


Recall examples (5) and (6), when we consider incomes and food expenditures, all
employees with the same income are expected to spend different amounts on food.
Consequently, the random error ∈ will assume different values for those employees. The

standard deviation of random errors (   ) measures the spread of these errors around the
population regression line. The standard deviation of errors tells us how widely the errors
and, hence, the values of y are spread for a given x.

Note that (   ) denotes the standard deviation of errors for the population. However,
usually is unknown. In such cases, it is estimated by (se), which is the standard deviation
of errors for the sample data. The following is the basic formula to calculate (se):

SSE
Se 
n2 (14)
SSE   ( y  yˆ ) 2

In this formula, (n-2) called the degrees of freedom for the regression model. The reason
that df = n - 2 is that we lose one degree of freedom to calculate 𝑥 and one for 𝑦.
To simplify the computation, it is more convenient to use the following formula to
calculate the standard deviation of errors (se):
SS yy  bSSxy
Se 
n2
(15)
( y ) 2
SS yy   y 
2

n
- 103 -
The calculation of SSxy was discussed earlier.
Like the value of SSxx, the value of SSyy is always positive.

Example (8):
Compute the standard deviation of errors (se) for the data on monthly incomes and food
expenditures of the seven employees given in example (5).
Solution:
To compute (se), we need to know the values of SSyy, SSxy, and b.
From examples (6) and (7) we computed SSxy, SSyy and b. These values are:
SSxy = 447.57 ,b = 0.2525 and SSyy = 125.7143
SS yy  bSS xy
Se 
n2
125.71  (0.2525)(447.57)
  1.59
72
Hence, the standard deviation of errors is approximately 1.6.

- 104 -
Exercises
(1) What does the linear correlation coefficient explain the association between two
variables? Within what range can a correlation coefficient assume a value?
(2) What are the differences among r, rs, rxy, 𝜌xy, and 𝜎xy.
(3) Explain each of the following concepts.
a- Perfect negative linear correlation
b- Strong positive linear correlation
c- Weak negative linear correlation
d- No linear correlation
(4) Five elements taken for two variables as follow:
x 10 11 18 5 16
y 45 50 60 30 35
a- Compute the sample and population covariance.
b- Calculate the sample Pearson correlation coefficient.
c- Find the Spearman correlation coefficient.
(5) What are the degrees of freedom for a simple linear regression model?
(6) Explain the meaning of coefficient of determination.
(7) For the data included in exercise (4), Find and explain the meaning of SST and SSR
You may use graphs for illustration purposes.
(8) A population data set produced the following information.
x 12 8 10 7
y 14 11 13 10

Find the values of yˆ, se and r2.


(9) The following information is obtained from a sample data set.
x 2 6 11 8
y 8 6 5 7

Find the values of ŷ, se and r2.

- 105 -
References

1- David R. Anderson a al, Statistics for Business and Economics, International Edition,
South –Western,2011.
2- David F.G and Patrick W. S Business Statistics, A decision - Making Approach, A
bell & Howell Co., Columbus, 1981.
3- Donald Waters, Quantitative Methods For Business, Fourth Edition, F.T, Prentice
Hall, 2001.
4- Douglas, C.M and George C-R, Applied Statistics and Probability for Engineers,
Second Edition, John. Wiley & Sons, Inc.,N.Y., 1999.
5- John W.B., Donald R.S., Methods of Finite Mathematics, John. Wiley & Sons, N.Y.,
1989.
6- Keller, Warrack and Bartel, Statistics For Management and Economics, Duxbury,
Press, Belmont, 1994.
7- Lawrence B. Morse, Statistics for Business and Economics, Harper Collins College
Publishers, 1996.
8- Mendenhall, Scheaffer and Wackerly, Mathematical Statistics With Application,
Duxbury Press, Boston 1986.
9- Mizrahi, Sullivan Mathematical Statistics, An Applied Approach, John Wiley &
Sons, Inc N.Y., 1996.
10- Park J.E, James S.F and Chi-y. lin, Applied Managerial Statistics, Prentice – Hall,
Inc., N.J., 1982.
11- Prem S. Mann, Statistics for Business & Economics, John Wiley & Sons, inc., N.Y.,
1995.
12- Prem S. Mann, Introductory Statistics, Sixth Edition, John Wiley & Sons, Inc.,
2007.
13- Walpole Myers and Myers, Probability and Statistics for Engineers and Scientists,
Hall Inter- national, Inc, N. Jersey, 1998.

- 106 -

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy