lec4-EDA2025
lec4-EDA2025
- A collection of data.
- The totality of methods used in the collection, processing, analysis or interpretation of
any kind of data.
Example:
To determine the public sentiment about the government’s program against drug related
cases, an interviewer asks a respondent the question: Do you feel that this wasteful program of
the government regarding unhuman killings related to drug cases should be stopped or not?
- This is called “ begging the question” and may well yield misleading results, because the
interviewer suggests that the program in fact is wasteful.
- Question asked should be clear, that the answer would be of interest mainly to persons
in relation to it.
Statistical model – methods necessary which apply regardless of whether the data are IQ’s, tax
payments, reaction time, gest scores and so on.
Origin
a) Government
b) Games of chance
Descriptive Statistics – methods which originally consisted mainly of presenting data in the
forms of tables and charts. This includes anything done to data which is designed to summarize
or describe them without attempting to infer anything that goes beyond the data themselves.
Statistical Inference – methods in which analysis will require generalizations which go beyond
the data.
Probability Theory – was applied to many problems in the behavioural, natural and social
sciences and provides an important tool for the analysis of any situation which in some way
involves an element of uncertainty or chance.
Statistical Data – are the raw material of statistical investigation and they arise whenever
measurements are made or observations are classified.
Nominal Data – numbers in which it represents coding of various categories. In this artificial
way or nominal way, categorical data can be made into numerical data.
Interval data – data in which we can form differences but not multiply or divide.
Ratio Data – data in which we can form quotients and not difficult to find.
FREQUENCY DISTRIBUTION
The most common method of summarizing data is to present them in condensed form in
tables or charts. When we deal with large sets of data, a good over-all picture and sufficient
information can often be conveyed by grouping the data into a number of classes. Tables like
these are called Frequency distribution.
Frequency Distribution present data in relatively compact form, give a good over- all
picture and contain information that is adequate for many purposes, but some things which can
be determined from the original data cannot be determined from a distribution.
Frequency Distribution present RAW or unprocessed data in a more readily usable form and
the price we pay for this – the loss of certain information – is usually a fair exchange.
Last two steps are purely mechanical, we shall concentrate on the first, namely; the problem
of choosing a suitable classification. The two things we must consider in choosing a
classification scheme for a numerical distribution are how many classes we should use and the
range of values each class should cover, that is; from where to where each class should go.
1) We seldom use fewer than six or more than fifteen classes ( 7 – 14 ); the exact number
we use in a given situation will depend mainly on the number of measurements or
observations we have to group.2)
2) We always make sure that each item ( measurement/observation ) will go into one and
only one class.
3) Whenever possible, we make the classes the same length; that is, we make cover equal
ranges of values.
Open Class – any class of the “ less than or less”, “ more than or more “ type. If a set of data
contains a few values which are much greater than or much smaller than the rest, open classes
are quite useful in reducing the number of classes required to accommodate the data.
However, we usually avoid open classes because they make it impossible to calculate certain
values of interest, such as an average or a total.
Example:
15.8 26.4 17.3 11.2 23.9 24.8 18.7 13.9 22.7 9.8
6.2 14.7 17.5 26.1 12.8 26.8 22.7 18.0 20.5 11.0
20.9 15.5 19.4 19.1 15.2 22.9 26.6 20.4 21.4 19.2
21.6 18.5 23.0 24.6 20.1 16.2 18.4 7.8 13.5 14.6
29.6 19.4 17.2 20.9 24.6 22.5 24.6 8.3 21.9 12.3
22.3 13.3 11.8 19.2 20.4 25.9 10.3 15.1 27.5 18.1
17.9 9.8 24.1 13.2 10.8 14.5 31.9 9.0 16.7 23.5
25.7 23.7 19.1 18.4 28.6 17.7 16.8 18.4 20.1 6.8
Class Mark – are simply the midpoints of the classes. They are found by adding the upper and
lower limits of a class and dividing by two.
Class Intervals – is merely the length of a class or the range of values it contain and it is given by
the difference between its boundaries. If the classes of a distribution are all equal in length ,
their common class interval is also given by the difference between any two successive class
marks.
PERCENTAGE DISTRIBUTION
To convert a distribution into a percentage distribution; divide each class frequency by the
total number of items grouped and then multiply by 100.
CUMULATIVE DISTRIBUTION
The other way of modifying a frequency distribution is to convert it into a “ less than “ or l “
“ less “, “ more than “ or “ more “ cumulative distribution. To this end, we simply add the class
frequencies, starting either at the top or at the bottom of the distribution.
Note : In the same way; we can also convert a percentage distribution into a cumulative
percentage distribution. We simply add the percentages starting either at the top or at the
bottom of the distribution.
GRAPHICAL PRESENTATION
When frequency distributions are constructed primarily to condense large sets of data and
display them in an easy to digest form, it is usually advisable to present them graphically.
A) HISTOGRAM
The most common form of graphical presentation of statistical data is the histogram. A
histogram is constructed by representing the measurements or observations that are grouped
on a horizontal scale, the lass frequencies on a vertical scale and drawing rectangles whose
bases equal the class interval and whose heights are determined by the corresponding class
frequencies. The markings on the horizontal scale can be the class limits, the class boundaries,
the class marks or arbitrary lay values. For easy readability, it is usually better to indicate the
class limits although the rectangles actually go from one class boundary to the next. Histograms
cannot be used in connection with frequency distributions having open classes and they must
be used extreme care if the class intervals are not all equal.
B) BAR GRAPHS
The height of the rectangles or bars again represent the class frequencies there is no
pretense of having a continuous horizontal scale.
C) FREQUENCY POLYGON
The class frequencies are plotted at the class marks and the successive points are
connected by means of straight lines. Note that we added classes with zero frequencies
at both ends of distribution to “ tie down “ the graph to the horizontal scale.
D) OGIVE
E) PICTOGRAMS
Distributions are presented more dramatically and often effectively ( they are often
seen in newspapers, magazines and reports of various sorts).
F) PIE CHARTS
MEASURES OF LOCATION
a) Arithmetic mean ( 𝒙 ̅ ) – the mean of a set of values is the sum of the values divided by
their number. ( in everyday language; the mean is often called the “ average” )
𝑋1 + 𝑋2 + 𝑋3 + ………..+ 𝑋𝑛 ∑𝑋
Sample mean = 𝑥̅ = =
𝑛 𝑛
𝑆𝑖𝑚𝑖𝑙𝑎𝑟𝑙𝑦;
∑𝑋
𝑃𝑜𝑝𝑢𝑙𝑎𝑡𝑖𝑜𝑛 𝑚𝑒𝑎𝑛 = 𝜇 =
𝑁
𝑷𝒓𝒐𝒑𝒆𝒓𝒕𝒊𝒆𝒔:
b) Median ( 𝒙̃ ) – the median of a set of data is the value of the middle item, or the mean
of the values of the two middle items, when the data are arranged according to size.
̂ ) - it is defined as the value which occurs with the highest frequency. Its two
c) Mode ( 𝒙
main advantages are that it requires no calculations, only counting and it can be
determined even for qualitative or nominal data.
WEIGHTED MEAN
In general, the weighted mean 𝑥̅ w of a set of numbers x1, x2, x3, ………and xn,
whose relative importance is expressed numerically by a corresponding set of numbers
w1, w2, w3, ……and wn is given by:
If all the weights are equal, the formula reduces to that of the ordinary arithmetic
mean. A special application of the formula for the weighted mean arises when we must
find the over-all mean or grand mean of k sets of data having the means 𝑥1 ̅̅̅ , 𝑥2
̅̅̅ , 𝑥3
̅̅̅ +
………… Xk and consisting of n1, n2, n3 ………….and nk measurements or observations.
The result is given by :
̅̅̅̅
𝑛1𝑥1 ̅̅̅̅
𝑛2𝑥2 ̅̅̅̅
𝑛3𝑥3 ̅̅̅̅
𝑛𝑘𝑥𝑘 ∑ 𝑛𝑥̅
𝑥̿ = + + + …….. + =
𝑛 𝑛 𝑛 𝑛 𝑛
MEASURES OF VARIATION
It is a statistical measure which provides ways of measuring the extent to which darta are
dispersed or spread out.
Example:
Suppose that in a hospital each patient’s pulse rate is taken in the morning, at noon and
in the evening and that on a certain day, the pulse rate of patient A is 72, 76 and 74
while that of patient B is 72, 91 and 59. The mean pulse rates of two patients are the
same as 74. But observe the difference in variability, whereas patient A’s pulse rate is
stable and that of patient B fluctuates widely.
RANGE – differences between the respective extremes ( smallest and largest ). The range is
easy to calculate and easy to understand, but despite these advantages, it is generally
not a very useful measure of variation. Its main shortcoming is that it tells us nothing
about the dispersion of the values which fall between the two extremes.
Example:
Sample 1 : 6 18 18 18 18 18 18 18 18 18
Sample 2: 6 6 6 6 6 18 18 18 18 18
Sample 3: 6 7 9 11 12 14 15 16 17 18
All of them has a Range ( R ) = 18 – 6 = 12, but the dispersion is quite different in each case.
In some cases, when the sample size is quite small, the range can be an adequate measure
of variation. For instance, it is used widely in industrial quality control to keep a close check on
the consistency of raw materials or products, or on the uniformity of a process, on the basis of
small samples taken at regular intervals of time.
Expressing literally what we have done here mathematically, it is also called the Root-Mean-
Square Deviation. Nowadays, it is customary to modify this formula by dividing the sum of the
squared deviation from the mean by ( n-1 ) instead of n. Therefore:
∑(𝑥− 𝑥̅ )^2
s = √ ( sample standard deviation )
𝑛−1
and its square, as the Sample Variance:
∑( 𝑥 − 𝑥̅ )^2
s2 = ( sample variance )
𝑛−1
𝑛(∑ 𝑥 2) − ( ∑ 𝑥 )^2
s = √
𝑛 ( 𝑛−1 )
Rule:
1) Find 𝑥.
̅
2) Determine the n deviations from the mean x - 𝑥̅ .
3) Square the deviation.
4) Add all the squared deviations.
5) Divide by ( n – 1 ).
6) Take the square root of the result obtained in step 5.
∑(𝑥 − 𝜇)^2
𝜎= √
𝑁
APPLICATION
CHEBYSHEV’s THEOREM
For any set of data ( population or sample ) and any constant k greater than 1, at least
1
of the data must lie within k standard deviations on either side of the mean.
1−𝑘 2
k ( s ) = |𝑥̅ − 𝑥|
In general, if x is a measurement belonging to a set of data having the mean 𝑥̅ and the
standard deviation s , then its value in standard units, denoted by z is:
𝑥 − 𝑥̅
z =
𝑠
COEFFICIENT OF VARIATION
Expresses the standard deviation as a percentage of what is being measured, at least
on the average.
𝑠
v = x 100
𝑥
Example:
1) If all the 1-lb cans of coffee filled by a food processor have a mean weight of 16 ounces
with a standard deviation of 0.02 ounce, at least what percentage of the cans must
contain between 15.8 and 16.2 ounces of coffee?
2) Suppose that the final examination in a French course consists of two parts, vocabulary
and grammar and that a certain student got 66 points in the vocabulary part and 80
points in the grammar part. In which part does the student is higher in command
compared to the rest of the class if all the students in the class averaged 51 in the
vocabulary part with a standard deviation of 12 and averaged 72 in the grammar part
with a standard deviation of 16.
3) In recent months, the price of sirloin steak averaged $ 2.87 with a standard deviation of
$ 0.13 and the price of T-bone steak averaged $ 3.90 with a standard deviation of $ 0.16.
For which of these two cuts of beef is the price relatively more variable?
As we have already seen, the grouping of data entails some loss of information. Each item
losses its identity, so to speak, we only know how many items there are in each class. In the
case of the mean and the standard deviation, we can usually get good approximations by
assigning to each item falling into a class, the value of the class mark.
To write general formulas for the mean and the standard deviation of a distribution with k
classes, let us denote the successive class marks by X1, X2, X3, …………………..and Xk and the
corresponding class frequencies by f1, f2, f3, …………………..and fk. Then the sum of all the
measurements or observations is given by ∑ 𝑋. 𝑓, the sum of their squares is given by ∑ 𝑋 2 .f
and the formula for 𝑥̅ and the corresponding formula for s can be written as:
CODING
If calculations were tedious, we can simplify this by coding the class marks so that we have
smaller numbers to work with. When the class intervals are all equal, this coding consists of
assigning the value zero to one of the class marks ( preferably at or near the center of the
distribution ) and representing all the class marks by means of successive integers. ( For
instance, if a distribution has nine classes and the class marks of the middle class is assigned the
value zero, the successive class marks of the distribution are assigned the values -4, -3, -2, -1, 0,
1, 2, 3 and 4).
Of course, when we code the class marks in this way, we must account for it in the formulas
for the mean and the standard deviation. Referring to the new(coded) class marks as u’s, we
write:
∑ 𝑢.𝑓
𝑥̅ = Xo + (c)
𝑛
𝑛( ∑ 𝑢2 .𝑓 ) − ( ∑ 𝑢.𝑓 )2
s = 𝑐√ Type equation here.
𝑛(𝑛−1)
where:
Xo - class mark in the original scale to which we assign zero in the new scale.
MEDIAN OF A DISTRIBUTION
Once a set of data has been grouped, we cannot find the exact value of the median
because of the loss of information which result from the act of grouping. So, we define the
median as follows:
“ The median of a distribution is the number which is such that half the total area of the
rectangles of the histogram of the distribution lies to its left and the other half lies to its right. “
In general, if L is the lower boundary of the class into which the median must fall, f is its
frequency, c is the class interval and j is the number of items we still lack when we reach L, then
the median of the distribution is given by:
𝑗
𝑥̃ = L + (c)
𝑓
Also, we can find the median of a distribution by starting to count at the other end
( beginning with the largest values ) and subtracting an appropriate fraction of the class interval
from the upper boundary U of the class into the median must fall, the corresponding formula is
given as:
𝑗
𝑥̃ = U - (c)
𝑓
a) Quartiles ( Q1 up to Q4 )
b) Deciles ( D1 up to D10 )
c) Percentiles ( P1 up to P100 )
PROBABILITY DISTRIBUTION
Random Variables – are usually classified according to the number of values which maybe
assumed. Random variables are functions and not variables.
Example:
RULE:
1) Since the values of a probability distribution are probabilities, they must be numbers on
the interval from zero to one.
2) Since a random variable has to take on one of its values, the sum of all the values of a
probability distribution must be equal to one.
BINOMIAL DISTRIBUTION
Assumption:
The number of trials is fixed; the probability of a success is the same for each trial; and
the trials are all independent( that is, what happens in any one trial does not affect the
probability of a success in any other trial ).
The probability of getting x successes in n independent trials is:
where:
(𝑎)( 𝑏 )
𝑥 𝑛−𝑥
f(x) = for x = 0, 1, 2, 3 ………..n
(𝑎+𝑏
𝑛
)
POISSON DISTRIBUTION
If n is large and p is small, binomial probabilities are often approximated by means of the
formula:
( 𝑛 𝑝 )𝑥
f(x) = ( 𝑒 −𝑛𝑝 ) for x = 0, 1, 2, 3 ………..n
𝑥!
Example:
1) If the probability is 0.8 that a cleaning fluid will remove any one spot, what is the
probability that it will remove exactly six out of eight spots?
2) A mailroom clerk is supposed to send six of fifteen packages to Europe by airmail, but he
gets them all mixed up and randomly puts airmail postage on six of the packages. What
is the probability that only three of the packages which are supposed to go by airmail
will go by airmail?
3) Out of 2500 cars who passed in EDSA , 2% of them causes traffic due to flat tires. What
is the probability that at most 5 of these cars causes traffic at EDSA?
If a random variable takes on the values X1, X2, X3, ………….. and Xk with the probabilities
f(x1), f(x2), f(x3) , …………….. and f(xk), its expected value is given by:
and it is customary to refer to this quantity as the Mean of the Random variable or the Mean of
its Distribution. Using the ∑ 𝑛𝑜𝑡𝑎𝑡𝑖𝑜𝑛 , we write:
𝜇 = ∑ 𝑋 𝑓(𝑥)
where:
𝜇 is the mean of the distribution
𝑛𝑎
and 𝜇 = ( mean of hypergeometric distribution )
𝑎+𝑏
For probability distribution, we measure variability in almost the same way, but instead of
averaging the squared deviations from the mean, we calculate their expected value. If x is a
value of some random variable whose probability distribution has the mean 𝜇, the deviation
from the mean is x – 𝜇 and we define the variance of the probability distribution as the
expected value of the squared deviation from the mean, namely as:
𝜎2 = ∑( 𝑥 − 𝜇 )2 f ( x )
The square root of the variance defines the Standard Deviation of a Probability Distribution and
we write:
𝜎 = √∑( 𝑥 − 𝜇 )2 √𝑓( 𝑥 )
𝜎2 = ∑ 𝑥 2 f ( x ) - (∑ 𝑥 𝑓(𝑥))2
We can also write:
CONTINUOUS DISTRIBUTION
Continuous curves are the graphs of functions called probability densities or informally,
continuous distributions. A probability density is characterized by the fact that:
“ The area under the curve between any two values a and b gives the probability that a
random variable having the continuous distribution will take on a value on the interval from a
to b.
NORMAL DISTRIBUTION
The graph of a normal distribution is a bell-shaped curve that extends indefinitely in both
directions, the curve comes closer and closer to the horizontal axis without ever reaching it, no
matter how far we go in either direction away from the mean.
An important feature of the normal distribution is that its mathematical equation is such
that we can determine the area under the curve between any two points on the horizontal
scale if we know its mean and its standard deviation; in other words, there is only one and only
one normal distribution with a given mean 𝜇 and a given standard deviation 𝜎.
In practice, we find areas under the graphs of normal distributions, or simply areas under
normal curves in special tables. As it is physically impossible and also unnecessary, to construct
separate tables of normal-curve areas for all conceivable pairs of values of 𝜇 and 𝜎. We
tabulate there areas only for the normal distribution with 𝜇 = 0 and 𝜎 = 1 called the Standard
Normal Distribution. Then, we obtain areas under any normal curve by performing the change
of scale which converts the units of measurements from the original scale, or x-scale into the
standard units, standard scores or z-scores by means of the formula:
𝑥− 𝜇
Z =
𝜎