EDA Lecture 8
EDA Lecture 8
Learning Objectives
At the end of the lecture the student is expected to able to understand and do the following:
A probability density function f(x) can be used to describe the probability distribution of a continuous
random variable X. For a complete characterization of a continuous random variable, it is necessary and
sufficient to know the probability density function of the random variable. The probability that X is between a
and b is determined as the integral of f(x) from a to b.
For a continuous random variable X, a probability density function is a function such that
A probability density function provides a simple description of the probabilities associated with a random
∞
variable. As long as f(x) is non-negative and ∫−∞ 𝑓(𝑥) 𝑑𝑥 = 1, 0 ≤ 𝑃(𝑎 < 𝑋 < 𝑏) ≤ 1 so that the probabilities
are properly restricted. A probability density function is zero for x values that cannot occur and it is assumed to
be zero wherever it is not specifically defined.
A histogram is an approximation to a probability density function. For each interval of the histogram, the area
of the bar equals the relative frequency (proportion) of the measurements in the interval. The relative frequency
1
is an estimate of the probability that a measurement falls in the interval. Similarly, the area under f(x) over any
interval equals the true probability that a measurement falls in the interval.
The important point is that f(x) is used to calculate an area that represents the probability that X assumes a
value in [a, b].
when a particular current measurement is observed, such as 14.47 milliamperes, this result can be interpreted as
the rounded value of a current measurement that is actually in a range such as 14.465 ≤ 𝑥 ≤ 14.475 Therefore,
the probability that the rounded value 14.47 is observed as the value for X is the probability that X assumes a
value in the interval [14.465, 14.475], which is not zero. Similarly, because each point has zero probability, one
need not distinguish between inequalities such as < or ≤ for continuous random variables.
Example 8.1
Let the continuous random variable X denote the current measured in a thin copper wire in milliamperes.
Assume that the range of X is [0, 20 mA], and assume that the probability density function of X is f(x) = 0.05 for
0 ≤ x ≤ 20. What is the probability that a current measurement is less than 10 milliamperes?
Solution
Also
𝜇 = 𝐸(𝑋) = ∫ 𝑥𝑓(𝑥)𝑑𝑥
−∞
2
The variance of X, denoted as V(X) or σ2 is
∞ ∞
Example 8.2
For the copper current measurement in Example 8.1, the mean of X is
The most widely used model for the distribution of a random variable is a normal distribution. Whenever a
random experiment is replicated, the random variable that equals the average (or total) result over the replicates
tends to have a normal distribution as the number of replicates becomes large.
Many continuous random variables have distribution that are bell-shaped and are called approximately normally
distributed variables. Such distributions are also known as the Bell curve or the Gaussian distribution.
When the data values are evenly distributed about the mean, the distribution is said to be symmetrical. When
majority of the values fall to the left or right of the mean, the distribution is said to be skewed. Figures 8.1 a, b,
and c show the different forms of distribution.
The tail of the curve indicates the direction of skewness (right is positive, left is negative).
3
Figure 8.1 Skewness of the distribution curve
Random variables with different means and variances can be modeled by normal probability density functions
with appropriate choices of the center and width of the curve. The value of E(X) = µ determines the center of
the probability density function and the value of V(X) = σ2 determines the width. Figure 8.2 illustrates several
normal probability density functions with selected values of µ and σ2. Each has the characteristic symmetric
bell-shaped curve, but the centers and dispersions differ.
Figure 8.2 Normal probability density functions for selected values of the parameters µ and σ2.
The following definition provides the formula for normal probability density functions.
A random variable X with probability density function
is a normal random variable with parameters µ, where−∞ < 𝜇 < ∞ and σ > 0
4
Also, E(X) = µ and V(X) = σ2 and the notation N(𝜇, σ2) is used to denote the distribution
To check whether a distribution is normal or approximately normal, the following steps are used:
a. Draw a histogram for the data and check its shape. If the histogram is not approximately bell-shaped, then
the data are not normally distributed.
5
b. Check the skewness of the data by using Pearson coefficient of skewness (PC) or Pearson’s index of
skewness.
3(𝑥̅ − 𝑚𝑒𝑑𝑖𝑎𝑛)
𝑃𝐶 =
𝑠
If PC ˃ +1 (positively skewed) and PC ˂ -1 (negatively skewed), it can be concluded that the data are
significantly skewed.
3. Check for outliers! One or more outliers can affect the normality.
Example 6.5
A survey of 18-high technology firms showed the number of days’ inventory they had on hand. Determine if the
data is approximately normally distributed.
5 29 34 44 45 63 68 74 74
Solution
The histogram is approximately bell-shaped, so we can conclude that the distribution is approximately normal.
-Check for skewness: For these data set, 𝑥̅ = 79.5, median = 77.5, and s = 40.5. Therefore,
3(79.5 − 77.5)
𝑃𝐶 = = 0.148
40.5
6
The distribution is not significantly skewed.
-Check for outliers: Q1 = 45, Q3 = 98 and IQR = 53. An outlier will be a data value less than 45 – 1.5(53) = -
34.5 or a data value larger than 98 + 1.5(53) = 177.5
In this case, there are no outliers! Generally, the distribution is approximately normal.
Since there can be thousands of normal distribution curves (due to the different mean and standard deviation of
variables), one would have to have a table of areas for each variable for practical applications. To simplify this
situation, the standard normal distribution is used.
The standard normal distribution is a normal distribution with a mean of 0 and a standard deviation of 1.
All normally distributed variables can be transformed into the standard normally distributed variable by using
the formula for standard score:
When the normal distribution is transformed into standard normal distribution, it can be used to solve practical
application problems.
Example 8.6
Find the area to the left of z = 2.06
7
Solution
We are looking for the area under the standard normal distribution to the left of z = 2.06. Look up for the area
between 0 and 2.06. From the standard normal distribution table (See handout to be given in class), this area is
0.4803. For the entire area to the left, add 0.5000. Therefore, the required area is 0.9803. Hence, 98.03% of the
area is less than z = 2.06.
Example 8.7
Find the area to the right of z = - 1.19
Solution
We are looking for the area to the right of z = 1.19. Look up to the area between 0 and 1.19. It is 0.3830
(handout given in class). Therefore, required area is 0.8830. Hence, 88.30% of the area under the Standard
normal distribution curve is to the right of z = -1.19
Example 8.8
Find the area between z = +1.68 and z = -1.37.
Solution
Look up for the areas between 0 and 1.68 and 0 and 1.37 and add the areas. The area between 0 and 1.68 is
0.4535 and that between -1.37 and 0 is 0.4147. Therefore, the required area is 0.8682.
8
8.8 Normal Distribution Curve as a Probability Distribution Curve
The normal distribution curve can be used as a probability distribution curve for normally distributed variables.
The area under the normal distribution curve corresponds to a probability. That is, if it were possible to select
any z value at random, the probability of choosing, say, between 0 and 2.00 would be the same as the area under
the curve between 0 and 2.00. In this case, the area is 0.4772. Therefore, the probability of selecting any z value
between 0 and 2.00 is 0.4772.
For probabilities, a special notation is used. For example, if one wants to find the probability of any z value
between 0 and 2.00, the probability is written as P(0 < z < 2.00).
Example 8.9
Find the probability for each.
a. P(0 ˂ z ˂ 2.32)
b. P(z ˂ 1.65)
c. P(z ˃ 1.91)
Solution
Example 8.10
Each month, an American household generate 28 pounds of newspaper for garbage or recycling. Assuming the
standard deviation is 2 pounds. If a household is selected at random, find the probability of its generating
a. Between 27 and 31 pounds per month
b. More than 30.2 pounds per month
Solution
The two z-values are
27 − 28 31 − 28
𝑧1 = = −0.5 𝑎𝑛𝑑 𝑧2 = = 1.5
2 2
9
The area between z = 0 and z = -0.5 is 0.1915 and the area between z = 0 and z = 1.5 is 0.4332. Therefore, the
total area = 0.1915 + 0.4332 = 0.6247. Hence, the probability that a randomly selected household generates
between 27 and 31 pounds of newspaper per month is 62.47%.
(b) The z value for x = 30.2 is 1.1. The area between z = 0 and z = 1.1 is 0.3643. Therefore, the actual area =
0.5000 – 0.3643 = 0.1357. Hence, the probability that a randomly selected household will accumulate more than
30.2 pounds of newspaper is 13.57%.
Example 8.11
The American Automobile Association reports that’s that the average time it takes to respond to an emergency
call is 25 minutes. Assume the variable is approximately normally distributed and the standard deviation is 4.5
minutes. If 80 calls are randomly selected, approximately how many will be responded to in less than 15
minutes?
Solution
The z value for x = 15 is -2.22. The area between z = 0 and z = -2.22 is 0.4868. Therefore, the actual area =
0.5000 – 0.4868 = 0.0132. The number of calls that will be made in less than 15 minutes will be (80 calls)
(0.0132) = 1.056. Hence, approximately 1 call be responded to in less than 15 minutes.
Example 8.12
In order to qualify for a police academy, candidates must score in the top 10% on a general abilities test. The
test has a mean of 200 and standard deviation of 20. Find the lowest possible score to qualify. Assume the test
scores are normally distributed.
Solution
10% or 0.100 represents the area to the right of the normal distribution for a text score of X. The area between z
= 0 and z-value of the test score will be 0.5000 – 0.1000 = 0.4000. From standard normal distribution table, z =
1.28 gives a corresponding area of 0.3997 (≈ 0.4000).
𝑥 − 200
1.28 =
20
𝑥 = 226
A score of 226 will be used as cutoff. Anyone who scores 226 or higher qualifies for the academy.
10
Example 8.13
For a medical study, a researcher wishes to select people in the middle 60% of the population based on blood
pressure. If the mean systolic blood pressure is 120 and the standard deviation is 8, find the upper and lower
readings that would qualify people to participate in the study.
Solution
Since a middle area of 0.6000 is required, the test values will have an area of 0.3000 on each side of the mean.
The closest z value for an area of 0.3000 is 0.84. Therefore, the z = ± 0.84. Calculating the test score, we have
𝑥1 − 120 𝑥2 − 120
0.84 = 𝑎𝑛𝑑 − 0.84 =
8 8
Therefore, the middle 60% will have blood pressure readings of 113.28 < x < 126.72.
The normal distribution is used to solve problems involving binomial distribution since when n is large (say,
100), the calculations are too difficult to do by hand using the binomial distribution.
Statisticians agree that the normal approximation should be used only when n.p and n.q are both greater or
equal to 5. Again, correction for continuity may be used in the normal approximation. A correction for
continuity is a correction employed when a continuous distribution is used to approximate a discrete
distribution. Table 8.1 summarizes the Normal approximation to Binomial distribution.
The formulas for the mean and standard deviation for the binomial distribution are
𝜇 =𝑛×𝑝 𝑎𝑛𝑑 𝜎 = √𝑛 × 𝑝 × 𝑞
11
Example 8.14
A magazine reported that 6% of American drivers read the newspaper while driving. If 300 drivers are selected
at random, find the probability that exactly 25 say they read the newspaper while driving.
Solution
Here, p = 0.06, q = 0.94, and n = 300
n.p = (300)(0.06) = 18 and n.q = (300)(0.94) = 282
Since both n.p and n.q are greater than 5, the normal distribution can be used.
Next, we write the problem in probability notation: P(X = 25). Convert the problem to Normal distribution and
solve it: P(24.5 < X < 25.5).
12
The z values are
24.5 − 18 25.5 − 18
𝑧1 = = 1.82 𝑎𝑛𝑑 𝑧2 = = 1.58
4.11 4.11
The area for z1 = 1.82 is 0.4656 and that for z2 = 1.58 is 0.4429. Therefore, the required area will be 0.4656 –
0.4429 = 0.0227. Hence the probability that exactly 25 people read the newspaper while driving is 2.27%.
Example 8.15
Of the members of a bowling league, 10% are widowed. If 200 bowling league members are selected at
random, find the probability that 10 or more will be widowed.
Solution
Problem in binomial probability notation: P(X ≥ 10). Convert to Normal distribution: P(X > 9.5). The z value is
9.5 − 20
𝑧= = −2.48
4.24
The area between z = 0 and z = 2.48 is 0.4934. Therefore, required area = 0.4934 + 0.5000 = 0.9934.
The probability of 10 or more widowed people in a random sample of 200 bowling league members is 99.34%.
As you might expect, the distribution of X can be obtained from knowledge of the distribution of the number of
flaws. The key to the relationship is the following concept. The distance to the first flaw exceeds 3 millimeters
if and only if there are no flaws within a length of 3 millimeters—simple, but sufficient for an analysis of the
distribution of X.
In general, let the random variable N denote the number of flaws in x millimeters of wire. If the mean number of
flaws is λ per millimeter, N has a Poisson distribution with mean λx. We assume that the wire is longer than the
value of x.
13
8.12 Erlang Distribution
An exponential random variable describes the length until the first count is obtained in a Poisson process. A
generalization of the exponential distribution is the length until r counts occur in a Poisson process. The random
variable that equals the interval length until r counts occur in a Poisson process has an Erlang random
variable.
The Erlang distribution is a special case of the gamma distribution. If the parameter r of an Erlang random
variable is not an integer, but r > 0, the random variable has a gamma distribution. However, in the Erlang
density function, the parameter r appears as r factorial.
Therefore, to define a gamma random variable, we require a generalization of the factorial function.
15