Unit 2
Unit 2
Since the number of possible values is relatively small—only 10—it’s appropriate to construct
a frequency distribution for ungrouped data.
Not all observations can be assigned to one and only one class (because of gap between
20–22 and 25–30 and overlap between 25–30 and 30–34). All classes are not equal in
width (25–30 versus 30–34). All classes do not have both boundaries (35–above).
Outliers (Very extreme score)
An outlier is an extremely high or extremely low data point relative to the nearest data point and
the rest of the neighboring co-existing values in a data graph or dataset.
Example
The value in the month of January is significantly less than in the other months.
4. Identify any outliers in each of the following sets of data collectedfrom nine college students.
1. Summer Income:
Mean = $7,522.67
Standard Deviation = $8,595.49
Z-scores:
$6,450: -0.123
$4,820: -0.287
$5,650: -0.082
$1,720: -0.785
$600: -0.852
$0: -0.877
$3,482: -0.409
$25,700: 2.106
$8,548: 0.603
Outlier: $25,700 (z-score > 3
2. Family Size:
Mean = 5.00
Standard Deviation = 5.29
Z-scores: 2: -0.377
4: -0.377
3: -0.377
6: 0.377
18: 2.831
2: -0.377
6: 0.377
3: -0.377
4: -0.377
Outlier: 18 (z-score > 3)
4. GPA: Mean = 3.05 Standard Deviation = 0.67 Z-scores: 2.30: -0.948 4.00: 0.840
3.56: 0.573 2.89: -0.802 2.15: -1.275 3.01: -0.694 3.09: -0.662 3.50: -0.134 3.20: -0.510
No outliers.
Therefore, the outliers in the data are:
Summer Income: $25,700
Family Size: 18
INTERPRETING DISTRIBUTIONS
In data science, interpreting distributions involves analyzing the patterns and characteristics of
data sets to extract insights and make informed decisions.
GRAPHS
Data can be described clearly and concisely with the aid of a well-constructed frequency
distribution.
Graphs for quantitative data
For visualizing quantitative data, histograms and box plots are commonly used.
Histogram:
A bar-type graph for quantitative data and there are common boundaries between adjacent bars
emphasize the continuity of the data, as withcontinuous variables.A histogram is a graphical
representation of the distribution of numerical data. It consists of a series of bars, where each bar
represents a range of values (bin) and the height of the bar indicates the frequency of data points
falling within that range. Histograms are useful for visualizing the shape, center, and spread of
the data distribution.
Features of histograms
Equal units along the horizontal axis (the X axis, or abscissa) reflect the various class
intervals of the frequency distribution.
Equal units along the vertical axis (the Y axis, or ordinate) reflect increases in frequency.
(The units along the vertical axis do not have to be the same width as those along the
horizontal axis.)
The intersection of the two axes defines the origin at which both numerical scales equal 0.
Numerical scales always increase from left to right along the horizontal axis and from bottom
to top along the vertical axis.
The body of the histogram consists of a series of bars whose heights reflect the frequencies
for the various classes. Notice that adjacent bars in histograms have common boundaries that
emphasize the continuity of quantitative data for continuous variables. The introduction of
gaps between adjacent bars would suggest an artificial disruption in the data more
appropriate for discrete quantitative variables or for qualitative variables.
Frequency Polygon
An important variation on a histogram is the frequency polygon, or line graph. Frequency
polygons may be constructed directly from frequency distributions. A line graph for quantitative
datathat also emphasizes the continuityof continuous variables.
Transformation of a histogram into a frequencypolygon
1. Construct a Histogram: Start by creating a histogram to represent the frequency distribution
of the data. Divide the range of the data into intervals (bins) and count the number of data
points falling into each interval.
2. Identify Midpoints and Heights: For each bar in the histogram, identify the midpoint of the
interval and the height of the bar (representing the frequency or relative frequency of data
points in that interval).
3. Plot Points: Plot each midpoint on the horizontal axis, with its corresponding height on the
vertical axis. These points represent the tops of the bars in the histogram.
4. Connect the Points: Connect the points on the graph using straight line segments. Start from
the leftmost point and end at the rightmost point. If you want to emphasize the continuity of
the distribution, you can close the polygon by connecting the last point to the first point.
Example Problem
5. The following frequency distribution shows the annual incomes indollars for a group of
college graduates.
a) Construct a histogram.
b) Construct a frequency polygon.
c) Is this distribution balanced or lopsided?
To determine if the distribution is balanced or lopsided, we typically look at the shape of the
histogram or frequency polygon. In this case, both the histogram and frequency polygon show
that the distribution is lopsided, with more data points concentrated on the left side (lower
income ranges) and fewer data points on the right side (higher income ranges). This suggests that
the distribution is positively skewed, meaning it has a longer tail on the right side. Thus, the
distribution is lopsided or skewed to the right.
6. The number of friends reported by Facebook users is summarized in the following
frequency distribution
a) Convert to a histogram.
b) Why would it not be possible to convert to a stem and leaf display?
It would not be possible to convert this distribution to a stem and leaf display because stem
and leaf plots is typically used for smaller datasets. In this case, you have 200 data points (the
number of users in each frequency category), which would make a stem and leaf plot impractical
and challenging to interpret. Stem and leaf plots are more suitable for datasets with fewer data
points to show the distribution of values in a compact and readable form.
StemandLeafDisplays
Still another technique for summarizing quantitative data is a stem and leaf display.Stem and leaf
displays are ideal for summarizing distributions, such as that for weightdata, without destroying
the identities of individual observations.
Selection of Stems
Stem values are not limited to units of 10. Depending on the data, identify the stem with one or
more leading digits that culminates in some variation on a stem value of 10, such as 1, 100, 1000,
or even .1, .01, .001, and so on.
7. Construct stem and leave display from the statistics:
The stem represents the tens digit of the weight.
The leaves represent the units digit of the weight.
8. Construct a stem and leaf display for the following IQ scores obtained from a group of
four-year-old children
A GRAPH FOR QUALITATIVE (NOMINAL) DATA
For qualitative (nominal) data, a bar graph is often used to represent the frequency or count of
each category.
Bar Graph
Gaps between adjacent bars emphasize the discontinuous nature of the data. A bar graph, also
known as a bar chart, is a graphical representation of data where the length or height of bars
corresponds to the frequency, count, or other numerical measures of different categories or
groups.
9. Construct a bar graph for the data shown in the following table:
AVERAGES
Averages consist of numbers (or words) about which the data are, in some sense, centered. They
are often referred to as measures of central tendency, the several types of average yield numbers
or words that attempt to describe, most generally, the middle or typical value for a distribution. It
focuses on three different measures of central tendency—the mode, median, and mean. Each of
these has its special uses, but the mean is the most important average in both descriptive and
inferential statistics. It is a measure used in statistics to summarize a set of data points.
MODE
The mode reflects the value of the most frequently occurring score.
More Than One Mode
Distributions can have more than one mode (or no mode at all). Distributions with two obvious
peaks, even though they are not exactly the same height, are referred to as bimodal. Distributions
with more than two peaks are referred to as multimodal. The presence of more than one mode
might reflect important differences among subsets of data. For instance, the distribution of
weights for both male and female statistics students would most likely be bimodal, reflecting the
combination of two separate weight distributions—a heavier one for males and a lighter one for
females.
10. Determine the mode for the following retirement ages: 60, 63, 45, 63, 65, 70, 55, 63, 60,
65, 63.
The retirement age 63 appears most frequently, occurring 4 times. So, the mode for this set of
retirement ages is 63.
11. The owner of a new car conducts six gas mileage tests and obtains the following results,
expressed in miles per gallon: 26.3, 28.7, 27.4, 26.6, 27.4, 26.9. Find the mode for these
data.
Here, the mileage 27.4 appears twice, which is more than any other value. So, the mode for this
set of gas mileage tests is 27.4 miles per gallon.
MEDIAN
The median reflects the middle value when observations are ordered from least to most. The
median splits a set of ordered observations into two equal parts, the upper and lower halves.
FINDING THE MEDIAN
12. Find the median for the following retirement ages: 60, 63, 45, 63, 65, 70, 55, 63, 60, 65,
63.
Arrange the retirement ages in ascending order:
45, 55, 60, 60, 63, 63, 63, 63, 65, 65, 70.
Since there are 11 data points, the median will be the middle value. In this case, the middle value
is the sixth value, which is 63.
So, the median retirement age for this set of data is 63.
13. Find the median for the following gas mileage tests: 26.3, 28.7, 27.4, 26.6, 27.4, 26.9.
let's arrange the values in ascending order:
26.3, 26.6, 26.9, 27.4, 27.4, 28.7
Since there are 6 data points, the median will be the average of the two middle values (if there's
an even number of data points). Here, the two middle values are 26.9 and 27.4.
Calculating the average:
Median = (26.9 + 27.4) / 2
Median = 54.3 / 2
Median = 27.15
So, the median for this set of gas mileage tests is 27.15 miles per gallon.
MEAN
The mean is the most common average, one you have doubtless calculated many times. The
mean is found by adding all scores and then dividing by the number of scores.
Statisticians distinguish between two types of means—the population mean and the sample
mean—depending on whether the data are viewed as a population (a complete set of scores) or as
a sample (a subset of scores).