Unit - Ii Describing Data I
Unit - Ii Describing Data I
Since the number of possible values is relatively small—only 10—it’s appropriate to construct a
frequency distribution for ungrouped data.
6. The real limits are located at the midpoint of the gap between adjacent tabled boundaries;
that is, one-half of one unit of measurement below the lower tabled boundary and one-half of one
unit of measurement above the upper tabled boundary.
65-0.5=64.5 69-0.5=68.5
The real limits for the lowest class interval 64.5-69.5.
Outliers (Very extreme score)
An outlier is an extremely high or extremely low data point relative to the nearest data point and
the rest of the neighboring co-existing values in a data graph or dataset.
Example
The value in the month of January is significantly less than in the other months.
3. Identify any outliers in each of the following sets of data collected from nine college students.
Therefore, the outliers in the data are: Summer Income: $25,700 Family Size: 18
RELATIVE FREQUENCY DISTRIBUTIONS
Relative frequency distributions show the frequency of each class as a part or fraction of the total
frequency for the entire distribution.This type of distribution allows us to focus on the relative
concentration of observations among different classes within the same distribution.
Constructing Relative Frequency Distributions
To convert a frequency distribution into a relative frequency distribution, divide the frequency
for each class by the total frequency for the entire distribution.
PROBLEMS
4. Calculate a relative frequency distribution based on the below weight distribution Table.
5. GRE scores for a group of graduate school applicants are distributed as follows:
Convert to a relative frequency distribution. When calculating proportions, round numbers to two
digits to the right of the decimal point.
CUMULATIVE FREQUENCY DISTRIBUTIONS
Cumulative frequency distributions show the total number of observations in each class and in all
lower-ranked classes. This type of distribution can be used effectively with sets of scores, such
as test scores for intellectual or academic aptitude, when relative standing within the distribution
assumes primary importance. Under these circumstances, cumulative frequencies are usually
converted, in turn, to cumulative percentages. Cumulative percentages are often referred to as
percentile ranks.
Constructing Cumulative Frequency Distributions
To convert a frequency distribution into a cumulative frequency distribution, add to the
frequency of each class the sum of the frequencies of all classes ranked below it. This gives the
cumulative frequency for that class. Begin with the lowest-ranked class in the frequency
distribution and work upward, finding the cumulative frequencies in ascending order
6. Calculate a cumulative frequency distribution and percentile ranks based on the below weight
distribution Table.
Percentile Ranks
The percentile rank of a score indicates the percentage of scores in the entire distribution with
similar or smaller values than that score. Thus a weight has a percentile rank of 80 if equal or
lighter weights constitute 80 percent of the entire distribution.
7.Find the approximate percentile rank of any weight in the class 200–209.
The approximate percentile rank for weights between 200 and 209 lbs is 92 (because 92 is the
cumulative percent for this interval).
Frequency distribution for nominal data
When, among a set of observations, any single observation is a word, letter, or numerical code,
the data are qualitative. Frequency distributions for qualitative data are easy to construct. Simply
determine the frequency with which observations occupy each class, and report these
frequencies.
Relative and Cumulative Distributions for Qualitative Data
Frequency distributions for qualitative variables can always be converted into relative frequency
distributions. Furthermore, if measurement is ordinal because observations can be ordered from
least to most, cumulative frequencies (and cumulative percentages) can be used.
Example
8. Movie ratings reflect ordinal measurement because they can be ordered from most to least
restrictive: NC-17, R, PG-13, PG, and G. The ratings of some films shown recently in San
Francisco are as follows:
(a) Construct a frequency distribution.
(b) Convert to relative frequencies, expressed as percentages.
(c) Construct a cumulative frequency distribution.
(d) Find the approximate percentile rank for those films with a PG rating.
(d) Percentile rank for films with a PG rating is 55 (from 11/ 20 multiplied by 100).
INTERPRETING DISTRIBUTIONS
In data science, interpreting distributions involves analyzing the patterns and characteristics of
data sets to extract insights and make informed decisions.
GRAPHS
Data can be described clearly and concisely with the aid of a well-constructed frequency
distribution.
Graphs for quantitative data
For visualizing quantitative data, histograms and box plots are commonly used.
Histogram:
A bar-type graph for quantitative data and there are common boundaries between adjacent bars
emphasize the continuity of the data, as with continuous variables.A histogram is a graphical
representation of the distribution of numerical data. It consists of a series of bars, where each bar
represents a range of values (bin) and the height of the bar indicates the frequency of data points
falling within that range. Histograms are useful for visualizing the shape, center, and spread of
the data distribution.
Features of histograms
● Equal units along the horizontal axis (the X axis, or abscissa) reflect the various class
intervals of the frequency distribution.
● Equal units along the vertical axis (the Y axis, or ordinate) reflect increases in frequency.
(The units along the vertical axis do not have to be the same width as those along the
horizontal axis.)
● The intersection of the two axes defines the origin at which both numerical scales equal 0.
● Numerical scales always increase from left to right along the horizontal axis and from bottom
to top along the vertical axis.
● The body of the histogram consists of a series of bars whose heights reflect the frequencies
for the various classes. Notice that adjacent bars in histograms have common boundaries that
emphasize the continuity of quantitative data for continuous variables. The introduction of
gaps between adjacent bars would suggest an artificial disruption in the data more
appropriate for discrete quantitative variables or for qualitative variables.
Frequency Polygon
An important variation on a histogram is the frequency polygon, or line graph. Frequency
polygons may be constructed directly from frequency distributions. A line graph for quantitative
data that also emphasizes the continuity of continuous variables.
Transformation of a histogram into a frequency polygon
1. Construct a Histogram: Start by creating a histogram to represent the frequency distribution
of the data. Divide the range of the data into intervals (bins) and count the number of data
points falling into each interval.
2. Identify Midpoints and Heights: For each bar in the histogram, identify the midpoint of the
interval and the height of the bar (representing the frequency or relative frequency of data
points in that interval).
3. Plot Points: Plot each midpoint on the horizontal axis, with its corresponding height on the
vertical axis. These points represent the tops of the bars in the histogram.
4. Connect the Points: Connect the points on the graph using straight line segments. Start from
the leftmost point and end at the rightmost point. If you want to emphasize the continuity of
the distribution, you can close the polygon by connecting the last point to the first point.
Example Problem
9. The following frequency distribution shows the annual incomes in dollars for a group of
college graduates.
a) Construct a histogram.
b) Construct a frequency polygon.
c) Is this distribution balanced or lopsided?
To determine if the distribution is balanced or lopsided, we typically look at the shape of the
histogram or frequency polygon. In this case, both the histogram and frequency polygon show
that the distribution is lopsided, with more data points concentrated on the left side (lower
income ranges) and fewer data points on the right side (higher income ranges). This suggests that
the distribution is positively skewed, meaning it has a longer tail on the right side. Thus, the
distribution is lopsided or skewed to the right.
10. The number of friends reported by Facebook users is summarized in the following
frequency distribution
a) Convert to a histogram.
b) Why would it not be possible to convert to a stem and leaf display?
It would not be possible to convert this distribution to a stem and leaf display because stem
and leaf plots is typically used for smaller datasets. In this case, you have 200 data points (the
number of users in each frequency category), which would make a stem and leaf plot impractical
and challenging to interpret. Stem and leaf plots are more suitable for datasets with fewer data
points to show the distribution of values in a compact and readable form.
StemandLeafDisplays
Still another technique for summarizing quantitative data is a stem and leaf display. Stem and
leaf
displays are ideal for summarizing distributions, such as that for weight data, without destroying
the identities of individual observations.
Selection of Stems
Stem values are not limited to units of 10. Depending on the data, identify the stem with one or
more leading digits that culminates in some variation on a stem value of 10, such as 1, 100, 1000,
or even .1, .01, .001, and so on.
11. Construct stem and leave display from the statistics:
AVERAGES
Averages consist of numbers (or words) about which the data are, in some sense, centered. They
are often referred to as measures of central tendency, the several types of average yield numbers
or words that attempt to describe, most generally, the middle or typical value for a distribution. It
focuses on three different measures of central tendency—the mode, median, and mean. Each of
these has its special uses, but the mean is the most important average in both descriptive and
inferential statistics. It is a measure used in statistics to summarize a set of data points.
MODE
The mode reflects the value of the most frequently occurring score.
Than One More Mode
Distributions can have more than one mode (or no mode at all). Distributions with two obvious
peaks, even though they are not exactly the same height, are referred to as bimodal. Distributions
with more than two peaks are referred to as multimodal. The presence of more than one mode
might reflect important differences among subsets of data. For instance, the distribution of
weights for both male and female statistics students would most likely be bimodal, reflecting the
combination of two separate weight distributions—a heavier one for males and a lighter one for
females.
14. Determine the mode for the following retirement ages: 60, 63, 45, 63, 65, 70, 55, 63, 60,
65, 63.
The retirement age 63 appears most frequently, occurring 4 times. So, the mode for this set of
retirement ages is 63.
15. The owner of a new car conducts six gas mileage tests and obtains the following results,
expressed in miles per gallon: 26.3, 28.7, 27.4, 26.6, 27.4, 26.9. Find the mode for these
data.
Here, the mileage 27.4 appears twice, which is more than any other value. So, the mode for this
set of gas mileage tests is 27.4 miles per gallon.
MEDIAN
The median reflects the middle value when observations are ordered from least to most. The
median splits a set of ordered observations into two equal parts, the upper and lower halves.
FINDING THE MEDIAN
16. Find the median for the following retirement ages: 60, 63, 45, 63, 65, 70, 55, 63, 60, 65,
63.
Arrange the retirement ages in ascending order:
45, 55, 60, 60, 63, 63, 63, 63, 65, 65, 70.
Since there are 11 data points, the median will be the middle value. In this case, the middle value
is the sixth value, which is 63.
So, the median retirement age for this set of data is 63.
17. Find the median for the following gas mileage tests: 26.3, 28.7, 27.4, 26.6, 27.4, 26.9.
let's arrange the values in ascending order:
26.3, 26.6, 26.9, 27.4, 27.4, 28.7
Since there are 6 data points, the median will be the average of the two middle values (if there's
an even number of data points). Here, the two middle values are 26.9 and 27.4.
Calculating the average:
Median = (26.9 + 27.4) / 2
Median = 54.3 / 2
Median = 27.15
So, the median for this set of gas mileage tests is 27.15 miles per gallon.
MEAN
The mean is the most common average, one you have doubtless calculated many times. The
mean is found by adding all scores and then dividing by the number of scores.
Statisticians distinguish between two types of means—the population mean and the sample
mean—depending on whether the data are viewed as a population (a complete set of scores) or as
a sample (a subset of scores).
Formula for Sample Mean