Fundamentals of Data Science unit 2
Fundamentals of Data Science unit 2
DESCRIBING DATA
Syllabus: UNIT II
Frequency distributions–Outliers–relative frequency distributions–
cumulative frequency distributions–frequency distributions for nominal
data–interpreting distributions–graphs–averages–mode–median–mean–
averages for qualitative and ranked data – describing variability–range–
variance–standard deviation–degrees of freedom–inter quartile range–
variability for qualitative and ranked data.
Frequency Distributions
A frequency distribution is a collection of observations produced by
sorting observations into classes and showing their frequency (f) of
occurrence in each class.
A frequency distribution helps us to detect any pattern in the data
(assuming a pattern exists) by superimposing some order on the
inevitable variability among observations.
The advantage of using frequency distributions is that they present raw
data in an organized, easy-to-read format. The most frequently occurring
scores are easily identified, as are score ranges, lower and upper limits,
cases that are not common, outliers, and total number of observations
between any given scores.
Frequency distribution shows whether the observations are high or low
and also whether they are concentrated in one area or spread out across
the entire scale.
Different Types of Frequency distributions:
Ungrouped frequency distribution.
Grouped frequency distribution.
Relative frequency distribution.
Cumulative frequency distribution
Frequency Distribution for Ungrouped Data
A frequency distribution produced whenever observations are sorted into
classes of single values is referred to as a frequency distribution for
ungrouped data.
Frequency distributions for ungrouped data are much more informative
when the number of possible values is less than about 20.
Example:
UNIT-2
DESCRIBING DATA
Syllabus: UNIT II
Frequency distributions–Outliers–relative frequency distributions–
cumulative frequency distributions–frequency distributions for nominal
data–interpreting distributions–graphs–averages–mode–median–mean–
averages for qualitative and ranked data – describing variability–range–
variance–standard deviation–degrees of freedom–inter quartile range–
variability for qualitative and ranked data.
Frequency Distributions
A frequency distribution is a collection of observations produced by
sorting observations into classes and showing their frequency (f) of
occurrence in each class.
A frequency distribution helps us to detect any pattern in the data
(assuming a pattern exists) by superimposing some order on the
inevitable variability among observations.
The advantage of using frequency distributions is that they present raw
data in an organized, easy-to-read format. The most frequently occurring
scores are easily identified, as are score ranges, lower and upper limits,
cases that are not common, outliers, and total number of observations
between any given scores.
Frequency distribution shows whether the observations are high or low
and also whether they are concentrated in one area or spread out across
the entire scale.
Different Types of Frequency distributions:
Ungrouped frequency distribution.
Grouped frequency distribution.
Relative frequency distribution.
Cumulative frequency distribution
Frequency Distribution for Ungrouped Data
A frequency distribution produced whenever observations are sorted into
classes of single values is referred to as a frequency distribution for
ungrouped data.
Frequency distributions for ungrouped data are much more informative
when the number of possible values is less than about 20.
Example:
UNIT-2
DESCRIBING DATA
Syllabus: UNIT II
Frequency distributions–Outliers–relative frequency distributions–
cumulative frequency distributions–frequency distributions for nominal
data–interpreting distributions–graphs–averages–mode–median–mean–
averages for qualitative and ranked data – describing variability–range–
variance–standard deviation–degrees of freedom–inter quartile range–
variability for qualitative and ranked data.
Frequency Distributions
A frequency distribution is a collection of observations produced by
sorting observations into classes and showing their frequency (f) of
occurrence in each class.
A frequency distribution helps us to detect any pattern in the data
(assuming a pattern exists) by superimposing some order on the
inevitable variability among observations.
The advantage of using frequency distributions is that they present raw
data in an organized, easy-to-read format. The most frequently occurring
scores are easily identified, as are score ranges, lower and upper limits,
cases that are not common, outliers, and total number of observations
between any given scores.
Frequency distribution shows whether the observations are high or low
and also whether they are concentrated in one area or spread out across
the entire scale.
Different Types of Frequency distributions:
Ungrouped frequency distribution.
Grouped frequency distribution.
Relative frequency distribution.
Cumulative frequency distribution
Frequency Distribution for Ungrouped Data
A frequency distribution produced whenever observations are sorted into
classes of single values is referred to as a frequency distribution for
ungrouped data.
Frequency distributions for ungrouped data are much more informative
when the number of possible values is less than about 20.
Example:
UNIT-2
DESCRIBING DATA
Syllabus: UNIT II
Frequency distributions–Outliers–relative frequency distributions–
cumulative frequency distributions–frequency distributions for nominal
data–interpreting distributions–graphs–averages–mode–median–mean–
averages for qualitative and ranked data – describing variability–range–
variance–standard deviation–degrees of freedom–inter quartile range–
variability for qualitative and ranked data.
Frequency Distributions
A frequency distribution is a collection of observations produced by
sorting observations into classes and showing their frequency (f) of
occurrence in each class.
A frequency distribution helps us to detect any pattern in the data
(assuming a pattern exists) by superimposing some order on the
inevitable variability among observations.
The advantage of using frequency distributions is that they present raw
data in an organized, easy-to-read format. The most frequently occurring
scores are easily identified, as are score ranges, lower and upper limits,
cases that are not common, outliers, and total number of observations
between any given scores.
Frequency distribution shows whether the observations are high or low
and also whether they are concentrated in one area or spread out across
the entire scale.
Different Types of Frequency distributions:
Ungrouped frequency distribution.
Grouped frequency distribution.
Relative frequency distribution.
Cumulative frequency distribution
Frequency Distribution for Ungrouped Data
A frequency distribution produced whenever observations are sorted into
classes of single values is referred to as a frequency distribution for
ungrouped data.
Frequency distributions for ungrouped data are much more informative
when the number of possible values is less than about 20.
Example:
UNIT-2
DESCRIBING DATA
Syllabus: UNIT II
Frequency distributions–Outliers–relative frequency distributions–
cumulative frequency distributions–frequency distributions for nominal
data–interpreting distributions–graphs–averages–mode–median–mean–
averages for qualitative and ranked data – describing variability–range–
variance–standard deviation–degrees of freedom–inter quartile range–
variability for qualitative and ranked data.
Frequency Distributions
A frequency distribution is a collection of observations produced by
sorting observations into classes and showing their frequency (f) of
occurrence in each class.
A frequency distribution helps us to detect any pattern in the data
(assuming a pattern exists) by superimposing some order on the
inevitable variability among observations.
The advantage of using frequency distributions is that they present raw
data in an organized, easy-to-read format. The most frequently occurring
scores are easily identified, as are score ranges, lower and upper limits,
cases that are not common, outliers, and total number of observations
between any given scores.
Frequency distribution shows whether the observations are high or low
and also whether they are concentrated in one area or spread out across
the entire scale.
Different Types of Frequency distributions:
Ungrouped frequency distribution.
Grouped frequency distribution.
Relative frequency distribution.
Cumulative frequency distribution
Frequency Distribution for Ungrouped Data
A frequency distribution produced whenever observations are sorted into
classes of single values is referred to as a frequency distribution for
ungrouped data.
Frequency distributions for ungrouped data are much more informative
when the number of possible values is less than about 20.
Example:
Syllabus: UNIT II
Frequency distributions–Outliers–relative frequency distributions–
cumulative frequency distributions–frequency distributions for nominal
data–interpreting distributions–graphs–averages–mode–median–mean–
averages for qualitative and ranked data
Syllabus: UNIT II
Frequency distributions–Outliers–relative frequency distributions–
cumulative frequency distributions–frequency distributions for nominal
data–interpreting distributions–graphs–averages–mode–median–mean–
averages for qualitative and ranked data
UNIT II
Syllabus:
Frequency Distributions
Interpreting Distributions
When inspecting a distribution for the first time, we have train to look at the entire table,
not just the distribution. Read the title, column headings, and any footnotes.
After these preliminaries, inspect the content of the frequency distribution.
When interpreting distributions, including distributions constructed by someone else,
keep an open mind.
Outliers
A very extreme score that requires special attention because of its potential impact
on a summary of the data is called outlier.
Example: A GPA of 0.06, an IQ of 170, summer wages of $62,000
Dealing with Outliers
Check for Accuracy:
Whenever an outlier encounter attempt to verify its accuracy.
Example: For instance, whether GPA of 3.06 recorded erroneously as 0.06?
If the outlier survives an accuracy check, it should be treated as a legitimate score.
Might Exclude from Summaries:
Choose to segregate an outlier from any summary of the data.
For example, we might relegate it to a footnote instead of using excessively wide
class intervals in order to include it in a frequency distribution. Or we might use
various numerical summaries, such as the median and inter quartile range
Might Enhance Understanding:
A valid outlier can be viewed as the product of special circumstances; it can help to
understand the data.
For example, we might understand better why crime rates differ among communities
by studying the special circumstances that produce a community with an extremely
low (or high) crime rate, or why learning rates differ among third graders by
studying a third grader who learns very rapidly (or very slowly).
Graphs
(Describing Data using Graphs)
Data can be described clearly and concisely with the aid of a well constructed frequency
distribution.
Data can often be described even more vividly, by converting frequency distributions
into graphs.
They also can be converted into relative frequency distributions and, if
the data can be ordered because of ordinal measurement, into percentile
ranks.
Interpreting Distributions
When inspecting a distribution for the first time, we have train to look at
the entire table, not just the distribution. Read the title, column headings,
and any footnotes.
After these preliminaries, inspect the content of the frequency distribution.
When interpreting distributions, including distributions constructed by
someone else, keep an open mind.
Outliers
A very extreme score that requires special attention because of its
potential impact on a summary of the data is called outlier.
Example: A GPA of 0.06, an IQ of 170, summer wages of $62,000
Dealing with Outliers
Check for Accuracy:
Whenever an outlier encounter attempt to verify its accuracy.
Example: For instance, whether GPA of 3.06 recorded erroneously as 0.06?
If the outlier survives an accuracy check, it should be treated as a legitimate
score.
Might Exclude from Summaries:
Choose to segregate an outlier from any summary of the data.
For example, we might relegate it to a footnote instead of using
excessively wide class intervals in order to include it in a frequency
distribution. Or we might use various numerical summaries, such as the
median and inter quartile range
Might Enhance Understanding:
A valid outlier can be viewed as the product of special circumstances; it
can help to understand the data.
For example, we might understand better why crime rates differ among
communities by studying the special circumstances that produce a
community with an extremely low (or high) crime rate, or why learning
rates differ among third graders by studying a third grader who learns
very rapidly (or very slowly).
Graphs
(Describing Data using Graphs)
Data can be described clearly and concisely with the aid of a well constructed
frequency distribution.
Data can often be described even more vividly, by converting frequency
distributions into graphs.
Most common types of graphs:
Graphs for Quantitative Data
Histograms
Frequency Polygon
Stem and Leaf Displays
Graphs for Qualitative Data
Bar graph
Histogram
A bar-type graph for quantitative data. The common boundaries between adjacent bars
emphasize the continuity of the data, as with continuous variables.
Important features of histograms.
Equal units along the horizontal axis (the X axis, or abscissa) reflect the
various class intervals of the frequency distribution.
Equal units along the vertical axis (the Y axis, or ordinate) reflect increases in
frequency.
The intersection of the two axes defines the origin at which both numerical
scales equal 0.
Numerical scales always increase from left to right along the horizontal axis
and from bottom to top along the vertical axis.
The body of the histogram consists of a series of bars whose heights reflect
the frequencies for the various classes.
Example:
Frequency Polygon
An important variation on a histogram is the frequency polygon, or line
graph.
Frequency polygons are particularly useful when two or more frequency
distributions or relative frequency distributions are to be included in the
same graph.
Frequency polygons can be constructed directly from frequency distributions.
It can also be constructed from histogram.
The step-by-step transformation of a histogram into a frequency polygon:
A: This panel shows the histogram for the weight distribution.
B: Place dots at the midpoints of each bar top or, in the absence of
bar tops, at midpoints for classes on the horizontal axis, and
connect them with straight lines.
C: Anchor the frequency polygon to the horizontal axis. First,
extend the upper tail to the midpoint of the first unoccupied class
on the upper flank of the histogram. Then extend the lower tail to
the midpoint of the first unoccupied class on the lower flank of the
histogram. Now all of the area under the frequency polygon is
enclosed completely.
D: Finally, erase all of the histogram bars, leaving only the
frequency polygon.
Example:
Frequency Polygon
An important variation on a histogram is the frequency polygon, or line graph.
Frequency polygons are particularly useful when two or more frequency distributions
or relative frequency distributions are to be included in the same graph.
Frequency polygons can be constructed directly from frequency distributions. It can also
be constructed from histogram.
The step-by-step transformation of a histogram into a frequency polygon:
A: This panel shows the histogram for the weight distribution.
B: Place dots at the midpoints of each bar top or, in the absence of bar tops, at
midpoints for classes on the horizontal axis, and connect them with straight
lines.
C: Anchor the frequency polygon to the horizontal axis. First, extend the upper
tail to the midpoint of the first unoccupied class on the upper flank of the
histogram. Then extend the lower tail to the midpoint of the first unoccupied class
on the lower flank of the histogram. Now all of the area under the frequency
polygon is enclosed completely.
D: Finally, erase all of the histogram bars, leaving only the frequency
polygon.
Example:
Stem and Leaf Displays
Stem and leaf displays are ideal for summarizing distributions, such as that for
weight data, without destroying the identities of individual observations.
Stem and Leaf display is a device for sorting quantitative data on the basis of leading and
trailing digits.
Stem and leaf displays represent statistical bargains. Just a few minutes of work produces
a description of data that is both clear and complete.
Even though rarely appearing in published reports, stem and leaf displays often serve as
the first step toward organizing data.
A good stem and leaf display
shows the first digits of the number (thousands, hundreds or tens) as the stem and
shows the last digit (ones) as the leaf.
usually uses whole numbers. Anything that has a decimal point is rounded to the
nearest whole number. For example, test results, speeds, heights, weights, etc.
looks like a bar graph when it is turned on its side.
shows how the data are spread—that is, highest number, lowest number,
most common number and outliers
To construct the stem and leaf display
On the left hand side of the page, write down the thousands, hundreds or
tens (all digits but the last one). These will be your stems.
Draw a line to the right of these stems.
On the other side of the line, write down the ones (the last digit of a number).
These will be your leaves.
Example 1: A teacher asked 10 of her students how many books they had read in the last
12 months. Their answers were as follows: 12, 23, 19, 6, 10, 7, 15, 25, 21, 12. Prepare a
stem and leaf display for these data.
Bimodal
It reflects the coexistence of two different types of observations in the same
distribution.
For instance, the distribution of the ages of residents in a neighborhood consisting
largely of either new parents or their infants has a bimodal shape.
Positively Skewed
A lopsided distribution caused by a few extreme observations in the positive
direction (to the right of the majority of Observations), is a positively skewed
distribution.
The distribution of incomes among U.S. families has a pronounced positive skew,
with most family incomes under $200,000 and relatively few family incomes
spanning a wide range of values above $200,000.
Negatively Skewed
A lopsided distribution caused by a few extreme observations in the negative
direction (to the left of the majority of observations), is a negatively skewed
distribution.
The distribution of ages at retirement among U.S. job holders has a pronounced
negative skew, with most retirement ages at 60 years or older and relatively few
retirement ages spanning the wide range of ages younger than 60.
Bar graphs: A Graph for Qualitative (Nominal) Data
Bar graphs are often used with qualitative data and sometimes with discrete
quantitative data.
They resemble histograms except that gaps separate adjacent bars in bar graphs.
Example 1:
Interpreting graphs
When interpreting graphs, beware of various unscrupulous techniques, such as
using bizarre combinations of axes to either exaggerate or suppress a particular data
pattern.
Describing Data with Averages
Averages consist of numbers (or words) about which the data are, in some sense,
centered. They are often referred to as measures of central tendency
A measure of center is a single number used to describe a set of numeric data. It
describes a typical value from the data set.
Several types of average yield numbers or words that attempt to describe, most
generally, the middle or typical value for a distribution.
Three different measures of central tendency are:
Mode
Median
Mean.
Each of these has its special uses, but the mean is the most important
average in both descriptive and inferential statistics.
Mode
The mode equals the value of the most frequently occurring or typical
score.
It is easy to assign a value to the mode. If the data are organized.
However, if the data are not organized, some counting may be required.
The mode is readily understood as the most prevalent or typical value.
Distributions can have more than one mode (or no mode at all).
Distributions with two obvious peaks, even though they are not exactly
the same height, are referred to as bimodal.
Distributions with more than two peaks are referred to as multimodal.
The presence of more than one mode might reflect important differences
among subsets of data. For instance, the distribution of weights for both
male and female statistics students would most likely be bimodal,
reflecting the combination of two separate weight distributions—a
heavier one for males and a lighter one for females.
Example1: Determine the mode for the following retirement ages: 60, 63,
45, 63, 65, 70, 55, 63, 60, 65, 63.
Answer: mode = 63
Example1: The owner of a new car conducts six gas mileage tests and
obtains the following results, expressed in miles per gallon: 26.3, 28.7,
27.4, 26.6, 27.4, 26.9. Find the mode for these data.
Answer: mode = 27.4
Median
The median reflects the middle value when observations are ordered from
least to most.
The median splits a set of ordered observations into two equal parts, the
upper and lower halves.
In other words, the median has a percentile rank of 50, since observations
with equal or smaller values constitute 50 percent of the entire distribution.
To find the median, scores always must be ordered from least to most (or
vice versa). This task is straightforward with small sets of data but becomes
increasingly cumbersome with larger sets of data that must be ordered
manually.
Three different measures of central tendency are:
Mode
Median
Mean.
Each of these has its special uses, but the mean is the most important average in both
descriptive and inferential statistics.
Mode
The mode equals the value of the most frequently occurring or typical score.
It is easy to assign a value to the mode. If the data are organized. However, if
the data are not organized, some counting may be required.
The mode is readily understood as the most prevalent or typical value.
Distributions can have more than one mode (or no mode at all).
Distributions with two obvious peaks, even though they are not exactly the same
height, are referred to as bimodal.
Distributions with more than two peaks are referred to as multimodal.
The presence of more than one mode might reflect important differences among subsets
of data. For instance, the distribution of weights for both male and female statistics
students would most likely be bimodal, reflecting the combination of two separate
weight distributions—a heavier one for males and a lighter one for females.
Example1: Determine the mode for the following retirement ages: 60, 63, 45, 63, 65, 70,
55, 63, 60, 65, 63.
Answer: mode = 63
Example2: The owner of a new car conducts six gas mileage tests and obtains
the following results, expressed in miles per gallon: 26.3, 28.7, 27.4, 26.6, 27.4,
26.9. Find the mode for these data.
Answer: mode = 27.4
Median
The median reflects the middle value when observations are ordered from least to most.
The median splits a set of ordered observations into two equal parts, the upper and lower
halves.
In other words, the median has a percentile rank of 50, since observations with equal or
smaller values constitute 50 percent of the entire distribution.
To find the median, scores always must be ordered from least to most (or vice versa).
This task is straightforward with small sets of data but becomes increasingly
cumbersome with larger sets of data that must be ordered manually.
When the total number of scores is odd, there is a single middle-ranked
score, and the value of the median equals the value of this score. When the
total number of scores is even, the value of the median equals a value
midway between the values of the two middlemost scores.
In either case, the value of the median always reflects the value of middle-
ranked scores, not the position of these scores among the set of ordered
scores
Example 1: Find the median for the following retirement ages: 60, 63, 45,
63,65, 70, 55, 63, 60, 65, 63.
Solution: median = 63
Example2: Find the median for the following gas mileage tests: 26.3,
28.7, 27.4, 26.6, 27.4, 26.9.
Solution: median = 27.15 (halfway between 26.9 and 27.4)
Mean
The mean is the most common average.
The mean is found by adding all scores and then dividing by the number
of scores.
That is
There is no requirement that presidential terms be ranked before calculating
the mean.
Even when large sets of unorganized data are involved, the calculation of
the mean is usually straightforward, particularly with the aid of a
calculator or computer.
The mean serves as the balance point for its frequency distribution.
Mean cannot be used with qualitative data.
Example 1: Find the mean for the following retirement ages: 60, 63, 45,
63, 65, 70, 55, 63, 60, 65, 63.
Solution:
Example 2: Find the mean for the following gas mileage tests: 26.3, 28.7,
27.4, 26.6, 27.4, 26.9.
Solution:
Which Average?
When a distribution of scores is not too skewed, the values of the mode,
median, and mean are similar, and any of them can be used to describe
the central tendency of the distribution.
When the total number of scores is odd, there is a single middle-ranked score,
and the value of the median equals the value of this score. When the total number of
scores is even, the value of the median equals a value midway between the values
of the two middlemost scores.
In either case, the value of the median always reflects the value of middle-ranked
scores, not the position of these scores among the set of ordered scores
Example 1: Find the median for the following retirement ages: 60, 63, 45, 63,65, 70, 55,
63, 60, 65, 63.
Solution: median = 63
Example2: Find the median for the following gas mileage tests: 26.3, 28.7, 27.4,
26.6, 27.4, 26.9.
Solution: median = 27.15 (halfway between 26.9 and 27.4)
Mean
The mean is the most common average.
The mean is found by adding all scores and then dividing by the number of scores.
That is
There is no requirement that presidential terms be ranked before calculating the mean.
Even when large sets of unorganized data are involved, the calculation of the mean is
usually straightforward, particularly with the aid of a calculator or computer.
The mean serves as the balance point for its frequency distribution.
Mean cannot be used with qualitative data.
Example 1: Find the mean for the following retirement ages: 60, 63, 45, 63, 65, 70, 55,
63, 60, 65, 63.
Solution:
Example 2: Find the mean for the following gas mileage tests: 26.3, 28.7, 27.4, 26.6,
27.4, 26, 9.
Solution:
Which average?
When a distribution of scores is not too skewed, the values of the mode, median, and
mean are similar, and any of them can be used to describe the central tendency of the
distribution.
When extreme scores cause a distribution to be skewed, the values of the three averages
can differ appreciably.
Unlike the mode and median, the mean is very sensitive to extreme scores, or
outliers.
Ideally, when a distribution is skewed, report both the mean and the median.
Appreciable differences between the values of the mean and median signal the
presence of a skewed distribution.
If the mean exceeds the media, the underlying distribution is positively skewed
because of one or more scores with relatively large values.
On the other hand, if the median exceeds the mean, the underlying distribution is
negatively skewed because of one or more scores with relatively small values.
In the long run, however, the mean is the single most preferred average for quantitative
data.
Following summarizes the relationship between the various averages and the two types
of skewed distributions (shown as smoothed curves).
Averages for Qualitative and Ranked Data
Mode Always Appropriate for Qualitative Data
For quantitative data, in principle, all three averages can be used.
The mode always can be used with qualitative data.
Median Sometimes Appropriate for Qualitative Data
The median can be used whenever it is possible to order qualitative data from
least to most because the level of measurement is ordinal.
It’s easiest to determine the median class for ordered qualitative data by using
relative frequencies
Mean cannot be used with qualitative data