Lecture Notes 2 Data Organization and Presentation
Lecture Notes 2 Data Organization and Presentation
Graphical Summary
Thus in form of tables, tree diagrams, stem and leaf, bar charts, pie charts,
pictographs, line graphs, histograms, frequency distribution curves, ogives etc
We shall describe and give examples of qualitative data (unordered and ordered) and
quantitative data (discrete and continuous); how these types of data can be represented
figuratively; the two important features of a quantitative dataset (location and
variability); the measures of location (mean, median and mode); the measures of
variability (range, interquartile range, standard deviation and variance)
Table 1
2. Stem-and-Leaf Diagrams
A stem-and-leaf diagram has the advantage of retaining the data in its original form, but
providing a visual representation. Illustrated below is the age distribution of some adults
aspiring for presidential candidate. In this case, the stem, the tens portion of the president's
age, is given on the left, and the leaf, the units portion of the president's age, is given on the
right.
Example 2
Data collected for the age distribution for 43 presidential candidates is as follows 42,
43,46,46,47,48,49,49,50,51,51,51,51,51,52,52,54,54,54,54,54,55,55,55,55,56,56,56,57,57,57,
57,58, 60,61,61,61,62,64,64,65,68,69
Stem Leaf
4|23667899
5|0111112244444555566677778
6|0111244589
Or
Reformatting the above with more rows (called by some books splitting the stem) emphasizes
even more its normally distributed nature. Notice how the stem-and-leaf diagram is also
somewhat like a histogram, but turned on its side.
Stem leaf
4|23
4|667899
5|0111112244444
5|555566677778
6|0111244
6|589
Please note that the separation line should be continuous. The following rules should be
observed when constructing stem-and-leaf diagrams.
1. The leaves on the right should be in increasing (or decreasing) order, left to right.
2. No commas should appear on the right.
3. No horizontal lines should appear.
4. If the stem/leaf break occurs at a decimal point, put the decimal point to the left with
the stem.
5. If the leaf is double or triple digit, etc., leave a [half] space between each entry.
6. There should be at least five but no more than twenty rows.
7. If a range is used for the stem, an asterisk (*) may be used to separate the
corresponding leaves.
Example 3
The number of rooms in each of 40 houses in a particular street is given by the
following set of data:
5 6 4 3 3 6 6 4 5 4 7 8 3 5 4 4 4 8 8 3 5 5 6 5 7
4 6 5 4 3 3 4 5 5 4 7 6 10 9 8
-now for the information to be manageable, we divide it into groups and form a
frequency table
-the recording is called tally
-normally if we have little data we array(re-arrange) it in order of size
3. Frequency Tables or Distributions
A frequency table lists in one column the data categories or classes and
in another column the corresponding frequencies.
Score limits (class limits) are the largest or smallest numbers which can actually belong to each class.
Class interval (class width) is the difference between two exact limits (class boundaries) (or
corresponding score/class limits).
Guidelines for constructing frequency tables.
1. The classes must be "mutually exclusive"—no element can belong to more than one class.
2. Even if the frequency is zero, include each and every class.
3. Make all classes the same width. (However, open ended classes may be inevitable.)
4. Target between 5 and 20 classes, depending on the range and number of data points.
5. Keep the limits as simple and as convenient as possible (multiple of width?).
6. If practical, make the width odd so that the interval midpoint is a whole number.
4. Bar Chart
Data represented as a series of bars, height of bar proportional to frequency
Bar graph for the number of rooms in each of 40 houses
number of rooms
12
10
0
3 4 5 6 7 8 9 10
frequency
5. Line graph for the number of rooms in each of 40 houses
rooms
12
10
0
3 4 5 6 7 8 9 10
frequency
6. Pie chart
- Data represented as a circle divided into segments, area of segment proportional to
frequency.
-a pie chart is a circle divided by radial lines into sections so that the area of each
section is proportional to the size of figure represented
3 4 5 6 7 8 9 10
7. Histogram
-a bar chart for a continuous distribution is referred to as a histogram
-Similar to a bar chart Continuous, not categorical variable
-Area of bars proportional to probability of observation being in that bar -Axis can be
Frequency (heights add up to n)
Percentage (heights add up to 100%)
Density (Areas add up to 1)
Example 4
From the frequency table below which shows the number of days technologists
spends to complete a certain project, construct a histogram
Number of days Tally mark Frequency
0-4 II 2
5-9 IIIII IIIII IIIII 15
10-14 IIIII IIIII IIIII IIIII I 21
15-19 IIIII IIIII IIIII III 18
20-24 IIIII IIIII IIII 14
25-29 IIIII IIIII III 13
30-34 IIIII IIII 9
35-39 IIIII 5
40-44 II 2
45-49 I 1
- When class intervals are equal, a histogram can be constructed straight away from
the given data(drawn manually)
8. Frequency curve
Procedure
-Mark the midpoints of the tops of each bar on a histogram
-join the points with straight lines then smoothen to form a curve
9. Ogive
-graph drawn from a cumulative frequency distribution [ALWAYS USE A GRAPH
PAPER]
Procedure
Compute cumulative frequencies of the distribution
Prepare a graph with the horizontal axis and with the cumulative frequency on the
vertical axis
Starting point should be zero
Plot cumulative frequency on a graph at the upper class
Example 5
Using the data for the example of number of rooms in each of 40 houses, construct a
cumulative frequency graph (ogive)(less than ogive)
Draw a histogram to represent this information stating any assumptions you make
2. Table below shows the distribution of skills offered by a construction company
Skill % available
Survey 12
Billing 20
Building 26
Plumbing 32
Civil works 10
Represent this information in a pie chart
NB: When given raw data you have to make a choice of classes and
Classes should be below ten if possible
Wherever practical, class intervals should be equal
Class intervals of 5 to 10 are more convenient
Classes should be chosen in such a way that occurrences within the classes tend to
balance around the midpoints of the classes
Numerical Statistics
-these are means, mode, median, standard deviation, interquartile range, percentiles, quartiles,
variance
It is a single value that attempts to describe a set of data by identifying the central position
within that set of data. As such, measures of central tendency are sometimes called measures
of central location. They are also classed as summary statistics. The mean (often called the
average) is most likely the measure of central tendency that you are most familiar with, but
there are others, such as the median and the mode.
The mean, median and mode are all valid measures of central tendency, but under different
conditions, some measures of central tendency become more appropriate to use than others.
In the following sections, we will look at the mean, mode and median, and learn how to
calculate them and under what conditions they are most appropriate to be used.
Mean (Arithmetic)
The mean (or average) is the most popular and well known measure of central tendency. It
can be used with both discrete and continuous data, although its use is most often with
continuous data. The mean is equal to the sum of all the values in the data set divided by the
number of values in the data set. So, if we have n values in a data set and they have values x1,
x2, ..., xn, the sample mean, usually denoted by (pronounced x bar), is:
This formula is usually written in a slightly different manner using the Greek capitol letter,
, pronounced "sigma", which means "sum of...":
The above formula refers to the sample mean. This is because, in statistics, samples and
populations have very different meanings and these differences are very important, even if, in
the case of the mean, they are calculated in the same way. To acknowledge that we are
calculating the population mean and not the sample mean, we use the Greek lower case letter
"mu", denoted as µ:
The mean is essentially a model of your data set. It is the value that is most common. You
will notice, however, that the mean is not often one of the actual values that you have
observed in your data set.
However, one of its important properties is that it minimises error in the prediction of any
one value in your data set. That is, it is the value that produces the lowest amount of error
from all other values in the data set.
An important property of the mean is that it includes every value in your data set as part of
the calculation. In addition, the mean is the only measure of central tendency where the sum
of the deviations of each value from the mean is always zero.
For example, consider the wages of staff at a factory below:(mean for ungrouped data)
Staff 1 2 3 4 5 6 7 8 9 10
Salary($) 15 18 16 14 15 15 12 17 90 95
The mean salary for these ten staff is $30.7. However, inspecting the raw data suggests that
this mean value might not be the best way to accurately reflect the typical salary of a worker,
as most workers have salaries in the $12 to $18 range. The mean is being skewed by the two
large salaries. Therefore, in this situation, we would like to have a better measure of central
tendency. As we will find out later, taking the median would be a better measure of central
tendency in this situation.
Exercise
Example
∑ 𝑓𝑥
𝑥̅ =
∑𝑓
The heights of boys in class are measured to the nearest cm and the results are tabulated as
follows
8560
𝑥̅ = = 171.2
50
The data below shows the age distribution of a small village, find the mean for the data?
Age (yrs) Frequency Midpoints (x) fx
0-14 18 7
15-19 21 17
20-24 38 42
25-34 41 28.5
35-44 38 38.5
45-59 15 52
60-69 20 64.5
Median
The median is the central value when all observations are sorted in order.
-If there is an odd number of observations, then it is simply the middle value; if there is an
even number of observations then it is the average of the middle two.
-The median does not have the beneficial mathematical properties of the mean.
-The median is the middle score for a set of data that has been arranged in order of
magnitude.
-The median is less affected by outliers and skewed data. In order to calculate the median,
suppose we have the data below:
65 55 89 56 35 14 56 55 87 45 92
We first need to rearrange that data into order of magnitude (smallest first):
14 35 45 55 55 56 56 65 87 89 92
Our median mark is the middle mark - in this case, 56 (highlighted in bold). It is the middle
mark because there are 5 scores before it and 5 scores after it. This works fine when you have
an odd number of scores, but what happens when you have an even number of scores? What
if you had only 10 scores? Well, you simply have to take the middle two scores and average
the result. So, if we look at the example below:
65 55 89 56 35 14 56 55 87 45
14 35 45 55 55 56 56 65 87 89
Only now we have to take the 5th and 6th score in our data set and average them to get a
median of 55.5.
Example (grouped data)
𝑐𝑚 (12𝑛−𝑓𝑚−1 )
Median for grouped data = 𝑙𝑚 +
𝑓𝑚
Where;
Calculate the median for the grouped data on heights of boys in class.
Mode
The mode is simply the most commonly occurring value in the data. It is not generally used
because it is often not representative of the data, particularly when the dataset is small.
The mode is the most frequent score in our data set. On a histogram it represents the highest
bar in a bar chart or histogram. You can, therefore, sometimes consider the mode as being the
most popular option.
For example of a mode is presented below: what is the modal value in the data set below?
Normally, the mode is used for categorical data where we wish to know which is the most
common category, as illustrated below on forms of transport used by students to come to
college:
We can see above that the most common form of transport, in this particular data set, is the
bus. However, one of the problems with the mode is that it is not unique, so it leaves us with
problems when we have two or more values that share the highest frequency, such as below:
We are now stuck as to which mode best describes the central tendency of the data. This is
particularly problematic when we have continuous data because we are more likely not to
have any one value that is more frequent than the other. For example, consider measuring 30
peoples' weight (to the nearest 0.1 kg). How likely is it that we will find two or more people
with exactly the same weight (e.g., 67.4 kg)? The answer, is probably very unlikely - many
people might be close, but with such a small sample (30 people) and a large range of possible
weights, you are unlikely to find two people with exactly the same weight; that is, to the
nearest 0.1 kg. This is why the mode is very rarely used with continuous data.
Another problem with the mode is that it will not provide us with a very good measure of
central tendency when the most common mark is far away from the rest of the data in the data
set, as depicted in the diagram below:
In the above diagram the mode has a value of 2. We can clearly see, however, that the mode
is not representative of the data, which is mostly concentrated around the 20 to 30 value
range. To use the mode to describe the central tendency of this data set would be misleading.
Geometric Mean
It is defined as the arithmetic mean of the values taken on a log scale. It is also expressed as
the nth root of the product of an observation.
Harmonic mean
It is the reciprocal of the arithmetic mean of the observations.
HM is appropriate in situations where the reciprocals of values are more useful. HM is used
when we want to determine the average sample size of a number of groups, each of which has
a different sample size.
The skewed and askew are widely used terminologies that refer to something that is out of
order or distorted on one side. Similarly, when referring to the shape of frequency
distributions or probability distributions, the term skewness also refers to asymmetry of that
distribution. A distribution with an asymmetric tail extending out to the right is referred to as
“positively skewed” or “skewed to the right”, while a distribution with an asymmetric tail
extending out to the left is referred to as “negatively skewed” or “skewed to the left”. The
range of skewness is from minus infinity (−∞ ) to positive infinity (+∞ ). In simple words
skewness (asymmetry) is measure of symmetry or in other words skewness is the lack of
symmetry.
Karl Pearson (1857-1936) first suggested measuring skewness by standardizing the difference
(𝝁−𝒎𝒐𝒅𝒆)
between the mean and the mode, such that, skewness = 𝐬𝐭𝐚𝐧𝐝𝐚𝐫𝐝 𝐝𝐞𝐯𝐢𝐚𝐭𝐢𝐨𝐧𝐬
Since, population modes are not well estimated from sample modes, therefore it was
suggested that one can estimate the difference between the mean and the mode as being three
times the difference between the mean and the median. Therefore, the estimate of skewness
𝟑(𝑴𝒆𝒂𝒏−𝒎𝒆𝒅𝒊𝒂𝒏)
will be: skewness = 𝐬𝐭𝐚𝐧𝐝𝐚𝐫𝐝 𝐝𝐞𝐯𝐢𝐚𝐭𝐢𝐨𝐧
It is important for researchers from the behavioral and business sciences to measure skewness
when it appears in their data. Great amount of skewness may motivate the researcher to
investigate the existence of outliers. When making decisions about which measure of location
to report and which inferential statistic to employ, one should take into consideration the
estimated skewness of the population. Normal distributions have zero skewness.
It is important to get a sense of the symmetry or skewness of the data to see whether
the distribution is fairly normal of balanced OR its skewed to either left or right. The
skewness (depending on whether its skewed to the left or right) will give us some idea
of whether there are a few extremely large values or a few extremely small values in
our data.
That will help us also decide better on whether to just use mean as a summary measure
or it might be better to report median as well. We will learn how to identify symmetry
and skewness from simply looking at the general shape of the distribution and from
numerical summary measures such as mean and median.
Below are histograms of particular data. From the earlier posts, you should have
learned that histograms is great for showing the shape of the distribution.
In a distribution which is skewed to the left, the value of the mean is less than the
median. Note the skewness is in the direction of the long tail (which is in the left side
in this case -- thus it's skewed to the left). The small values tend to pull the mean to
the left so its a little lower than the median.
SKEWED TO THE RIGHT (MEAN > MEDIAN)(+VE
SKEW)
In a distribution which is skewed to the left, the value of the mean is l ess than the
median. Again, the skewness is in the direction of the long tail (which is in the right
side in this case -- thus it's skewed to the right). The large values tend to pull the mean
to the right so its a little larger than the median.
2.Measures of variability
The measures of central tendency are not adequate to describe data. Two data sets can have
the same mean but they can be entirely different. Thus to describe data, one needs to know
the extent of variability. This is given by the measures of dispersion. Range, interquartile
range, and standard deviation are the three commonly used measures of dispersion.
Range
Range is the difference between the largest and smallest observation in the dataset. The
disadvantage of this measure is that it is based on only two of the observations and may not
be representative of the whole dataset, particularly if there are outliers. In addition, it gives no
information regarding how the data are distributed between the two extremes.
The prime advantage of this measure of dispersion is that it is easy to calculate. On the other
hand, it has lot of disadvantages. It is very sensitive to outliers and does not use all the
observations in a data set. It is more informative to provide the minimum and the maximum
values rather than providing the range.
Interquartile range
Interquartile range is defined as the difference between the 25th and 75th percentile (also called
the first and third quartile ie (Q3-Q1)). Hence the interquartile range describes the middle 50%
of observations. If the interquartile range is large it means that the middle 50% of
observations are spaced wide apart.
Like the median, the interquartile range is not influenced by unusually high or low values and
may be particularly useful when data are not symmetrically distributed. Ranges based on
alternative subdivisions of the data can also be calculated; for example, if the data are split
into deciles, 80% of the data will lie between the bottom and top deciles and so on.
𝑄3 − 𝑄1
𝑄𝐷 =
2
∑|𝑥 − 𝑥̅ |
𝑀𝐷 =
𝑛
∑ 𝑓|𝑥 − 𝑥̅ |
𝑀𝐷 =
∑𝑓
Standard deviation
The standard deviation summarizes a great deal of information in one number and, like the
mean, has useful mathematical properties.
-it uses information from every observation
Algebraically the standard deviation for a set of n values (X1,X2,...,Xn} is written as follows:
∑𝒏 ̅)𝟐
𝒊=𝟏(𝒙𝒊 −𝒙
𝑺𝑫 = √ , for ungrouped data
𝒏
where
Example
60, 72, 61, 66, 63, 66, 59, 64, 71, 68.
∑ 𝒇𝒙𝟐
𝑺𝑫 = √ ∑𝒇
̅𝟐
−𝒙
Example
The heights of boys in class are measured to the nearest cm and the results are tabulated as
follows, calculate the standard deviation for the data
Another measure of variability that may be encountered is the variance. This is simply the
square of the standard deviation:
Variance = S2
∑𝑛𝑖=1(𝑥𝑖 − 𝑥̅ )2
𝑣𝑎𝑟 =
𝑛
∑ 𝒇𝒙𝟐
𝑣𝑎𝑟 = ̅𝟐
−𝒙
∑𝒇
The variance is not generally used in data description but is central to analysis of variance .
Normal distribution
Symmetrical “Bell-shaped” distribution
Easiest to use mathematically
Many variables are normally distributed
Can be described by two numbers
Mean (measure of location)
Standard Deviation (measure of variation)