0% found this document useful (0 votes)
104 views26 pages

Lecture Notes 2 Data Organization and Presentation

This document discusses various methods for organizing and presenting quantitative data, including graphical and numerical summaries. It provides examples of different data visualization techniques like histograms, bar charts, pie charts, and stem-and-leaf diagrams. Guidelines are given for constructing frequency tables and distributions from raw data, as well as rules for designing clear and informative tables and graphs.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
104 views26 pages

Lecture Notes 2 Data Organization and Presentation

This document discusses various methods for organizing and presenting quantitative data, including graphical and numerical summaries. It provides examples of different data visualization techniques like histograms, bar charts, pie charts, and stem-and-leaf diagrams. Guidelines are given for constructing frequency tables and distributions from raw data, as well as rules for designing clear and informative tables and graphs.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 26

NATIONAL DIPLOMA IN QUANTITY SURVEYING-575/15/TN/0

SUBJECT TITLE: STATISTICS


SUBJECT CODE: 575/15/SO6

DATA ORGANIZATION AND PRESENTATION


When data is collected, it is raw data ie ungrouped, muddled, cumbersome uninteresting so it
needs to be summarised
Appropriate ways to summarise these data
 Graphical Summary
 Numerical Summary

Graphical Summary
Thus in form of tables, tree diagrams, stem and leaf, bar charts, pie charts,
pictographs, line graphs, histograms, frequency distribution curves, ogives etc

Why do we summarise data?


-The first step in any analysis is to describe and summarize the data
-to reduce data
-to conserve storage space
-to order data
-to see the salient features of the data
-to become familiar with the data
-to look for unusually high or low values (outliers)
- to check the assumptions required for statistical tests
-to decide the best way to categorize the data if this is necessary
-In addition to tables and graphs, summary values are a convenient way to summarize
large amounts of information.

We shall describe and give examples of qualitative data (unordered and ordered) and
quantitative data (discrete and continuous); how these types of data can be represented
figuratively; the two important features of a quantitative dataset (location and
variability); the measures of location (mean, median and mode); the measures of
variability (range, interquartile range, standard deviation and variance)

How to design a table after collecting data?


a) layout rows and columns
b) content of cells is created by rows and columns
c) Annotation (footnotes should be used to qualify or clarify the table)
d) It should be simple
e) Source of data must be stated
f) Units of measurement must be clearly stated
General rules for presentation of graphs
a) Short and informative title (clear and comprehensive title)
b) Correct impression must be given
c) Units of measurement must be shown
Example 1
1. Cross table
Mean number of students per class in Civil Engineering department
Course levels
NC ND HND
Quantity Surveying 24 22 10
Water Resources Engineering 25 18 9
Civil Engineering 30 25 10

Table 1

2. Stem-and-Leaf Diagrams
A stem-and-leaf diagram has the advantage of retaining the data in its original form, but
providing a visual representation. Illustrated below is the age distribution of some adults
aspiring for presidential candidate. In this case, the stem, the tens portion of the president's
age, is given on the left, and the leaf, the units portion of the president's age, is given on the
right.
Example 2
Data collected for the age distribution for 43 presidential candidates is as follows 42,
43,46,46,47,48,49,49,50,51,51,51,51,51,52,52,54,54,54,54,54,55,55,55,55,56,56,56,57,57,57,
57,58, 60,61,61,61,62,64,64,65,68,69
Stem Leaf
4|23667899
5|0111112244444555566677778
6|0111244589
Or
Reformatting the above with more rows (called by some books splitting the stem) emphasizes
even more its normally distributed nature. Notice how the stem-and-leaf diagram is also
somewhat like a histogram, but turned on its side.
Stem leaf
4|23
4|667899
5|0111112244444
5|555566677778
6|0111244
6|589

Please note that the separation line should be continuous. The following rules should be
observed when constructing stem-and-leaf diagrams.
1. The leaves on the right should be in increasing (or decreasing) order, left to right.
2. No commas should appear on the right.
3. No horizontal lines should appear.
4. If the stem/leaf break occurs at a decimal point, put the decimal point to the left with
the stem.
5. If the leaf is double or triple digit, etc., leave a [half] space between each entry.
6. There should be at least five but no more than twenty rows.
7. If a range is used for the stem, an asterisk (*) may be used to separate the
corresponding leaves.
Example 3
The number of rooms in each of 40 houses in a particular street is given by the
following set of data:

5 6 4 3 3 6 6 4 5 4 7 8 3 5 4 4 4 8 8 3 5 5 6 5 7
4 6 5 4 3 3 4 5 5 4 7 6 10 9 8
-now for the information to be manageable, we divide it into groups and form a
frequency table
-the recording is called tally
-normally if we have little data we array(re-arrange) it in order of size
3. Frequency Tables or Distributions

A frequency table lists in one column the data categories or classes and
in another column the corresponding frequencies.

Score limits (class limits) are the largest or smallest numbers which can actually belong to each class.
Class interval (class width) is the difference between two exact limits (class boundaries) (or
corresponding score/class limits).
Guidelines for constructing frequency tables.
1. The classes must be "mutually exclusive"—no element can belong to more than one class.
2. Even if the frequency is zero, include each and every class.
3. Make all classes the same width. (However, open ended classes may be inevitable.)
4. Target between 5 and 20 classes, depending on the range and number of data points.
5. Keep the limits as simple and as convenient as possible (multiple of width?).
6. If practical, make the width odd so that the interval midpoint is a whole number.

3. Frequency distribution Table for the number of rooms in each of 40 houses


Number of rooms Tally Frequency (fi)
3 IIIII I 6
4 IIIII IIIII 10
5 IIIII IIII 9
6 IIIII I 6
7 III 3
8 IIII 4
9 I 1
10 I 1

4. Bar Chart
Data represented as a series of bars, height of bar proportional to frequency
Bar graph for the number of rooms in each of 40 houses

number of rooms
12

10

0
3 4 5 6 7 8 9 10

frequency
5. Line graph for the number of rooms in each of 40 houses

rooms
12

10

0
3 4 5 6 7 8 9 10

frequency

6. Pie chart
- Data represented as a circle divided into segments, area of segment proportional to
frequency.
-a pie chart is a circle divided by radial lines into sections so that the area of each
section is proportional to the size of figure represented

Pie chart for number of rooms in each of 40 houses


houses

3 4 5 6 7 8 9 10

7. Histogram
-a bar chart for a continuous distribution is referred to as a histogram
-Similar to a bar chart Continuous, not categorical variable
-Area of bars proportional to probability of observation being in that bar -Axis can be
 Frequency (heights add up to n)
 Percentage (heights add up to 100%)
 Density (Areas add up to 1)
Example 4
From the frequency table below which shows the number of days technologists
spends to complete a certain project, construct a histogram
Number of days Tally mark Frequency
0-4 II 2
5-9 IIIII IIIII IIIII 15
10-14 IIIII IIIII IIIII IIIII I 21
15-19 IIIII IIIII IIIII III 18
20-24 IIIII IIIII IIII 14
25-29 IIIII IIIII III 13
30-34 IIIII IIII 9
35-39 IIIII 5
40-44 II 2
45-49 I 1

- When class intervals are equal, a histogram can be constructed straight away from
the given data(drawn manually)
8. Frequency curve
Procedure
-Mark the midpoints of the tops of each bar on a histogram
-join the points with straight lines then smoothen to form a curve

9. Ogive
-graph drawn from a cumulative frequency distribution [ALWAYS USE A GRAPH
PAPER]
Procedure
 Compute cumulative frequencies of the distribution
 Prepare a graph with the horizontal axis and with the cumulative frequency on the
vertical axis
 Starting point should be zero
 Plot cumulative frequency on a graph at the upper class
Example 5
Using the data for the example of number of rooms in each of 40 houses, construct a
cumulative frequency graph (ogive)(less than ogive)

Cumulative frequency table for the number of rooms in each of 40 houses


Number of rooms Cumulative frequency
≤3 6
≤4 16
≤5 25
≤6 31
≤7 34
≤8 38
≤9 39
≤ 10 40

Draw a Cumulative frequency curve


Exercise
1. The data below shows the age distribution of a small village
Age (yrs) Frequency 𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦
Frequency density= 𝑐𝑙𝑎𝑠𝑠 𝑤𝑖𝑑𝑡ℎ
0-14 18 1.2
15-19 21 4.2
20-24 38 7.6
25-34 41 4.1
35-44 38 3.8
45-59 15 1
60+ 20 2

Draw a histogram to represent this information stating any assumptions you make
2. Table below shows the distribution of skills offered by a construction company
Skill % available
Survey 12
Billing 20
Building 26
Plumbing 32
Civil works 10
Represent this information in a pie chart

 NB: When given raw data you have to make a choice of classes and
 Classes should be below ten if possible
 Wherever practical, class intervals should be equal
 Class intervals of 5 to 10 are more convenient
Classes should be chosen in such a way that occurrences within the classes tend to
balance around the midpoints of the classes
Numerical Statistics
-these are means, mode, median, standard deviation, interquartile range, percentiles, quartiles,
variance

1. Measures of Central Tendency


A measure of central tendency is a single value that describes the way in which a group of
data cluster around a central value. To put in other words, it is a way to describe the center of
a data set. There are three measures of central tendency: the mean, the median, and the mode.

It is a single value that attempts to describe a set of data by identifying the central position
within that set of data. As such, measures of central tendency are sometimes called measures
of central location. They are also classed as summary statistics. The mean (often called the
average) is most likely the measure of central tendency that you are most familiar with, but
there are others, such as the median and the mode.

The mean, median and mode are all valid measures of central tendency, but under different
conditions, some measures of central tendency become more appropriate to use than others.
In the following sections, we will look at the mean, mode and median, and learn how to
calculate them and under what conditions they are most appropriate to be used.

Mean (Arithmetic)

The mean (or average) is the most popular and well known measure of central tendency. It
can be used with both discrete and continuous data, although its use is most often with
continuous data. The mean is equal to the sum of all the values in the data set divided by the
number of values in the data set. So, if we have n values in a data set and they have values x1,
x2, ..., xn, the sample mean, usually denoted by (pronounced x bar), is:

This formula is usually written in a slightly different manner using the Greek capitol letter,
, pronounced "sigma", which means "sum of...":
The above formula refers to the sample mean. This is because, in statistics, samples and
populations have very different meanings and these differences are very important, even if, in
the case of the mean, they are calculated in the same way. To acknowledge that we are
calculating the population mean and not the sample mean, we use the Greek lower case letter
"mu", denoted as µ:

The mean is essentially a model of your data set. It is the value that is most common. You
will notice, however, that the mean is not often one of the actual values that you have
observed in your data set.

However, one of its important properties is that it minimises error in the prediction of any
one value in your data set. That is, it is the value that produces the lowest amount of error
from all other values in the data set.

An important property of the mean is that it includes every value in your data set as part of
the calculation. In addition, the mean is the only measure of central tendency where the sum
of the deviations of each value from the mean is always zero.

One main disadvantage of mean: it is particularly susceptible to the influence of outliers.


These are values that are unusual compared to the rest of the data set by being especially
small or large in numerical value.

For example, consider the wages of staff at a factory below:(mean for ungrouped data)

Staff 1 2 3 4 5 6 7 8 9 10
Salary($) 15 18 16 14 15 15 12 17 90 95

The mean salary for these ten staff is $30.7. However, inspecting the raw data suggests that
this mean value might not be the best way to accurately reflect the typical salary of a worker,
as most workers have salaries in the $12 to $18 range. The mean is being skewed by the two
large salaries. Therefore, in this situation, we would like to have a better measure of central
tendency. As we will find out later, taking the median would be a better measure of central
tendency in this situation.

Exercise

Calculate the mean for the data below


60, 72, 61, 66, 63, 66, 59, 64, 71, 68.

Example

Mean for grouped data

∑ 𝑓𝑥
𝑥̅ =
∑𝑓

The heights of boys in class are measured to the nearest cm and the results are tabulated as
follows

Height cm Frequency (f) Midpoints (x) fx


145-154.9 3 150 450
155-164.9 9 160 1440
165-174.9 21 170 3570
175-184.9 13 180 2340
185-194.9 4 190 760
∑ 𝑓 = 50 ∑ 𝑓𝑥 = 8560

8560
𝑥̅ = = 171.2
50

The data below shows the age distribution of a small village, find the mean for the data?
Age (yrs) Frequency Midpoints (x) fx
0-14 18 7
15-19 21 17
20-24 38 42
25-34 41 28.5
35-44 38 38.5
45-59 15 52
60-69 20 64.5
Median
The median is the central value when all observations are sorted in order.

-If there is an odd number of observations, then it is simply the middle value; if there is an
even number of observations then it is the average of the middle two.

-The median does not have the beneficial mathematical properties of the mean.

-However, it is not generally influenced by extreme values (outliers), and as a result it is


particularly useful in situations where there are unusually low or high values that would
render the mean unrepresentative of the data.

-The median is the middle score for a set of data that has been arranged in order of
magnitude.

-The median is less affected by outliers and skewed data. In order to calculate the median,
suppose we have the data below:

Example (ungrouped data)

65 55 89 56 35 14 56 55 87 45 92

We first need to rearrange that data into order of magnitude (smallest first):

14 35 45 55 55 56 56 65 87 89 92

Our median mark is the middle mark - in this case, 56 (highlighted in bold). It is the middle
mark because there are 5 scores before it and 5 scores after it. This works fine when you have
an odd number of scores, but what happens when you have an even number of scores? What
if you had only 10 scores? Well, you simply have to take the middle two scores and average
the result. So, if we look at the example below:

65 55 89 56 35 14 56 55 87 45

We again rearrange that data into order of magnitude (smallest first):

14 35 45 55 55 56 56 65 87 89

Only now we have to take the 5th and 6th score in our data set and average them to get a
median of 55.5.
Example (grouped data)

𝑐𝑚 (12𝑛−𝑓𝑚−1 )
Median for grouped data = 𝑙𝑚 +
𝑓𝑚

Where;

𝑙𝑚 = 𝑙𝑜𝑤𝑒𝑟 𝑚𝑒𝑑𝑖𝑎𝑛 𝑐𝑙𝑎𝑠𝑠 𝑏𝑜𝑢𝑛𝑑𝑎𝑟𝑦

𝑐𝑚 = 𝑤𝑖𝑑𝑡ℎ 𝑜𝑓 𝑚𝑒𝑑𝑖𝑎𝑛 𝑐𝑙𝑎𝑠𝑠

𝑓𝑚−1 = 𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦 𝑜𝑓 𝑡ℎ𝑒 𝑐𝑙𝑎𝑠𝑠 𝑖𝑚𝑚𝑒𝑑𝑖𝑎𝑡𝑒𝑙𝑦 𝑏𝑒𝑙𝑜𝑤 𝑚𝑒𝑑𝑖𝑎𝑛 𝑐𝑙𝑎𝑠𝑠

𝑓𝑚 = 𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦 𝑜𝑓 𝑡ℎ𝑒 𝑚𝑒𝑑𝑖𝑎𝑛 𝑐𝑙𝑎𝑠𝑠

Calculate the median for the grouped data on heights of boys in class.

Mode
The mode is simply the most commonly occurring value in the data. It is not generally used
because it is often not representative of the data, particularly when the dataset is small.

The mode is the most frequent score in our data set. On a histogram it represents the highest
bar in a bar chart or histogram. You can, therefore, sometimes consider the mode as being the
most popular option.

For example of a mode is presented below: what is the modal value in the data set below?

i) 1, 2, 3, 4, 100 mode does not exist


ii) 12,16, 8, 11, 12, 8, 2, 8, 1, 14 mode is 8
iii) 12,16, 8, 11, 12, 8, 2, 8, 1, 14, 12 mode is 8 and 12-bimodal set

Normally, the mode is used for categorical data where we wish to know which is the most
common category, as illustrated below on forms of transport used by students to come to
college:
We can see above that the most common form of transport, in this particular data set, is the
bus. However, one of the problems with the mode is that it is not unique, so it leaves us with
problems when we have two or more values that share the highest frequency, such as below:
We are now stuck as to which mode best describes the central tendency of the data. This is
particularly problematic when we have continuous data because we are more likely not to
have any one value that is more frequent than the other. For example, consider measuring 30
peoples' weight (to the nearest 0.1 kg). How likely is it that we will find two or more people
with exactly the same weight (e.g., 67.4 kg)? The answer, is probably very unlikely - many
people might be close, but with such a small sample (30 people) and a large range of possible
weights, you are unlikely to find two people with exactly the same weight; that is, to the
nearest 0.1 kg. This is why the mode is very rarely used with continuous data.

Another problem with the mode is that it will not provide us with a very good measure of
central tendency when the most common mark is far away from the rest of the data in the data
set, as depicted in the diagram below:
In the above diagram the mode has a value of 2. We can clearly see, however, that the mode
is not representative of the data, which is mostly concentrated around the 20 to 30 value
range. To use the mode to describe the central tendency of this data set would be misleading.

Summary of when to use the mean, median and mode


Please use the following summary table to know what the best measure of central tendency is
with respect to the different types of variable.

Type of Variable Best measure of central tendency


Nominal Mode
Ordinal Median
Interval/Ratio (not skewed) Mean
Interval/Ratio (skewed) Median
Advantages and Disadvantages of Measures of Central Tendency
NOT TO BE EXAMINED

Geometric Mean
It is defined as the arithmetic mean of the values taken on a log scale. It is also expressed as
the nth root of the product of an observation.

GM is an appropriate measure when values change exponentially and in case of skewed


distribution that can be made symmetrical by a log transformation. GM is more commonly
used in microbiological and serological research. One important disadvantage of GM is that it
cannot be used if any of the values are zero or negative.

Harmonic mean
It is the reciprocal of the arithmetic mean of the observations.

Alternatively, the reciprocal of HM is the mean of reciprocals of individual observations.

HM is appropriate in situations where the reciprocals of values are more useful. HM is used
when we want to determine the average sample size of a number of groups, each of which has
a different sample size.

Skewness: Measure of Asymmetry

The skewed and askew are widely used terminologies that refer to something that is out of
order or distorted on one side. Similarly, when referring to the shape of frequency
distributions or probability distributions, the term skewness also refers to asymmetry of that
distribution. A distribution with an asymmetric tail extending out to the right is referred to as
“positively skewed” or “skewed to the right”, while a distribution with an asymmetric tail
extending out to the left is referred to as “negatively skewed” or “skewed to the left”. The
range of skewness is from minus infinity (−∞ ) to positive infinity (+∞ ). In simple words
skewness (asymmetry) is measure of symmetry or in other words skewness is the lack of
symmetry.
Karl Pearson (1857-1936) first suggested measuring skewness by standardizing the difference
(𝝁−𝒎𝒐𝒅𝒆)
between the mean and the mode, such that, skewness = 𝐬𝐭𝐚𝐧𝐝𝐚𝐫𝐝 𝐝𝐞𝐯𝐢𝐚𝐭𝐢𝐨𝐧𝐬
Since, population modes are not well estimated from sample modes, therefore it was
suggested that one can estimate the difference between the mean and the mode as being three
times the difference between the mean and the median. Therefore, the estimate of skewness
𝟑(𝑴𝒆𝒂𝒏−𝒎𝒆𝒅𝒊𝒂𝒏)
will be: skewness = 𝐬𝐭𝐚𝐧𝐝𝐚𝐫𝐝 𝐝𝐞𝐯𝐢𝐚𝐭𝐢𝐨𝐧
It is important for researchers from the behavioral and business sciences to measure skewness
when it appears in their data. Great amount of skewness may motivate the researcher to
investigate the existence of outliers. When making decisions about which measure of location
to report and which inferential statistic to employ, one should take into consideration the
estimated skewness of the population. Normal distributions have zero skewness.

Shape of the Distribution: Symmetry and


Skewness
Skewness is the degree of asymmetry or departure from symmetry of the distribution of a real
valued random variable

It is important to get a sense of the symmetry or skewness of the data to see whether
the distribution is fairly normal of balanced OR its skewed to either left or right. The
skewness (depending on whether its skewed to the left or right) will give us some idea
of whether there are a few extremely large values or a few extremely small values in
our data.

That will help us also decide better on whether to just use mean as a summary measure
or it might be better to report median as well. We will learn how to identify symmetry
and skewness from simply looking at the general shape of the distribution and from
numerical summary measures such as mean and median.

Below are histograms of particular data. From the earlier posts, you should have
learned that histograms is great for showing the shape of the distribution.

SYMMETRIC DATA (MEAN = MEDIAN)

In a symmetric distribution, the value of the mean is equal to the median.


SKEWED TO THE LEFT (MEAN < MEDIAN)(-VE
SKEW)

In a distribution which is skewed to the left, the value of the mean is less than the
median. Note the skewness is in the direction of the long tail (which is in the left side
in this case -- thus it's skewed to the left). The small values tend to pull the mean to
the left so its a little lower than the median.
SKEWED TO THE RIGHT (MEAN > MEDIAN)(+VE
SKEW)

In a distribution which is skewed to the left, the value of the mean is l ess than the
median. Again, the skewness is in the direction of the long tail (which is in the right
side in this case -- thus it's skewed to the right). The large values tend to pull the mean
to the right so its a little larger than the median.

2.Measures of variability
The measures of central tendency are not adequate to describe data. Two data sets can have
the same mean but they can be entirely different. Thus to describe data, one needs to know
the extent of variability. This is given by the measures of dispersion. Range, interquartile
range, and standard deviation are the three commonly used measures of dispersion.

Range

Range is the difference between the largest and smallest observation in the dataset. The
disadvantage of this measure is that it is based on only two of the observations and may not
be representative of the whole dataset, particularly if there are outliers. In addition, it gives no
information regarding how the data are distributed between the two extremes.

Range = (Largest measurement) - (smallest measurement)

It depends on only two measurements

The prime advantage of this measure of dispersion is that it is easy to calculate. On the other
hand, it has lot of disadvantages. It is very sensitive to outliers and does not use all the
observations in a data set. It is more informative to provide the minimum and the maximum
values rather than providing the range.

Interquartile range

Interquartile range is defined as the difference between the 25th and 75th percentile (also called
the first and third quartile ie (Q3-Q1)). Hence the interquartile range describes the middle 50%
of observations. If the interquartile range is large it means that the middle 50% of
observations are spaced wide apart.

The important advantage of interquartile range is that it can be used as a measure of


variability if the extreme values are not being recorded exactly. It is also not affected by
extreme values. The main disadvantage in using interquartile range as a measure of
dispersion is that it is not amenable (willing to be influenced) to mathematical manipulation.

Like the median, the interquartile range is not influenced by unusually high or low values and
may be particularly useful when data are not symmetrically distributed. Ranges based on
alternative subdivisions of the data can also be calculated; for example, if the data are split
into deciles, 80% of the data will lie between the bottom and top deciles and so on.

Less sensitive to extreme values

Need fairly large numbers of observations

Quartile deviation (semi-quartile range)

𝑄3 − 𝑄1
𝑄𝐷 =
2

1st quartile (Q1) or 25th percentile

2nd quartile (Q2) or 50th percentile

3rd quartile (Q3) or 75th percentile

The mean deviation

-for ungrouped data

∑|𝑥 − 𝑥̅ |
𝑀𝐷 =
𝑛

-for grouped data

∑ 𝑓|𝑥 − 𝑥̅ |
𝑀𝐷 =
∑𝑓

Standard deviation

The standard deviation is a measure of the degree to which individual observations in a


dataset deviate from the mean value. Broadly, it is the average deviation from the mean
across all observations. It is calculated by squaring the difference of each individual
observation from the mean (squared to remove any negative differences), adding them
together, dividing by the total number of observations, and taking the square root of the
result.

The standard deviation summarizes a great deal of information in one number and, like the
mean, has useful mathematical properties.
-it uses information from every observation

-Not robust to outliers

Algebraically the standard deviation for a set of n values (X1,X2,...,Xn} is written as follows:

∑𝒏 ̅)𝟐
𝒊=𝟏(𝒙𝒊 −𝒙
𝑺𝑫 = √ , for ungrouped data
𝒏

where

and is the mean described above.

Example

Calculate the standard deviation for the data below

60, 72, 61, 66, 63, 66, 59, 64, 71, 68.

-for grouped data

∑ 𝒇𝒙𝟐
𝑺𝑫 = √ ∑𝒇
̅𝟐
−𝒙

Example

The heights of boys in class are measured to the nearest cm and the results are tabulated as
follows, calculate the standard deviation for the data

Height cm Frequency (f) Midpoints (x) fx x2f


145-154.9 3 150 450 67500
155-164.9 9 160 1440 230400
165-174.9 21 170 3570 606900
175-184.9 13 180 2340 421200
185-194.9 4 190 760 144400
∑ 𝑓 = 50 ∑ 𝑓𝑥 = 8560 ∑ 𝑥 2 𝑓 =1470400
Variance

Another measure of variability that may be encountered is the variance. This is simply the
square of the standard deviation:

Variance = S2

-for ungrouped data

∑𝑛𝑖=1(𝑥𝑖 − 𝑥̅ )2
𝑣𝑎𝑟 =
𝑛

-for grouped data

∑ 𝒇𝒙𝟐
𝑣𝑎𝑟 = ̅𝟐
−𝒙
∑𝒇

Variance is easy to use mathematically

The variance is not generally used in data description but is central to analysis of variance .

Normal distribution
Symmetrical “Bell-shaped” distribution
Easiest to use mathematically
Many variables are normally distributed
Can be described by two numbers
 Mean (measure of location)
 Standard Deviation (measure of variation)

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy