HSO 4104 Basic Social Statistics
HSO 4104 Basic Social Statistics
INTRODUCTION
INTRODUCTION
Purpose
To introduce the student to the world of statistics and to acquaint them with the role of
statistics in Business.
Objectives
This definition clearly points out four stages in a statistical investigation, namely:
Definition:
Social statistics is the science of good decision making in the face of uncertainty and
is used in many disciplines such as sociology, psychology financial analysis,
econometrics, auditing, production and operations including services improvement, and
marketing research..
1. To present the data in a concise and definite form: Statistics helps in classifying
and tabulating raw data for processing and further tabulation for end users.
2. To make it easy to understand complex and large data: This is done by presenting
the data in the form of tables, graphs, diagrams etc., or by condensing the data
with the help of means, dispersion etc.
3. For comparison: Tables, measures of means and dispersion can help in
comparing different sets of data..
4. In forming policies: It helps in forming policies like a production schedule, based
on the relevant sales figures. It is used in forecasting future demands.
5. Enlarging individual experiences: Complex problems can be well understood by
statistics, as the conclusions drawn by an individual are more definite and precise
than mere statements on facts.
6. In measuring the magnitude of a phenomenon:- Statistics has made it possible to
count the population of a country, the industrial growth, the agricultural growth,
the educational level (of course in numbers)
1. Statistics does not deal with individual measurements. Since statistics deals with
aggregates of facts, it cannot be used to study the changes that have taken place
in individual cases. For example, the wages earned by a single industry worker at
any time, taken by itself is not a statistical datum. But the wages of workers of
that industry can be used statistically. (2) class marks
2. Statistics cannot be used to study qualitative phenomenon like morality,
intelligence, beauty etc. as these cannot be quantified.
3. Statistical results are true only on an average:- The conclusions obtained
statistically are not universal truths. They are true only under certain conditions.
This is because statistics as a science is less exact as compared to the natural
science.
4. Statistical data, being approximations, are mathematically incorrect. Therefore,
they can be used only if mathematical accuracy is not needed.
5. Statistics, being dependent on figures, can be manipulated and therefore can be
used only when the authenticity of the figures has been proved beyond doubt..
A Paris banker said, "Statistics is like a miniskirt, it covers up essentials but gives you
the ideas."
The term distrust of statistics mean lack of confidence in statistical statements and
methods.
In most research conducted on groups of people, you will use both descriptive and
inferential statistics to analyze your results and draw conclusions.
Descriptive statistics is the term given to the analysis of data that helps describe, show
or summarize data in a meaningful way such that, for example, patterns might emerge
from the data. Descriptive statistics do not, however, allow us to make conclusions
beyond the data we have analyzed or reach conclusions regarding any hypotheses we
might have made. They are simply a way to describe our data.
Descriptive statistics allow us to present data in a more meaningful way which allows
simpler interpretation of the data. For example, if we had the results of 100 pieces of
students' coursework, we may be interested in the overall performance of those
students. We would also be interested in the distribution or spread of the marks.
Descriptive statistics allow us to do this. There are two general types of statistic that are
used to describe data:
• Measures of central tendency: these are ways of describing the central position of
a frequency distribution for a group of data.
• A frequency distribution is a table used to describe a data set. It lists intervals or
ranges of data values called data classes together with the number of data values
from the set that are in each class.
• The three common measures of central tendency are the :
• Mean
• Median
• mode
• Measures of spread or variation: these are ways of summarizing a group of data
by describing how spread out the scores are.
• Spread or variation in data set is the amount of difference between data values.
• The common measures of spread are:
• Range
• Quartiles
• Absolute deviation
• variance
standard deviation.
Inferential statistics aim to make inferences from data in order to make conclusions that
go beyond the data.
inferential statistics are used to make inferences about a population from a sample in
order to make assumptions about the wider population and/or make predictions about
the future.
For example, a Board of Examiners may want to compare the performance of 1000
students that completed an examination. Of these, 500 students are girls and 500
students are boys. The 1000 students represent our "population". Whilst we are
interested in the performance of all 1000 students, girls and boys, it may be impractical
to examine the marks of all of these students because of the time and cost required to
collate all of their marks. Instead, we can choose to examine a "sample" of these
students and then use the results to make generalizations about the performance of all
1000 students. For the purpose of our example, we may choose a sample size of 200
students. Since we are looking to compare boys and girls, we may randomly select 100
girls and 100 boys in our sample. We could then use this, for example, to see if there are
any statistically significant differences in the mean mark between boys and girls, even
though we have not measured all 1000 students.
Core text
S.P Gupta (2004): Introduction to statistical methods 23rd-ed: vikas publishing house
New Delhi
2. Futher reading
• Saleemi N.A (1997), Statistics Simplified Reprinted January 2011: Nairobi, Saleemi
Publication limited.
LESSON TWO
Population
Refers to the complete set of observations of a given characteristics of interest.
(The universe )
Sample
list Census
This is a study where all the elements in the sampling frame are included in the
survey
Parameter
Statistic
Statistics
1.6 variable
e.g Height
Measurements with ordinal scales are ordered in the sense that higher numbers
represent higher values. However, the intervals between the numbers are not
necessarily equal. For example, on a five-point rating scale measuring attitudes
toward gun control, the difference between a rating of 2 and a rating of 3 may not
represent the same difference as the difference between a rating of 4 and a rating
of 5. There is no "true" zero point for ordinal scales since the zero point is chosen
arbitrarily. The lowest point on the rating scale in the example was arbitrarily
chosen to be 1. It could just as well have been 0 or -5.
1.7.3 Interval Scale
On interval measurement scales, one unit on the scale represents the same
magnitude on the trait or characteristic being measured across the whole range of
the scale. For example, if anxiety were measured on an interval scale, then a
difference between a score of 10 and a score of 11 would represent the same
difference in anxiety as would a difference between a score of 50 and a score of
51. Interval scales do not have a "true" zero point, however, and therefore it is not
possible to make statements about how many times higher one score is than
another. A good example of an interval scale is the Fahrenheit scale for
temperature. Equal differences on this scale represent equal differences in
temperature, but a temperature of 30 degrees is not twice as warm as one of 15
degrees.
Core text
S.P Gupta (2004): Introduction to statistical methods 23rd-ed: vikas publishing house
New Delhi
2. Futher reading
• Saleemi N.A (1997), Statistics Simplified Reprinted January 2011: Nairobi, Saleemi
Publication limited.
CHAPTER 3
COLLECTION OF DATA
For any statistical enquiry, the basic objective is to collect facts and figures
relating to a particular phenomenon for further statistical analysis The process of
counting, enumeration or measurement together with systematic recording of
results is called collection of statistical data
Data types
Primary data is data that you collect yourself using such methods as:
• questionnaires
• interviews
• observation
• case-studies
• diaries
• critical incidents
The primary data, which is generated by the above methods, may be qualitative in
nature (usually in the form of words) or quantitative (usually in the form of
numbers or where you can make counts of words used
2.2.1 Questionnaires
Questionnaires are a popular means of collecting data, but are difficult to design
and often require many rewrites before an acceptable questionnaire is produced.
Advantages:
• Relatively cheap.
• No interviewer bias.
Disadvantages:
• Design problems.
• Questions have to be relatively simple.
2.2.2 Interviews
Personal interview
Advantages:
Disadvantages:
• Time consuming.
• Geographic limitations.
• Can be expensive.
2.2.3 Case-studies
Iit is historical.
It can enable the researcher to explore, unravel and understand problems, issues
and relationships.
It does not allow the researcher to argue that from one case-study the results,
findings or theory developed apply to other similar case-studies. The case looked
at may be unique and, therefore not representative of other instances
2.2.4 Diaries
A diary is a way of gathering information about the way individuals spend their
time on professional activities.
Advantages:
Disadvantages:
• magazines, newspapers
• reviews
• research articles
Primary data is expensive and difficult to acquire, but it's trustworthy. Secondary
data is cheap and easy to collect, but must be treated with caution.
Data Classification
The process of grouping raw data into different classes or sub classes according to
some characteristics.
The collected data, also known as raw data or ungrouped data are always in an un
organized form and need to be organized and presented in meaningful and
readily comprehensible form in order to facilitate further statistical analysis. It is,
therefore, essential for an investigator to condense a mass of data into more and
more comprehensible form.
Classification is the first step in tabulation
Objectives of Classification
How to prepare
count the number of times a particular value is repeated- the frequency of that
class.
In order to facilitate counting prepare a column of tallies.
In another column, place all possible values of variable from the lowest to the
highest.
Then put a bar (Vertical line) opposite the particular value to which it relates.
To facilitate counting, blocks of five bars are prepared and some space is left in
between each block.
count the number of bars and get frequency.
Example 1:
In a survey of 40 families in a village, the number of children per family was recorded
and the following data obtained.
1 ,0, 3, 2, 1, 5, 6, 2,
2,1,0,3,4,2,1,6
3,2,1,5,3,3,2,4
2,2,3,0,2,1,4,5
3,3,4,4,1,2,4,5
Represent the data in the form of a discrete frequency distribution.
Solution:
Frequency distribution of the number of children
Number of frequency
children
Tally marks
0 111 3
1 1111 11 7
2 1111 1111 10
3 1111 111 8
4 1111 1 6
5 1111 4
6 11 2
total 40
Basic Terms
a) Class limits:
T he class limits are the lowest and the highest values that can be included in the class.
For example, take the class 30-40.The lowest value of the class is 30 and highest class is
40. The two
boundaries of class are known as the lower limits and the upper limit of the class.
The lower limit of a class is the value below which there can be no item in the class.
The upper limit of a class is the value above which there can be no item to that class. Of
the
class 60-79, 60 is the lower limit and 79 is the upper limit, i.e. in the case there can be
no value which is less than 60 or more than
79. The way in which class limits are stated depends upon the nature of the data. In
statistical calculations, lower class limit is denoted by L and upper class limit by U.
b) Class Interval:
The class interval may be defined as the size of each
grouping of data. For example, 50-75, 75-100, 100-125… are class
intervals. Each grouping begins with the lower limit of a class interval and ends at the
lower limit of the next succeeding class interval
c) Width or size of the class interval:
The difference between the lower and upper class limits is called Width or size of class
interval and is denoted by ‘ C’ .
d) Range:
The difference between largest and smallest value of the observation is called The Range
and is denoted by ‘ R’ ie
R = Largest value – Smallest value
R=L-S
e) Mid-value or mid-point:
The central point of a class interval is called the mid value or mid-point. It is found out
by adding the upper and lower limits of a class and dividing the sum by 2. (i.e.)
Midvalue = (L+ U)/2
For example, if the class interval is 20-30 then the mid-value
is (20 +30)/2
f) Frequency:
Number of observations falling within a particular class interval is called frequency of
that class.
The total frequency indicate the total number of observations considered in a frequency
distribution.
g) Number of class intervals:
The number of class interval should not be too many. For an ideal frequency
distribution, the number of class intervals
To decide the number of class intervals for the frequency distributive in the whole data
choose the lowest and the highest of the values. The difference between them will enable
us to decide the class intervals.(use intuition)
=Range/1+3.322 log N
Where Range = Largest Value – smallest value in the distribution.
a) Exclusive method:
In exclusive method, the class intervals are so fixed that the upper limit of one class is
the lower limit of the next class.
This method ensures continuity of data
Its widely used in practice
Example
The first step is to divide the observed range of variable into a suitable number of class-
intervals and to record the number of observations in each class. Example
42 62 46 54 41 37 54 44 32 45
47 50 58 49 51 42 46 37 42 39
54 39 51 58 47 64 43 48 49 48
49 61 41 40 58 49 59 57 57 34
56 38 45 52 46 40 63 41 51 41
=32/6.64=4.8 Approx….five
The required frequency distribution is prepared using tally marks as given below:
Class Interval Tally marks Frequency
30-35 2
35-40 6
40-45 12
45-50 14
50-55 6
55-60 6
60-65 4
Total 50
2. Percentage frequency table
It is also called relative frequency table
The percentage frequency distribution facilitates easy comparability especially when the
total number of items are large and highly different from one distribution to another. In
percentage frequency table actual frequencies are converted into percentages. The
percentages are calculated by using the formula given below:
Frequency percentage = Actual Frequency/Total Frequency× 100 An
example is given below to construct a percentage frequency table.
Marks No. of students Frequency percentage
0-10 3 6
10-20 8 16
20-30 12 24
30-40 3 4
40-50 6 12
50-60 4 8
Total 50 100
3. Cumulative frequency table
Cumulative frequency distribution has a running total of the values. It is constructed by
adding the frequency of the first class interval to the frequency of the second class
interval. Again add
that total to the frequency in the third class interval continuing until the final total
appearing opposite to the last class interval will be the total of all frequencies. The
cumulative frequency may be downward or upward.
Example
Age (yrs) No. men Less than c.f More than c.f
15-20 3 3 64
20-25 7 10 61
25-30 15 25 54
30-35 21 46 39
35-40 12 58 18
40-45 6 64 6
4. Histogram:
A histogram is a bar chart or graph showing the frequency of occurrence of each value of
the variable being analysed. In histogram, data are plotted as a series of rectangles.
Class intervals are shown on the ‘X-axis’ and the frequencies on the ‘Y-axis’ . The height
of each rectangle represents the frequency of the class interval. Each rectangle is formed
with the other so as to give a continuous picture.
Example
For the following data, draw a histogram.
Marks Number of Students
21-30 6
31-40 15
41-50 22
51-60 31
61-70 17
71-80 9
Solution:
For drawing a histogram, the frequency distribution should be continuous. If it is not
continuous, then first make it continuous as follows.
Marks Number of Students
20.5-30.5 6
30.5-40.5 15
40.5-50.5 22
50.5-60.5 31
60.5-70.5 17
70.5-80.5 9
4. Frequency Polygon
If we mark the midpoints of the top horizontal sides of the rectangles in a histogram and
join them by a straight line, the figure so formed is called a Frequency Polygon. This is
done under the assumption that the frequencies in a class interval are evenly distributed
throughout the class.
5 Frequency Curve
If the middle point of the upper boundaries of the rectangles of a histogram is corrected
by a smooth freehand curve, then that diagram is called frequency curve. The curve
should begin and end at the base line.
example
Draw a frequency curve for the following data.
Monthly Wages(sh.) No. of family
0-1000 21
1000-2000 35
2000-3000 56
3000-4000 74
4000-5000 63
5000-6000 40
6000-7000 29
7000-8000 14
(see paper)
6.Ogive
This curve is obtained by plotting cumulative frequencies.
There are two methods of constructing ogive namely:
1. The ‘ less than ogive’ method
2. The ‘more than ogive’ method.
In less than ogive method we start with the upper limits of the classes and go adding the
frequencies. When these frequencies are plotted, we get a rising curve. In more than
ogive method, we start with the lower limits of the classes and from the total frequencies
we subtract the frequency of each class. When these frequencies are plotted we get a
declining curve.
Example 15:
Draw the Ogives for the following data.
Class interval Frequency cf
20-30 4 4
30-40 6 10
40-50 13 23
50-60 25 48
60-70 32 80
70-80 19 99
80-90 8 107
90-100 3 110
Solution:
Class limit Less than ogive More than ogive
20 0 110
30 4 106
40 10 100
50 23 87
60 48 62
70 80 30
80 99 11
90 107 3
100 110 0
E.G
Mean, Median and mode…………..simple averages
fx
x= N
where x = the mid-point of individual class
f = the frequency of individual class
N = the sum of the frequencies or total frequencies.
Example :
Following is the distribution of persons according to different income groups. Calculate
arithmetic mean. ( see HR paper)
Income sh (1000)
0-10 10-20 20-30 30-40 40-50 50-60 60-70
6 8 10 12 7 4 3
N +1
2
Step4: Then the corresponding value of x is median.
Example :
The following data pertaining to the number of members in a family. Find median size
of the family.
Number of members x 1 2 345 67 8 9 10 1112
Frequency 1 3 5 6 10139 5 3 2 2 1
Solution:
X f cf
1 1 1
2 3 4
3 5 9 60 +1 =30.5
2
4 6 15
5 10 25
6 13 38
7 9 47
8 5 52
9 3 55
10 2 57
11 2 59
12 1 60
60
The cumulative frequencies just greater than 30.5 is 38.and the value of x corresponding
to 38 is 6. Hence the median size is 6 members per family
Continuous Series:
The steps given below are followed for the calculation of median in continuous series.
Step1: Find cumulative frequencies.
Step2: Find
N/2
Step3: See in the cumulative frequency the value first greater than N/2. Then the
corresponding class interval is called the Median class. Then apply the formula
N 2− m
Median = l + c Where l = Lower limit of the median class, m = cumulative
f
frequency preceding the median class, c = width of the median class, f =frequency in the
median class. N=Total frequency. If the class intervals are given in inclusive type
convert them into exclusive type and call it as true class interval and consider lower
limit in this case.
Example 7:
Determine the median of the data in the table below using Formula method
IQ No of residents
0–20 6
20–40 18
40–60 32
60–80 48
80 – 100 27
100 – 120 13
120 – 140 2
73.5 - 56
=60+ × 20
48
= 60 + 7.29
= 67.29
Merits of Median :
1. Median is not influenced by extreme values because it is a positional average.
2. Median can be calculated in case of distribution with open end intervals.
3. Median can be located even if the data are incomplete.
4. Median can be located even for qualitative factors such as ability, honesty etc.
Demerits of Median:
1. A slight change in the series may bring drastic change in median value.
2. In case of even number of items or continuous series, median is an estimated value
other than any value in the series.
3. It is not suitable for further mathematical treatment except its use in mean deviation.
4. It is not taken into account all the observations
Quartiles :
The quartiles divide the distribution in four parts. There are three quartiles. The second
quartile divides the distribution into two halves and therefore is the same as the median.
The first (lower) quartile (Q1) marks off the first one-fourth, the third (upper) quartile
(Q3) marks off the three-fourth. First arrange the given data in the increasing order and
use the formula for Q1 and Q3
4 4
corresponding value of x is Q3
Example 23:
Compute quartiles for the data given bellow.
x f c.f
5 4 4
8 3 7
12 2 9
15 4 13
19 5 18
24 2 20
30 4 24
Total 24
Solution
Q1= 24+1/4= 6.25th item = 8
Continuous series
Step1: Find cumulative frequencies
Step2: Find N/4
Step3: See in the cumulative frequencies, the value just greater than N/4, then the
corresponding class interval is called first quartile class. Find 3/4N. See in the
cumulative frequencies the value
just greater than ¾(N) then the corresponding class interval is called 3rd quartile class.
N
4− 3 −
m = 4 m3
Then apply the respective formulae Q1 =1 + Q3
N 1
Example
C.I. f cf
0-10 11 11
10-20 18 29
20-30 25 54
30-40 28 82
40-50 30 112
50-60 33 145
60-70 22 167
70-80 15 182
80-90 12 194
90-100 10 204
204
N/4=204/4=51, 3(204/4) = 153
MODE
The mode refers to that value in a distribution, which occur most frequently. It is an
actual value, which has the highest concentration of items in and around it. It shows the
centre of concentration of the frequency in around a given value. Therefore, where the
purpose is to know the point of the highest concentration it is preferred. It is, thus, a
positional measure.
Computation of the mode:
Ungrouped or Raw Data:
For ungrouped data or a series of individual observations, mode is often found by mere
inspection.
Example
2 , 7, 10, 15, 10, 17, 8, 10, 2\ Mode = M0=10
In some cases the mode may be absent while in some cases there may be more than one
mode.
Example
12, 10, 15, 24, 30 (no mode)
7, 10, 15, 12, 7, 14, 24, 10, 7, 20, 10\ the modes are 7 and 10
Grouped Data
For Discrete distribution, see the highest frequency and corresponding value of X is
mode.
Continuous distribution
See the highest frequency then the corresponding value of class interval is called the
modal class. Then apply the formula
m0 = l + 1
c
+
1 2
l = Lower limit of the model class ,∆1 = f1-f0, ∆2 =f1-f2 ,f1 = frequency of the modal
class,f0 = frequency of the class preceding the modal class,f2 = frequency of the class
succeeding the modal class or simply
m0 = l + f 1
−f 0
2f −f −f
1 0 2
Example
Calculate mode for the following :
C- I f
0-50 5
50-100 14
100-150 40
150-200 91
200-250 150
250-300 87
300-350 60
350-400 38
400 and above 15
Solution
The highest frequency is 150 and corresponding class interval is
200 – 250, which is the modal class.
Here l=200,f1=150,f0=91, f2=87, C=50
Mode=224.18
Merits of Mode:
1. It is easy to calculate and in some cases it can be located mere inspection
2. Mode is not at all affected by extreme values.
3. It can be calculated for open-end classes.
4. It is usually an actual value of an important part of the series.
5. In some circumstances it is the best representative of data.
Demerits of mode:
1. It is not based on all observations.
2. It is not capable of further mathematical treatment.
3. Mode is ill-defined generally, it is not possible to find mode in some cases.
4. As compared with mean, mode is affected to a great extent, by sampling fluctuations.
5. It is unsuitable in cases where relative importance of items has to be considered.
Core text
S.P Gupta (2004): Introduction to statistical methods 23rd-ed: vikas publishing house
New Delhi
2. Futher reading
• Saleemi N.A (1997), Statistics Simplified Reprinted January 2011: Nairobi, Saleemi
Publication limited.
MEASURES OF DISPERSION
The measures of dispersion are very useful in statistical work because they indicate
whether the
rest of the data are scattered around the mean or away from the mean.
If the data is approximately dispersed around the mean then the measure of dispersion
obtained
will be small therefore indicating that the mean is a good representative of the sample
data. But
on the other hand, if the figures are not closely located to the mean then the measures of
dispersion obtained will be relatively big indicating that the mean does not represent the
data
sufficiently
vi. Variance
6.2 RANGE
The range is defined as the difference between the highest and the smallest values in a
frequency
distribution. This measure is not very efficient because it utilizes only 2 values in a given
frequency distribution. However the smaller the value of the range, the less dispersed
the
Example 1:
The following are the prices of 4 kgs of beans in Mathare slums market
Required:
Solution
Range=L-S
250-160=90
Co-efficient of Range=L-S/L+S
=250-160/250+160=90/410
=0.22
INTERQUARTILE RANGE
This is a measure of dispersion which involves the use of quartile. A quartile is a mark or
a value
which lies at the boundary of a division when any given set of data is divided into four
equal
divisions.
The semi interquartile range is a good measure of dispersion because it shows how the
rest of the
i. The lower quartile (first quartile Q1) this usually binds the lower 25% of the data
Q3-Q1 SIR
=2
Example 2:
16.2, 17, 20, 25(Q1) 29, 32.2, 35.8, 36.8(Q2) 40, 41, 42, 44(Q3) 49, 52, 55 (in kgs)
Required
Mean Deviation:
The range and quartile deviation are not based on all observations. They are positional
measures of dispersion. They do not show any scatter of the observations from an
average. The mean deviation is measure of dispersion based on all items in a
distribution.
Definition:
Mean deviation is the arithmetic mean of the deviations of a series computed from any
measure of central tendency; i.e., the mean, median or mode, all the deviations are
taken as positive i.e., signs are ignored. According to Clark and
Schekade, “Average deviation is the average amount scatter of the items in a distribution
from either the mean or the median, ignoring the signs of the deviations”.
We usually compute mean deviation about any one of the three averages mean, median
or mode. Some times mode may be ill defined and as such mean deviation is computed
from mean and median. Median is preferred as a choice between mean and median. But
in general practice and due to wide applications of mean, the mean deviation is
generally computed from mean. M.D can be used to denote mean deviation.
2. Take the deviations of items from average ignoring signs and denote these deviations
by |D|.
Calculate mean deviation from mean and median for the following data:
of M.D.
Solution:
Mean = 369
Md = |D|/n
=1570 /9 = 174.44
The method of calculating mean deviation in a continuous series same as the discrete
series. In continuous series we have to find out the mid points of the various classes and
take deviation of these points from the average selected. Thus
M.D = f | D |
Where D = m - average
M = Mid point
Find out the mean deviation from mean from the following series.
0-10 20
10-20 25
20-30 32
30-40 40
40-50 42
50-60 35
60-70 10
70-80 8
Solution:
Mean = 35
MD = f| D | = 3193/212 = 15.06
Merits:
2. It is rigidly defined.
Demerits:
4. Algebraic positive and negative signs are ignored. It is mathematically unsound and illogical.
Standard Deviation:
Karl Pearson introduced the concept of standard deviation in 1893. It is the most important
measure of dispersion and is widely used in many statistical formulae. Standard deviation is also
called Root-Mean Square Deviation. The reason is that it is the square–root of the mean of the
squared deviation from the arithmetic mean. It provides accurate result. Square of standard
Definition:
It is defined as the positive square-root of the arithmetic mean of the Square of the deviations of
the given observation from their arithmetic mean. The standard deviation is denoted by the