Data Analysis and Presentation
Data Analysis and Presentation
Data Type
• Quantitative data is classified as categorical and
numerical data
• Categorical data refer to data whose values cannot be
measured numerically but can be either classified into
sets (categories) such as sex (male and female),
religion, department
• Numerical data, which are sometimes termed
‘quantifiable’, are those whose values are measured or
counted numerically as quantities
• These are analyzed by different techniques
Quantitative Data Analysis
Two common types analysis
1. Descriptive statistics
– to describe, summarize, or explain a given set of data
2. Inferential statistics
– use statistics computed from a sample to infer about
the population
– It is concerned by making inferences from the
samples about the populations from which they have
been drawn
Common data analysis technique
1. Frequency distribution
2. Measures of central tendency
3. Measures of dispersion
4. Correlation
5. Regression
6. And more
Frequency distribution
• Shows the frequency of occurrence of
different values of a single Phenomenon.
• Main purpose
1. To facilitate the analysis of data.
2. To estimate frequencies of the unknown
population distribution from the distribution of
sample data and
3. To facilitate the computation of various statistical
measures
Example – Frequency Distribution
• In a survey of 30 organizations, the number of computers
registered in each organizations is given in the following
table
• This data has no meaning unless it is summarized in
some form
Example
The following table shows frequency distribution
Number of
computers
Example …
• The above table can tell us meaningful
information such as
– How many computers most organizations has?
– How many organizations do not have computers?
– How many organizations have more than five
computers?
– Why the computer distribution is not the same in
all organizations?
– And other questions
Continuous frequency distribution
• Continuous frequency distribution constructed when
the values do not have discrete values like number of
computers
• Example is age, salary variables have continuous values
Constructing frequency table
• The number of classes should preferably be between 5 and
20. However there is no rigidity about it.
• As far as possible one should avoid values of class intervals
as 3,7,11,26….etc. preferably one should have class intervals
of either five or multiples of 5 like 10,20,25,100 etc.
• The starting point i.e the lower limit of the first class, should
either be zero or 5 or multiple of 5.
• To ensure continuity and to get correct class interval we
should adopt “exclusive” method.
• Wherever possible, it is desirable to use class interval of
equal sizes.
Constructing …
You can create a frequency table with two variables
This is called Bivariate frequency table
Country of Computer
Origin import
China 62
Japan 47
Germany 35
India 16
USA 6
Bar graph
70
60
50
40
Computer import
30
20
10
0
China Japan Germany India
Measures of central tendency
• Mode shows values that occurs most
frequently
• is the only measure of central tendency that
can be interpreted sensibly
• Median is used to identify the mid point of
the data
Central Tendency ….
• Mean is a measure of central tendency
• includes all data values in its calculation
x
fx
N
• where x = the mid-point of individual class
• f = the frequency of individual class
• N = the sum of the frequencies or total frequencies.
Mean
• Mean is used to assess the association between two
variables.
• Assume an organization (X) that uses web based service
sales and the other organization (Y) using the traditional
rented shop sales office
• X average monthly sales is 10, 000 birr while Y monthly sales
is 7, 000 birr
• The mean has significant difference and we conclude that
use of web based sales service increase X organization sales
performance
• You can apply T-test to check its statistical significance
Advantages of Mean
• It should be rigidly defined.
• It should be easy to understand and compute.
• It should be based on all items in the data.
• Its definition shall be in the form of a mathematical
• formula.
• It should be capable of further algebraic treatment.
• It should have sampling stability.
• It should be capable of being used in further statistical
computations or processing
• However affected by extreme data values in skewed distributions
• For Skewed distribution, use median than mean
Exercise
• Do the following exercise for the following IT staff
data for 13 organizations named as O1 to O13
• 25, 18, 20, 10, 8, 30, 42, 20, 53, 25, 10, 20, 42
• What is the mode?
• What is the median?
• What is the mean?
• Change into frequency table?
• Plot on bar graph? Pie chart?
• What you interpret from the data?
Measures of Dispersion
• The measure of central tendency serve to locate
the center of the distribution,
• This characteristic of a frequency distribution is
commonly referred to as dispersion.
• Small dispersion indicates high uniformity of the
items,
• Large dispersion indicates less uniformity.
• Less variation or uniformity is a desirable
characteristic
Type of measure of dispersion
2
( x x )
n 1
Negatively
skewed
distribution
Measures of Skewness
1. Karl – Pearason’ s coefficient of skewness
2. Bowley’ s coefficient of skewness
3. Measure of skewness based on moments
We see Karl- Pearson, read others from the textbook
• Karl – Pearson is the absolute measure of skewness = mean – mode.
• Not suitable for different unit of measures
• Use relative measure of skewness -- Karl – Pearson’ s coefficient of
skewness, i.e
(Mean –Mode)/standard deviation
In case of ill defined mode, we use
3(Mean –median)/standard deviation
Kurtosis
• All the frequency curves expose different degrees of
flatness or peskiness – called kurtosis
• Measure of kurtosis tell us the extent to which a
distribution is more peaked or more flat topped than
the normal curve, which is symmetrical and bell-
shaped, is designated as Mesokurtic.
• If a curve is relatively more narrow and peaked at the
top, it is designated as Leptokurtic.
• If the frequency curve is more flat than normal curve,
it is designated as platykurtic.
Interpretation
• Real word things are usually have a normal
distribution pattern – Bell shape
Normal dist…
• This implies that
• 68% of the population is in side 1
• 95% of the population is inside 2
• 99% of the population is 3
• So you need to select a confidence limit to say
your sample is statistically significant or not
• For example, if more than 5% of the population
falls outside 2 standard deviation, the difference
between two groups of population is not
statistically significant
Correlation
• Correlation is used to measure the linear
association between two variables
• For example, assume X is IT skill and Y is IT
use. Is there association b/n these two
variables
r
(x x) ( y y)
(x x) ( y y)
2 2
Correlation …
• Correlation expresses the inter-dependence of two
sets of variables upon each other.
• One variable may be called as independent variable
(IV) and the other is dependent variable (DV)
• A change in the IV has an influence in changing the
value of dependent variable
• For example IT use will increase organization
productivity because have better information access
and improve their skills and knowledge
Correlation Lines
Correlation Lines
Perfect
Correlation
No
Correlation
Type of Correlation
1. Simple
2. Multiple correlation
3. Partial correlation
• In simple correlation, we study only two variables.
• For example, number of computers and organization efficiency
• In multiple correlation we study more than two variables
simultaneously.
• For example, usefulness and easy of use and IT adoption
• Partial correlation, it refers to the study of two variables
excluding some other variables
Karl pearson’ s coefficient of correlation
• Karl pearson, a great biometrician and statistician, suggested
a mathematical method for measuring the magnitude of
linear relationship between the two variables
• Karl pearson’ s coefficient of correlation is the most widely
used method of correlation
r
XY
n x
y
r
XY
x . y
2 2
where X = x - x , Y = y - y
Exercise
Calculate the correlation for the following given data
7 4 6 2 1 9 3 8 5
X 56 78 65 89 93 24 87 44 74
Y 34 65 67 90 86 30 80 50 70
8 6 5 1 2 9 3 7 4
D -1 -2 1 1 -1 0 0 1 1
D2 1 4 1 1 1 0 0 1 1
Spear Man Rank Correlation
• Developed by Edward Spearman in 1904
• It is studied when no assumption about the
parameters of the population is made.
• This method is based on ranks
• It is useful to study the qualitative measure of
attributes like honesty, colour, beauty, intelligence,
character, morality etc.
• The individuals in the group can be arranged in order
and there on, obtaining for each individual a number
showing his/her rank in the group
Formula
6 D
2
r 1 3
n n
• Where D2 = sum of squares of differences between the pairs of
ranks.
• n = number of pairs of observations.
• The value of r lies between –1 and +1. If r = +1, there is complete
agreement in order of ranks and the direction of ranks is also
same. If r = -1, then there is complete disagreement in order of
ranks and they are in opposite directions.
Advantage of Correlation
• It is a simplest and attractive method of finding the nature of
correlation between the two variables.
• It is a non-mathematical method of studying correlation. It is
easy to understand.
• It is not affected by extreme items.
• It is the first step in finding out the relation between the two
variables.
• We can have a rough idea at a glance whether it is a positive
correlation or negative correlation.
• But we cannot get the exact degree or correlation between
the two variables
The Pearson Chi-square
• it is the most common coefficient of association,
which is calculated to assess the significance of
the relationship between categorical variables.
• It is used to test the null hypothesis that
observations are independent of each other.
• It is computed as the difference between
observed frequencies shown in the cells of cross-
tabulation and expected frequencies that would
be obtained if variables were truly independent.
Chi-square …
x (O E )
2 Obse Exp. differe
nce
E M 3 6 -3
Where O is observed value T 5 6 -1
E is expected value
X2 is the association W 7 6 1
Where is X2 value and its significance
level depend on the total number of Th 6 6 0
observations and the number of cells
F 9 6 3
in the table
Degree of freedom is no. of variable Tot 1
minus from no. of observations
DF = (r - 1) * (c - 1)
where r is the number of levels for one categorical
variable, and c is the number of levels for the other
categorical variable.
Assumptions
• Ensure that every observation is independent of every other
observation; in other words, each individual should be counted
once and in only one category.
• Make sure that each observation is included in the appropriate
category; it is not permitted to omit some of the observations.
• The total sample should exceed 20; otherwise, the chi-squared
test as described here is not applicable. More precisely, the
minimum expected frequency should be at least 5 in every use.
• Remember that showing that there is an association is not the
same as showing that there is a causal effect; for example, the
association between a healthy diet and low cholesterol does not
demonstrate that a healthy diet causes low cholesterol.
Parametric Tests
n1 ( x1 x ) 2 n2 ( x2 x ) 2 ... nI ( xI x ) 2
F I 1
(n1 1) s12 (n2 1) s22 ... (nI 1) s I2
N I
X 6 2 10 4 8
Y 9 11 5 8 7
Solution
• Regression equation of Y on X is Y = a + bX and
• the normal equations are
∑Y = na + b ∑X
∑XY = a ∑X + b ∑X2
• Where n= 5
∑Y= 40
∑X= 30
∑XY = 214
∑X2 = 220
Regression
• Substituting the values, we get
• 40 = 5a + 30b …… ( equation 1)
• 214 = 30a + 220b ……. ( equation 2)
• Multiplying (equation 1) by 6
• 240 = 30a + 180b……. ( equation 3)
• Subtract equation 3 from equation 2
• You get - 26 = 40b or b = - 26/40
b = - 0.65
• Now, substituting the value of ‘ b’ in equation (1)
• 40 = 5a – 19.5
• 5a = 59.5
• a = 59.5/5 or a = 11.9
Regression
• Hence, required regression line Y on X is
• Y = 11.9 – 0.65 X.
• This implies that
• 11.9 is a constant or intercept. When X is zero, the Value of Y is
11.9
• 0.65 is the slope. This implies that 1 unit change in X brings
0.65 minus the constant (11.9) change on Y
• Likewise 2 units change in X results 1.30 minus the constant
change in Y
• This is generalized for the population by checking the statistical
significance
Statistical significance
• What happens after we have chosen a statistical test, and
analysed our data, and want to interpret our findings? We
use the results of the test to choose between the following:
1. Alternative (Experimental) hypothesis (e.g. loud noise disrupts
learning).
2. Null hypothesis, which asserts that there is no difference
between conditions (e.g. loud noise has no effect on learning).
• If the statistical test indicates that there is only a small
probability of the difference between two conditions (e.g.
loud noise vs. learing), then we reject the null hypothesis in
favor of the experimental hypothesis
Statistical ….
• Psychologists generally use the 5% (0.05) level of statistical significance.
What this means is that the null hypothesis is rejected (and the
experimental hypothesis is accepted) if the probability that the results were
due to chance alone is 5% or less. This is often expressed as p <= 0.05
• It is possible to use other statistical significance such as 10%, when we have
greater confidence
• But the null hypothesis can be rejected with greater confidence
• This leads to Type I and Type II errors
– Type I error: we may reject the null hypothesis in favour of the experimental
hypothesis even though the findings are actually due to chance; the probability of
this happening is given by the level of statistical significance that is selected.
– Type II error: we may retain the null hypothesis even though the experimental
hypothesis is actually correct.
Statistical significance …
• Researchers are interested not only in the correlation between
two variables, but also in whether the value of r they obtain is
statistically significant.
• Statistical significance exists when a correlation coefficient
calculated on a sample has a very low probability of being zero in
the population.
• Assume we get a correlation between X and Y is 0.4 in our sample.
• How do we now if this r is not zero (r=0.0) if we take the census of
the entire population.
• The probability that our correlation is truly zero in the population is
sufficiently low (usually less than .05),
• we refer this probability as statistically significant
Factors affecting statistical significance
• Sample size
• Assume that, unknown to each other, you and I independently
calculated the correlation between shyness and self-esteem and that
we both obtained a correlation of -.50.
• However, your calculation was based on data from 300 participants,
whereas my calculation was based on data from 30 participants.
• Which of us should feel more confident that the true correlation
between shyness and self-esteem in the population is not .OO?
• You can probably guess that your sample of 300 should give you more
confidence in the value of r you obtained than my sample of 30.
• Thus, all other things being equal, we are more likely to conclude that a
particular correlation is statistically significant the larger our sample is.
Factors …
• Magnitude of the correlation. For a given
sample size, the larger the value of r we obtain,
the less likely it is to be .00 in the population.
• Imagine you and I both calculated a correlation
coefficient based on data from 300 participants;
• your calculated value of r was .75, whereas my
value of r was .20. You would be more confident
that your correlation was not truly .00 in the
population than I would be.
Factors ..
• Level of confidence
• It indicates how we are careful we want to be not to
draw an incorrect conclusion about whether the
correlation we obtain could be zero in the population.
• Typically, researchers decide that they will consider a
correlation to be significantly different from zero if
there is less than a 5% chance (that is, less than 5
chances out of 100) that a correlation as large as the
one they obtained could have come from a population
with a true correlation of zero.
Techniques
• There are two methods to make
generalizations about the population
1. Mean Method – by confidence Interval
2. Statistical significance method – using different
inferential statistics such as Chai Square, ANOVA,
regression, etc
Confidence interval Method
• When you compute a confidence interval on the mean, you compute the mean
of a sample in order to estimate the mean of the population.
• Clearly, if you already knew the population mean, there would be no need for a
confidence interval.
• Assume that the weights of 10-year-old children are normally distributed with a
mean of 90 and a standard deviation of 36.
• What is the sampling distribution of the mean for a sample size of 9?
• The formula (standard error ) is given by
X 8 10 7 20 15
References
• http://onlinestatbook.com/2/normal_distribu
tion/normal_distribution.html