Prof. James Analysis
Prof. James Analysis
INTRODUCTION
With descriptive statistical analyses we can observe trends, distribution patterns in data
Univariate analysis deals with one variable at a time: Most descriptive statistical
analyses fall under this category.
Bivariate analysis deals with two variables at a time: It is concerned with establishing
relationship among two variables- it is a part of inferential statistical analysis. (Example
Pearson chi-square test, Pearson correlation analysis, Spearman’s rank correlation etc.).
Multivariate analysis deals with more than two variables at a time: It is concerned with
establishing relationship between several (more than two) variables- it is a part of
inferential statistical analysis. (Example, Multiple regression analysis, factor analysis,
discriminant analysis etc.)
They are summary measures: They are important for observing trends/patterns
of distribution in data/variables:
These include;
i) Measures of central tendency (i.e. Mean) and measures of dispersion
(Variance, SD) for CONTINUOUS data/variables,
AND
ii) Determination of frequencies and percentages for CATEGORICAL
data/variables.
Note:
Type of variable (level of measurement) is among the key determinant for the choice of
statistical method to use in data analysis. As you can see, for descriptive statistical
analysis, we analyze for frequencies and percentage for categorical variables;
AND mean, mode, median, min. value, max. value, variance and s.d for continuous
variables. Continuous variables can also be analyzed for skewness, kurtosis, interquartile
range, Box and Whiskers plots (box-plots)
Procedures in SPSS;
variable (s) of interest and put it(them) to variable (s) box Option (Select
Using “Workshop SPSS working file1”, compute descriptive statistics for the variables
age, hhsize, income1, income2, fsize, ncattle.
Solution;
variables “age, hhsize, income1, income2, fsize, ncattle” and put them to Variable (s)
Box OK
Output
Descriptive Statistics
Interpretation/reporting
Results from Table.. Indicate age of respondent varied from 22 to 62 years with average
of 44 years; household size varied from 1 to 13 individuals per household with average of
7 individuals. Regarding annual household income, recorded in ‘000 Tsh, the income
varied from 150 to 3405 with average of 1189 for the period before project, and from 130
to 3200 with average of 1248 for the period after project. Furthermore, results reveal farm
size per household varied from 1 to 16 acres with average of 4.3 acres, and number of
cattle ranged from 0 to 120 with average of 23 cattle per household.
Procedures in SPSS
Using “Workshop SPSS working file1”, compute descriptive statistics (frequencies and
percentages) for the variables age2, sex, marital, educ.
Solution;
variables “age2, sex, marital, educ” and put them to Variable (s) Box OK.
age
Cumulat iv e
Frequency Percent Valid Percent Percent
Valid < 30 11 7.3 7.3 7.3
40 - 50 96 64.0 64.0 71.3
> 50 43 28.7 28.7 100.0
Total 150 100.0 100.0
sex
Cumulat iv e
Frequency Percent Valid Percent Percent
Valid male 92 61.3 61.3 61.3
f emale 58 38.7 38.7 100.0
Total 150 100.0 100.0
marital status
Cumulat iv e
Frequency Percent Valid Percent Percent
Valid single 22 14.7 14.7 14.7
married 122 81.3 81.3 96.0
div orced 3 2.0 2.0 98.0
widow 3 2.0 2.0 100.0
Total 150 100.0 100.0
education level
Cumulat iv e
Frequency Percent Valid Percent Percent
Valid none 7 4.7 4.7 4.7
primary 81 54.0 54.0 58.7
secondary 40 26.7 26.7 85.3
college and abov e 22 14.7 14.7 100.0
Total 150 100.0 100.0
Note1: A column for “Percent” and “Valid percent” has similar values. This is because
we don’t have missing cases. However, if there was missing cases, the values in two
columns would be different. Always use a column for VALID percent!
Results from Table.. indicate majority of respondents (64.0%) aged between 30- 50
years, with very few (7.3%) aged below 30 years. Results from Table.. also reveal that
most of respondents (61.3%) were males, and overwhelming majority (81.3%) were
married. Findings from Table.. further indicate that about half (54.0%) of total
respondents had primary education, and more than one-third (41.4%) had at least
secondary education.
Example;
Results from Table.. indicate majority of respondents, 64.0%, aged between 30- 50 years,
with very few, 7.3%, aged below 30 years. Results from Table.. also reveal that most of
respondents, 61.3%, were males, and overwhelming majority, 81.3%, were married.
Findings from Table.. further indicate that about half, 54.0%, of total respondents had
primary education, and more than one-third, 41.4%, had at least secondary education
In some field you may be required to indicate both frequencies and percentage when
explaining your results; example in the above text we could say;
Results from Table.. indicate out 150 of the survey respondents, majority 96 (64.0%)
aged between 30- 50 years, with very few 11 (7.3%) aged below 30 years. Results from
Table.. also reveal that of 150 of the surveyed respondents, most of them 92 (61.3%)
were males, and overwhelming majority 122 (81.3%) were married. Findings from
Table.. further indicate that about half 81 (54.0%) of total respondents had primary
education, and more than one-third 62 (41.4%) had secondary education and above.
Important!
Note1;
In fact, there are several formats and ways of interpreting/structuring sentences. The
above are just few examples. Reading of previous research report i.e. Dissertation or
Journal Article would strengthen your skills on this aspect. This is the easiest way of
learning on how to interpret and report results of your analysis!. (The problem is most
students don’t read beyond what has been taught in class!).
Note2;
When explaining your results try to explain general trend/pattern (explain the message we
get from results) in your results, i.e. try to avoid as much as you can coping/ transferring
ALL information from table to your text describing/explain results.
Sometimes there are variables in which a respondent can give more than one answer
(responses);
Example:
- What type of crops grown?.
- What is do you think should be done in order to improve agricultural productivity in
your village?.
- What is the source of information/knowledge on issues related to sexual and
reproductive health by a youth?
These type of questions allows a respondent to give several answers i.e. combination of
responses. Under such situation, that variable can either be coded including combination
of responses OR use Multiple Response option in SPSS. Multiple response is always
preferred as it is easy to see a trend of responses (clarity): For first option (use of
Combination of responses in coding) you may end up into having many combinations
and hence lost focus and clarity.
1 = maize
2 = Sorghum
3 = Millet
4 = Rice
5 = Beans
6 = Simsim
7= Sunflower
8 = Groundnuts
9 = Cassava
10 = Sweet potatoes
11 = Bambara nuts
12 = Maize and sorghum
13 = Maize and millet
14 = Maize, Rice
15 = Sorghum, Beans, simsim
16 = Rice, groundnuts, rice
17 = Cassava, maize, millet, Bambara nuts, sweet potatoes
18 = Millet, Maize, Sunflower, beans, Rice,
Two approached:
a) When we have column for every possible answer/option of response, and the columns
are coded in 1 = Yes 2 = No style
b) When we have several columns for a question under study with number of columns
depends on maximum number of possible answers a respondent can give. Furthermore in
each column coding is uniform with each column contain codes (value labels) for all
possible options/answers:
1. A variable for such analysis must have several columns in SPSS data file with
number of columns representing maximum (possible) number of
options/responses an individual respondent can give as observed in the
questionnaires (Note: It is not related to number of possible options/responses for
that variable in the data file/questionnaires).
2. Coding for all columns for that variable should be uniform; This can be achieved
by copying and pasting value labels for the first column to other columns.
Note: First column does NOT MEAN it is for a response coded as one in a data file, and
second code column for a response coded as two!. Any response can be punched on first
or second or third column etc depending sequence of responses given by a respondents.
That means first column if for first response by a respondent, second column for second
response by a respondent etc. First response by respondent X can be any i.e. the one
coded as 7 and second response can also be any other i.e. the one coded as 20 etc. In this
situation 7 should be punched in first column and 20 in second column.
Example;
Consider a variable “where” i.e. where get knowledge on reproductive health in data file
“Workshop SPSS working file0” . Based on responses from questionnaire, there were 8
responses coded as indicated below and a maximum number of responses by a
respondent was four (4) and hence having 4 columns in a data file.
OR
After having file with a variable for multiple response prepared in a required format as
explained above, the following are procedures for carrying out multiple response analysis
in SPSS
Procedures in SPSS
Analyze Multiple Response Select variables for the set and put them
to Variables in Set Box Categories (Specify range for categories/ responses for
Note: This variable has been represented by four columns (i.e. entered into four columns)
viz. where1, where2, where3, where4.
Solution;
where3 and where4” and put them to Variables in Set Box Categories
(Specify range for categories/ responses for a variable under analysis i.e. 1 to 8 )
Write “where” as variable name and “where get knowledge” as variable label
Pct of Pct of
Category label Code Count Responses Cases
We usually pick a column for category label, count (frequency) and either of the last two
columns (i.e. pct of responses or pct of cases).
Note: pct = percent, cases = respondents. I usually prefer picking last column;
Note: value for n here is the value for valid cases indicated at the bottom of output
of multiple response analysis
Data can also be summarized for observing trends by presenting it in form of graphs.
There are several graphical methods for presenting data. The most common one are;
i) Pie chart
ii) Bar chart
iii) Histogram
iv) Scatter plot
Example:
Using “Workshop SPSS working file4”, generate a pie chart for percent distribution for
different options/responses of a variable living2 (i.e. Living arrangement);
Solution;
Living arrangement
50.99%
Example;
Using “Workshop SPSS working file1”, generate a pie chart for total income1 (annual
label and percent to appear in a pie chart) Title (Write/type “Income before
Kongwa 49.04%
Chamwino 26.91%
Note: However, the above output could make sense if sample size was equal among
districts
X-axis (Make sure variable in X-axis is specified as categorical (right click and change it
OK.
Example;
Using “Workshop SPSS working file4”, generate a bar chart for a variable religion
(religion affiliation)
Output
60 %
Percent
40 %
20 %
0%
ca th ol ic protes ta nt mos le m
religion affiliation
Example:
Example;
Using “Workshop SPSS working file1”, generate a Histogram for a variable Income1
(income before project)
20 %
15 %
Percent
10 %
5%
0%
10 00.00 20 00.00 30 00.00
Procedures in SPSS
OK.
Note: You can fit a regression line (equation) by use Fit option before specifying a title.
In this regard, while in fit option, in method dialogue box choose regression, specify to
Expected output
Linear Regression
household annual income ('000) = 411.40 + 194.65 * fsize
household annual income ('000)
R-Square = 0.48
30 00.00
20 00.00
10 00.00
4.00 8.00 12 .0 0 16 .0 0
Exercise
Generate a scatter plot for Number of cattle (X) vs Income (Y)
In testing of hypothesis we usually try to look at whether we can REJECT null hypothesis
or ACCEPT it.
When null hypothesis is rejected, we accept alternative hypothesis, therefore, depending
on nature of problem declare that there is significant difference between groups/ or there
is significant association/relationship between variables.
The basic question to ask ourselves is, how lower should P-value be so as we can
reject null hypothesis?. The general rule is that it should be equal to or lower than 5%
(i.e. 0.05) (This is because 5% i.e. 0.05 level of significance is usually taken as standard
P> 0.05 = Non- significant at P >0.05 (i.e. null hypothesis is accepted, therefore there is
NO significant difference/relationship) (i.e. NS)
P ≤ 0.05 = Significance at P < 0.05 (null hypothesis is rejected at P< 0.05 therefore, there
is significant difference/relationship at P< 0.05) (i.e. *)
P < 0.01 = Significance at P < 0.01 (null hypothesis is rejected at P< 0.01 therefore, there
is significant difference/relationship at P< 0.01) (i.e. **)
P < 0.001 = Significance at P < 0.001(null hypothesis is rejected at P< 0.001 therefore,
there is significant difference/relationship at P< 0.001) (i.e. ***)
From the above, it is important to note that: for P >0.05 (i.e. values above 0.05) we
agreed with claims by null hypothesis;
AND for P ≤ 0.05, P < 0.01 and P < 0.001 (i.e. values equal to or below 0.05) we
disagreed with claims by null hypothesis and hence took/agreed with alternative claims
by alternative hypothesis.
Ha: Male students have significantly higher average performance in Statistics compared
to female students (i.e. males perform better than females in Statistics). (for the case of
right tailed test)
OR
Ha: Male students have significantly lower average performance in Statistics compared
to female students (i.e. males perform poor than females in Statistics). (for the case of
left tailed test)
Computer output for most of the statistical tests and for most statistical software give
results of p-value (sig.) for two-tailed test. To get corresponding P-value for one-tailed
test just divides the obtained value of two-tailed test by 2.
- Example: An economist wants to know if the per capita income of a particular region is
same as the national average.
- This test is suitable for a continuous data i.e. able to compute mean from the
data/computation of mean make sense!.
- Test statistic = t
Or
Ha: Population mean is higher than 0 (right tailed test) Or Population mean is lower
than 0 (left tailed test)
(for the case of one tailed test)
However, in case of one tailed test, alternative hypotheses (Ha) would be;
Ha: Average weight of second year student at IRDP is higher than 50kg (for right tailed
test). Or Ha: Average weight of second year student at IRDP is lower than 50kg (for left
tailed test).
under study (variable of interest) and put it to Test variables (s) box Specify
T-Test
One-Sample Statistics
Std. Error
N Mean Std. Dev iat ion Mean
annual income
150 1189.2200 859.22329 70.15529
bef ore project ('000)
P> 0.05 = Non- significant at P >0.05 (i.e. null hypothesis is accepted, therefore there is
NO significant difference/relationship) (i.e. NS)
P ≤ 0.05 = Significance at P < 0.05 (null hypothesis is rejected at P< 0.05 therefore, there
is significant difference/relationship at P< 0.05) (i.e. *)
P < 0.01 = Significance at P < 0.01 (null hypothesis is rejected at P< 0.01 therefore, there
is significant difference/relationship at P< 0.01) (i.e. **)
P < 0.001 = Significance at P < 0.001(null hypothesis is rejected at P< 0.001 therefore,
there is significant difference/relationship at P< 0.001) (i.e. ***)
From the above, it is important to note that: for P >0.05 (i.e. values above 0.05) we
agreed with claims by null hypothesis;
AND for P ≤ 0.05, P < 0.01 and P < 0.001 (i.e. values equal to or below 0.05) we
disagreed with claims by null hypothesis and hence took/agreed with alternative claims
by alternative hypothesis.
Therefore, for the above output is can be said that; Average annual household income
by a rural household of Dodoma is significantly different from 1,000 (in ‘000) (t = 2.67,
P< 0.01).
Note1: Depending on preference, d.f can also be reported in the above sentence (i.e. t =
2.67, df = 149, P< 0.01) or only P-value reported (i.e. P< 0.01)
Note 2: Suppose sig. in the last table is 0.031 or 0.025, or 0.016 (i.e. values which are
also below 0.05) it would also imply presence of significance difference.
Note3: suppose sig. in the last table is 0.075 or 0.082 or 0.154 (i.e. values above 0.05), it
would imply that there was no significant difference.
Note 6: Sig. in computer output is rounded to 3 decimal places. Therefore sig. of 0.000
could possibly be 0.000067 or 0.00028 etc.
Results for one-sample t-test for average annual income before project
(test value in ‘000 Tsh. = 1000)
N Mean Std dev. (SD)
150 1189.2 859.2
t-value = 2.697, Significance = 0.008 (or P<0.01).
Exercise
Using the same file (Workshop SPSS working file1), test the hypothesis (null
hypothesis) that average annual household income before project (in’000 Tsh.) was 1200.
Note: Summarize your output and interpret your results.
This test is concerned in testing equality of means for two groups (i.e comparing two
groups). Is it involved in answering the question like “are the values for two groups
similar?
In this test it is assumed that the two samples are from two independent populations
Test statistic = t
Ha: Mean for group A is significantly higher than mean for group B (for right tailed test)
OR
Ha: Mean for group A is significantly lower than mean for mean for group B (for left
Tailed test).
OR
Ha: Male students have significantly lower average performance in Statistics compared
to female students (i.e. males perform poor than females in Statistics). (for the case of
left tailed test)
Procedures for independent samples t-test in SPSS:
Response variable/criterion variable (a dependent variable) and put it to Test variable (s)
T-Test
Group Statistics
Std. Error
sex of household head N Mean Std. Dev iat ion Mean
annual income male 92 1618.9022 740.51248 77.20376
bef ore project('000) f emale 58 507.6552 532.65886 69.94153
From the above output it can be concluded that average annual household income before
project for male headed households was significantly different from that of female
headed households (t= 10.67, P < 0.001).
Note1: if P value (sig.) for Levene’s test is P ≤ 0.05 (i.e. equal or below 0.05) use last row
for the above table in interpretation, and if it is P >0.05 (i.e. it is above 0.05) use the first
row.
Note2: suppose sig. in the last table is any value above 0.05 such as 0.23, it implies that
there was no significant difference.
Therefore, for the above output if your interest was one tailed test we could state that
“average annual household income before project for male headed households was
significantly higher than that of female headed households (t= 10.67, P < 0.001).
Average annual household income before project (in ‘000 Tsh.) was compared for the
two types of household. Results from Table … indicate average annual household income
for male headed households, 1618.9, was significantly higher than that for female headed
households, 507.7, (t= 10.67, P < 0.001).
OR we could say;
Average annual household income before project (in ‘000 Tsh.) was compared for the
two types of household. Results from Table … indicate average annual household income
for male headed households (1618.9) was higher than that for female headed households
(507.7). Results for t-test indicate the difference to be significant (t= 10.67, P < 0.001).
OR
Example1;
Example, apart from annual income before project suppose we compared the households
on other variables such as household size (number of individuals in a household), farm
size (acres) and number of cattle owned and the following output was obtained.
T-Test
Group Statistics
Std. Error
sex of household head N Mean Std. Dev iat ion Mean
annual income bef ore male 92 1618.9022 740.51248 77.20376
project('000) f emale 58 507.6552 532.65886 69.94153
household size (number) male 92 5.7283 2.41835 .25213
f emale 58 8.3793 2.03330 .26699
f arm size (acres) male 92 5.3478 3.23591 .33737
f emale 58 2.6983 1.99987 .26260
number of cattle owned male 92 33.1522 17.43492 1.81772
f emale 58 7.2414 9.00158 1.18197
Mean values for various variables for male and female head households
Mean ± SD
Variable Male Female t-value Sign.
This study compared the two types of households in terms of average annual household
income before project (in 000 Tsh), household size, farm size (acres) and number of
cattle owned. Results from Table… indicate significant differences between the two types
of households in these variables. Male headed households had significantly higher
average annual household income (t=10.67, P < 0.001), average farm size (t = 5.60, P <
0.001), and average number of cattle owned (t = 10.46, P < 0.001) compared to female
headed households. However, in contrary, female headed households had significantly
higher average household size compared to male headed households (t = -6.94, P <
0.001).
Exercise;
Re-write the above text by indicating values for mean (use allowed format: to achieve
this try to read already published papers/research report the used that style)
Results from Table 3 indicate exclusive breastfeeding had improved significantly Height-
for –Age z scores (HAZ), and Weight-for- height z scores (WHZ). The exclusively
Exercise
Using the file “Workshop SPSS working file1”, Compare the male headed and female
headed households on mean age and mean annual income after project.
This test is also concerned with comparing means for two groups/two situations/two
periods.
Apply for paired observations (i.e. two samples are from two related populations).
Example:
- Data taken from the same individual/household at two different periods (i.e.
before and after intervention).
- The HR manager wants to know if a particular training program had any impact in
increasing the motivation level of the employees.
In SPSS computer program you don’t need to specify what dependent variable is and
what independent variable is as we saw for the case of independent samples t-test.
(Note how data are entered for this analysis vs those for independent samples t-test =
format differ i.e. entered differently)
d n
t= ,
sd
where d is the difference between pairs, d = mean of the differences, and s d = standard
deviation of the differences.
Ho: Mean (average) for period1 is NOT significantly different from mean for period2
Ha: Mean (average) for period1 is significantly different from that of period2
Ha: Mean for period1 is significantly higher than mean for period2 (for right tailed test)
OR
Ha: Mean for period1 is significantly lower than mean for period2 (for left tailed test)
Ha: Average annual household income before project is significantly higher than that
after project (for the case of right tailed test)
two variables for comparison (click the first variable and then the second variable) and
T-Test
Paired Samples Statistics
Std. Error
Mean N Std. Dev iat ion Mean
Pair annual income
1189.2200 150 859.22329 70.15529
1 bef ore project ('000)
annual income af ter
1248.1800 150 868.43369 70.90731
project('000)
N Correlation Sig.
Pair annual income bef ore
1 project('000) & annual 150 .998 .000
income af t er project('000)
From the above output it can be said that; average annual household income before
project was significantly different from that after project (t = -11.94, P < 0.001). While
average annual household income before project (in ‘000 Tsh) was 1189.2, the
corresponding average annual household income after project was 1248.2.
Or we could say;
Average annual household income before project (1189.2) was significantly different
from that after project (1248.2), both in ‘000 Tsh (t = -11.94, P < 0.001).
(Note: You can as well use comma before mean/average instead of brackets)
Note1: Results are also significant at one-tailed test; In this regard if our interest was
one-tailed test we could say; income after project was significantly higher than income
before project (i.e. income improved significantly after project) (t = - 11.94, P < 0.001).
While average annual household income before project (in ‘000 Tsh) was 1189.2, the
corresponding average annual household income after project was 1248.2.
Or we could say;
Average annual household income before project (1189.2) was significantly lower than
that after project (1248.2), both in ‘000 Tsh (t = -11.94, P < 0.001).
(Note: You can as well use comma before mean/average instead of brackets)
Note2: suppose sig. in the last table is any value above 0.05 such as 0.083, it implies that
there was no significant difference.
OR
Average annual household income before project and after
project (in ‘000 Tsh.)
Period N Mean SD
Before Project 150 1189.2 859.2
After Project 150 1248.2 868.4
t- value = -11.942, Significance = 0.000 (or P< 0.001).
These tests are concerned with testing equality of means for more than two groups (i.e
comparing more than two groups).
Statistical tests under this category employ analysis of variance (ANOVA) to test equality
of means
The simplest test under this class is One-way Analysis of Variance (ANOVA I). This
test can be followed by mean separation tests such as Duncan Multiple range test,
Tukey’s test, LSD test etc.
Other ANOVAs include Two-way ANOVA, Latin Square Design, Split Plot Design,
Factorial Design, and ANOVA for repeated measures. These are a bit complex and most
of them very are popular in Agric. Sciences.
- Test statistic = F
- The larger the F-ratio, the greater is the difference between groups as compared to within
group differences.
The ANOVA procedure can be used correctly if the following conditions are satisfied:
1. The dependent variable should be interval or ratio data type.
2. The populations should be normally distributed (i.e. parametric) and the population variances
should be equal.
Example: Suppose someone want to compare average annual household income for three
district;
response variable i.e. dependent variable (criterion variable) and put it to Dependent list
Continue OK
You can as well generate results for mean separation tests by clicking Post Hoc button
and select a test you want i.e Duncan.
In tables below are computer outputs for null hypothesis that average annual household
income (in ‘000) before project (Income1) is the same for all three district under study
(i.e. Bahi, Chamwino, Kongwa) vs alternative hypothesis that there are differences (or at
least one pair of means differ significantly). The study involved a sample of 45, 56 and
49 rural households of Bahi, Chamwino and Kongwa district in Dodoma region,
respectively.
ANOVA
From the above output it can be said that; Results for ANOVA indicate significant overall
difference between districts on average annual household income before project (F =
22.85, P < 0.001). Recorded in ‘000 Tsh, Kongwa district had the highest mean value
(1785.3), followed by Bahi (953.4) , and Chamwino had the lowest mean value (857.2).
Or
Results indicate Kongwa district had the highest average annual household income,
followed by Bahi, with Chamwino district having lowest average. Mean values for the
three districts in ‘000 Tsh were 1785.3, 953.4 and 857.2 for Kongwa, Bahi and
Chamwino, respectively. Results for ANOVA indicate the overall difference to be
significant (F = 22.85, P < 0.001)
Note: suppose sig. in the last table is any value above 0.05 such as 0.146, it implies that
there was no significant difference.
Note: These tests are performed when ANOVA show significance overall difference
between samples/ treatments.
Homogeneous Subsets
annual income before project('000)
a,b
Duncan
Subset f or alpha = .05
Dist rict of residence N 1 2
Chamwino 56 857.1607
Bahi 45 953.4000
Kongwa 49 1785.2857
Sig. .527 1.000
Means f or groups in homogeneous subsets are display ed.
a. Uses Harmonic Mean Sample Size = 49.597.
b. The group sizes are unequal. The harmonic mean
of the group sizes is used. Ty pe I error lev els are
not guaranteed.
The same way as in the case of independent samples t –test, and additionally there would
be similar subscripts letters for similar means as revealed by mean separation tests such
as Duncan;
Example 1:
Table.. Mean annual household before project by district
District N Means ± S.D
Bahi 45 953.4 ± 927.9b
Chamwino 56 857.2 ± 509.4b
Kongwa 49 1785.3 ± 813.4a
a,b
Means with different superscript letters are significantly different (P<0.05)
S.D = Standard deviation
Example 2; A case of profit from various sources among dairy farmers in Kayanga ward
Karagwe district.
Procedures in SPSS:
Performed through Univariate Analysis of Variance Option
Example: Pearson Chi- Square Test for independence, McNemar’s Test (for paired
dichotomous-i.e two related samples), Cochran Q Test for three or more related samples,
Mantel-Haenszel Comparison (for 2x2 contingency tables while controlled for a
third variable), Fisher’s Exact Test, LR test.
In this class we will concentrate on Pearson Chi- square test for independence among
two variables
Note: There is also Chi-Square Test for Goodness of Fit including testing of
homogeneity: Sometimes called one- sample chi-square test. We won’t discuss this.
2
O E 2
E
Where by O = Observed frequency, E = Expected frequency
Note:
No One-Sided Tests here!
Notice that the alternative hypotheses above do not assume any “direction.” Thus, there
are no one- and two-sided versions of these tests. Chi-square tests are inherently non-
directional (“sort of two-sided”) in the sense that the chi-square test is simply testing
whether the observed frequencies and expected frequencies agree without regard to
whether particular observed frequencies are above or below the corresponding expected
frequencies.
interest and put it to row(s) Box Select variable of interest and put it to
column (s) Box Specify how percentages should be computed (i.e. within
column or within row?: Hint- compute % within groups you want to compare) after
Chi-square Continue OK
Example;
In tables below are computer outputs for null hypothesis that there is no association
between sex of household head and access to credit vs alternative hypothesis that there is
association. The study involved 150 rural households of Dodoma.
Crosstabs
Acess to credit
receiv ed not receiv ed
credit credit Total
sex male Count 79 13 92
% wit hin Acess to credit 78.2% 26.5% 61.3%
f emale Count 22 36 58
% wit hin Acess to credit 21.8% 73.5% 38.7%
Total Count 101 49 150
% wit hin Acess to credit 100.0% 100.0% 100.0%
Note1: In the contingency table above we computed % within credit status (received vs
not received). However if we had computed % within sex we would have output below
(results for statistical tests- i.e. chi-square would be the same as in previous analysis
therefore not presented) and we could interpret results as follow.
Crosstabs
sex of household head * Acess to credit Crosstabulation
Acess to credit
receiv ed not receiv ed
credit credit Total
sex of household male Count 79 13 92
head % within sex of
85.9% 14.1% 100.0%
household head
f emale Count 22 36 58
% within sex of
37.9% 62.1% 100.0%
household head
Total Count 101 49 150
% within sex of
67.3% 32.7% 100.0%
household head
From the above output, based on Pearson chi-square it can be said that there was
relationship between sex of household head and access to credit ( 2 = 37.17, P < 0.001),
in which most of the male headed households had access to credit compared to female
headed households (85.9% vs 37.9%).
Note2: suppose sig. for Pearson chi- square in the last table is any value above 0.05 such
as 0.347, it implies that there was no association.
The percentage of respondents who reported to had received or given gift or money in
exchange for sex was higher for females (21.6%) than for males (16.0%). However, this
difference was not statistically significant ( 2 = 0.91, P = 0.341),
Table..: Proportion of respondents discussing with their parents about sexuality and early
pregnancies
Majority of female respondents reported that their parents/guardian discuss with them the
impact of pregnancies and sexuality. The percentage of females who discuss with their
parents/guardian about impacts of early pregnancies and sexuality (73%) was higher than
that for males (32%). The difference between male and female respondents about
Note3: Chi- square test cannot be applied when more than 20% of cells has expected
frequency of less than 5 and when have a cell that have expected frequency of 1 (or less
than 1).
Under this situation we may be required to collapse (merge) some categories.
Note4: Unlike correlation coefficients, Chi-square does not convey information about the
strength of a relationship. By strength is meant that a large chi-square value and a
correspondingly strong significance level (e.g. P < 0.001) cannot be taken to mean a
closer relationship between two variables than when chi-square is considerably smaller
but moderately significant (e.g. P < 0.05). What is telling us is how confident we can be
that there is a relationship between two variables.
Already shown in some tables above. However standard format for most of the report is
as below;
Crosstabs
age of woman * if she ever used modern contraceptives (MC)- is
she a current user
Crosstab
if she ev er used
modern contraceptiv es
(MC)- is she a current
user
y es no Total
age of woman 30 and below Count 32 63 95
% wit hin if she ev er
used modern
64.0% 57.3% 59.4%
contraceptiv es (MC)-
is she a current user
> 30 Count 18 47 65
% wit hin if she ev er
used modern
36.0% 42.7% 40.6%
contraceptiv es (MC)-
is she a current user
Total Count 50 110 160
% wit hin if she ev er
used modern
100.0% 100.0% 100.0%
contraceptiv es (MC)-
is she a current user
Chi-Square Tests
if she ev er used
modern contraceptiv es
(MC)- is she a current
user
y es no Total
education lev el-woman primary or below Count 30 105 135
% wit hin if she ev er
used modern
60.0% 95.5% 84.4%
contraceptiv es (MC)-
is she a current user
sec and abov e Count 20 5 25
% wit hin if she ev er
used modern
40.0% 4.5% 15.6%
contraceptiv es (MC)-
is she a current user
Total Count 50 110 160
% wit hin if she ev er
used modern
100.0% 100.0% 100.0%
contraceptiv es (MC)-
is she a current user
Chi-Square Tests
if she ev er used
modern contraceptiv es
(MC)- is she a current
user
y es no Total
religious catholic Count 6 19 25
af f iliation % within if she ev er
used modern
12.0% 17.3% 15.6%
contraceptiv es (MC)-
is she a current user
protestant Count 40 89 129
% within if she ev er
used modern
80.0% 80.9% 80.6%
contraceptiv es (MC)-
is she a current user
moslem Count 4 2 6
% within if she ev er
used modern
8.0% 1.8% 3.8%
contraceptiv es (MC)-
is she a current user
Total Count 50 110 160
% within if she ev er
used modern
100.0% 100.0% 100.0%
contraceptiv es (MC)-
is she a current user
Chi-Square Tests
if she ev er used
modern contraceptiv es
(MC)- is she a current
user
y es no Total
ty pe of marriage monogamy Count 44 97 141
% wit hin if she ev er
used modern
88.0% 88.2% 88.1%
contraceptiv es (MC)-
is she a current user
poly gamy Count 6 13 19
% wit hin if she ev er
used modern
12.0% 11.8% 11.9%
contraceptiv es (MC)-
is she a current user
Total Count 50 110 160
% wit hin if she ev er
used modern
100.0% 100.0% 100.0%
contraceptiv es (MC)-
is she a current user
Chi-Square Tests
if she ev er used
modern contraceptiv es
(MC)- is she a current
user
y es no Total
Current number 3 and below Count 15 93 108
of liv ing children % wit hin if she ev er
used modern
30.0% 84.5% 67.5%
contraceptiv es (MC)-
is she a current user
4 and abov e Count 35 17 52
% wit hin if she ev er
used modern
70.0% 15.5% 32.5%
contraceptiv es (MC)-
is she a current user
Total Count 50 110 160
% wit hin if she ev er
used modern
100.0% 100.0% 100.0%
contraceptiv es (MC)-
is she a current user
Chi-Square Tests
if she ev er used
modern contraceptiv es
(MC)- is she a current
user
y es no Total
ethnicity Sukuma Count 30 64 94
% wit hin if she ev er
used modern
60.0% 58.2% 58.8%
contraceptiv es (MC)-
is she a current user
others Count 20 46 66
% wit hin if she ev er
used modern
40.0% 41.8% 41.3%
contraceptiv es (MC)-
is she a current user
Total Count 50 110 160
% wit hin if she ev er
used modern
100.0% 100.0% 100.0%
contraceptiv es (MC)-
is she a current user
Chi-Square Tests
if she ev er used
modern contraceptiv es
(MC)- is she a current
user
y es no Total
If f requently discuss wit h Y es Count 21 5 26
husband on Family % wit hin if she ev er
Planning (Spousal used modern
communication) 42.0% 4.5% 16.3%
contraceptiv es (MC)-
is she a current user
No Count 29 105 134
% wit hin if she ev er
used modern
58.0% 95.5% 83.8%
contraceptiv es (MC)-
is she a current user
Total Count 50 110 160
% wit hin if she ev er
used modern
100.0% 100.0% 100.0%
contraceptiv es (MC)-
is she a current user
Chi-Square Tests
if she ev er used
modern contraceptiv es
(MC)- is she a current
user
y es no Total
if husband approv e y es Count 45 50 95
Modern % wit hin if she ev er
contraceptiv es (MC) used modern
90.0% 45.5% 59.4%
contraceptiv es (MC)-
is she a current user
no Count 5 60 65
% wit hin if she ev er
used modern
10.0% 54.5% 40.6%
contraceptiv es (MC)-
is she a current user
Total Count 50 110 160
% wit hin if she ev er
used modern
100.0% 100.0% 100.0%
contraceptiv es (MC)-
is she a current user
Chi-Square Tests
Interpretation/reporting of results
Results from Table.. indicate that there was significant association between several
factors considered in this study and being current user of modern contraceptives by a
woman. Current use of modern contraceptives was significantly associated with
James Lwelamira (PhD) Page 75
education level ( 0.65, P < 0.001), current number of living children ( 46.62, P <
2 2
approval of modern contraceptives ( 28.28, P < 0.01). The effects of other variables
2
on current use of modern contraceptives considered in this analysis were not significant
(P > 0.05). These include age religious affiliation, ethnicity and type of marriage.
Age (years)
< 35 13.2% 13.3% 13.2% NS
0.20
35 -50 55.3% 60.0% 57.4%
51+ 31.6% 26.7% 29.4%
Marital status
Married 92.1% 83.3% 88.2% NS
2.76
Single 0.0% 6.7% 2.9%
Widow 7.9% 10.0% 8.8%
Education level
No formal education 2.6% 0.0% 1.5% NS
2.21
Primary education 65.8% 80.0% 72.1%
Secondary education 18.4% 13.3% 16.2%
College and above 13.2% 6.7% 10.3%
Results indicate distribution of household heads by sex, age, marital status and education
level in the two types of households (i.e. Dairy farmers and Non-dairy farmers) were not
significantly different (P> 0.05). Majority of respondents in both groups (i.e. more than
50%) were males, had age between 35 to 50 years, married and had primary education.
Exercise
Using the file “Workshop SPSS working file1”, test if there is significant association
between Engagement on Off-farm activities (a dependent variable) and Sex (gender),
marital status and district of residence by a respondent (Independent variables)
Since Chi-square does not by itself provide and estimate of the magnitude of association
between two attributes, any obtained Chi-square value may be converted into phi
coefficient.
2
N
Useful for 2 x 2 contingency table (esp. nominal by nominal var. contingency table)
Chi-square value may also be converted into coefficient of contingency (C), especially in
case of a contingency table of higher order than 2 x 2 table to study the magnitude of the
relation or degree of association between two attributes;
2
C
2 N
As with contingency coefficient (C), Cramer’s is also used for tables of higher order
i.e number of both rows and columns is greater than 2.
Significance level = just tell us if obtained coefficients i.e. low corr has been arisen by
chance (i.e. sampling error) or it actually exist in a population when sample has been
selected.
H1: There is significant positive correlation between two variables under study i.e.
correlation coefficient is significantly above zero (for right tailed test);
Or
H1: There is significant negative correlation between two variables under study i.e.
correlation coefficient is significantly below zero (for left tailed test);
r 2 n 2
t
1 r2
Example;
Below is the output for the null hypothesis that annual household income before project
was not significantly correlated with age of respondent, farm size, and number of cattle
owned Vs alternative hypothesis that they are correlated.
Correlations
annual
age of income
household bef ore f arm size number of
head project('000) (acres) cattle owned
age of household head Pearson Correlation 1 .203* .175* .265**
Sig. (2-tailed) . .013 .032 .001
N 150 150 150 150
annual income bef ore Pearson Correlation .203* 1 .693** .739**
project('000) Sig. (2-tailed) .013 . .000 .000
N 150 150 150 150
f arm size (acres) Pearson Correlation .175* .693** 1 .617**
Sig. (2-tailed) .032 .000 . .000
N 150 150 150 150
number of cattle owned Pearson Correlation .265** .739** .617** 1
Sig. (2-tailed) .001 .000 .000 .
N 150 150 150 150
*. Correlation is signif icant at t he 0.05 lev el (2-tailed).
**. Correlation is signif icant at t he 0.01 lev el (2-tailed).
Results from Table.. reveal (indicate/show) that annual household income before project
was significantly correlated with age of household head (r = 0.203, P = 0.013), farm size
( r = 0.693, P = 0.000), and number of cattle owned (r = 0.739, P = 0.000).
Or
Results from Table.. reveal that annual household income before project was significantly
correlated with age of household head (r = 0.203, P <0.05), farm size ( r = 0.693, P <
0.001), and number of cattle owned (r = 0.739, P <0.001).
Note1;
The above output was for two-tailed test. However, if your interest was one- tailed
test you could command a computer to produce results for one- tailed test before
clicking OK or divide the P-values in the above output by 2.
Output if commanded a computer to produce results for one-tailed test would be!
Note2 : Compute produce output for all possible comparison, please stick to your
pre-determined comparison
Note3: Results above diagonal and that below diagonal are the same!, Therefore, use
one side of diagonal for interpretation to avoid confusion!.
annual
income
bef ore f arm size number of
age project('000) (acres) cattle owned
age Pearson Correlation 1 .203** .175* .265**
Sig. (1-tailed) . .006 .016 .001
N 150 150 150 150
annual income bef ore Pearson Correlation .203** 1 .693** .739**
project('000) Sig. (1-tailed) .006 . .000 .000
N 150 150 150 150
f arm size (acres) Pearson Correlation .175* .693** 1 .617**
Sig. (1-tailed) .016 .000 . .000
N 150 150 150 150
number of cattle owned Pearson Correlation .265** .739** .617** 1
Sig. (1-tailed) .001 .000 .000 .
N 150 150 150 150
**. Correlation is signif icant at the 0.01 lev el (1-tailed).
*. Correlation is signif icant at the 0.05 lev el (1-tailed).
Interpretation/reporting of results
Results from Table.. indicate (reveal/show) that annual household income before project
was significantly positively correlated with age of household head (r = 0.203, P = 0.006),
farm size ( r = 0.693, P = 0.000), and number of cattle owned (r = 0.739, P = 0.000).
Or
Results from Table.. indicate that annual household income before project was
significantly positively correlated with age of household head (r = 0.203, P <0.01), farm
size ( r = 0.693, P < 0.001), and number of cattle owned (r = 0.739, P <0.001).
Note:
If we have +ve Pearson correlation coefficient, it imply that variables are positively
correlated (i.e. increase in X1 is associated with increase in X2) AND if we have –ve
Pearson correlation coefficient, it imply that variables are negatively correlated (i.e.
increase in X1 is associated with decrease in X2)
Or
- Correlation coefficient for ordinal-ordinal variables and when conditions for using
Pearson’s correlation analysis (i.e. normal distribution, continuous variables) not
met.
- Rank correlation = Spearman’s rho (ρ) or R and Kendall’s tau (τ). Variables under
study are categorical – ordinal). Commonly used coefficient is Spearman’s rho
(ρ). However, Kendall’s tau (τ) preferred when there is large proportion of tied
ranks.
- Kendall’s tau (τ)- a for ungrouped data, Kendall’s tau (τ)- b for grouped data
Ha: There is significant positive correlation between two variables under study i.e.
correlation coefficient is significantly above zero (for right tailed test);
Or
Ha: There is significant negative correlation between two variables under study i.e.
correlation coefficient is significantly below zero (for left tailed test);
Procedures in SPSS
Exercise
Using the same file (Workshop SPSS working file2), Compare ranking between
interview method and Hopkin’s I.Q test method
Regression can also be linear (when there is linear relationship between Y and X(s), or
can be non-linear (when there is non-linear relationship between Y and X(s).
Note: we have linear regression when a dependent variable (Y) is continuous, and we
have non- linear regression when a dependent variable is categorical i.e. binary.
Where;
Y = is the dependent variable
X = is the independent (explanatory) variable
= is the intercept in the Y axis ( a regression constant)
Note1:
is the value of Y when a value of X is zero; and is the amount of change in Y when
X is increased by one unit.
Note2: i is the effect of all other variables not included in the model (sometimes
denoted as u i ).
Note3: Sometimes we can denote a regression constant as 0 instead of
Yi a bX i ei
Yi ˆ ˆX i ei
a and b are estimated using Least Square Method or Maximum Likelihood approach.
However, Least Square Method is the common one.
Once you have values for a and b you can estimate a value of Y for a given value of X
Linear Regression
household annual income ('000) = 411.40 + 194.65 * fsize
household annual income ('000)
R-Square = 0.48
30 00.00
20 00.00
10 00.00
4.00 8.00 12 .0 0 16 .0 0
OR
INCOME1i 0 1 FSIZEi i
Whereby;
INCOME1 = annual household income before project (‘000Tsh)
FSIZE= Farm size (acres)
0 = Regression constant
1 = Regression coefficient
= Error term
OR
Yi X i i
Whereby; Yi = Annual household income before project (‘000Tsh)
X i = Farm size (acres)
= Regression constant
= Regression coefficient
= Error term
OR
Yi 0 1 X i i
Whereby; Yi = Annual household income before project (‘000Tsh)
X i = Farm size (acres)
0 = Regression constant
1 = Regression coefficient
= Error term
Divides total variation into its components i.e. variation due to regression (due to X
included in the equation) and variation due to residue. It test the significance of the
model.
Hypotheses;
Ho: Independent variable (X) has no significant influence on dependent variable (Y)
Ha: Independent variable (X) has significant influence on dependent variable (Y)
ESS
R2
TSS
Note; in literature there are several versions of the formula for R 2, however, the above
formula is the simplest one
Hypotheses
Ho: = 0 (Independent variable (X) has no significant influence on dependent variable (Y))
H1: ≠ 0 (Independent variable (X) has significant influence on dependent variable (Y))
Note:
For the case of one independent variable, you can choose either F test (ANOVA) or t –
test to study the effect of X on Y (the above case) as they will end up into the same
conclusion.
OK.
Example;
The following it the output from SPSS for regression analysis for null hypothesis that
income before project1 by a household was not influenced by farm size vs alternative
hypothesis that it had an influence.
File name; Workshop SPSS working file1
INCOME1i FSIZEi i
Whereby; INCOME1 = Annual household income before project (‘000Tsh)
FSIZE= Farm size (acres)
= Regression constant
= Regression coefficient
= Error term
SPSS output
Regression
Variables Variables
Model Entered Remov ed Method
1 f arm size
a . Enter
(acres)
a. All requested v ariables entered.
b. Dependent Variable: annual income
bef ore project('000)
Model Summary
ANOVAb
Sum of
Model Squares df Mean Square F Sig.
1 Regression 52780721 1 52780720. 96 136.516 .000a
Residual 57220713 148 386626.438
Total 1.10E+08 149
a. Predictors: (Constant), f arm size (acres)
b. Dependent Variable: annual income bef ore project('000)
Coefficientsa
Unstandardized Standardized
Coeff icients Coeff icients
Model B Std. Error Beta t Sig.
1 (Constant) 358.792 87.344 4.108 .000
f arm size (acres) 192.081 16.440 .693 11.684 .000
a. Dependent Variable: annual income bef ore project('000)
Interpretation/reporting of results
Results indicate farm size was a good predictor of annual household income before
project. About 48% of variations in income were due to variations in farm size (R2
=.0.48). Furthermore, results indicate farm size was significantly positively associated
with annual household income before project (t = 4.108, P < 0.001).
Note:
If we have +ve coefficient (i.e. + ), it imply that there is positive relationship between X
and Y (i.e. increase in X is associated with increase in Y) AND if we have –ve
coefficient (i.e. - ), it imply that there is negative relationship between X and Y (i.e.
increase in X is associated with decrease in Y)
It is a multivariate analysis
Yi 1 X 1i 2 X 2i ... p X pi i
Whereby;
Yi = A dependent variable
X 1i. .. X pi Independent variables
= a regression constant
1 ... p = regression coefficients
i = random error (disturbance term)
Basically,
= is the value of Y when all Xs (all independent variables) are zero (0)
1 = is the amount of change in Y when X1 is increased by one unit when other
independent variables are held constant (i.e held at their mean); 2 = is the amount of
change in Y when X2 is increased by one unit when other independent variables are held
constant (i.e held at their mean); etc.
i = Effect of all other variables NOT included in the model
Yi a b1 X 1i b2 X 2i ... b p X pi ei
Or
- You can be able to study the effect of several independent variables collectively;
- You can be able to study the effect of a specific independent variables while
controlling for the effect of other variable (s) (i.e. confounders).
Whereby;
INCOME1 = annual household income before project (‘000Tsh)
AGE = Age (years)
EDUC2 = Education level (Years in school)
NCATTLE = Number of cattle owned
FSIZE= Farm size (acres)
= Regression constant
1 ... 4 = Regression coefficients
= Error term
OR
Whereby;
INCOME1 = annual household income before project (‘000Tsh)
AGE = Age (years)
EDUC2 = Education level (Years in school)
NCATTLE = Number of cattle owned
FSIZE= Farm size (acres)
0 = Regression constant
1 ... 4 = Regression coefficients
= Error term
OR
Yi 1 X1i 2 X 2i 3 X 3i 4 X 4i i
Whereby;
Y = annual household income before project (‘000Tsh)
X 1 = Age (years)
X 2 = Education level (Years in school)
X 3 = Number of cattle owned
X 4 = Farm size (acres)
= Regression constant
1 ... 4 = Regression coefficients
= Error term
OR
Yi 0 1 X1i 2 X 2i 3 X 3i 4 X 4i i
Whereby;
Y = annual household income before project (‘000Tsh)
X 1 = Age (years)
X 2 = Education level (Years in school)
X 3 = Number of cattle owned
X 4 = Farm size (acres)
0 = Regression constant
1 ... 4 = Regression coefficients
= Error term
Y f X1 , X 2 , X 3 , X 4 ,
Whereby;
Y = annual household income before project (‘000Tsh)
X 1 = Age (years)
X 2 = Education level (Years in school)
X 3 = Number of cattle owned
X 4 = Farm size (acres)
= Error term
Hypotheses;
Ho: All independent variables included in the model collectively does not
significantly influence a dependent variable (Y)
ESS
R2
TSS
Hypotheses:
Ho: 1 2 ... p 0
Ha: 1 2 ... p 0
Or
Ho: All regression coefficients are not significantly different from zero (i.e. no
relationship)
Ha: At least one regression coefficient is significantly different from zero (i.e. at least one
X has relationship with Y)
Or
Ho: There is no relationship between a dependent variable and variables included in the
model/or dependent variable is not significantly influenced by independent variables
include in the model.
Ha: There is relationship between a dependent variable and at least one independent
variable included the model/ or dependent variable is significantly influenced by at
least one independent variable included in the model.
Note: The above hypotheses are two tailed tests and too general- what about effect of
specific independent variables? = To be more specific, depending on literature or
existing information you may indicate expected change on each independent variable
(after specifying the model or when defining independent variables of the model) and
conduct one tailed test in some independent variables and two tailed test in some.
Example; consider a study on factors influencing adoption of agriculture technology for
Irish potato farming;
variable and put it to Dependent box Chose independent variables and put them
Example;
The following it the output from SPSS for regression analysis for null hypothesis that
income before project by a household (INCOME1) was not influenced by age of
respondent (years) (AGE), education level (years in school) (EDUC2), number of cattle
Whereby;
INCOME1 = Annual household income before project (‘000Tsh)
AGE = Age (years)
EDUC2 = Education level (Years in school)
NCATTLE = Number of cattle owned
FSIZE= Farm size (acres)
= Regression constant
1 ... 4 = Regression coefficients
= Error term
SPSS Output
Regression
Variabl es Entered/Removedb
Variables Variables
Model Entered Remov ed Method
1 f arm size
(acres),
age
(y ears),
education
lev el . Enter
(y ears in
school),
number of
cattle a
owned
a. All requested v ariables entered.
b. Dependent Variable: annual income
bef ore project('000 Tsh)
ANOVAb
Sum of
Model Squares df Mean Square F Sig.
1 Regression 78682359 4 19670589. 81 91.070 .000a
Residual 31319074 145 215993.617
Total 1.10E+08 149
a. Predictors: (Constant), f arm size (acres), age (y ears), education lev el (y ears in
school), number of cattle owned
b. Dependent Variable: annual income bef ore project('000 Tsh)
Coeffici entsa
Unstandardized Standardized
Coef f icients Coef f icients
Model B Std. Error Beta t Sig.
1 (Constant) -478.897 219.182 -2.185 .031
age (y ears) -2.743 4.342 -.029 -.632 .529
education lev el (y ears
114.006 18.021 .422 6.326 .000
in school)
number of cattle owned 9.673 3.229 .219 2.996 .003
f arm size (acres) 96.609 15.687 .348 6.159 .000
a. Dependent Variable: annual income bef ore project('000 Tsh)
Table:… Multiple Linear Regression Analysis for factors influenced annual household
income before project (in ‘000Tsh), a dependent variable
Standard
Independent variable B Error t- value Sig.
(SE)
Constant -478.90 219.18 -2.19 0.031
Age (Years) -2.74 4.34 -0.63 0.529
Education level (Years in school) 114.01 18.02 6.33 0.000
Number of cattle owned 9.67 3.23 3.00 0.003
Farm size (acres) 96.61 15.69 6.16 0.000
R2= 0.71; F-value = 91.07, P <0.001
Table:.. Multiple Linear Regression Analysis for factors influenced annual household
income before project (in ‘000Tsh), a dependent variable
Standard
Independent variable B t- value
Error (SE)
Constant -478.90 219.18 -2.19*
Age (Years) -2.74 4.34 -0.63NS
Education level (Years in school) 114.01 18.02 6.33***
Number of cattle owned 9.67 3.23 3.00**
Farm size (acres) 96.61 15.69 6.16***
R2= 0.71; F-value = 91.07, P <0.001; NS = Non- significant, * = Significant at P < 0.05, ** =
Significant at P < 0.01; *** = Significant at P < 0.001
OR
Table:… Multiple Linear Regression Analysis for factors influenced annual household
income before project (in ‘000Tsh) (INCOME1), a dependent variable
Standard
Independent variable B t- value
Error (SE)
Constant -478.90 219.18 -2.19*
AGE (Years) -2.74 4.34 -0.63NS
EDUC2 (Years in school) 114.01 18.02 6.33***
NCATTLE 9.67 3.23 3.00**
FSIZE (acres) 96.61 15.69 6.16***
R2= 0.71; F-value = 91.07, P <0.001; NS = Non- significant, * = Significant at P < 0.05, ** =
Significant at P < 0.01; *** = Significant at P < 0.001
Result form Table. indicate independent variables included in the model were good
predictors of annual household income before project. About 71% of variations in annual
household income before project were due to variations in independent variables included
in the model. Results further indicate that independent variables included in the model
collectively had a significant influence on annual household income before project (F =
91.07, P < 0.001). Results for t- test indicate annual household income before project had
a significant relationship with education level (t = 6.33, P < 0.001), number of cattle
owned (t = 3.00, P < 0.01) and farm size (t = 6.16, P < 0.05). The effect of age was not
significant (t = - 0.63, P > 0.05). Increase in education level, number of cattle owned and
farm size was associated with increased income before project.
Result form Table. indicate independent variables included in the model were good
predictors of annual household income before project. About 71% of variations in annual
household income before project were due to variations in independent variables included
in the model. Results further indicate that independent variables included in the model
collectively had a significant influence on annual household income before project (F =
91.07, P < 0.001). Results for t- test indicate annual household income before project was
significantly positively related with education level (t = 6.33, P < 0.001), number of cattle
owned (t = 3.00, P < 0.01) and farm size (t = 6.16, P < 0.05). Results also reveal that
income was negatively related with age of respondent. However, the relationship with
age was not statistically significant (t = - 0.63, P > 0.05).
Note:
If we have +ve coefficient (i.e. + ), it imply that there is positive relationship between X
and Y (i.e. increase in X is associated with increase in Y) AND if we have –ve
coefficient (i.e. - ), it imply that there is negative relationship between X and Y (i.e.
increase in X is associated with decrease in Y)
Exercise
Using the same file (Workshop SPSS working file1), Re-run (re-analyse) the above
information by dropping (removing) a variable “number of cattle owned” from the
model. Interpret your results
In the above examples for regression analysis we had continuous dependent variable and
continuous explanatory variable(s).
When we have continuous dependent variable, and some of independent variables are
categorical, we have to transform those variables into dummy variables(How?) and the
run least square linear regression analysis as usual. This would facilitate interpretation.
Example 1;
For a categorical variable “Sex of respondent (SEX)” with two categories (Male and
Female) can be coded as a dummy variable as follows;
Example 2;
For a categorical variable “If engaging in off farm activities (OFFA)” with two categories
(Yes and No) can be coded as a dummy variable as follows;
Note: strictly speaking, it is more sensible to use “Otherwise” when we generate dummy
variables for categorical variable with more than two categories (see the next example).
Therefore, for the above example (when we generate dummy for categorical variable
with two categories), format in bracket is more appealing (label the categories directly!!).
Use that format!.
2. Example of dummy variables for categorical variables with more than two
categories
For a categorical variable “District of residence” with three categories (Chamwino, Bahi
and Kongwa) we will have two dummy variables with one district chosen as base
(omitted category)- Suppose if we omit Chamwino district, we will have two dummy
1 = if Yes; 0 = otherwise
1 = if Yes; 0 = otherwise
(However, example given in this handout considered categorical variables with two
categories)
Note1: Depending on required information AND for simplicity, other authors may
collapse responses of a categorical variable with more than two categories into just two
categories coded in dummy form (i.e. one dummy variable)
If we have +ve coefficient (i.e. + ), imply that category coded as 1 is associated with
increase in Y (a dependent); AND if the coefficient is –ve (i.e. - ), the category coded
as 1 is associated with decrease in Y (a dependent).
Example;
The following is multiple linear regression analysis for null hypothesis that annual
household income (INCOME) (in ‘000 Tsh) is NOT influenced by Age of respondent
(AGE) (in years), Sex of respondent (SEX), education level (ADUC2) (in years in
school), engagement in off-farm activities (OFFA), number of cattle owned (NCATTLE),
and farm size (FSIZE) (in acres) vs (against) Alternative hypothesis that it is influenced
by those variables.
File name: Workshop SPSS working file3
In this analysis, the independent variables SEX and OFFA were categorical variables and
were coded as dummy as follows.
SPSS output
Regression
Variables Entered/Removedb
Variables Variables
Model Entered Remov ed Method
1
f arm size
(acres),
age of
household
head, If
engaged
in of f -f arm
activ ities
f or income,
sex of . Enter
household
head,
number of
cattle
owned,
education
lev el
(y ears in a
school)
ANOVAb
Sum of
Model Squares df Mean Square F Sig.
1 Regression 82912499 6 13818749. 82 64.680 .000a
Residual 30551822 143 213649.106
Total 1.13E+08 149
a. Predictors: (Constant), f arm size (acres), age of household head, If engaged in
of f -f arm activ ities f or income, sex of household head, number of catt le owned,
education lev el (y ears in school)
b. Dependent Variable: household annual income ('000)
Coeffici entsa
Unstandardized Standardized
Coef f icients Coef f icients
Model B Std. Error Beta t Sig.
1 (Constant) -340.252 221.940 -1.533 .127
age of household head -3.834 4.350 -.040 -.881 .380
sex of household head 320.241 102.803 .177 3.115 .002
education lev el (y ears
94.832 20.019 .346 4.737 .000
in school)
If engaged in of f -f arm
3.702 97.099 .002 .038 .970
activ ities f or income
number of cattle owned 8.312 3.257 .185 2.552 .012
f arm size (acres) 94.468 15.653 .335 6.035 .000
a. Dependent Variable: household annual income ('000)
Table:… Multiple Linear Regression Analysis for factors influencing annual household
income (in ‘000Tsh), a dependent variable
Standard
Independent variable B t- value
Error (SE)
Constant -340.25 221.94 -1.53NS
Age of household head (Years) -3.83 4.35 -0.88NS
Sex of household head 320.24 102.80 3.12**
Education level (Years in school) 94.83 20.02 4.74***
If engaged in off-farm activities 3.70 97.10 0.04NS
Number of cattle owned 8.31 3.26 2.55*
Farm size 94.47 15.65 6.04***
R2= 0.72; F-value = 64.68 (P <0.001); NS = Non- significant, * = Significant at P < 0.05, ** =
Significant at P < 0.01; *** = Significant at P < 0.001
Result form Table…. indicate independent variables included in the model were good
predictors of annual household income before project. About 72% of variations in annual
household income before project were explained by variations in independent variables
included in the model. Results further indicate that independent variables included in the
model collectively had a significant influence (effect) on annual household income before
project (F = 64.68, P < 0.001). Results for t- test indicate annual household income had a
significant relationship with sex of household head (t = 3.12, P < 0.01), education level (t
= 4.74, P < 0.001), number of cattle owned (t = 2.55, P < 0.05) and farm size (t = 6.04, P
<0.001). On the other hand, age of respondent and engagement on off-farm activities had
no significant influence on annual household income (t = -0.88, P > 0.05 and t = 0.04, P
>0.05, respectively). Being a male, increase in education level, number of cattle owned
and farm size was associated with increased annual household income.
OR
Result form Table. indicate independent variables included in the model were good
predictors of annual household income. About 72% of variations in annual household
income before project were explained variations in independent variables included in the
model. Results further indicate that independent variables included in the model
collectively had a significant influence on annual household income (F = 64.68, P <
0.001). Results for t-test indicate annual household income was significantly positively
related with education level (t = 4.74, P < 0.001), number of cattle owned (t = 2.55, P <
0.05), and farm size (t = 6.04, P <0.001). Results further indicate being a male was
significantly associated with increased annual household income (i.e. significantly
positively related to income) (t = 3.12, P < 0.01). Furthermore, age of respondent was
negatively related to annual household income, while engagement on off-farm activities
was positively related to annual household income, however, their effect were not
significant (t = -0.88, P > 0.05 and t = 0.04, P >0.05, respectively).
An overview
Linear regression model cannot directly apply to this situation as some of its assumptions
are violated.
Solution;
Latent variable approach: (Econometric approach). It assumes that there is an
underlying continuous variable (i.e unobserved/ latent variable) associated with a
response categorical variable under study and we only observe a particular category of a
response categorical variable when an underlying continuous variable is at particular
level.
Logit link is derived from Logistic distribution (to be explained later) while probit link is
from cumulative normal distribution.
Basically transformation is done in order for the functions to have linear properties like
that of linear regression and hence easily solved.
Logit link is from logistic regression. We are going to concentrate on logistic regressions
(as it is more easy to understand) specifically BINARY LOGISTIC REGRESSION.
In social sciences we frequently encounter response variables with binary responses i.e.
two categories of response (BINARY VARIABLE). In this situation, it is logical to
study (examine) probability of observing a particular category of a response variable (Y)
given particular levels of an independent variable. A relationship between probabilities of
observing a particular category of response variable given particular levels of
independent variable is NON LINEAR.
We can model this non linear relationship using logistic equation (Logit link) or
cumulative normal distribution equation (Probit link). In this handout we are going to use
logit link.
Suppose we have a dependent variable (Y) with two response categories; 1 if a condition
is observed and 0 if otherwise (not observed), and one independent variable (X).
e x
1 e x
OR
1
1 e ( x )
Whereby;
e = a constant with its value equals to 2.718
= regression constant
= regression coefficient
(Note: you may encounter in literature different notations for the expected
P(Y 1 | X x) such as x , , P, Pi , i , (x) and many others. Try to use the
simple one)
The main task in the above equation is the estimation of and . These can easily be
estimated after transforming the above equation into linear form (i.e. logit
transformation).
How??
P(Y 1 | X x)
1 P(Y 1 | X x) 1
It can be shown that;
P(Y 1 | X x)
g = ln ln = x
1 P(Y 1 | X x) 1
Note: we call g (which can also be written as g (x) ) as logit ( ) i.e. log-odds
is a value of logit when a value for X equals to zero and is the amount of change in
logit when X is increased by 1 unit (for a continuous independent variable), or when a
categorical X (a categorical independent variable) is changed from a base category to a
particular category.
The logit have some desirable properties like that of linear regression. It is linear on
its parameters, and it may be continuous and may range from to depending
on range of x
P(Y 1 | X x)
ĝ = ln ln = a bx
1 P(Y 1 | X x) 1
OR
P(Y 1 | X x) ˆ ˆ
ĝ = ln ln = x
1 P(Y 1 | X x) 1
Note2;
During interpretation, sometimes b can be transformed into ODDS RATIO (OR) to make
interpretation easy to understand. This is done by computing exp(b).
1. Goodness of fit-test
There are several tests i.e Nagelkerke R2, Cox and Snell R2, McFaden R2 etc.
Does the model that includes the variable in the question tell us more about the outcome
(or response) variable than a model that does not include the variable?. i.e Likelihood
ratio test (LR test).
Note2: G is the test statistic. The larger the G, the more likely we are going to reject null
hypothesis and hence declaring a variable in the model had significant effect (influence)
on Y
Hypotheses; Stated in the similar way as that for simple linear regression
The calculation of the log likelihood and the likelihood ratio test are standard features of
all logistic regression software. In the simple case of single independent variable, we first
fit a model containing only the constant term. We then fit a model containing
independent variable along with the constant. This give rise to new log likelihood (LL).
The Likelihood Ratio (LR) test is obtained by multiplying the difference between
these two values by -2. The results along with the associated p-value for the chi-
square distribution, may be obtained from most of the software packages
(For more information refer to Scott and Long, 1998; Hosmer and Lemershow, 2003;
Agresti, 2007)
Basically;
l
G 2 log 0 2log( l0 ) log( l1 ) 2L0 L1
l1
Where L0 and L1 denote maximized log-likelihood (LL) functions for a model with only
intercept and a model with intercept and a variable (i.e. a model with a variable added),
respectively. Under H 0 : 0 , this test statistic (G) has a large-sample chi-square
distribution with df = 1 (i.e. asymptotic chi-square distribution). Why df=1?, because we
introduced a single (only one) independent variable.
ˆ
W= =Z
S .E ( ˆ )
2
ˆ
W (i.e. Z ) =
2 2
= 2
ˆ
S .E ( )
Under the hypothesis that is equal to zero (i.e. null hypothesis), the statistic W2
follows a chi-square distribution ( 2 ) with 1 degree of freedom. SPSS program produce
output for this test.
Hypotheses; Stated in the similar way as that for simple linear regression
Note: For Z test, you can as well use C.I to make decision on whether to accept or
reject null hypothesis (this approach can also be used to some previously presented
statistical tests)
The concept of ˆ ± 1.96SE( ˆ1 ) vs if the interval contain zero vs significance of the
estimate at 5% level of significance (i.e 95% C.I). If dealing OR, to be significant interval
of OR should not contain1
Suppose we want to establish if there is relationship between a variable “If ever given
birth (BIRTH)” by a female adolescent as a dependent variable (with values 1 = Ever
given birth, meaning YES and 0 = Not ever given birth, meaning NO) and age of an
adolescent (AGE) (in years) as independent variable (from our file: Workshop SPSS
working file4), the simple binary logistic regression equation/model can be written as;
P(YES ) i
ln AGEi
1 P(YES ) i
Whereby;
P(YES ) i = Probability that an adolescent had ever given birth
AGE= Age of an adolescent (Years)
= Regression constant
= Regression coefficient
OR
Pi
ln AGEi
1 Pi
Whereby;
Pi = Probability that an adolescent had ever given birth
AGE= Age of an adolescent (Years)
= Regression constant
= Regression coefficient
Note1: in the above example we had a continuous independent variable (i.e. age). If you
have a categorical independent variables you must indicate it when defining terms in the
model by indicating its categories.
variable and put it to a Dependent box Select independent variable and put it
Specify a reference category and click Change button (Option but I prefer first
In the output below are the results for logistic regression analysis for the null hypothesis
that probability of an adolescent to report birth had no relationship with her age vs
(against) alternative hypothesis that it is influenced by age (i.e had relationship with age).
Logistic Regression
Case Processing Summary
a
Unweighted Cases N Percent
Selected Cases Included in Analy sis 202 100.0
Missing Cases 0 .0
Total 202 100.0
Unselected Cases 0 .0
Total 202 100.0
a. If weight is in ef f ect, see classif ication table f or the total
number of cases.
Predicted
if ev er giv en birth (birth
status)
Not ev er Ev er giv en Percentage
Observ ed giv en birth birth Correct
Step 0 if ev er giv en birth Not ev er giv en birt h 153 0 100.0
(birth status) Ev er giv en birth 49 0 .0
Ov erall Percentage 75.7
a. Const ant is included in the model.
b. The cut v alue is .500
Score df Sig.
St ep 0 Variables AGE 61.608 1 .000
Ov erall Statistics 61.608 1 .000
Chi-square df Sig.
St ep 1 St ep 69.301 1 .000
Block 69.301 1 .000
Model 69.301 1 .000
Model Summary
Classification Tablea
Predicted
if ev er giv en birth (birth
status)
Not ev er Ev er giv en Percentage
Observ ed giv en birth birth Correct
Step 1 if ev er giv en birth Not ev er giv en birt h 143 10 93.5
(birth status) Ev er giv en birth 29 20 40.8
Ov erall Percentage 80.7
a. The cut v alue is .500
Interpretation/reporting results
Based on Tables for Model Summary and Variables in the equation (i.e. last table) it can
be said that age was a good predictor of a probability of reporting birth by an adolescent
Example;
In the output below are the results for logistic regression analysis for the null hypothesis
that probability of an adolescent to report birth has no relationship with alcohol use vs
(against) alternative hypothesis that it is influenced by alcohol use (i.e it had relationship
with alcohol use by an adolescent) .
Note: Alcohol use was categorical with two categories (i.e. “Not use”, and “Use”)
Pi
ln ALCOHOLi
1 Pi
Whereby;
P i = Probability that an adolescent had ever given birth (i.e. reported birth)
ALCOHOL= If use alcohol (1 = Yes, 0 = No)
= Regression constant
= Regression coefficient
Note: Coding for “ALCOHOL” shown above was based on the reference category
chosen (i.e. “not use”) in which in computer coding a design variable/dummy
variable produced was coded as 1 = if use 0 = if not use (meaning “not use” was a
reference category, though in Data file it was coded as 1 = if not use and 2 = if use;
Reason for produced coding by computer is that FIRST category was chosen as a
reference category).
You can as well use directly in the data file the coding style of 0 = for reference
category and 1 = for non-reference category. But file used in this example coded 1 =
Logistic Regression
Case Processing Summary
a
Unweighted Cases N Percent
Selected Cases Included in Analy sis 202 100.0
Missing Cases 0 .0
Total 202 100.0
Unselected Cases 0 .0
Total 202 100.0
a. If weight is in ef f ect, see classif ication table f or the total
number of cases.
Paramet e
Frequency r coding
(1)
if use alcohol Not use 152 .000
Use 50 1.000
Predicted
if ev er giv en birth (birth
status)
Not ev er Ev er giv en Percentage
Observ ed giv en birth birth Correct
Step 0 if ev er giv en birth Not ev er giv en birt h 153 0 100.0
(birth status) Ev er giv en birth 49 0 .0
Ov erall Percentage 75.7
a. Const ant is included in the model.
b. The cut v alue is .500
Score df Sig.
St ep 0 Variables ALCOHOL(1) 36.440 1 .000
Ov erall Statistics 36.440 1 .000
Chi-square df Sig.
St ep 1 St ep 33.147 1 .000
Block 33.147 1 .000
Model 33.147 1 .000
Model Summary
Classification Tablea
Predicted
if ev er giv en birth (birth
status)
Not ev er Ev er giv en Percentage
Observ ed giv en birth birth Correct
Step 1 if ev er giv en birth Not ev er giv en birt h 131 22 85.6
(birth status) Ev er giv en birth 21 28 57.1
Ov erall Percentage 78.7
a. The cut v alue is .500
Interpretation/reporting results
Based on Tables for Model Summary and Variables in the equation (i.e. last table) it can
be said that Alcohol use predicted 23% of variations in probability of reporting birth
among adolescents, therefore, moderately good predictor (Nagelkerke R2 = 23%). Based
James Lwelamira (PhD) Page 122
on Wald- Statistic results further indicate alcohol use had a significant relationship with
log-odds for reporting birth (P<0.001). Results for Odds Ratio (OR) indicate alcohol
users were eight times more likely to report birth compared non- users (OR = 7.94; 95%
C.I, 3.85 – 16.38).
The case of a categorical independent variable with more than two categories (e.g.
three categories
Example;
In the output below are the results for logistic regression analysis for the null hypothesis
that probability of an adolescent reporting birth has no relationship with wealth status of
family vs (against) alternative hypothesis that it has relationship with wealth status of a
family
Note: Wealth status (WEALTH) was categorical with three categories (i.e. “LOW”,
“MODERATE”, and “HIGH”)
Pi
ln 1MODERATEi 2 HIGHi
1 Pi
Whereby;
P i = Probability that an adolescent had ever given birth (i.e. reported birth)
MODERATE= If an adolescent is from moderate income family (1 = if Yes, 0 =
otherwise)
HIGH = If an adolescent is from high income family (1 = if Yes, 0 = otherwise)
= Regression constant
1 , 2 = Regression coefficients
OR
Pi
ln 1WEALTH 1i 2WEALTH 2i
1 Pi
Whereby;
Logistic Regression
Case Processing Summary
a
Unweighted Cases N Percent
Selected Cases Included in Analy sis 202 100.0
Missing Cases 0 .0
Total 202 100.0
Unselected Cases 0 .0
Total 202 100.0
a. If weight is in ef f ect, see classif ication table f or the total
number of cases.
Paramet er coding
Frequency (1) (2)
Wealth st at us low 98 .000 .000
of a f amily moderate 55 1.000 .000
high 49 .000 1.000
Predicted
if ev er giv en birth (birth
status)
Not ev er Ev er giv en Percentage
Observ ed giv en birth birth Correct
Step 0 if ev er giv en birth Not ev er giv en birt h 153 0 100.0
(birth status) Ev er giv en birth 49 0 .0
Ov erall Percentage 75.7
a. Const ant is included in the model.
b. The cut v alue is .500
Score df Sig.
Step Variables WEALTH 26.558 2 .000
0 WEALTH(1) 3.880 1 .049
WEALTH(2) 14.333 1 .000
Ov erall Statistics 26.558 2 .000
Chi-square df Sig.
St ep 1 St ep 29.748 2 .000
Block 29.748 2 .000
Model 29.748 2 .000
Model Summary
Predicted
if ev er giv en birth (birth
status)
Not ev er Ev er giv en Percentage
Observ ed giv en birth birth Correct
Step 1 if ev er giv en birth Not ev er giv en birt h 153 0 100.0
(birth status) Ev er giv en birth 49 0 .0
Ov erall Percentage 75.7
a. The cut v alue is .500
Note2: description of coding for dummy variable with more than two categories
shown above is also applicable in linear regression analysis/models.
Interpretation/reporting results
Based on Tables for Model Summary and Variables in the equation (i.e. last table) it can
be said that wealth status of a family predicted 20% of variations in probability of
reporting birth among adolescents, therefore, moderately good predictor (Nagelkerke R2
= 20%). Based on Wald- Statistic results further indicate family wealth had a significant
relationship with log-odds for reporting birth (P<0.001). Results for Odds Ratio (OR)
indicate being from moderate income family by an adolescent was associated with
significant reduction in odds (chances) for reporting birth relative to be from low income
family (OR = 0.26; 95% C.I, 0.11 – 0.60). Similarly, being from high income family by
an adolescent was also associated with significant reduction in odds for reporting birth
compared to an adolescent from low income family (OR = 0.06; 95% C.I, 0.02 – 0.28).
Note 3: Highlight on Crude Odds Ratio (COR) vs Adjusted Odds Ratio (AOR)
x x ... x
e 11 2 2 p p
x x ... x
1 e 1 1 2 2 p p
ln 1 x1 2 x2 ... p x p
1
g(x) = o 1 x1 2 x2 ... p x p
or for simplicity
g = 1 x1 2 x2 ... p x p
A logit equation for a sample for estimating logit equation for a population can be written
as;
ĝ = ˆ ˆ1 x1 ˆ2 x2 ... ˆ p x p
If some of the independent variables are categorical i.e discrete, nominal scale variables
such as race, sex, etc, we need to generate design variables (D) (or dummy variables) (i.e
using a particular coding scheme) as shown in simple logistic regression. There are
different style for coding scheme (Refer to Hosmer and Lemershow, 2003), however, the
style shown previously in simple logistic regression is the simplest one.
k j 1
g(x) = 1 x1 ... jl D jl p x p
l 1
1. Goodness of fit-test
There are several tests i.e Nagelkerke R2, Cox and Snell R2, McFaden R2 etc.
OR
Compare reduce model vs full model if testing significance of additional variable (s) in
the model
Test statistic;
G = -2(log likelihood of reduced model or model with only the intercept – log
likelihood of full model or fitted model)
G will follow chi-square distribution with p degree of freedom i.e. (p+1)-1 under null
hypothesis that the p “slopes” coefficients for the covariates (independent variables) in
the model are equal to zero;
In carrying out this test you may possibly ending up into rejecting null hypothesis and
hence concluding that at least one and perhaps all p coefficients are different from zero,
an interpretation analogous to that in multiple linear regression.
Before concluding that any or all of the coefficients are nonzero, we may wish to look at
the univariate Wald test statistics. i.e. we can use Wald test to test significance of
individual coefficients
ˆ j
Wj
= Z
SE ˆ j
j
Note: under hypothesis that an individual coefficient is zero, the above Wald statistic will
follow the standard normal distribution.
Under the null hypothesis that a coefficient is zero the resultant statistic will follow chi-
square distribution with 1 degree of freedom. SPSS program produce output for this test.
Hypotheses; Stated in the similar way as that for multiple linear regression
Note: For Z test, you can as well use C.I to make decision on whether to accept or
reject null hypothesis (this approach can also be used to some previously presented
statistical tests)
The concept of ˆ ± 1.96SE( ˆ1 ) vs if the interval contain zero vs significance of the
estimate at 5% level of significance (i.e 95% C.I). If dealing OR, to be significant interval
of OR should not contain1
When OR >1 (i.e. 1.5, 2.4 3.1 etc) means that changing of a category of response variable
from that of reference to a particular category or a continuous independent variable is
increased by one unit would result into increased likelihood (odds or probability) for
observing a particular category of a response variable (i.e. that coded as 1).
1
When OR < 1 (i.e. 0.36, 0.42, 0.67), means that changing of a category of response
variable from that of reference to a particular category or a continuous independent
variable is increased by one unit would result into decreased likelihood (odds or
probability) for observing a particular category of a response variable (i.e that coded as
1).
The value for OR = 2.4, imply that women living in urban areas were two times more
likely to be diagnosed positive for breast cancer relative to those living in rural areas.
The value for OR = 4.1, imply that women living in urban areas were four times more
likely to be diagnosed positive for breast cancer relative to those living in rural areas.
The value for OR = 1.7, imply that women living in urban areas were almost two times
more likely to be diagnosed positive for breast cancer relative to those living in rural
areas. Alternatively, the OR can also be interpreted as women living in urban areas were
70% more likely to be diagnosed positive for breast cancer relative to those living in rural
areas (i.e. relative to their counterpart). Another alternative is that the OR can also be
interpreted as living in urban areas increased odds for being diagnosed positive for breast
cancer by 70% relative to rural areas.
The value for OR = 3.0, imply that women living in urban areas were three times more
likely to be diagnosed positive for breast cancer relative to those living in rural areas.
The value for OR = 0.36, imply that living in urban areas by a woman was associated
with 64% reduction in odds (or probability) for being diagnosed positive for breast
cancer. Or living in urban area by a woman decreased the likelihood (odds or probability)
of being diagnosed positive for breast cancer by 64%.
The value for OR = 0.50, imply that living in urban areas by a woman was associated
with 50% reduction in odds (or probability) for being diagnosed positive for breast
cancer. Or living in urban area by a woman decreased the likelihood (odds or probability)
of being diagnosed positive for breast cancer by 50%.
The value for OR = 0.25, imply that living in urban areas by a woman was associated
with 75% (or three quarters) reduction in odds (or probability) for being diagnosed
positive for breast cancer. Or living in urban area by a woman decreased the likelihood
(odds or probability) of being diagnosed positive for breast cancer by 75% (or three
quarters).
The value for OR = 0.36, imply that increase in age by one year would be associated with
64% reduction in odds (or probability) for being diagnosed positive for breast cancer. Or
increase in age by one year would decrease the likelihood (odds or probability) of being
diagnosed positive for breast cancer by 64%.
The value for OR = 1.4, imply that increase in age by one year would increase likelihood
(odds or probability) for being diagnosed positive for breast cancer by 40%.
The value for OR = 3.0, imply that increase in age by one year would result into three
times increase in likelihood (odds or probability) of being diagnosed positive for breast
cancer.
Etc.
It can be shown that the regression coefficient 1 is related to OR in the following way;
Likewise, ln (OR) = 1
Suppose we want to establish if there is relationship between a variable “If ever given
birth (BIRTH)” by a female adolescent as a dependent variable and Age (AGE), Highest
education level attained (HIHESED), Religious affiliation (RELIGION), Ethnicity
(ETHNIC), Wealth status of a family (WEALTH), Living arrangement (LIVING), Type
of marriage by parents (TMARRIAG), Having close friends that are sexually active
(ACTIVE), and Use of alcohol (ALCOHOL) by an adolescent as independent variables;
(from our file: Workshop SPSS working file4)
Note:
- A dependent variable “BIRTH” was categorical with two categories ( Ever given
birth, Not ever given birth).
- Independent variable “AGE” was continuous (in years).
- Independent variable “HIHESED” was categorical with two categories (Primary
and below, Secondary and above).
- Independent variable “RELIGION” was categorical with three categories
(Catholic, Protestant, Moslem).
- Independent variable “ETHNIC” was categorical with two categories (Gogo,
Others)
- Independent variable “WEALTH” was categorical with three categories (Low,
Moderate, High).
- Independent variable “LIVING” was categorical with three categories (Single
parent, Both parents, Others).
- Independent variable “TMARRIAG” was categorical with two categories
(Polygamy, Monogamy)
- Independent variable “ACTIVE” was categorical with two categories (Yes, No)
- Independent variable “ALCOHOL” was categorical with two categories (Not use,
Use)
Therefore, multiple binary logistic regression model/logit model for the above problem
can be written as;
P(Yi 1)
ln 1 AGE i 2 HIHESEDi 2 RELIGION1i 3 RELIGION 2 i 4 ETHNICi
1 P(Yi 1)
5WEALTH 1i 6WEALTH 2 i 7 LIVING1i 8 LIVING 2 i 9 TMARRIAG i 10 ACTIVE i 11 ALCOHOLi
Whereby;
P(Yi 1) = Probability that an adolescent had ever given birth
AGE= Age of an adolescent (Years)
HIHESED= Highest education level attained (1= if secondary and above, 0 = if primary
and below,).
RELIGION1 = If religious affiliation is protestant (1 = Yes, 0 = Otherwise).
RELIGION2 = If religious affiliation is moslem (1 = Yes, 0 = Otherwise).
ETHNIC = Ethnicity (1 = if others, 0 = if Gogo)
OR
P(Yi 1)
ln 1 AGE i 2 HIHESEDi 2 PROTESTANT i 3 MOSLEM i 4 ETHNICi
1 P(Yi 1)
5 MODERATE i 6 HIGH i 7 BOTH i 8 OTHERS i 9 TMARRIAG i 10 ACTIVE i 11 ALCOHOLi
Note: You can use other format for expressing the information on left hand side of the
equation (i.e. logit) as we have seen in a simple logistic regression. This include;
Pi
ln ...
1 Pi
Whereby;
Pi = Probability that an adolescent had ever given birth
OR
variable and put it to a Dependent box Select independent variables and pit them
that are categorical and put them to Categorical Covariate box Specify a
reference category and click Change button (Option: but I prefer first category!!!)
Continue OK.
P(Yi 1)
ln 1 AGE i 2 HIHESEDi 2 RELIGION1i 3 RELIGION 2 i 4 ETHNICi
1 P(Yi 1)
5WEALTH 1i 6WEALTH 2 i 7 LIVING1i 8 LIVING 2 i 9 TMARRIAG i 10 ACTIVE i 11 ALCOHOLi
Whereby;
P(Yi 1) = Probability that an adolescent had ever given birth
AGE= Age of an adolescent (Years)
HIHESED= Highest education level attained (1= if secondary and above, 0 = if primary
and below,).
RELIGION1 = If religious affiliation is protestant (1 = Yes, 0 = Otherwise).
RELIGION2 = If religious affiliation is moslem (1 = Yes, 0 = Otherwise).
Example:
Ho: Independent variables included in the model collectively had no significant influence
on probability for reporting birth (i.e. had no influence on a dependent variable)
Ha: Independent variables included in the model collectively had significant influence on
probability for reporting birth (i.e. had influence on a dependent variable)
Ho: All independent variables included in the model had no significant influence on
probability for reporting birth (had no influence on a dependent variable)
Ha: At least one independent variable included in the model had significant influence on
probability for reporting birth (i.e. had influence on a dependent variable)
Results
Logistic Regression
Paramet er coding
Frequency (1) (2)
Wealth st atus of a low 98 .000 .000
f amily moderate 55 1.000 .000
high 49 .000 1.000
religion af f iliation catholic 52 .000 .000
protestant 134 1.000 .000
moslem 16 .000 1.000
Liv ing arrangement single parent 79 .000 .000
both parents 103 1.000 .000
others (relativ es) 20 .000 1.000
Et hnicity gogo 153 .000
others 49 1.000
if use alcohol Not use 152 .000
Use 50 1.000
Ty pe of marriage by Poly gamy 123 .000
parents Monogamy 79 1.000
if close f riends are Y es 173 .000
sexually activ e No 29 1.000
highest education primary or below 146 .000
lev el Sec and abov e 56 1.000
Predicted
if ev er giv en birth (birth
status)
Not ev er Ev er giv en Percentage
Observ ed giv en birth birth Correct
Step 0 if ev er giv en birth Not ev er giv en birt h 153 0 100.0
(birth status) Ev er giv en birth 49 0 .0
Ov erall Percentage 75.7
a. Const ant is included in the model.
b. The cut v alue is .500
Score df Sig.
Step Variables AGE 61.608 1 .000
0 HIHESED(1) 4.193 1 .041
RELIGION 8.003 2 .018
RELIGION(1) 6.796 1 .009
RELIGION(2) .005 1 .942
ETHNIC(1) 14.333 1 .000
WEALTH 26.558 2 .000
WEALTH(1) 3.880 1 .049
WEALTH(2) 14.333 1 .000
LIVI NG 33.230 2 .000
LIVI NG(1) 27.550 1 .000
LIVI NG(2) .219 1 .640
TMARRI AG(1) .003 1 .956
ACTIVE(1) 21.762 1 .000
ALCOHOL(1) 36.440 1 .000
Ov erall Statistics 109.874 12 .000
Chi-square df Sig.
St ep 1 St ep 155.839 12 .000
Block 155.839 12 .000
Model 155.839 12 .000
Classification Tablea
Predicted
if ev er giv en birth (birth
status)
Not ev er Ev er giv en Percentage
Observ ed giv en birth birth Correct
Step 1 if ev er giv en birth Not ev er giv en birt h 146 7 95.4
(birth status) Ev er giv en birth 10 39 79.6
Ov erall Percentage 91.6
a. The cut v alue is .500
Table.. Multiple logistic regression for factors influencing fertility (i.e. reporting birth)
among female adolescents
Variable B S.E. Wald df Sig. Exp(B) 95.0% C.I.for EXP(B)
Lower Upper
AGE 1.097 .226 23.618 1 .000 2.996 1.925 4.664
HIHESED -2.703 .866 9.737 1 .002 .067 .012 .366
RELIGION1 -1.903 .870 4.781 1 .029 .149 .027 .821
RELIGION2 -3.614 1.593 5.150 1 .023 .027 .001 .611
ETHNIC -5.271 1.813 8.450 1 .004 .005 .000 .180
WEALTH1 -2.933 .907 10.460 1 .001 .053 .009 .315
WEALTH2 -2.101 1.163 3.262 1 .071 .122 .013 1.196
LIVING1 -2.347 .810 8.399 1 .004 .096 .020 .468
LIVING2 1.247 1.268 .968 1 .325 3.479 .290 41.724
TMARRIAG -1.399 .709 3.894 1 .048 .247 .061 .991
ACTIVE -.156 .891 .031 1 .861 .856 .149 4.910
ALCOHOL 3.129 .878 12.711 1 .000 22.860 4.092 127.711
Constant -17.659 3.897 20.538 1 .000 .000
Nagelkerke R2= 0.80
Based on Tables for Model Summary and Variables in the equation (i.e. last table) it can
be said variables included in the model were good predictor for reporting birth by an
adolescent (Nagelkerke R2= 0.80). Wald- chi-square test indicate education level,
religious affiliation, ethnicity, wealth status of a family, living arrangement, type of
marriage by parents and alcohol use had significant influence on probability of reporting
birth (i.e. having pre- marital fertility) by an adolescent. Effect of peer pressure was not
significant. Increase in age by one year was associated increase in likelihood of reporting
birth by three times (OR = 3.0, 95% C.I 1.93 - 4.66). Having secondary education and
above was associated with decreasing likelihood for reporting birth (OR = 0.07, 0.01 –
0.37). Results also indicate being Protestant relative to catholic, and being Moslem
relative to catholic were associated with reduced odds for reporting birth by an adolescent
(OR = 0.15, 95% C.I 0.03 – 0.81 and OR = 0.00 – 0.61, respectively). Adolescents from
other tribes were less likely to report birth compared to those from Gogo tribe (OR =
0.01, 95% C.I 0.00 – 0.18). Adolescents from Middle income families were less likely to
report birth compared to those from low income families (OR = 0.05, 95% C.I = 0.01 –
0.32). The effect of high income relative to low income was not significant. Regarding
living arrangement, living with both parents by an adolescent reduced likelihood for
reporting birth relative to living with single parent (OR = 0.10 , 95% C.I = 0.02 – 0.47).
Living with others had no effect on likelihood for reporting birth relative to living with
single parent (OR = 3.5, 95% C.I = 0.29 – 41.72). Likewise, type of marriage by parents
being monogamy and not having close friends that are sexually active did not had an
influence on probability for reporting birth relative to the counterparts (OR = 0.25, 95%
C.I 0.06 – 0.99 and OR = 0.86, 95% C.I 0.15 – 4.91). Alcohol use increased chances for
reporting births relative to non- use (OR = 0.22, 95% C.I 4.09 – 127.71).