0% found this document useful (0 votes)
199 views142 pages

Prof. James Analysis

Uploaded by

meshack mbala
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
199 views142 pages

Prof. James Analysis

Uploaded by

meshack mbala
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 142

QUANTITATATIVE DATA ANALYSIS and interpretation

INSTRUCTOR: Prof. Dr. James Lwelamira (PhD)

INTRODUCTION

As described earlier, we conduct data analysis to understand information/ message


contained in raw data. Through data analysis we can understand distributions in variables,
relationship or association among variables, differences among groups etc.
There are two main types for quantitative data analysis viz.
i) Descriptive statistics
ii) Inferential statistics

Descriptive statistics: Is concerned with analyzing frequencies and percentage


distributions for various categories of a variable under consideration. Descriptive
statistical analysis is also concerned with computation of measures of central tendency
(mean, mode, median) as well as measures of dispersion (standard deviation, variance,
range) for variables under study.

With descriptive statistical analyses we can observe trends, distribution patterns in data

Inferential statistics: Is concerned with testing of hypotheses concerned with;


- Determining whether there are significant differences between groups;
- Determining whether there is significant association/ relationship among
variables;
OR: determining whether an independent variable (s) has an effect (influence) on a
dependent variable.)

James Lwelamira (PhD) Page 1


What are the hypotheses?:
Hypotheses are claims/ propositions/ guesses concerning a population parameter, and
these claims can be true or not true (subject to results of statistical tests).
OR:
Hypotheses are tentative predictions of results of a study/experiment.
OR
Hypotheses are predictive statements about the outcome of a study

Based on number of variables under consideration at a time, quantitative data analysis


can be UNIVARIATE, BIVARIATE, or MULTIVARIATE data analysis

Univariate analysis deals with one variable at a time: Most descriptive statistical
analyses fall under this category.

Bivariate analysis deals with two variables at a time: It is concerned with establishing
relationship among two variables- it is a part of inferential statistical analysis. (Example
Pearson chi-square test, Pearson correlation analysis, Spearman’s rank correlation etc.).

Multivariate analysis deals with more than two variables at a time: It is concerned with
establishing relationship between several (more than two) variables- it is a part of
inferential statistical analysis. (Example, Multiple regression analysis, factor analysis,
discriminant analysis etc.)

James Lwelamira (PhD) Page 2


Furthermore, inferential statistics can be parametric or non- parametric (decision for
the choice of the method based on compliance to normal distribution assumption).
However, with exception to few cases (i.e. chi-square test, and spearman’s rank
correlation analysis), this lecture would concentrate on parametric tests.

James Lwelamira (PhD) Page 3


DISCRIPTIVE STATISTICAL ANALYSIS

 They are summary measures: They are important for observing trends/patterns
of distribution in data/variables:
 These include;
i) Measures of central tendency (i.e. Mean) and measures of dispersion
(Variance, SD) for CONTINUOUS data/variables,
AND
ii) Determination of frequencies and percentages for CATEGORICAL
data/variables.

 Descriptive Statistics can be summarized/presented in Tables or graphs, and


these tables/graphs can be simple or cross-classified (i.e. cross- tabulation
tables). But in this class we will concentrate on simple tables and graphs for
descriptive statistics: handout containing cross-classification analysis can be
provided to interested students upon request.

Note:
Type of variable (level of measurement) is among the key determinant for the choice of
statistical method to use in data analysis. As you can see, for descriptive statistical
analysis, we analyze for frequencies and percentage for categorical variables;

AND mean, mode, median, min. value, max. value, variance and s.d for continuous
variables. Continuous variables can also be analyzed for skewness, kurtosis, interquartile
range, Box and Whiskers plots (box-plots)

James Lwelamira (PhD) Page 4


I) NUMERICAL (TABULAR) METHODS FOR DESCRIPTIVE STATISTICS

a) MEASURE OF CENTRAL TENDENCY AND DISPERSION;

 Applicable to continuous data/variables (i.e. scale/ratio)

Procedures in SPSS;

Analyze Descriptive statistics Descriptive Select

variable (s) of interest and put it(them) to variable (s) box Option (Select

mean and other statistics you want) OK

(Note: N, Mean, SD, Min, and Max are generated by default)

James Lwelamira (PhD) Page 5


James Lwelamira (PhD) Page 6
James Lwelamira (PhD) Page 7
Example:

Using “Workshop SPSS working file1”, compute descriptive statistics for the variables
age, hhsize, income1, income2, fsize, ncattle.

Solution;

Analyze Descriptive statistics Descriptives Select

variables “age, hhsize, income1, income2, fsize, ncattle” and put them to Variable (s)

Box OK

Output
Descriptive Statistics

N Minimum Maximum Mean Std. Dev iat ion


age of household head 150 22.00 62.00 44.3867 9.15180
household size (number) 150 1.00 13.00 6.7533 2.61369
annual income bef ore
150 150.00 3405.00 1189.2200 859.22329
project('000)
annual income af ter
150 130.00 3200.00 1248.1800 868.43369
project('000)
f arm size (acres) 150 1.00 16.00 4.3233 3.09857
number of cattle owned 150 .00 120.00 23.1333 19.41482
Valid N (list wise) 150

Interpretation/reporting

Results from Table.. Indicate age of respondent varied from 22 to 62 years with average
of 44 years; household size varied from 1 to 13 individuals per household with average of
7 individuals. Regarding annual household income, recorded in ‘000 Tsh, the income
varied from 150 to 3405 with average of 1189 for the period before project, and from 130
to 3200 with average of 1248 for the period after project. Furthermore, results reveal farm
size per household varied from 1 to 16 acres with average of 4.3 acres, and number of
cattle ranged from 0 to 120 with average of 23 cattle per household.

James Lwelamira (PhD) Page 8


b) FREQUENCY DISTRIBUTION AND PERCENTAGES (%)

 Applicable to categorical data/variables (i.e. Ordinal/Nominal) e.g. categorized or


coded data.

Procedures in SPSS

Analyze Descriptive statistics Frequencies Select

variable (s) of interest and put it(them) to Variable (s) Box OK

James Lwelamira (PhD) Page 9


Example;

Using “Workshop SPSS working file1”, compute descriptive statistics (frequencies and
percentages) for the variables age2, sex, marital, educ.

Solution;

Analyze Descriptive statistics Frequencies Select

variables “age2, sex, marital, educ” and put them to Variable (s) Box OK.

James Lwelamira (PhD) Page 10


Output;

age

Cumulat iv e
Frequency Percent Valid Percent Percent
Valid < 30 11 7.3 7.3 7.3
40 - 50 96 64.0 64.0 71.3
> 50 43 28.7 28.7 100.0
Total 150 100.0 100.0

sex

Cumulat iv e
Frequency Percent Valid Percent Percent
Valid male 92 61.3 61.3 61.3
f emale 58 38.7 38.7 100.0
Total 150 100.0 100.0

marital status

Cumulat iv e
Frequency Percent Valid Percent Percent
Valid single 22 14.7 14.7 14.7
married 122 81.3 81.3 96.0
div orced 3 2.0 2.0 98.0
widow 3 2.0 2.0 100.0
Total 150 100.0 100.0

education level

Cumulat iv e
Frequency Percent Valid Percent Percent
Valid none 7 4.7 4.7 4.7
primary 81 54.0 54.0 58.7
secondary 40 26.7 26.7 85.3
college and abov e 22 14.7 14.7 100.0
Total 150 100.0 100.0

Note1: A column for “Percent” and “Valid percent” has similar values. This is because
we don’t have missing cases. However, if there was missing cases, the values in two
columns would be different. Always use a column for VALID percent!

James Lwelamira (PhD) Page 11


Note2: SPSS usually produce one table for each variable and hence ending up with many
tables in output. It is a good practice to summarize related information in one Table using
MS Word so as to reduce number of tables and for clarity. The table containing multiple
variables should be properly labeled.

Example: the above tables can be summarized as follows;

Table ..: General characteristics of respondents (n = 150)


Variable Frequency Percent
Age (Years)
< 30 11 7.3
30-50 96 64.0
> 50 43 28.7
Sex
Male 92 61.3
Female 58 38.7
Marital status
Single 22 14.7
Married 122 81.3
Divorced 3 2.0
Widow 3 2.0
Education level
None 7 4.7
Primary 81 54.0
Secondary 40 26.7
College and above 22 14.7

How do we interpret/explain/describe these results?. We could say;

Results from Table.. indicate majority of respondents (64.0%) aged between 30- 50
years, with very few (7.3%) aged below 30 years. Results from Table.. also reveal that
most of respondents (61.3%) were males, and overwhelming majority (81.3%) were
married. Findings from Table.. further indicate that about half (54.0%) of total
respondents had primary education, and more than one-third (41.4%) had at least
secondary education.

James Lwelamira (PhD) Page 12


Depending on preference, you may use comma instead of brackets when indicating
percentages. (However, the most common system is that using brackets)

Example;
Results from Table.. indicate majority of respondents, 64.0%, aged between 30- 50 years,
with very few, 7.3%, aged below 30 years. Results from Table.. also reveal that most of
respondents, 61.3%, were males, and overwhelming majority, 81.3%, were married.
Findings from Table.. further indicate that about half, 54.0%, of total respondents had
primary education, and more than one-third, 41.4%, had at least secondary education

In some field you may be required to indicate both frequencies and percentage when
explaining your results; example in the above text we could say;

Results from Table.. indicate out 150 of the survey respondents, majority 96 (64.0%)
aged between 30- 50 years, with very few 11 (7.3%) aged below 30 years. Results from
Table.. also reveal that of 150 of the surveyed respondents, most of them 92 (61.3%)
were males, and overwhelming majority 122 (81.3%) were married. Findings from
Table.. further indicate that about half 81 (54.0%) of total respondents had primary
education, and more than one-third 62 (41.4%) had secondary education and above.

Important!
Note1;
In fact, there are several formats and ways of interpreting/structuring sentences. The
above are just few examples. Reading of previous research report i.e. Dissertation or
Journal Article would strengthen your skills on this aspect. This is the easiest way of
learning on how to interpret and report results of your analysis!. (The problem is most
students don’t read beyond what has been taught in class!).

Note2;
When explaining your results try to explain general trend/pattern (explain the message we
get from results) in your results, i.e. try to avoid as much as you can coping/ transferring
ALL information from table to your text describing/explain results.

James Lwelamira (PhD) Page 13


Note3:
In explaining your results try to use words/phrases commonly used in reports by other
researchers. This would add flavor to your writing. This would be explained in the topic
concerning writing of a research report.

Special case: Multiple responses

Sometimes there are variables in which a respondent can give more than one answer
(responses);

Example:
- What type of crops grown?.
- What is do you think should be done in order to improve agricultural productivity in
your village?.
- What is the source of information/knowledge on issues related to sexual and
reproductive health by a youth?

These type of questions allows a respondent to give several answers i.e. combination of
responses. Under such situation, that variable can either be coded including combination
of responses OR use Multiple Response option in SPSS. Multiple response is always
preferred as it is easy to see a trend of responses (clarity): For first option (use of
Combination of responses in coding) you may end up into having many combinations
and hence lost focus and clarity.

Example: consider the question- What type of crops grown?

1 = maize
2 = Sorghum
3 = Millet
4 = Rice
5 = Beans
6 = Simsim
7= Sunflower
8 = Groundnuts
9 = Cassava
10 = Sweet potatoes
11 = Bambara nuts
12 = Maize and sorghum
13 = Maize and millet
14 = Maize, Rice
15 = Sorghum, Beans, simsim
16 = Rice, groundnuts, rice
17 = Cassava, maize, millet, Bambara nuts, sweet potatoes
18 = Millet, Maize, Sunflower, beans, Rice,

James Lwelamira (PhD) Page 14


19 = ……
20 = …….
21 = etc
So many combinations

To run a multiple response analysis:

Two approached:
a) When we have column for every possible answer/option of response, and the columns
are coded in 1 = Yes 2 = No style

b) When we have several columns for a question under study with number of columns
depends on maximum number of possible answers a respondent can give. Furthermore in
each column coding is uniform with each column contain codes (value labels) for all
possible options/answers:

Procedures below are for the second approach;

1. A variable for such analysis must have several columns in SPSS data file with
number of columns representing maximum (possible) number of
options/responses an individual respondent can give as observed in the
questionnaires (Note: It is not related to number of possible options/responses for
that variable in the data file/questionnaires).

2. Coding for all columns for that variable should be uniform; This can be achieved
by copying and pasting value labels for the first column to other columns.

Note: First column does NOT MEAN it is for a response coded as one in a data file, and
second code column for a response coded as two!. Any response can be punched on first
or second or third column etc depending sequence of responses given by a respondents.
That means first column if for first response by a respondent, second column for second
response by a respondent etc. First response by respondent X can be any i.e. the one
coded as 7 and second response can also be any other i.e. the one coded as 20 etc. In this
situation 7 should be punched in first column and 20 in second column.

Example;
Consider a variable “where” i.e. where get knowledge on reproductive health in data file
“Workshop SPSS working file0” . Based on responses from questionnaire, there were 8
responses coded as indicated below and a maximum number of responses by a
respondent was four (4) and hence having 4 columns in a data file.

Where get knowledge on reproductive health


1 = home-parents/guardian

James Lwelamira (PhD) Page 15


2 = friends/age mate
3 = seminars
4 = Radio
5 = Tv
6 = Newspaper
7 = school teacher
8 = matron

Appearance in SPSS data file


Respondent No. where1 where2 where3 where4
1 2 3 1
2 2 4
3 3 2 1 4
4 1
5 2 5
6 1 5 6
7 1
8 3
9 5
10 1
11 3 5 8
12 1 3
13 1 6 8
14 7 2
15 7
16 8 1 3
17 2 4 8
18 7 1 2
etc.

OR

Respondent No. where1 where2 where3 where4


1 home-
friends/age mate seminars parents/guardian
2 friends/age mate Radio
3 home-
seminars friends/age mate parents/guardian Radio
4 home-
parents/guardian
5 friends/age mate Tv
6 home-
parents/guardian Tv Newspaper
7 home-
parents/guardian

James Lwelamira (PhD) Page 16


8 seminars
9 Tv
10 home-
parents/guardian
11 seminars Tv matron
12 home-
parents/guardian seminars
13 home-
parents/guardian Newspaper matron
14 school teacher friends/age mate
15 school teacher
16 home-
matron parents/guardian seminars
17 friends/age mate Radio matron
18 home-
school teacher parents/guardian friends/age mate
etc.

Executing Multiple Response analysis in SPSS

After having file with a variable for multiple response prepared in a required format as
explained above, the following are procedures for carrying out multiple response analysis
in SPSS

Procedures in SPSS

Analyze Multiple Response Select variables for the set and put them

to Variables in Set Box Categories (Specify range for categories/ responses for

a variable under analysis Define Name and Label of a new variable in

respective Boxes Add Close Analyze Multiple

response Frequency Select and put “multiple response set” to

Table(s) for Box OK

James Lwelamira (PhD) Page 17


James Lwelamira (PhD) Page 18
Example;
From the file “Workshop SPSS working file0” conduct a multiple response analysis for
the variable “where get knowledge on reproductive health”.

Note: This variable has been represented by four columns (i.e. entered into four columns)
viz. where1, where2, where3, where4.

Solution;

Analyze Multiple Response Select variables “where1, where2,

where3 and where4” and put them to Variables in Set Box Categories

(Specify range for categories/ responses for a variable under analysis i.e. 1 to 8 )

Write “where” as variable name and “where get knowledge” as variable label

in respective Boxes Add Close Analyze Multiple

response Frequency Select and put “multiple response set” to

Table(s) for Box OK


Output;
Group $where where get knowledge

Pct of Pct of
Category label Code Count Responses Cases

home-parents/guardian 1 39 18.2 39.0


friends/age mate 2 30 14.0 30.0
seminars 3 49 22.9 49.0
Radio 4 12 5.6 12.0
Tv 5 28 13.1 28.0
Newspaper 6 12 5.6 12.0
school teacher 7 15 7.0 15.0
matron 8 29 13.6 29.0
------- ----- -----
Total responses 214 100.0 214.0

0 missing cases; 100 valid cases

How do we summarize results?.

We usually pick a column for category label, count (frequency) and either of the last two
columns (i.e. pct of responses or pct of cases).
Note: pct = percent, cases = respondents. I usually prefer picking last column;

James Lwelamira (PhD) Page 19


Example; Consider the table below for side effect experienced by a sample of women
from Mpwapwa districts when use modern contraceptives.

Table 6: Distribution of respondents by side effects ever encountered (n = 28)


Side effect Frequency Percent (%)
Headache 4 14.3
Irregular bleeding 18 64.3
Weight gain 1 3.6
Weight loss 2 7.1
Fatigue 3 10.7
Backache 4 14.3
Nausea 3 10.7
Abdominal pain 9 32.1
Vertigo 1 3.6
Waist pain 1 3.6
Increase heart beats 2 7.1
Pains in the reproductive organ 1 3.6
Vomiting 1 3.6
Pains in the whole body 2 7.1
Cough dark sputum 1 3.6
Paralysis 2 7.1
*Data were based on multiple responses.

Note: value for n here is the value for valid cases indicated at the bottom of output
of multiple response analysis

II) GRAPHICAL METHODS FOR DESCRIPTIVE STATISTICS (DATA


PRESENTATION BY GRAPHICAL METHODS)

Data can also be summarized for observing trends by presenting it in form of graphs.

There are several graphical methods for presenting data. The most common one are;

i) Pie chart
ii) Bar chart
iii) Histogram
iv) Scatter plot

Note: There are several options/routes/procedures for generating graphs in SPSS


program.

PROCEDURES FOR PIE CHART IN SPSS

a) Presenting frequency distribution/percentages for different categories of a categorical


variable in Pie chart

James Lwelamira (PhD) Page 20


Graphs Interactive Pie Simple Specify/indicate slice

summary to be in percentage (%) Drag a variable of interest to Slice By Box

(Make sure it is specified as categorical variable) Pies (select category and

percent) Title(specify title) OK.

Example:

Using “Workshop SPSS working file4”, generate a pie chart for percent distribution for
different options/responses of a variable living2 (i.e. Living arrangement);

Solution;

Graphs Interactive Pie Simple Specify/indicate slice

summary to be in percentage (%) Drag a variable “living2” to Slice By Box

(Make sure it is specified as categorical variable) Pies (Specify indication of

categories and percent) Title(Write/type living arrangement) OK.

James Lwelamira (PhD) Page 21


Output

Living arrangement

9.90% Living arrangem ent


single parent
both parents
others (relatives)

39.11% Pies show percents

50.99%

b) Presenting totals and percentage of a continuous variable by different categories


of a categorizing variable in a Pie chart

Graphs Interactive Pie Simple Drag a continuous

variable of interest to a Slice summary Box Drag a variable of interest

(Categorizing variable) to Slice By Box (Make sure it is specified as categorical variable)

Pies (Specify slice/category label and percent to appear in a pie chart)

Title (Write/type title for a chart) OK.

Example;

Using “Workshop SPSS working file1”, generate a pie chart for total income1 (annual

income before project) by district;

James Lwelamira (PhD) Page 22


Solution;

Graphs Interactive Pie Simple Drag a variable

“income1” to a Slice summary Box Drag a variable “district” to Slice By Box

(Make sure it is specified as categorical variable) Pies (Specify slice/category

label and percent to appear in a pie chart) Title (Write/type “Income before

project by district”) OK.

Annual income before project by District

Dis trict of re sidence


Bahi
Bahi 24.05% Chamwino
Kongwa

Pies show Sums of income1

Kongwa 49.04%

Chamwino 26.91%

Note: However, the above output could make sense if sample size was equal among
districts

James Lwelamira (PhD) Page 23


PROCEDURES FOR BAR CHART IN SPSS

- For categorical variables

Graphs Interactive Bar Change unit of measurement in Y-axis

to be Percent (%) by selecting Percent Select variable of interest and drag it to

X-axis (Make sure variable in X-axis is specified as categorical (right click and change it

if not specified as categorical variable Title (specify title of the graph)

OK.

Example;
Using “Workshop SPSS working file4”, generate a bar chart for a variable religion

(religion affiliation)

Output

James Lwelamira (PhD) Page 24


Religious affiliation by a respondent

Bars show percents

60 %
Percent

40 %

20 %

0%
ca th ol ic protes ta nt mos le m

religion affiliation

PROCEDURES FOR HISTOGRAM IN SPSS

- For continuous variables

Graphs Interactive Histogram Select variable of interest and

put it to X-axis Box Specify percent in y-axis Histogram (specify scale)

Title (specify title) OK

Note: a variable for histogram must be continuous (i.e. scale/interval)

Example:

Example;
Using “Workshop SPSS working file1”, generate a Histogram for a variable Income1
(income before project)

James Lwelamira (PhD) Page 25


Output

Annual household income before project ('000 Tsh)

20 %

15 %
Percent

10 %

5%

0%
10 00.00 20 00.00 30 00.00

annual income before project('000)

PROCEDURES FOR SCATTER PLOT IN SPSS

- Require two continuous variables

Procedures in SPSS

Graphs Interactive Scatter plot Select variables of interest

and put them to respective/appropriate axis (X or Y) Title (Specify title)

OK.

Note: You can fit a regression line (equation) by use Fit option before specifying a title.

In this regard, while in fit option, in method dialogue box choose regression, specify to

include constant in equation.

James Lwelamira (PhD) Page 26


Use Workshop SPSS working file3

Generate a scatter plot for farm size (X) vs Income (Y)

Expected output

Farm size vs income


Linear Regression
household annual income ('000) = 411.40 + 194.65 * fsize

household annual income ('000)

R-Square = 0.48 
 
30 00.00 
  
 
   
   
   

     

  
  
20 00.00 
    
 

 
  
  
 
 
 

  
10 00.00  
  
  
  

 
 

 


 
 


  


 

   

    
   
    

4.00 8.00 12 .0 0 16 .0 0

farm size (acre s)

Exercise
Generate a scatter plot for Number of cattle (X) vs Income (Y)

James Lwelamira (PhD) Page 27


INFERENTIAL STATISTICS

As stated earlier, inferential statistics is concerned with testing of hypotheses.


Hypotheses are predictive statements about the outcome of a study. These hypotheses
could be on differences between groups or relationship/association among variables.
We usually state hypothesis in two forms viz. Null hypothesis (Ho) and Alternative
Hypothesis (Ha).

Null Hypothesis (Ho)


Literally speaking, null hypothesis is the one that declare that “there is NO significant
difference between groups (if comparing groups)/ or there is NO significant
association/relationship between variables (if concerned on testing
association/relationship between variables)”.
(or the observed difference/association/relationship IS DUE TO CHANCE!!)
Alternative Hypothesis (Ha)
Alternative hypothesis is the one that declare that “there is significant difference between
groups (if comparing groups)/ or there is significant association/relationship between
variables (if concerned on testing association/relationship between variables)”
(or the observed difference/association/relationship is REAL and is not due to chance!!)

In testing of hypothesis we usually try to look at whether we can REJECT null hypothesis
or ACCEPT it.
When null hypothesis is rejected, we accept alternative hypothesis, therefore, depending
on nature of problem declare that there is significant difference between groups/ or there
is significant association/relationship between variables.

How decision to reject or accept null is reached?.

James Lwelamira (PhD) Page 28


Theoretically we compared standardized/transformed difference between claimed
(hypothesized) value/situation and true value/situation (represented by sample
information) regarding population parameter (i.e. a test statistic- e.g. Z, t, Chi-square, F)
with a particular benchmark (i.e. tabulated value) set based on chosen level of
significance (  ). When a value for test statistic is higher than tabulated value, null
hypothesis is REJECTED, and hence accepting alternative hypothesis.

What is a level of significance?.


Level of significance is the probability (chances) of wrongly rejecting null hypothesis
(i.e. probability of committing type I error). We usually prefer lower levels of
significance. Acceptable levels significance are 0.1% (0.001), 1% (0.01), 5% (0.05), 10%
(0.1), but the most common ones are 5% (i.e. 0.05) and lower.

Someone may wonder why probability attached to our decisions above?:


This is because decisions are based on sample information and these information vary
from one sampling to another

Important!!!!! (P-VALUE APPROACH for decisions)


However, practically, computer output gives P-values (Significance = sig.) and hence
interpretation is based on P-value. P- value has direct connection with a value of test
statistic. P-value is a probability /chances of observing OBTAINED VALUE (computed
based on available data) of a test statistic OR observing HIGHER VALUE THAN THAT
(in absolute terms) if null hypothesis is true (in other word, probability of null hypothesis
to be true given sample information/data). The lower the P-value, the less likely that null
hypothesis holds (i.e. the lower the P- value, the less likely the null hypothesis is true).
Therefore lower values of P lead us to reject null hypothesis.

The basic question to ask ourselves is, how lower should P-value be so as we can
reject null hypothesis?. The general rule is that it should be equal to or lower than 5%
(i.e. 0.05) (This is because 5% i.e. 0.05 level of significance is usually taken as standard

James Lwelamira (PhD) Page 29


starting point). Meaning that P-values larger than 5% (i.e. 0.05) lead us to ACCEPT null
hypothesis therefore declaring that there is NO significant difference between groups (if
comparing groups)/ or there is NO significant association/relationship between variables
(if concerned on testing association/relationship between variables) etc.

Hints for interpretation of significance (sig.) or P-values

P> 0.05 = Non- significant at P >0.05 (i.e. null hypothesis is accepted, therefore there is
NO significant difference/relationship) (i.e. NS)

P ≤ 0.05 = Significance at P < 0.05 (null hypothesis is rejected at P< 0.05 therefore, there
is significant difference/relationship at P< 0.05) (i.e. *)

P < 0.01 = Significance at P < 0.01 (null hypothesis is rejected at P< 0.01 therefore, there
is significant difference/relationship at P< 0.01) (i.e. **)

P < 0.001 = Significance at P < 0.001(null hypothesis is rejected at P< 0.001 therefore,
there is significant difference/relationship at P< 0.001) (i.e. ***)

From the above, it is important to note that: for P >0.05 (i.e. values above 0.05) we
agreed with claims by null hypothesis;

AND for P ≤ 0.05, P < 0.01 and P < 0.001 (i.e. values equal to or below 0.05) we
disagreed with claims by null hypothesis and hence took/agreed with alternative claims
by alternative hypothesis.

One-tailed vs two tailed test


- Hypothesis testing can be one tailed test (directional test) or two tailed test (non-
directional) depending on whether decision to reject/accept null hypothesis is
done on one tail of the distribution or on both sides of the distribution.
- One tailed test is applicable when you know (have a prior information) the
direction of the effect/relationship, while two tailed test is appropriate to you
when you don’t have this prior information.

James Lwelamira (PhD) Page 30


- In most cases the test can be classified to be one-tailed test or two tailed test
depend on how alternative hypothesis is stated;

James Lwelamira (PhD) Page 31


Example: Consider performance of male and female students in Statistics module
(we are comparing two groups on average performance in Statistics module)

The null hypothesis would be;


Ho: There is no significant difference between male and female students on average
performance in Statistics.
The alternative hypothesis would be;
Ha: Average performance of male students in Statistics is significantly different from that
of female students (i.e. males perform differently from females in Statistics).
(For the case of two tailed test)

If it is one tailed test, alternative hypothesis would be;

Ha: Male students have significantly higher average performance in Statistics compared
to female students (i.e. males perform better than females in Statistics). (for the case of
right tailed test)

OR
Ha: Male students have significantly lower average performance in Statistics compared
to female students (i.e. males perform poor than females in Statistics). (for the case of
left tailed test)

Computer output for most of the statistical tests and for most statistical software give
results of p-value (sig.) for two-tailed test. To get corresponding P-value for one-tailed
test just divides the obtained value of two-tailed test by 2.

Dependent vs Independent variables


Dependent variable (Y) = response variable
Independent variable (Xs): predictor variable (s)/ variable(s) that influence Y.

James Lwelamira (PhD) Page 32


In some statistical tests you may be required to specify what is a dependent variable and
what is an independent variable(s), while in some others tests is not required (i.e. this
specification is not applicable/not needed).

COMMON INFERENTIAL STATISTICAL METHODS IN SOCIAL SCIENCES

A. TESTING OF HYPOTHESES ABOUT POPULATION MEAN


(S)(COMPARING GROUP MEANS)

1. ONE SAMPLE t- test

- This test is concerning in testing if population mean is different from a particular


specified value (  0 ).

- Example: An economist wants to know if the per capita income of a particular region is
same as the national average.

- This test is Parametric. Non- parametric counterpart include Median


test/Binomial test, concerned in testing if a sample is drawn from a population
with median value equals to a particular value.

- This test is suitable for a continuous data i.e. able to compute mean from the
data/computation of mean make sense!.

- Test statistic = t

Sample mean  hypothesiz ed mean


- t
S tan dard error of sample mean

In this test (one-sample t-test), hypotheses would;


Ho: Population mean is equal to  0
Ha: Population mean is not equal to  0
(for the case of two-tailed test)

Or
Ha: Population mean is higher than  0 (right tailed test) Or Population mean is lower
than  0 (left tailed test)
(for the case of one tailed test)

James Lwelamira (PhD) Page 33


Example, consider the following example on average (mean) weight of second year
students at IRDP; Someone may wish to test the claim that average weight of second year
students is 50kg

The hypotheses could be;


Ho: Average weight of second year student at IRDP is 50kg
Ha: Average weight of second year student at IRDP is not 50kg
(if he/she want to perform a two-tailed test)

However, in case of one tailed test, alternative hypotheses (Ha) would be;

Ha: Average weight of second year student at IRDP is higher than 50kg (for right tailed
test). Or Ha: Average weight of second year student at IRDP is lower than 50kg (for left
tailed test).

Procedures for One-sample t-test in SPSS:

Analyze Compare means One sample t- test Select variable

under study (variable of interest) and put it to Test variables (s) box Specify

test value at Test value box OK

James Lwelamira (PhD) Page 34


Example:
In tables below are computer outputs for the null hypothesis that average annual
household income (in ‘000) before a particular project (Income1) for rural household of
Dodoma is 1,000 vs Alternative hypothesis that it is not 1,000. The study involved a
sample of 150 households;
File name: Workshop SPSS working file1

T-Test
One-Sample Statistics

Std. Error
N Mean Std. Dev iat ion Mean
annual income
150 1189.2200 859.22329 70.15529
bef ore project ('000)

James Lwelamira (PhD) Page 35


One-Sample Test

Test Value = 1000


95% Conf idence
Interv al of the
Mean Dif f erence
t df Sig. (2-tailed) Dif f erence Lower Upper
annual income
2.697 149 .008 189.2200 50.5922 327.8478
bef ore project('000)

Recall on hints for interpretation of significance (sig.) or P-values

P> 0.05 = Non- significant at P >0.05 (i.e. null hypothesis is accepted, therefore there is
NO significant difference/relationship) (i.e. NS)

P ≤ 0.05 = Significance at P < 0.05 (null hypothesis is rejected at P< 0.05 therefore, there
is significant difference/relationship at P< 0.05) (i.e. *)

P < 0.01 = Significance at P < 0.01 (null hypothesis is rejected at P< 0.01 therefore, there
is significant difference/relationship at P< 0.01) (i.e. **)

P < 0.001 = Significance at P < 0.001(null hypothesis is rejected at P< 0.001 therefore,
there is significant difference/relationship at P< 0.001) (i.e. ***)

From the above, it is important to note that: for P >0.05 (i.e. values above 0.05) we
agreed with claims by null hypothesis;

AND for P ≤ 0.05, P < 0.01 and P < 0.001 (i.e. values equal to or below 0.05) we
disagreed with claims by null hypothesis and hence took/agreed with alternative claims
by alternative hypothesis.

Therefore, for the above output is can be said that; Average annual household income
by a rural household of Dodoma is significantly different from 1,000 (in ‘000) (t = 2.67,
P< 0.01).

Note1: Depending on preference, d.f can also be reported in the above sentence (i.e. t =
2.67, df = 149, P< 0.01) or only P-value reported (i.e. P< 0.01)

Note 2: Suppose sig. in the last table is 0.031 or 0.025, or 0.016 (i.e. values which are
also below 0.05) it would also imply presence of significance difference.

Note3: suppose sig. in the last table is 0.075 or 0.082 or 0.154 (i.e. values above 0.05), it
would imply that there was no significant difference.

James Lwelamira (PhD) Page 36


Note4: Computer generates results of sig. for two tailed test by default. To get results for
one tailed test just divide sig. for two tailed test by two (2) i.e for the above table, for one
tailed test sig. would be 0.008/2 = 0.004. (WHY?=consider # of alternatives vs chances
of success!!; WARNING-don’t confuse level of significance vs P-value/Sign =they are not
the same!!)

Note5: If result is significant at two-tailed test, it would also be significant at one-tailed


test. But vice versa is not necessarily true (i.e. we cannot warrant this for a vice versa
situation!)

Note 6: Sig. in computer output is rounded to 3 decimal places. Therefore sig. of 0.000
could possibly be 0.000067 or 0.00028 etc.

Summarizing the output


The above output from SPSS can be summarized in simple table for presentation in a
report as follows;

Results for one-sample t-test for average annual income before project
(test value in ‘000 Tsh. = 1000)
N Mean Std dev. (SD)
150 1189.2 859.2
t-value = 2.697, Significance = 0.008 (or P<0.01).

Exercise
Using the same file (Workshop SPSS working file1), test the hypothesis (null
hypothesis) that average annual household income before project (in’000 Tsh.) was 1200.
Note: Summarize your output and interpret your results.

WHEN WE WANT TO COMPARE TWO GROUPS (i.e. When we want to


compare TWO MEANS)

2. TWO INDEPENDENT SAMPLES t –test

This test is concerned in testing equality of means for two groups (i.e comparing two
groups). Is it involved in answering the question like “are the values for two groups
similar?
In this test it is assumed that the two samples are from two independent populations

This test is parametric. Non-parametric counterpart include: Mann- Whitney U test

James Lwelamira (PhD) Page 37


Dependent variable: Continuous-interval/ratio- can compute mean, s.d (This is a
response variable). Sometimes called criterion variable

Independent variable: Categorical (This specifies groups for comparison). Sometimes


called comparison- group variable. In this test, independent variable has two
levels/groups for comparison.

Test statistic = t

Hypotheses in this test are as follows;

Ho: There is no significant difference in averages (means) for two groups


Ha: There is significant difference in averages (means) for two groups
(For the case for two tailed test)

If it is one tailed test, alternative hypothesis would be;

Ha: Mean for group A is significantly higher than mean for group B (for right tailed test)

OR
Ha: Mean for group A is significantly lower than mean for mean for group B (for left
Tailed test).

Example: consider our previous example on average (mean) performance of students in


Statistics;

The null hypothesis would be;


Ho: There is no significant difference between male and female students on average
performance in Statistics.
The alternative hypothesis would be;
Ha: Average performance of male students in Statistics is significantly different from that
of female students (i.e. males perform differently from females in Statistics).
(For the case of two tailed test)

If it is one tailed test, alternative hypothesis would be;

James Lwelamira (PhD) Page 38


Ha: Male students have significantly higher average performance in Statistics compared
to female students (i.e. males perform better than females in Statistics). (for the case of
right tailed test)

OR
Ha: Male students have significantly lower average performance in Statistics compared
to female students (i.e. males perform poor than females in Statistics). (for the case of
left tailed test)
Procedures for independent samples t-test in SPSS:

Analyze Compare means Independent samples t- test Select a

Response variable/criterion variable (a dependent variable) and put it to Test variable (s)

Box Select a comparison- group variable (independent variable) and put it to

Grouping variable Box Specify groups to be compared after clicking Define

groups button Continue OK

James Lwelamira (PhD) Page 39


Example;
In tables below are computer output for null hypothesis that average annual household
income (in ‘000) before a particular project (Income1) for male headed households is the

James Lwelamira (PhD) Page 40


same as that of female headed households vs alternative hypothesis that they are
different. The study involved a sample of 150 rural households of Dodoma.

File name: Workshop SPSS working file1

T-Test
Group Statistics

Std. Error
sex of household head N Mean Std. Dev iat ion Mean
annual income male 92 1618.9022 740.51248 77.20376
bef ore project('000) f emale 58 507.6552 532.65886 69.94153

Independent Samples Test

Lev ene's Test f or


Equality of Variances t-test f or Equality of Means
95% Conf idence
Interv al of the
Mean Std. Error Dif f erence
F Sig. t df Sig. (2-tailed) Dif f erence Dif f erence Lower Upper
annual income Equal v ariances
28.080 .000 9.920 148 .000 1111.2470 112.02601 889.86989 1332.624
bef ore project ('000) assumed
Equal v ariances
10.667 145.356 .000 1111.2470 104.17408 905.35539 1317.139
not assumed

From the above output it can be concluded that average annual household income before
project for male headed households was significantly different from that of female
headed households (t= 10.67, P < 0.001).

Note1: if P value (sig.) for Levene’s test is P ≤ 0.05 (i.e. equal or below 0.05) use last row
for the above table in interpretation, and if it is P >0.05 (i.e. it is above 0.05) use the first
row.

Note2: suppose sig. in the last table is any value above 0.05 such as 0.23, it implies that
there was no significant difference.

Note3: As stated earlier, if result is significant at two-tailed test, it would also be


significant at one-tailed test. But vice versa is not necessarily true (i.e. we cannot
warrant this for a vice versa situation!)

Therefore, for the above output if your interest was one tailed test we could state that
“average annual household income before project for male headed households was
significantly higher than that of female headed households (t= 10.67, P < 0.001).

James Lwelamira (PhD) Page 41


Note4: It is a common practice to present results for descriptive statistics and support
(back) them with results for inferential statistics/ or vice versa when describing your
results (if possible/if improve clarity). For example, in the above output we could say;

Average annual household income before project (in ‘000 Tsh.) was compared for the
two types of household. Results from Table … indicate average annual household income
for male headed households, 1618.9, was significantly higher than that for female headed
households, 507.7, (t= 10.67, P < 0.001).

OR we could say;

Average annual household income before project (in ‘000 Tsh.) was compared for the
two types of household. Results from Table … indicate average annual household income
for male headed households (1618.9) was higher than that for female headed households
(507.7). Results for t-test indicate the difference to be significant (t= 10.67, P < 0.001).

(i.e. we used brackets instead of commas in specifying values for averages/mean;


choice of style depends on preference)

Summarizing the output from SPSS


There are several styles. The following style can also be used.

A case of single test (response) variable


Average annual household income before project among
male and female headed households (in ‘000 Tsh.)
Household head N Mean ± SD
Male 92 1618.9 ± 740.5
Female 58 507.7± 532.7
t- value = 10.67, Significance = 0.000 (or P< 0.001).

OR

Average annual household income before project among


male and female headed households (in ‘000 Tsh.)
Household head N Mean SD
Male 92 1618.9 740.5
Female 58 507.7 532.7
t- value = 10.667, Significance = 0.000 (or P< 0.001

James Lwelamira (PhD) Page 42


A case of several test (response) variables

Example1;
Example, apart from annual income before project suppose we compared the households
on other variables such as household size (number of individuals in a household), farm
size (acres) and number of cattle owned and the following output was obtained.

T-Test
Group Statistics

Std. Error
sex of household head N Mean Std. Dev iat ion Mean
annual income bef ore male 92 1618.9022 740.51248 77.20376
project('000) f emale 58 507.6552 532.65886 69.94153
household size (number) male 92 5.7283 2.41835 .25213
f emale 58 8.3793 2.03330 .26699
f arm size (acres) male 92 5.3478 3.23591 .33737
f emale 58 2.6983 1.99987 .26260
number of cattle owned male 92 33.1522 17.43492 1.81772
f emale 58 7.2414 9.00158 1.18197

Independent Samples Test

Lev ene's Test f or


Equality of Variances t-test f or Equality of Means
95% Conf idence
Interv al of the
Mean Std. Error Dif f erence
F Sig. t df Sig. (2-tailed) Dif f erence Dif f erence Lower Upper
annual income bef ore Equal v ariances
28.080 .000 9.920 148 .000 1111.2470 112.02601 889.86989 1332.624
project('000) assumed
Equal v ariances
10.667 145.356 .000 1111.2470 104.17408 905.35539 1317.139
not assumed
household size (number) Equal v ariances
1.026 .313 -6.942 148 .000 -2.6510 .38190 -3.40573 -1.89637
assumed
Equal v ariances
-7.219 136.166 .000 -2.6510 .36722 -3.37724 -1.92486
not assumed
f arm size (acres) Equal v ariances
11.015 .001 5.595 148 .000 2.6496 .47359 1.71368 3.58542
assumed
Equal v ariances
6.198 147.961 .000 2.6496 .42752 1.80472 3.49438
not assumed
number of cattle owned Equal v ariances
16.513 .000 10.464 148 .000 25.9108 2.47615 21.01762 30.80397
assumed
Equal v ariances
11.950 143.317 .000 25.9108 2.16821 21.62499 30.19660
not assumed

James Lwelamira (PhD) Page 43


The above output can be summarized as follows;

Mean values for various variables for male and female head households
Mean ± SD
Variable Male Female t-value Sign.

Annual household income before


1618.9 ± 740.5 507.7± 532.7 10.667 ***
project (‘000 Tsh.)
Household size (number) 5.7 ± 2.4 8.4 ± 2.0 -6.942 ***
Farm size (acres) 5.5 ± 2.2 2.7 ± 2.0 5.595 ***
Number of cattle 33.1 ± 17.4 7.2 ± 9.0 10.464 ***
*** = Significance at P < 0.001.
Note: We can also have NS, * and ** to imply non-significant, significant at P<0.05, and
Significance at P<0.01, respectively

How do we report/interpret these results?

This study compared the two types of households in terms of average annual household
income before project (in 000 Tsh), household size, farm size (acres) and number of
cattle owned. Results from Table… indicate significant differences between the two types
of households in these variables. Male headed households had significantly higher
average annual household income (t=10.67, P < 0.001), average farm size (t = 5.60, P <
0.001), and average number of cattle owned (t = 10.46, P < 0.001) compared to female
headed households. However, in contrary, female headed households had significantly
higher average household size compared to male headed households (t = -6.94, P <
0.001).
Exercise;
Re-write the above text by indicating values for mean (use allowed format: to achieve
this try to read already published papers/research report the used that style)

Example 2: Nutritional status of exclusively breastfed children vs non-exclusively


breastfed children

Results from Table 3 indicate exclusive breastfeeding had improved significantly Height-
for –Age z scores (HAZ), and Weight-for- height z scores (WHZ). The exclusively

James Lwelamira (PhD) Page 44


breastfed children had significantly higher average HAZ (t= 2.35, p< 0.05) and WAZ (t =
7.5, p< 0.01) than children who were not exclusively breastfed. Mean value for HAZ and
WAZ for breastfed children were 3.7 and 4.7, respectively. The corresponding values for
non-breastfed children were 3.3 and 3.8.

Exercise
Using the file “Workshop SPSS working file1”, Compare the male headed and female
headed households on mean age and mean annual income after project.

Note: Summarize your output and interpret your results.

3. PAIRED SAMPLES t-test (or Dependent samples t-test)

This test is also concerned with comparing means for two groups/two situations/two
periods.

Apply for paired observations (i.e. two samples are from two related populations).

Example:
- Data taken from the same individual/household at two different periods (i.e.
before and after intervention).

- The HR manager wants to know if a particular training program had any impact in
increasing the motivation level of the employees.

This test is parametric test. Non-parametric counterpart include:……

Dependent variable: Continuous-interval/ratio- can compute mean, s.d (This is a


response variable).

Independent variable: Period/situation

In SPSS computer program you don’t need to specify what dependent variable is and
what independent variable is as we saw for the case of independent samples t-test.
(Note how data are entered for this analysis vs those for independent samples t-test =
format differ i.e. entered differently)

James Lwelamira (PhD) Page 45


Test statistic = t

d n
t= ,
sd
where d is the difference between pairs, d = mean of the differences, and s d = standard
deviation of the differences.

Hypotheses in this test are as follows;

Ho: Mean (average) for period1 is NOT significantly different from mean for period2
Ha: Mean (average) for period1 is significantly different from that of period2

(For the case for two tailed test)

If it is one tailed test, alternative hypothesis would be;

Ha: Mean for period1 is significantly higher than mean for period2 (for right tailed test)

OR
Ha: Mean for period1 is significantly lower than mean for period2 (for left tailed test)

Example: Suppose someone is interested in comparing average annual household income


before and after implementation of a particular project for improving income of
smallholder farmers

The null hypothesis would be;


Ho: There is no significant difference on average household income before and after
project
The alternative hypothesis would be;
Ha: There is significant difference on average household income before and after project
(For the case of two tailed test)

If it is one tailed test, alternative hypothesis would be;

Ha: Average annual household income before project is significantly higher than that
after project (for the case of right tailed test)

James Lwelamira (PhD) Page 46


OR
Ha: Average annual household income before project is significantly lower than that after
project (for the case of left tailed test)

Procedures for paired samples t- test in SPSS:

Analyze Compare means Paired samples t- test Select

two variables for comparison (click the first variable and then the second variable) and

put them to Paired variables Box OK

James Lwelamira (PhD) Page 47


Example;
In tables below are computer outputs for null hypothesis that average annual household
income (in ‘000) before project (Income1) is the same as that after project (Income2) vs
alternative hypothesis that they are different. The study involved a sample of 150 rural
households of Dodoma.

File name: Workshop SPSS working file1

T-Test
Paired Samples Statistics

Std. Error
Mean N Std. Dev iat ion Mean
Pair annual income
1189.2200 150 859.22329 70.15529
1 bef ore project ('000)
annual income af ter
1248.1800 150 868.43369 70.90731
project('000)

Paired Samples Correlations

N Correlation Sig.
Pair annual income bef ore
1 project('000) & annual 150 .998 .000
income af t er project('000)

James Lwelamira (PhD) Page 48


Paired Samples Test

Paired Diff erences


95% Confidence
Interv al of the
Std. Error Dif f erence
Mean Std. Dev iation Mean Lower Upper t df Sig. (2-tailed)
Pair annual income bef ore
1 project('000) - annual -58.9600 60.47040 4.93739 -68.7163 -49.2037 -11.942 149 .000
income after project('000)

From the above output it can be said that; average annual household income before
project was significantly different from that after project (t = -11.94, P < 0.001). While
average annual household income before project (in ‘000 Tsh) was 1189.2, the
corresponding average annual household income after project was 1248.2.

Or we could say;

Average annual household income before project (1189.2) was significantly different
from that after project (1248.2), both in ‘000 Tsh (t = -11.94, P < 0.001).

(Note: You can as well use comma before mean/average instead of brackets)

Note1: Results are also significant at one-tailed test; In this regard if our interest was
one-tailed test we could say; income after project was significantly higher than income
before project (i.e. income improved significantly after project) (t = - 11.94, P < 0.001).
While average annual household income before project (in ‘000 Tsh) was 1189.2, the
corresponding average annual household income after project was 1248.2.

Or we could say;
Average annual household income before project (1189.2) was significantly lower than
that after project (1248.2), both in ‘000 Tsh (t = -11.94, P < 0.001).

(Note: You can as well use comma before mean/average instead of brackets)

Note2: suppose sig. in the last table is any value above 0.05 such as 0.083, it implies that
there was no significant difference.

James Lwelamira (PhD) Page 49


Summarizing the SPSS output

A case of single test (response variable) variable

A case of single test (response) variable


Average annual household income before project and after
project (in ‘000 Tsh.)
Period N Mean ± SD
Before Project 150 1189.2 ± 859.2
After Project 150 1248.2 ± 868.4
t- value = -11.942, Significance = 0.000 (or P< 0.001).

OR
Average annual household income before project and after
project (in ‘000 Tsh.)

Period N Mean SD
Before Project 150 1189.2 859.2
After Project 150 1248.2 868.4
t- value = -11.942, Significance = 0.000 (or P< 0.001).

A case of several test (response) variables


Tables be will be summarized as in the case of independent samples t- test, with the
exception that in this case our grouping variable is Period (Before project and After
project) cf. Sex of household head (Male and Female) for example given for independent
samples t-test.

WHEN WE WANT TO COMPARE MORE THAN TWO GROUPS (i.e. When we


want to compare MORE THAN TWO MEANS)

These tests are concerned with testing equality of means for more than two groups (i.e
comparing more than two groups).

Statistical tests under this category employ analysis of variance (ANOVA) to test equality
of means

Dependent variable: Continuous-interval/ratio- can compute mean, s.d- (i.e.


computation of mean make sense!!!). This is a response variable.

James Lwelamira (PhD) Page 50


Independent variable: Categorical (This specifies groups for comparison). For tests in
this class, independent variable is called a FACTOR. The independent variable has more
than two levels/groups for comparison

The simplest test under this class is One-way Analysis of Variance (ANOVA I). This
test can be followed by mean separation tests such as Duncan Multiple range test,
Tukey’s test, LSD test etc.

Other ANOVAs include Two-way ANOVA, Latin Square Design, Split Plot Design,
Factorial Design, and ANOVA for repeated measures. These are a bit complex and most
of them very are popular in Agric. Sciences.

These tests are parametric. Non-parametric counterpart include: Kruskal- Wallis


Test for ANOVA I and Friedman’s Test for ANOVA II.

In this class we will concentrate on ANOVA I

- Test statistic = F

- The larger the F-ratio, the greater is the difference between groups as compared to within
group differences.

The ANOVA procedure can be used correctly if the following conditions are satisfied:
1. The dependent variable should be interval or ratio data type.
2. The populations should be normally distributed (i.e. parametric) and the population variances
should be equal.

4. ONE- WAY ANOVA (ANOVA I)


- It assume response/criterion variable is influenced by ONE factor
- A factor has several levels/categories (more than two) that entail groups for
comparison. These levels or categories are sometimes called samples or
treatments.

Hypotheses in this test are as follows;

Ho: There is no significant difference in averages (means) for different groups


Ha: At least one pair of means differ significantly

Example: Suppose someone want to compare average annual household income for three
district;

James Lwelamira (PhD) Page 51


The null hypothesis would be;
Ho: There is no significant difference among three districts on average (mean) annual
household income.
The alternative hypothesis would be;
Ha: At least one pair of means differ significantly

Note: here we don’t have 2-tailed test.

Procedures for One- Way ANOVA in SPSS: Route 1

Analyze Compare Means One – way ANOVA Select a

response variable i.e. dependent variable (criterion variable) and put it to Dependent list

Box Select a comparison- group variable (independent variable) and put it to

Factor Box Options Descriptives (select descriptive statistics)

Continue OK

You can as well generate results for mean separation tests by clicking Post Hoc button
and select a test you want i.e Duncan.

James Lwelamira (PhD) Page 52


James Lwelamira (PhD) Page 53
Procedures for ANOVA I in SPSS: Route 2

Analyse General Linear Model Univariate etc.

James Lwelamira (PhD) Page 54


Example: Using route 1

In tables below are computer outputs for null hypothesis that average annual household
income (in ‘000) before project (Income1) is the same for all three district under study
(i.e. Bahi, Chamwino, Kongwa) vs alternative hypothesis that there are differences (or at
least one pair of means differ significantly). The study involved a sample of 45, 56 and
49 rural households of Bahi, Chamwino and Kongwa district in Dodoma region,
respectively.

File name: Workshop SPSS working file1

James Lwelamira (PhD) Page 55


Oneway
Descriptives

annual income bef ore project('000)


95% Conf idence Interv al f or
Mean
N Mean St d. Dev iation St d. Error Lower Bound Upper Bound Minimum Maximum
Bahi 45 953.4000 927.91304 138.32511 674.6241 1232.1759 150.00 3025.00
Chamwino 56 857.1607 509.43112 68.07560 720.7342 993.5873 170.00 3405.00
Kongwa 49 1785.2857 813.37986 116.19712 1551.6557 2018.9157 180.00 3105.00
Total 150 1189.2200 859.22329 70.15529 1050.5922 1327.8478 150.00 3405.00

ANOVA

annual income bef ore project('000)


Sum of
Squares df Mean Square F Sig.
Between Groups 26086669 2 13043334. 69 22.849 .000
Within Groups 83914764 147 570848.737
Total 1.10E+08 149

From the above output it can be said that; Results for ANOVA indicate significant overall
difference between districts on average annual household income before project (F =
22.85, P < 0.001). Recorded in ‘000 Tsh, Kongwa district had the highest mean value
(1785.3), followed by Bahi (953.4) , and Chamwino had the lowest mean value (857.2).

Or

Results indicate Kongwa district had the highest average annual household income,
followed by Bahi, with Chamwino district having lowest average. Mean values for the
three districts in ‘000 Tsh were 1785.3, 953.4 and 857.2 for Kongwa, Bahi and
Chamwino, respectively. Results for ANOVA indicate the overall difference to be
significant (F = 22.85, P < 0.001)

Note: suppose sig. in the last table is any value above 0.05 such as 0.146, it implies that
there was no significant difference.

James Lwelamira (PhD) Page 56


Results for Post- Hoc test (Mean separation test) using Duncan Multiple Range Test

Note: These tests are performed when ANOVA show significance overall difference
between samples/ treatments.

Post Hoc Tests

Homogeneous Subsets
annual income before project('000)
a,b
Duncan
Subset f or alpha = .05
Dist rict of residence N 1 2
Chamwino 56 857.1607
Bahi 45 953.4000
Kongwa 49 1785.2857
Sig. .527 1.000
Means f or groups in homogeneous subsets are display ed.
a. Uses Harmonic Mean Sample Size = 49.597.
b. The group sizes are unequal. The harmonic mean
of the group sizes is used. Ty pe I error lev els are
not guaranteed.

How do we incorporate results of mean separation test in our writing?


From the above output it can be said that; Results for ANOVA indicate significant overall
difference between districts on average annual household income before project (F =
22.85, P < 0.001). Recorded in ‘000 Tsh, Kongwa district had the highest mean value
(1785.3), followed by Bahi (953.4) , and Chamwino had the lowest mean value (857.2).
However, following Mean Separation Test using Duncan test, the difference between
Bahi and Chamwino was not significant (P> 0.05).

James Lwelamira (PhD) Page 57


Summarizing the output from SPSS

The same way as in the case of independent samples t –test, and additionally there would
be similar subscripts letters for similar means as revealed by mean separation tests such
as Duncan;

A case of single test (response) variable

Example 1:
Table.. Mean annual household before project by district
District N Means ± S.D
Bahi 45 953.4 ± 927.9b
Chamwino 56 857.2 ± 509.4b
Kongwa 49 1785.3 ± 813.4a
a,b
Means with different superscript letters are significantly different (P<0.05)
S.D = Standard deviation

Example 2; A case of profit from various sources among dairy farmers in Kayanga ward
Karagwe district.

Table…: Annual profit for different enterprises by dairy farming households


Enterprise No. of household Annual profit per household per
in the enterprise annum (Tsh.) (Means ± S.D)
Dairy 38 1,000,150 ± 388,770a
Crop farming 38 1,032,432 ± 489,277a
Small scale businesses 8 950,000 ± 542,481a
(i.e. off-farm activities)
Other type of livestock 12 131,833 ± 114,290b
a,b
Means with different superscript letters are significantly different (P<0.05)
S.D = Standard deviation

James Lwelamira (PhD) Page 58


Exercise
Using the same file (Workshop SPSS working file1), Compare the districts on average
annual household income after project, farm size, and number of cattle owned.

Note: Summarize your output and interpret your results.

5. TWO- WAY ANOVA (ANOVA II)


-It assumes two factors influence a response variable
-Would have one response (independent) variable and two comparison- group
(independent) variables.
-Would have two sets of hypothesis (i.e. one set for each factor)
-Test statistic is also F, but computed for each factor. Therefore need to compute mean
squares for each factor as well as that for error.
-Tabulated values are determine for each factor and are obtained from statistical tables as
with ANOVA I
-Conclusion is made by comparing Calculated Fs vs Tabulated Fs

 Procedures in SPSS:
Performed through Univariate Analysis of Variance Option

James Lwelamira (PhD) Page 59


B. TESTING OF HYPOTHESES ABOUT ASSOCIATION AMONG
VARIABLES

1. TESTING OF HYPOTHESES FOR INDEPENDENCE /ASSOCIATION


AMONG CATEGORICAL VARIABLES
- Both dependent and independent variables are categorical
- Categorical variable?: Data are capable of being analysed/summarized into
frequencies and percentages. (i.e. computation of mean does not make sense!!!)
- Data is usually summarized in a contingency table
- The tests are non- parametric
- It is a typical example of bivariate analyses

Example: Pearson Chi- Square Test for independence, McNemar’s Test (for paired
dichotomous-i.e two related samples), Cochran Q Test for three or more related samples,
Mantel-Haenszel Comparison (for 2x2 contingency tables while controlled for a
third variable), Fisher’s Exact Test, LR test.

In this class we will concentrate on Pearson Chi- square test for independence among
two variables

Note: There is also Chi-Square Test for Goodness of Fit including testing of
homogeneity: Sometimes called one- sample chi-square test. We won’t discuss this.

PEARSON CHI-SQUARE TEST FOR INDEPENDENCE/ASSOCIATION

Test statistic is Chi-square (  2 )

2  
O  E 2
E
Where by O = Observed frequency, E = Expected frequency

James Lwelamira (PhD) Page 60


Warning; It may be tempting to use percentages in the table and calculate the chi-square
test from these percentages instead of the raw observed frequencies. This is incorrect—
don’t do it!

Hypotheses for Pearson Chi-square test are as follows;

Ho: No association between a variable in row and a variable in column in a


contingency table
H1: There is association between variable in row and variable in column

Note:
No One-Sided Tests here!

Notice that the alternative hypotheses above do not assume any “direction.” Thus, there
are no one- and two-sided versions of these tests. Chi-square tests are inherently non-
directional (“sort of two-sided”) in the sense that the chi-square test is simply testing
whether the observed frequencies and expected frequencies agree without regard to
whether particular observed frequencies are above or below the corresponding expected
frequencies.

Validity of the test


The validity of the chi-square test depends on both the sample size and the number of
cells. Several rules of thumb have been suggested to indicate whether the chi-square
approximation is satisfactory. One such rule suggested by Cochran (1954) says that the
approximation is adequate if no expected cell frequencies are less than one and no more
than 20% are less than five.

 Procedures for Chi-square test in SPSS:

Analyze Descriptive statistics Crosstabs Select variable of

interest and put it to row(s) Box Select variable of interest and put it to

column (s) Box Specify how percentages should be computed (i.e. within

column or within row?: Hint- compute % within groups you want to compare) after

clicking cell button Continue Click statistics button and choose

Chi-square Continue OK

James Lwelamira (PhD) Page 61


Note: There are two main routes for accomplishing the above analysis. You can as well
accomplish it via Analyse Non- parametric Chi-square etc. For this route
you need to weigh your cases. Furthermore, in this route data are entered in different
style compared to the first (above) route. However, the above route (i.e. via crosstabs) is
the most common one. Therefore, we are going to see an example using the first route
(crosstabs).

Example;

In tables below are computer outputs for null hypothesis that there is no association
between sex of household head and access to credit vs alternative hypothesis that there is
association. The study involved 150 rural households of Dodoma.

File name: Workshop SPSS working file1

Crosstabs

sex * Acess to credit Crosstabulation

Acess to credit
receiv ed not receiv ed
credit credit Total
sex male Count 79 13 92
% wit hin Acess to credit 78.2% 26.5% 61.3%
f emale Count 22 36 58
% wit hin Acess to credit 21.8% 73.5% 38.7%
Total Count 101 49 150
% wit hin Acess to credit 100.0% 100.0% 100.0%

James Lwelamira (PhD) Page 62


Chi-Square Tests

Asy mp. Sig. Exact Sig. Exact Sig.


Value df (2-sided) (2-sided) (1-sided)
Pearson Chi-Square 37.167b 1 .000
Continuity Correctiona 35.020 1 .000
Likelihood Ratio 37.598 1 .000
Fisher's Exact Test .000 .000
Linear-by -Linear
36.919 1 .000
Association
N of Valid Cases 150
a. Computed only f or a 2x2 table
b. 0 cells (.0%) hav e expected count less than 5. The minimum expected count is
18.95.

Interpreting/reporting results; Example 1 (above output)


From the above output, based on Pearson chi-square it can be said that; there was
significant association between sex of household head and access to credit (  2 = 37.17, P
< 0.001). Male headed households were more likely to access credit than females headed
households. Majority of credit receivers were males (78.2%) compared to non-credit
receivers in which most of them were females (73.5%).

Note1: In the contingency table above we computed % within credit status (received vs
not received). However if we had computed % within sex we would have output below
(results for statistical tests- i.e. chi-square would be the same as in previous analysis
therefore not presented) and we could interpret results as follow.

Crosstabs
sex of household head * Acess to credit Crosstabulation

Acess to credit
receiv ed not receiv ed
credit credit Total
sex of household male Count 79 13 92
head % within sex of
85.9% 14.1% 100.0%
household head
f emale Count 22 36 58
% within sex of
37.9% 62.1% 100.0%
household head
Total Count 101 49 150
% within sex of
67.3% 32.7% 100.0%
household head

From the above output, based on Pearson chi-square it can be said that there was
relationship between sex of household head and access to credit (  2 = 37.17, P < 0.001),
in which most of the male headed households had access to credit compared to female
headed households (85.9% vs 37.9%).

James Lwelamira (PhD) Page 63


Important thing to note here is that the message we get from two approaches for
computing % is the same; that males had more access to credit than females (i.e. males
were more likely to access to credit than females). Choice of approach on how to
compute % depends on preference. However, I prefer computation of % within categories
of independent variable (last approach)

Note2: suppose sig. for Pearson chi- square in the last table is any value above 0.05 such
as 0.347, it implies that there was no association.

Further examples on how to Report/interpret results for Chi-square test

Example2; Transaction sex among youths

Table... Percentage of respondents who had exchanged sex for money


Have you exchanged Yes No Total
sex for money or gift
Male 12 (16.0%) 63 (84.0%) 75 (100.0%)
Female 24 (21.6%) 87 (78.4%) 111(100.0%)
Total 36 150
 2 = 0.91, df= 1, P = 0.341

The percentage of respondents who reported to had received or given gift or money in
exchange for sex was higher for females (21.6%) than for males (16.0%). However, this
difference was not statistically significant (  2 = 0.91, P = 0.341),

Example3: Parental communication on sexual and reproductive health

Table..: Proportion of respondents discussing with their parents about sexuality and early
pregnancies

Does your parents, Yes No


guardians discuss with you
about impact or early
pregnancies and sexuality?
Male 24 (32.0%) 51 (68.0%)
Female 81 (73.0%) 30 (27.0%)
 = 30.56, df = 1, P < 0.01
2

Majority of female respondents reported that their parents/guardian discuss with them the
impact of pregnancies and sexuality. The percentage of females who discuss with their
parents/guardian about impacts of early pregnancies and sexuality (73%) was higher than
that for males (32%). The difference between male and female respondents about

James Lwelamira (PhD) Page 64


discussing impact of early pregnancies and sexuality was significant (  2 = 30.56, P <
0.01)

Example 4: If a woman given post-partum care vs place of delivery

Note3: Chi- square test cannot be applied when more than 20% of cells has expected
frequency of less than 5 and when have a cell that have expected frequency of 1 (or less
than 1).
Under this situation we may be required to collapse (merge) some categories.

Note4: Unlike correlation coefficients, Chi-square does not convey information about the
strength of a relationship. By strength is meant that a large chi-square value and a
correspondingly strong significance level (e.g. P < 0.001) cannot be taken to mean a
closer relationship between two variables than when chi-square is considerably smaller
but moderately significant (e.g. P < 0.05). What is telling us is how confident we can be
that there is a relationship between two variables.

James Lwelamira (PhD) Page 65


Note 5: 2 x 2 Contingency table vs overestimation of chi-square value vs Yates correction
for continuity vs SPSS computer output./ Preference for phi coefficient vs chi square to
study association between two dichotomous variables. This statistic (phi)   is similar to
correlation coefficient i.e gives strength of relationship and vary from 0 to 1

Summarizing the output from SPSS

A case of single test (response) variable.

Already shown in some tables above. However standard format for most of the report is
as below;

Table.. Distribution of respondents by Access to credit for


male and female headed households

Received credit Not received credit


Gender of hh
(n = 101) (n = 49)
head
Frequency % Frequency %
Male 79 78.2 13 26.5
Female 22 21.8 36 73.5
Total 101 100.0 49 100.0
Chi-square value = 37.167, df = 1, Significance = 0.000 (or P< 0.001)

James Lwelamira (PhD) Page 66


A case of several test (response) variables

Example 1: A case of factors influencing current use of modern contraceptive (MC)


among married women in Sukumaland

Crosstabs
age of woman * if she ever used modern contraceptives (MC)- is
she a current user
Crosstab

if she ev er used
modern contraceptiv es
(MC)- is she a current
user
y es no Total
age of woman 30 and below Count 32 63 95
% wit hin if she ev er
used modern
64.0% 57.3% 59.4%
contraceptiv es (MC)-
is she a current user
> 30 Count 18 47 65
% wit hin if she ev er
used modern
36.0% 42.7% 40.6%
contraceptiv es (MC)-
is she a current user
Total Count 50 110 160
% wit hin if she ev er
used modern
100.0% 100.0% 100.0%
contraceptiv es (MC)-
is she a current user

Chi-Square Tests

Asy mp. Sig. Exact Sig. Exact Sig.


Value df (2-sided) (2-sided) (1-sided)
Pearson Chi-Square .645b 1 .422
Continuity Correctiona .396 1 .529
Likelihood Ratio .650 1 .420
Fisher's Exact Test .489 .265
Linear-by -Linear
.641 1 .423
Association
N of Valid Cases 160
a. Computed only f or a 2x2 table
b. 0 cells (.0%) hav e expected count less than 5. The minimum expected count is
20.31.

James Lwelamira (PhD) Page 67


education level-woman * if she ever used modern contraceptives
(MC)- is she a current user
Crosstab

if she ev er used
modern contraceptiv es
(MC)- is she a current
user
y es no Total
education lev el-woman primary or below Count 30 105 135
% wit hin if she ev er
used modern
60.0% 95.5% 84.4%
contraceptiv es (MC)-
is she a current user
sec and abov e Count 20 5 25
% wit hin if she ev er
used modern
40.0% 4.5% 15.6%
contraceptiv es (MC)-
is she a current user
Total Count 50 110 160
% wit hin if she ev er
used modern
100.0% 100.0% 100.0%
contraceptiv es (MC)-
is she a current user

Chi-Square Tests

Asy mp. Sig. Exact Sig. Exact Sig.


Value df (2-sided) (2-sided) (1-sided)
Pearson Chi-Square 32.776b 1 .000
Continuity Correctiona 30.142 1 .000
Likelihood Ratio 30.707 1 .000
Fisher's Exact Test .000 .000
Linear-by -Linear
32.571 1 .000
Association
N of Valid Cases 160
a. Computed only f or a 2x2 table
b. 0 cells (.0%) hav e expected count less than 5. The minimum expected count is
7.81.

James Lwelamira (PhD) Page 68


religious affiliation * if she ever used modern contraceptives
(MC)- is she a current user
Crosstab

if she ev er used
modern contraceptiv es
(MC)- is she a current
user
y es no Total
religious catholic Count 6 19 25
af f iliation % within if she ev er
used modern
12.0% 17.3% 15.6%
contraceptiv es (MC)-
is she a current user
protestant Count 40 89 129
% within if she ev er
used modern
80.0% 80.9% 80.6%
contraceptiv es (MC)-
is she a current user
moslem Count 4 2 6
% within if she ev er
used modern
8.0% 1.8% 3.8%
contraceptiv es (MC)-
is she a current user
Total Count 50 110 160
% within if she ev er
used modern
100.0% 100.0% 100.0%
contraceptiv es (MC)-
is she a current user

Chi-Square Tests

Asy mp. Sig.


Value df (2-sided)
Pearson Chi-Square 4.118a 2 .128
Likelihood Ratio 3.812 2 .149
Linear-by -Linear
2.495 1 .114
Association
N of Valid Cases 160
a. 2 cells (33.3%) hav e expected count less t han 5. The
minimum expected count is 1.88.

James Lwelamira (PhD) Page 69


type of marriage * if she ever used modern contraceptives (MC)-
is she a current user
Crosstab

if she ev er used
modern contraceptiv es
(MC)- is she a current
user
y es no Total
ty pe of marriage monogamy Count 44 97 141
% wit hin if she ev er
used modern
88.0% 88.2% 88.1%
contraceptiv es (MC)-
is she a current user
poly gamy Count 6 13 19
% wit hin if she ev er
used modern
12.0% 11.8% 11.9%
contraceptiv es (MC)-
is she a current user
Total Count 50 110 160
% wit hin if she ev er
used modern
100.0% 100.0% 100.0%
contraceptiv es (MC)-
is she a current user

Chi-Square Tests

Asy mp. Sig. Exact Sig. Exact Sig.


Value df (2-sided) (2-sided) (1-sided)
Pearson Chi-Square .001b 1 .974
Continuity Correctiona .000 1 1.000
Likelihood Ratio .001 1 .974
Fisher's Exact Test 1.000 .581
Linear-by -Linear
.001 1 .974
Association
N of Valid Cases 160
a. Computed only f or a 2x2 table
b. 0 cells (.0%) hav e expected count less than 5. The minimum expected count is
5.94.

James Lwelamira (PhD) Page 70


Current number of living children * if she ever used modern
contraceptives (MC)- is she a current user
Crosstab

if she ev er used
modern contraceptiv es
(MC)- is she a current
user
y es no Total
Current number 3 and below Count 15 93 108
of liv ing children % wit hin if she ev er
used modern
30.0% 84.5% 67.5%
contraceptiv es (MC)-
is she a current user
4 and abov e Count 35 17 52
% wit hin if she ev er
used modern
70.0% 15.5% 32.5%
contraceptiv es (MC)-
is she a current user
Total Count 50 110 160
% wit hin if she ev er
used modern
100.0% 100.0% 100.0%
contraceptiv es (MC)-
is she a current user

Chi-Square Tests

Asy mp. Sig. Exact Sig. Exact Sig.


Value df (2-sided) (2-sided) (1-sided)
Pearson Chi-Square 46.620b 1 .000
Continuity Correctiona 44.167 1 .000
Likelihood Ratio 45.987 1 .000
Fisher's Exact Test .000 .000
Linear-by -Linear
46.329 1 .000
Association
N of Valid Cases 160
a. Computed only f or a 2x2 table
b. 0 cells (.0%) hav e expected count less than 5. The minimum expected count is
16.25.

James Lwelamira (PhD) Page 71


ethnicity * if she ever used modern contraceptives (MC)- is she a
current user
Crosstab

if she ev er used
modern contraceptiv es
(MC)- is she a current
user
y es no Total
ethnicity Sukuma Count 30 64 94
% wit hin if she ev er
used modern
60.0% 58.2% 58.8%
contraceptiv es (MC)-
is she a current user
others Count 20 46 66
% wit hin if she ev er
used modern
40.0% 41.8% 41.3%
contraceptiv es (MC)-
is she a current user
Total Count 50 110 160
% wit hin if she ev er
used modern
100.0% 100.0% 100.0%
contraceptiv es (MC)-
is she a current user

Chi-Square Tests

Asy mp. Sig. Exact Sig. Exact Sig.


Value df (2-sided) (2-sided) (1-sided)
Pearson Chi-Square .047b 1 .829
Continuity Correctiona .002 1 .965
Likelihood Ratio .047 1 .828
Fisher's Exact Test .864 .484
Linear-by -Linear
.047 1 .829
Association
N of Valid Cases 160
a. Computed only f or a 2x2 table
b. 0 cells (.0%) hav e expected count less than 5. The minimum expected count is
20.63.

James Lwelamira (PhD) Page 72


If frequently discuss with husband on Family Planning (Spousal
communication) * if she ever used modern contraceptives (MC)-
is she a current user
Crosstab

if she ev er used
modern contraceptiv es
(MC)- is she a current
user
y es no Total
If f requently discuss wit h Y es Count 21 5 26
husband on Family % wit hin if she ev er
Planning (Spousal used modern
communication) 42.0% 4.5% 16.3%
contraceptiv es (MC)-
is she a current user
No Count 29 105 134
% wit hin if she ev er
used modern
58.0% 95.5% 83.8%
contraceptiv es (MC)-
is she a current user
Total Count 50 110 160
% wit hin if she ev er
used modern
100.0% 100.0% 100.0%
contraceptiv es (MC)-
is she a current user

Chi-Square Tests

Asy mp. Sig. Exact Sig. Exact Sig.


Value df (2-sided) (2-sided) (1-sided)
Pearson Chi-Square 35.433b 1 .000
Continuity Correctiona 32.735 1 .000
Likelihood Ratio 33.305 1 .000
Fisher's Exact Test .000 .000
Linear-by -Linear
35.212 1 .000
Association
N of Valid Cases 160
a. Computed only f or a 2x2 table
b. 0 cells (.0%) hav e expected count less than 5. The minimum expected count is
8.13.

James Lwelamira (PhD) Page 73


if husband approve Modern contraceptives (MC) * if she ever
used modern contraceptives (MC)- is she a current user
Crosstab

if she ev er used
modern contraceptiv es
(MC)- is she a current
user
y es no Total
if husband approv e y es Count 45 50 95
Modern % wit hin if she ev er
contraceptiv es (MC) used modern
90.0% 45.5% 59.4%
contraceptiv es (MC)-
is she a current user
no Count 5 60 65
% wit hin if she ev er
used modern
10.0% 54.5% 40.6%
contraceptiv es (MC)-
is she a current user
Total Count 50 110 160
% wit hin if she ev er
used modern
100.0% 100.0% 100.0%
contraceptiv es (MC)-
is she a current user

Chi-Square Tests

Asy mp. Sig. Exact Sig. Exact Sig.


Value df (2-sided) (2-sided) (1-sided)
Pearson Chi-Square 28.278b 1 .000
Continuity Correctiona 26.462 1 .000
Likelihood Ratio 32.058 1 .000
Fisher's Exact Test .000 .000
Linear-by -Linear
28.102 1 .000
Association
N of Valid Cases 160
a. Computed only f or a 2x2 table
b. 0 cells (.0%) hav e expected count less than 5. The minimum expected count is
20.31.

James Lwelamira (PhD) Page 74


The above output can be summarized in a single table as follows;

Table..: Factors influencing current use of modern contraceptive (MC) among


married women in Sukumaland

Variable Current ser Non- current


(n = 50) user (n = 110)  2 - value
Age of woman (Years)
30 and below 32 (64.0%) 63(57.3%) 0.65NS
Above 30 18(36.0%) 47(42.7%)
Education level
Primary or below 30(60.0%) 105(95.5%) 32.76***
Secondary and above 20(40.0%) 5(5.0%)
Religious affiliation
Catholic 6 (12.0%) 19(17.3%) 4.12NS
Protestant 40(80.0%) 89(80.9%)
Moslem 4(8.0%) 2(1.8%)
Type of marriage
Monogamy 44 (88.0%) 97(88.2%) 0.01NS
Polygamy 6 (12.0%) 13 (11.8%)
Ethnicity
Sukuma 30(60.0%) 64(58.2%) 0.047NS
Others 20(40.0%) 46(41.8%)
Current number of living
children
3 and below 15 (30.0%) 93 (84.5%) 46.62***
4 and above 35 (70.0%) 17 (15.5%)
Spousal communication on
family planning
Yes 21 (42.0%) 5 (4.5%) 35.43***
No 29 (58.0%) 105 (95.5%)
If husband approve MC
Yes 45(90.0%) 50(45.5%) 28.28**
No 5(10.0%) 60(54.5%)
NS, **, *** = Non-significant, Significant at (P<0.01) and Significant at (P<0.001)

Interpretation/reporting of results

Results from Table.. indicate that there was significant association between several
factors considered in this study and being current user of modern contraceptives by a
woman. Current use of modern contraceptives was significantly associated with
James Lwelamira (PhD) Page 75
education level (   0.65, P < 0.001), current number of living children (   46.62, P <
2 2

0.001), Spousal communication on family planning (   35.43, P <0.001) and husband


2

approval of modern contraceptives (   28.28, P < 0.01). The effects of other variables
2

on current use of modern contraceptives considered in this analysis were not significant
(P > 0.05). These include age religious affiliation, ethnicity and type of marriage.

(Note: trying to clarify the association, we could further say as follows;)


Results indicate current users were more likely to have secondary education and above
compared to non-current users. Proportion of respondents with secondary education and
above was 40% and 5% for current users and non- current users, respectively. Results
also indicate respondents with large number of living children were more likely to be
current users compared to those with low number of living children. In this regard,
majority of current users (70.0%) had at least four children compared to only little
proportion (15.5%) of non- current users with at least four children. Results from Table..
further reveal that spousal communication on family planning enhanced use of modern
contraceptive among women. Notable proportion of current users 42.0% reported
existence of spousal communication on family planning in their family (i.e. communicate
with husband on family planning issues) compared to only 4.5% for non- current users.
Furthermore, women in families in which husband approve modern contraceptives were
more likely to be current users of modern contraceptives compared those in which
husband doesn’t approve modern contraceptives. In this regard, it is evident from Table..
that overwhelming majority of current users (90.0%) indicated their husband approves
modern contraceptives compared to 45.5% for non current users.

James Lwelamira (PhD) Page 76


James Lwelamira (PhD) Page 77
Example 2: Factors influencing utilization of post-natal care.

James Lwelamira (PhD) Page 78


Example 3: A case of distribution of Dairy and Non-dairy farmers on socio-
demographic variables (general information)

Table… General information of respondents


Variable Dairy farmers Non-Dairy All
 2 -value
(n = 38) farmers (n = 30) (n = 68)
Sex
Male 68.4% 70.0% 69.1% NS
0.02
Female 31.6% 30.0% 30.9%

Age (years)
< 35 13.2% 13.3% 13.2% NS
0.20
35 -50 55.3% 60.0% 57.4%
51+ 31.6% 26.7% 29.4%

Marital status
Married 92.1% 83.3% 88.2% NS
2.76
Single 0.0% 6.7% 2.9%
Widow 7.9% 10.0% 8.8%
Education level
No formal education 2.6% 0.0% 1.5% NS
2.21
Primary education 65.8% 80.0% 72.1%
Secondary education 18.4% 13.3% 16.2%
College and above 13.2% 6.7% 10.3%

James Lwelamira (PhD) Page 79


NS = Non significant at (P> 0.05)

Results indicate distribution of household heads by sex, age, marital status and education
level in the two types of households (i.e. Dairy farmers and Non-dairy farmers) were not
significantly different (P> 0.05). Majority of respondents in both groups (i.e. more than
50%) were males, had age between 35 to 50 years, married and had primary education.

Exercise
Using the file “Workshop SPSS working file1”, test if there is significant association
between Engagement on Off-farm activities (a dependent variable) and Sex (gender),
marital status and district of residence by a respondent (Independent variables)

Note: Summarize your output and interpret your results.

Other tests related to chi-square

Conversion of Chi-square into Phi coefficient  

Since Chi-square does not by itself provide and estimate of the magnitude of association
between two attributes, any obtained Chi-square value may be converted into phi
  coefficient.

2

N
Useful for 2 x 2 contingency table (esp. nominal by nominal var. contingency table)

Conversion of Chi-square into Coefficient of Contingency (C)

Chi-square value may also be converted into coefficient of contingency (C), especially in
case of a contingency table of higher order than 2 x 2 table to study the magnitude of the
relation or degree of association between two attributes;

2
C
2 N

Conversion of Chi-square into Cramer’s  coefficient

As with contingency coefficient (C), Cramer’s  is also used for tables of higher order
i.e number of both rows and columns is greater than 2.

James Lwelamira (PhD) Page 80


2 2
Cramer’s   
N min r  1, c  1 min r  1, c  1

2. TESTING OF HYPOTHESES FOR ASSOCIATION AMONG


CONTINUOUS VARIABLES

Concerned with computation of correlation coefficients and testing of hypotheses


regarding correlation coefficients.

Unlike Chi-square, it indicates both strength/intensity and direction of the relationship


between a pair of variables. Strength/intensity means closeness of relationship among
pairs of variables. This is done by computation of correlation coefficient. (numerical vs
scatter plot).

Pearson’s correlation coefficient (r)

- Typical bivariate analysis


- Measure linear relationship
- Variables under study are interval or ratio i.e. continuous
- It is parametric: Non- parametric counterpart include Spearman’s rank correlation.
- According to Cohen and Holliday (1982): r ≤ 0.19 = very low corr, 0.20 – 0.39 =
low corr, 0.40 – 0.69 = modest corr, 0.70 – 0.89 = high corr, 0.90 – 1 = very high
corr.

Significance level = just tell us if obtained coefficients i.e. low corr has been arisen by
chance (i.e. sampling error) or it actually exist in a population when sample has been
selected.

Hypotheses under this test are;

Ho : There is No significant correlation between two variables under study


H1: There is significant correlation between two variables under study
(for the case of two-tailed test)

Or alternative hypothesis could be;

H1: There is significant positive correlation between two variables under study i.e.
correlation coefficient is significantly above zero (for right tailed test);
Or
H1: There is significant negative correlation between two variables under study i.e.
correlation coefficient is significantly below zero (for left tailed test);

James Lwelamira (PhD) Page 81


Test statistic is t

r 2 n  2
t
1 r2

Procedures for Pearson’s correlation analysis in SPSS

Analyze Correlate Bivariate select variables of interest and

put them to variables Box Select Pearson OK

Example;
Below is the output for the null hypothesis that annual household income before project
was not significantly correlated with age of respondent, farm size, and number of cattle
owned Vs alternative hypothesis that they are correlated.

File; Workshop SPSS working file1

Correlations

annual
age of income
household bef ore f arm size number of
head project('000) (acres) cattle owned
age of household head Pearson Correlation 1 .203* .175* .265**
Sig. (2-tailed) . .013 .032 .001
N 150 150 150 150
annual income bef ore Pearson Correlation .203* 1 .693** .739**
project('000) Sig. (2-tailed) .013 . .000 .000
N 150 150 150 150
f arm size (acres) Pearson Correlation .175* .693** 1 .617**
Sig. (2-tailed) .032 .000 . .000
N 150 150 150 150
number of cattle owned Pearson Correlation .265** .739** .617** 1
Sig. (2-tailed) .001 .000 .000 .
N 150 150 150 150
*. Correlation is signif icant at t he 0.05 lev el (2-tailed).
**. Correlation is signif icant at t he 0.01 lev el (2-tailed).

James Lwelamira (PhD) Page 82


Interpretation/reporting of results

Results from Table.. reveal (indicate/show) that annual household income before project
was significantly correlated with age of household head (r = 0.203, P = 0.013), farm size
( r = 0.693, P = 0.000), and number of cattle owned (r = 0.739, P = 0.000).

Or
Results from Table.. reveal that annual household income before project was significantly
correlated with age of household head (r = 0.203, P <0.05), farm size ( r = 0.693, P <
0.001), and number of cattle owned (r = 0.739, P <0.001).

Note1;
The above output was for two-tailed test. However, if your interest was one- tailed
test you could command a computer to produce results for one- tailed test before
clicking OK or divide the P-values in the above output by 2.

Example, one tailed- in the sense that hypotheses could be:


Ho: Annual household income before project is not significantly correlated with age of
household head, farm size and number of cattle owned.

Ha: Annual household income before project is significantly POSITIVELY correlated


with age of household head, farm size and number of cattle owned.

Output if commanded a computer to produce results for one-tailed test would be!

Note2 : Compute produce output for all possible comparison, please stick to your
pre-determined comparison

Note3: Results above diagonal and that below diagonal are the same!, Therefore, use
one side of diagonal for interpretation to avoid confusion!.

James Lwelamira (PhD) Page 83


Correlati ons

annual
income
bef ore f arm size number of
age project('000) (acres) cattle owned
age Pearson Correlation 1 .203** .175* .265**
Sig. (1-tailed) . .006 .016 .001
N 150 150 150 150
annual income bef ore Pearson Correlation .203** 1 .693** .739**
project('000) Sig. (1-tailed) .006 . .000 .000
N 150 150 150 150
f arm size (acres) Pearson Correlation .175* .693** 1 .617**
Sig. (1-tailed) .016 .000 . .000
N 150 150 150 150
number of cattle owned Pearson Correlation .265** .739** .617** 1
Sig. (1-tailed) .001 .000 .000 .
N 150 150 150 150
**. Correlation is signif icant at the 0.01 lev el (1-tailed).
*. Correlation is signif icant at the 0.05 lev el (1-tailed).

Interpretation/reporting of results

Results from Table.. indicate (reveal/show) that annual household income before project
was significantly positively correlated with age of household head (r = 0.203, P = 0.006),
farm size ( r = 0.693, P = 0.000), and number of cattle owned (r = 0.739, P = 0.000).

Or
Results from Table.. indicate that annual household income before project was
significantly positively correlated with age of household head (r = 0.203, P <0.01), farm
size ( r = 0.693, P < 0.001), and number of cattle owned (r = 0.739, P <0.001).

Note:
If we have +ve Pearson correlation coefficient, it imply that variables are positively
correlated (i.e. increase in X1 is associated with increase in X2) AND if we have –ve
Pearson correlation coefficient, it imply that variables are negatively correlated (i.e.
increase in X1 is associated with decrease in X2)

James Lwelamira (PhD) Page 84


Summarizing output
Pearson’s correlation coefficients
Age HH Farm Number of
income size cattle owned
before (acres)
proj.
(‘000
Tsh)
Age . r =0.203 r = 0.175 r = 0.265
p = 0.006 p = 0.016 p = 0.001
HH income before proj. (‘000 Tsh) . r = 0.693 r = 0.739
p = 0.000 p = 0.000
Farm size (acres) . r = 0.617
p = 0.000
Number of cattle owned .

Or

Pearson’s correlation coefficients

Age HH income Farm size Number of


before proj. (acres) cattle owned
(‘000 Tsh)
Age . 0.203** 0.175* 0.265**

HH income before proj. (‘000 Tsh) . 0.693*** 0.739***


Farm size (acres) . 0.617***
Number of cattle owned .
* = Significant at P <0.05, ** = Significant at P< 0.01, *** = Significant at P < 0.001

SPEARMAN’S RANK CORRELATION COEFFICIENT (Spearman’s rho (ρ))

- Alternative to Pearson’s correlation analysis

- Correlation coefficient for ordinal-ordinal variables and when conditions for using
Pearson’s correlation analysis (i.e. normal distribution, continuous variables) not
met.

James Lwelamira (PhD) Page 85


- It is non- parametric

- Rank correlation = Spearman’s rho (ρ) or R and Kendall’s tau (τ). Variables under
study are categorical – ordinal). Commonly used coefficient is Spearman’s rho
(ρ). However, Kendall’s tau (τ) preferred when there is large proportion of tied
ranks.

- Kendall’s tau (τ)- a for ungrouped data, Kendall’s tau (τ)- b for grouped data

Hypotheses for Spearman’s rank correlation analysis

Ho : There is No significant correlation between two variables under study


Ha: There is significant correlation between two variables under study
(for the case of two-tailed test)

Or alternative hypothesis could be;

Ha: There is significant positive correlation between two variables under study i.e.
correlation coefficient is significantly above zero (for right tailed test);
Or
Ha: There is significant negative correlation between two variables under study i.e.
correlation coefficient is significantly below zero (for left tailed test);

Procedures in SPSS

Analyze Correlate Bivariate select variables of interest and

put them to variables Box Select Spearman OK

Note: Kendall’s tau –b can also be selected by clicking a respective box.

Exercise
Using the same file (Workshop SPSS working file2), Compare ranking between
interview method and Hopkin’s I.Q test method

Note: Interpret your results.

James Lwelamira (PhD) Page 86


Regression analysis, goodness of fit, and testing of hypothesis about regression
coefficients.
Linear (classical regression analysis) or non – linear (Binary logistic reg., multinomial
logistic reg , ordered logistic reg.)

3. TESTING OF HYPOTHESES FOR RELATIONSHIP AMONG


VARIABLES (REGRESSION ANALYSIS)
Regression analysis is used to establish relationship among variables (using an equation)

In regression analysis we usually have dependent variable (Y) and independent


variable(s) (Xs)

Regression analysis can be simple regression (when we have a single independent


variable), or multiple regression (when we have more than one (several) independent
variables

Regression can also be linear (when there is linear relationship between Y and X(s), or
can be non-linear (when there is non-linear relationship between Y and X(s).

Note: we have linear regression when a dependent variable (Y) is continuous, and we
have non- linear regression when a dependent variable is categorical i.e. binary.

SIMPLE LINEAR REGRESSION

It assume one variable influence a response variable

Population regression equation


Yi    X i   i

Where;
Y = is the dependent variable
X = is the independent (explanatory) variable
 = is the intercept in the Y axis ( a regression constant)

James Lwelamira (PhD) Page 87


  is the gradient (slope) of relationship (a regression coefficient)
 i = is random error in Y (Disturbance term)

Note1:
 is the value of Y when a value of X is zero; and  is the amount of change in Y when
X is increased by one unit.

Note2:  i is the effect of all other variables not included in the model (sometimes
denoted as u i ).
Note3: Sometimes we can denote a regression constant as  0 instead of 

In practice we usually involving in fitting/determining a sample regression equation. A


sample regression equation is the estimate of population regression equation (i.e. Sample
regression equation is used to estimated population regression equation.

A sample regression equation can be given by;

Yi  a  bX i  ei

Whereby a estimate  , and b estimate 

Or the equation can be written as;

Yi  ˆ  ˆX i  ei

In fitting a regression equation, we usually involved in determination/computation of


values of a and b and testing their values whether they are zero or significantly different
from zero (Why testing if they are zero, especially for b? = to be explained later).

a and b are estimated using Least Square Method or Maximum Likelihood approach.
However, Least Square Method is the common one.

Once you have values for a and b you can estimate a value of Y for a given value of X

We get estimated Yi i.e Yˆi as;


Yˆi = a  bX i
This is the sample regression line.

James Lwelamira (PhD) Page 88


Farm size vs income


Linear Regression
household annual income ('000) = 411.40 + 194.65 * fsize

household annual income ('000)

R-Square = 0.48 
 
30 00.00 
  
 
   
   
   

     

  
  
20 00.00 
    
 

 
  
  
 
 
 

  
10 00.00  
  
  
  

  


 


 
 

  
    

  

 
   
     
   

4.00 8.00 12 .0 0 16 .0 0

farm size (acre s)

We usually need to specify a regression equation/model that we are going to estimate


when writing research proposals/reports. It is usually advised to write it in a simple and
more understandable form. For example, a simple regression model can be written in the
form indicated in the example below;

Example of a simple regression equation (model)

Suppose we want to establish if there is relationship between a variable annual household


income before project (INCOME1) as a dependent variable and farm size (FSIZE) as
independent variable (from our data file:Workshop SPSS working file1), the regression
equation/model can be written as;

James Lwelamira (PhD) Page 89


INCOME1i    FSIZEi   i
Whereby; INCOME1 = Annual household income before project (‘000Tsh)
FSIZE= Farm size (acres)
 = Regression constant
 = Regression coefficient
 = Error term

OR

INCOME1i   0  1 FSIZEi   i
Whereby;
INCOME1 = annual household income before project (‘000Tsh)
FSIZE= Farm size (acres)
 0 = Regression constant
 1 = Regression coefficient
 = Error term
OR

Yi    X i   i
Whereby; Yi = Annual household income before project (‘000Tsh)
X i = Farm size (acres)
 = Regression constant
 = Regression coefficient
 = Error term
OR

Yi   0  1 X i   i
Whereby; Yi = Annual household income before project (‘000Tsh)
X i = Farm size (acres)
 0 = Regression constant
 1 = Regression coefficient
 = Error term

Statistical tests in simple linear regression

James Lwelamira (PhD) Page 90


i) ANOVA (more useful in multiple regression analysis)

Divides total variation into its components i.e. variation due to regression (due to X
included in the equation) and variation due to residue. It test the significance of the
model.

TSS = ESS + RSS


Whereby TSS = total sum of squares, ESS = explained (regression) sum of squares, RSS
= Residual sum of squares

Hypotheses;
Ho: Independent variable (X) has no significant influence on dependent variable (Y)
Ha: Independent variable (X) has significant influence on dependent variable (Y)

ii) R-square (Goodness of fit)

Indicate percentage variation in Y that is explained by variation in X

ESS
R2 
TSS

Note; in literature there are several versions of the formula for R 2, however, the above
formula is the simplest one

iii) t-test and regression coefficient

Test the significance of X in influencing Y

Hypotheses
Ho:  = 0 (Independent variable (X) has no significant influence on dependent variable (Y))
H1:  ≠ 0 (Independent variable (X) has significant influence on dependent variable (Y))

Note:
For the case of one independent variable, you can choose either F test (ANOVA) or t –
test to study the effect of X on Y (the above case) as they will end up into the same
conclusion.

James Lwelamira (PhD) Page 91


Procedures for Simple linear regression in SPSS

Analyze Linear Choose dependent (criterion) variable and put it to

Dependent box Chose independent variable and put it to Independent(s) box

OK.

Example;

The following it the output from SPSS for regression analysis for null hypothesis that
income before project1 by a household was not influenced by farm size vs alternative
hypothesis that it had an influence.
File name; Workshop SPSS working file1

Note: the model for this analysis was as follows;

INCOME1i    FSIZEi   i
Whereby; INCOME1 = Annual household income before project (‘000Tsh)
FSIZE= Farm size (acres)
 = Regression constant
 = Regression coefficient
 = Error term

SPSS output

Regression

James Lwelamira (PhD) Page 92


Variabl es Entered/Removedb

Variables Variables
Model Entered Remov ed Method
1 f arm size
a . Enter
(acres)
a. All requested v ariables entered.
b. Dependent Variable: annual income
bef ore project('000)

Model Summary

Adjusted Std. Error of


Model R R Square R Square the Est imat e
1 .693a .480 .476 621.79292
a. Predictors: (Constant), f arm size (acres)

ANOVAb

Sum of
Model Squares df Mean Square F Sig.
1 Regression 52780721 1 52780720. 96 136.516 .000a
Residual 57220713 148 386626.438
Total 1.10E+08 149
a. Predictors: (Constant), f arm size (acres)
b. Dependent Variable: annual income bef ore project('000)

Coefficientsa

Unstandardized Standardized
Coeff icients Coeff icients
Model B Std. Error Beta t Sig.
1 (Constant) 358.792 87.344 4.108 .000
f arm size (acres) 192.081 16.440 .693 11.684 .000
a. Dependent Variable: annual income bef ore project('000)

Interpretation/reporting of results

Results indicate farm size was a good predictor of annual household income before
project. About 48% of variations in income were due to variations in farm size (R2
=.0.48). Furthermore, results indicate farm size was significantly positively associated
with annual household income before project (t = 4.108, P < 0.001).

Note:
If we have +ve coefficient (i.e. +  ), it imply that there is positive relationship between X
and Y (i.e. increase in X is associated with increase in Y) AND if we have –ve
coefficient (i.e. -  ), it imply that there is negative relationship between X and Y (i.e.
increase in X is associated with decrease in Y)

James Lwelamira (PhD) Page 93


MULTIPLE LINEAR REGRESSION ANALYSIS

Here we have more than one explanatory (independent) variable

It is a multivariate analysis

With p explanatory variables, a population regression equation can be written as;

Yi    1 X 1i   2 X 2i  ...   p X pi   i

Whereby;
Yi = A dependent variable
X 1i. .. X pi  Independent variables
 = a regression constant
1 ... p = regression coefficients
 i = random error (disturbance term)

Basically,
 = is the value of Y when all Xs (all independent variables) are zero (0)
 1 = is the amount of change in Y when X1 is increased by one unit when other
independent variables are held constant (i.e held at their mean);  2 = is the amount of
change in Y when X2 is increased by one unit when other independent variables are held
constant (i.e held at their mean); etc.
 i = Effect of all other variables NOT included in the model

Note: We can also denote a regression constant as  0 instead of  , and a disturbance


term as u i instead of  i .

A sample regression equation can be written as;

Yi  a  b1 X 1i  b2 X 2i  ...  b p X pi  ei

Or

Yi  ˆ  ˆ1 X 1i  ˆ2 X 2i  ...  ˆ p X pi  ei

Note1: Sample regression equation is used to estimate population regression equation

James Lwelamira (PhD) Page 94


Note2: Regression equations/models can also be written/expressed using matrices (won’t
be shown here)
Note3: the coefficients (the parameters) are estimated using Least Square Method or
Maximum Likelihood approach. However, Least Square Method is the common one.

Advantage of multiple regression analysis;

- You can be able to study the effect of several independent variables collectively;

- You can be able to study the effect of a specific independent variables while
controlling for the effect of other variable (s) (i.e. confounders).

We usually need to specify a regression equation/model that we are going to estimate


when writing research proposals/reports. It is usually advised to write it in a simple and
more understandable way. For example a multiple linear regression model can be written
in the form indicated in the example below;

Example of a multiple linear regression equation (model)

Suppose we want to establish if there is relationship between annual household income


before project by a household (INCOME1) as a dependent variable and age of respondent
(years) (AGE), education level (years in school) (EDUC2), number of cattle owned
(NCATTLE) and farm size (FSIZE) as independent variables (from our data file:
Workshop SPSS working file1), the model can be written as;

INCOME1i    1 AGEi   2 EDUC2i  3 NCATTLEi   4 FSIZEi   i

Whereby;
INCOME1 = annual household income before project (‘000Tsh)
AGE = Age (years)
EDUC2 = Education level (Years in school)
NCATTLE = Number of cattle owned
FSIZE= Farm size (acres)
 = Regression constant
1 ... 4 = Regression coefficients
 = Error term

OR

James Lwelamira (PhD) Page 95


INCOME1i  0  1 AGEi  2 EDUC2i  3 NCATTLEi  4 FSIZEi   i

Whereby;
INCOME1 = annual household income before project (‘000Tsh)
AGE = Age (years)
EDUC2 = Education level (Years in school)
NCATTLE = Number of cattle owned
FSIZE= Farm size (acres)
 0 = Regression constant
1 ... 4 = Regression coefficients
 = Error term
OR

Yi    1 X1i  2 X 2i  3 X 3i  4 X 4i   i
Whereby;
Y = annual household income before project (‘000Tsh)
X 1 = Age (years)
X 2 = Education level (Years in school)
X 3 = Number of cattle owned
X 4 = Farm size (acres)
 = Regression constant
1 ... 4 = Regression coefficients
 = Error term

OR

Yi  0  1 X1i  2 X 2i  3 X 3i  4 X 4i   i
Whereby;
Y = annual household income before project (‘000Tsh)
X 1 = Age (years)
X 2 = Education level (Years in school)
X 3 = Number of cattle owned
X 4 = Farm size (acres)
 0 = Regression constant
1 ... 4 = Regression coefficients
 = Error term

James Lwelamira (PhD) Page 96


OR

Y  f  X1 , X 2 , X 3 , X 4 ,  
Whereby;
Y = annual household income before project (‘000Tsh)
X 1 = Age (years)
X 2 = Education level (Years in school)
X 3 = Number of cattle owned
X 4 = Farm size (acres)
 = Error term

- Standardized vs Non- standardized regression coefficients????

Statistical tests for multiple regression

1. ANOVA- Testing the overall significance of regression.

Hypotheses;

Ho: All independent variables included in the model collectively does not
significantly influence a dependent variable (Y)

Ha: All independent variables included in the model collectively significantly


influence a dependent variable (Y)

2. R-square: Proportion of variations in Y that are explained by variations in


independent variables included in the model.

ESS
R2 
TSS

3. Test of statistical significance of individual regression coefficients

James Lwelamira (PhD) Page 97


We use a t- test!!

Hypotheses:

If we have p independent variables

Ho: 1   2  ... p  0
Ha: 1   2  ... p  0

Or

Ho: All regression coefficients are not significantly different from zero (i.e. no
relationship)
Ha: At least one regression coefficient is significantly different from zero (i.e. at least one
X has relationship with Y)

Or
Ho: There is no relationship between a dependent variable and variables included in the
model/or dependent variable is not significantly influenced by independent variables
include in the model.

Ha: There is relationship between a dependent variable and at least one independent
variable included the model/ or dependent variable is significantly influenced by at
least one independent variable included in the model.

Note: The above hypotheses are two tailed tests and too general- what about effect of
specific independent variables? = To be more specific, depending on literature or
existing information you may indicate expected change on each independent variable
(after specifying the model or when defining independent variables of the model) and
conduct one tailed test in some independent variables and two tailed test in some.
Example; consider a study on factors influencing adoption of agriculture technology for
Irish potato farming;

James Lwelamira (PhD) Page 98


Procedures for multiple linear regression in SPSS

Analyze Regression Linear Choose dependent (criterion)

variable and put it to Dependent box Chose independent variables and put them

to Independent (s) box OK

Example;

The following it the output from SPSS for regression analysis for null hypothesis that
income before project by a household (INCOME1) was not influenced by age of
respondent (years) (AGE), education level (years in school) (EDUC2), number of cattle

James Lwelamira (PhD) Page 99


owned (NCATTLE), and farm size (FSIZE) vs (against) alternative hypothesis that they
had an influence.

File name; Workshop SPSS working file1

Note: The model for this analysis was as follows;

INCOME1i    1 AGEi  2 EDUC2i  3 NCATTLEi  4 FSIZEi   i

Whereby;
INCOME1 = Annual household income before project (‘000Tsh)
AGE = Age (years)
EDUC2 = Education level (Years in school)
NCATTLE = Number of cattle owned
FSIZE= Farm size (acres)
 = Regression constant
1 ... 4 = Regression coefficients
 = Error term

SPSS Output

Regression
Variabl es Entered/Removedb

Variables Variables
Model Entered Remov ed Method
1 f arm size
(acres),
age
(y ears),
education
lev el . Enter
(y ears in
school),
number of
cattle a
owned
a. All requested v ariables entered.
b. Dependent Variable: annual income
bef ore project('000 Tsh)

James Lwelamira (PhD) Page 100


Model Summary

Adjusted Std. Error of


Model R R Square R Square the Est imat e
1 .846a .715 .707 464.75113
a. Predictors: (Constant), f arm size (acres), age (y ears),
education lev el (y ears in school), number of cattle
owned

ANOVAb

Sum of
Model Squares df Mean Square F Sig.
1 Regression 78682359 4 19670589. 81 91.070 .000a
Residual 31319074 145 215993.617
Total 1.10E+08 149
a. Predictors: (Constant), f arm size (acres), age (y ears), education lev el (y ears in
school), number of cattle owned
b. Dependent Variable: annual income bef ore project('000 Tsh)

Coeffici entsa

Unstandardized Standardized
Coef f icients Coef f icients
Model B Std. Error Beta t Sig.
1 (Constant) -478.897 219.182 -2.185 .031
age (y ears) -2.743 4.342 -.029 -.632 .529
education lev el (y ears
114.006 18.021 .422 6.326 .000
in school)
number of cattle owned 9.673 3.229 .219 2.996 .003
f arm size (acres) 96.609 15.687 .348 6.159 .000
a. Dependent Variable: annual income bef ore project('000 Tsh)

The above output can be summarized and interpreted as follows;

Table:… Multiple Linear Regression Analysis for factors influenced annual household
income before project (in ‘000Tsh), a dependent variable
Standard
Independent variable B Error t- value Sig.
(SE)
Constant -478.90 219.18 -2.19 0.031
Age (Years) -2.74 4.34 -0.63 0.529
Education level (Years in school) 114.01 18.02 6.33 0.000
Number of cattle owned 9.67 3.23 3.00 0.003
Farm size (acres) 96.61 15.69 6.16 0.000
R2= 0.71; F-value = 91.07, P <0.001

James Lwelamira (PhD) Page 101


OR

Table:.. Multiple Linear Regression Analysis for factors influenced annual household
income before project (in ‘000Tsh), a dependent variable

Standard
Independent variable B t- value
Error (SE)
Constant -478.90 219.18 -2.19*
Age (Years) -2.74 4.34 -0.63NS
Education level (Years in school) 114.01 18.02 6.33***
Number of cattle owned 9.67 3.23 3.00**
Farm size (acres) 96.61 15.69 6.16***
R2= 0.71; F-value = 91.07, P <0.001; NS = Non- significant, * = Significant at P < 0.05, ** =
Significant at P < 0.01; *** = Significant at P < 0.001

OR

Table:… Multiple Linear Regression Analysis for factors influenced annual household
income before project (in ‘000Tsh) (INCOME1), a dependent variable

Standard
Independent variable B t- value
Error (SE)
Constant -478.90 219.18 -2.19*
AGE (Years) -2.74 4.34 -0.63NS
EDUC2 (Years in school) 114.01 18.02 6.33***
NCATTLE 9.67 3.23 3.00**
FSIZE (acres) 96.61 15.69 6.16***
R2= 0.71; F-value = 91.07, P <0.001; NS = Non- significant, * = Significant at P < 0.05, ** =
Significant at P < 0.01; *** = Significant at P < 0.001

Interpretation based on R-square, ANOVA and t-test !!! (presenting results)

Result form Table. indicate independent variables included in the model were good
predictors of annual household income before project. About 71% of variations in annual
household income before project were due to variations in independent variables included
in the model. Results further indicate that independent variables included in the model
collectively had a significant influence on annual household income before project (F =
91.07, P < 0.001). Results for t- test indicate annual household income before project had
a significant relationship with education level (t = 6.33, P < 0.001), number of cattle
owned (t = 3.00, P < 0.01) and farm size (t = 6.16, P < 0.05). The effect of age was not
significant (t = - 0.63, P > 0.05). Increase in education level, number of cattle owned and
farm size was associated with increased income before project.

James Lwelamira (PhD) Page 102


OR

Result form Table. indicate independent variables included in the model were good
predictors of annual household income before project. About 71% of variations in annual
household income before project were due to variations in independent variables included
in the model. Results further indicate that independent variables included in the model
collectively had a significant influence on annual household income before project (F =
91.07, P < 0.001). Results for t- test indicate annual household income before project was
significantly positively related with education level (t = 6.33, P < 0.001), number of cattle
owned (t = 3.00, P < 0.01) and farm size (t = 6.16, P < 0.05). Results also reveal that
income was negatively related with age of respondent. However, the relationship with
age was not statistically significant (t = - 0.63, P > 0.05).

Note:
If we have +ve coefficient (i.e. +  ), it imply that there is positive relationship between X
and Y (i.e. increase in X is associated with increase in Y) AND if we have –ve
coefficient (i.e. -  ), it imply that there is negative relationship between X and Y (i.e.
increase in X is associated with decrease in Y)

Note: Highlight on Enter method; Stepwise Method, Forward selection method,


Backward selection method for multiple regression analysis

Exercise
Using the same file (Workshop SPSS working file1), Re-run (re-analyse) the above
information by dropping (removing) a variable “number of cattle owned” from the
model. Interpret your results

James Lwelamira (PhD) Page 103


SPECIAL CASES

MULTIPLE REGRESSION ANALYSIS WHEN WE HAVE CATEGORICAL


INDEPENDENT VARIBALES (S) IN THE MODEL

In the above examples for regression analysis we had continuous dependent variable and
continuous explanatory variable(s).

In some occasions we can have a continuous dependent variable and a mixture of


categorical and continuous independent variables in the model.

When we have continuous dependent variable, and some of independent variables are
categorical, we have to transform those variables into dummy variables(How?) and the
run least square linear regression analysis as usual. This would facilitate interpretation.

Number of dummy variables for a categorical variables equals to number of category


minus 1 (i.e. # category – 1). Therefore for a categorical variable with two categories we
will have one dummy variable, and that with three categories we will have two dummy
variables. (Basically one category is chosen as a reference category).

1. Example of dummy variables for categorical variables with two categories

Example 1;
For a categorical variable “Sex of respondent (SEX)” with two categories (Male and
Female) can be coded as a dummy variable as follows;

1 = if male; 0 = otherwise (or 1 = Male, 0 = Female)

Example 2;
For a categorical variable “If engaging in off farm activities (OFFA)” with two categories
(Yes and No) can be coded as a dummy variable as follows;

1 = if Yes; 0 = otherwise (or 1 = Yes, 0 = No)

Note: strictly speaking, it is more sensible to use “Otherwise” when we generate dummy
variables for categorical variable with more than two categories (see the next example).
Therefore, for the above example (when we generate dummy for categorical variable
with two categories), format in bracket is more appealing (label the categories directly!!).
Use that format!.

2. Example of dummy variables for categorical variables with more than two
categories

For a categorical variable “District of residence” with three categories (Chamwino, Bahi
and Kongwa) we will have two dummy variables with one district chosen as base
(omitted category)- Suppose if we omit Chamwino district, we will have two dummy

James Lwelamira (PhD) Page 104


variables, one representing if respondent is from Bahi and the second if the respondent is
from Kongwa , and coding would be as follows;

First dummy variable “If from Bahi(BAHI)” would be coded as;

1 = if Yes; 0 = otherwise

Second dummy variable “If from Kongwa (KONGWA) would be coded as

1 = if Yes; 0 = otherwise

(However, example given in this handout considered categorical variables with two
categories)

Note1: Depending on required information AND for simplicity, other authors may
collapse responses of a categorical variable with more than two categories into just two
categories coded in dummy form (i.e. one dummy variable)

Note2: interpretation of results for dummy variables (IMPORTANT!!)

If we have +ve coefficient (i.e. +  ), imply that category coded as 1 is associated with
increase in Y (a dependent); AND if the coefficient is –ve (i.e. -  ), the category coded
as 1 is associated with decrease in Y (a dependent).

Example;

The following is multiple linear regression analysis for null hypothesis that annual
household income (INCOME) (in ‘000 Tsh) is NOT influenced by Age of respondent
(AGE) (in years), Sex of respondent (SEX), education level (ADUC2) (in years in
school), engagement in off-farm activities (OFFA), number of cattle owned (NCATTLE),
and farm size (FSIZE) (in acres) vs (against) Alternative hypothesis that it is influenced
by those variables.
File name: Workshop SPSS working file3

In this analysis, the independent variables SEX and OFFA were categorical variables and
were coded as dummy as follows.

SEX: 1= Male, 0 = Female


OFFA: 1 = Yes, 0 = No

Regression model for this analysis was as follows;

James Lwelamira (PhD) Page 105


INCOMEi    1 AGEi   2 SEX i   3 EDUC2i   4OFFAi   5 NCATTLEi   6 FSIZEi   i
Whereby;
INCOME = Annual household income (‘000Tsh)
AGE = Age (years)
SEX = Sex (Dummy: 1= if male, 0 = if female)
EDUC2 = Education level (years in school)
OFFA = Engagement in off-farm activities (Dummy: 1 If yes, 0 = if not)
NCATTLE = Number of cattle owned
FSIZE= Farm size (acres)
 = Regression constant
1 ... 6 = Regression coefficients
 = Error term

Procedures in SPSS are the same as in previous multiple regression analysis

SPSS output

Regression
Variables Entered/Removedb

Variables Variables
Model Entered Remov ed Method
1
f arm size
(acres),
age of
household
head, If
engaged
in of f -f arm
activ ities
f or income,
sex of . Enter
household
head,
number of
cattle
owned,
education
lev el
(y ears in a
school)

a. All requested v ariables entered.


b. Dependent Variable: household annual income ('000)

James Lwelamira (PhD) Page 106


Model Summary

Adjusted Std. Error of


Model R R Square R Square the Est imat e
1 .855a .731 .719 462.22192
a. Predictors: (Constant), f arm size (acres), age of
household head, If engaged in of f -f arm activ ities f or
income, sex of household head, number of cattle
owned, education lev el (y ears in school)

ANOVAb

Sum of
Model Squares df Mean Square F Sig.
1 Regression 82912499 6 13818749. 82 64.680 .000a
Residual 30551822 143 213649.106
Total 1.13E+08 149
a. Predictors: (Constant), f arm size (acres), age of household head, If engaged in
of f -f arm activ ities f or income, sex of household head, number of catt le owned,
education lev el (y ears in school)
b. Dependent Variable: household annual income ('000)

Coeffici entsa

Unstandardized Standardized
Coef f icients Coef f icients
Model B Std. Error Beta t Sig.
1 (Constant) -340.252 221.940 -1.533 .127
age of household head -3.834 4.350 -.040 -.881 .380
sex of household head 320.241 102.803 .177 3.115 .002
education lev el (y ears
94.832 20.019 .346 4.737 .000
in school)
If engaged in of f -f arm
3.702 97.099 .002 .038 .970
activ ities f or income
number of cattle owned 8.312 3.257 .185 2.552 .012
f arm size (acres) 94.468 15.653 .335 6.035 .000
a. Dependent Variable: household annual income ('000)

James Lwelamira (PhD) Page 107


The above output can be summarized and interpreted as follows;

Table:… Multiple Linear Regression Analysis for factors influencing annual household
income (in ‘000Tsh), a dependent variable
Standard
Independent variable B t- value
Error (SE)
Constant -340.25 221.94 -1.53NS
Age of household head (Years) -3.83 4.35 -0.88NS
Sex of household head 320.24 102.80 3.12**
Education level (Years in school) 94.83 20.02 4.74***
If engaged in off-farm activities 3.70 97.10 0.04NS
Number of cattle owned 8.31 3.26 2.55*
Farm size 94.47 15.65 6.04***
R2= 0.72; F-value = 64.68 (P <0.001); NS = Non- significant, * = Significant at P < 0.05, ** =
Significant at P < 0.01; *** = Significant at P < 0.001

Result form Table…. indicate independent variables included in the model were good
predictors of annual household income before project. About 72% of variations in annual
household income before project were explained by variations in independent variables
included in the model. Results further indicate that independent variables included in the
model collectively had a significant influence (effect) on annual household income before
project (F = 64.68, P < 0.001). Results for t- test indicate annual household income had a
significant relationship with sex of household head (t = 3.12, P < 0.01), education level (t
= 4.74, P < 0.001), number of cattle owned (t = 2.55, P < 0.05) and farm size (t = 6.04, P
<0.001). On the other hand, age of respondent and engagement on off-farm activities had
no significant influence on annual household income (t = -0.88, P > 0.05 and t = 0.04, P
>0.05, respectively). Being a male, increase in education level, number of cattle owned
and farm size was associated with increased annual household income.

OR

Result form Table. indicate independent variables included in the model were good
predictors of annual household income. About 72% of variations in annual household
income before project were explained variations in independent variables included in the
model. Results further indicate that independent variables included in the model
collectively had a significant influence on annual household income (F = 64.68, P <
0.001). Results for t-test indicate annual household income was significantly positively
related with education level (t = 4.74, P < 0.001), number of cattle owned (t = 2.55, P <
0.05), and farm size (t = 6.04, P <0.001). Results further indicate being a male was
significantly associated with increased annual household income (i.e. significantly
positively related to income) (t = 3.12, P < 0.01). Furthermore, age of respondent was
negatively related to annual household income, while engagement on off-farm activities
was positively related to annual household income, however, their effect were not
significant (t = -0.88, P > 0.05 and t = 0.04, P >0.05, respectively).

James Lwelamira (PhD) Page 108


Exercise
Re-do the same problem by dropping age of household head in the model. Interpret your
results.

REGRESSION ANALYSIS WHEN WE HAVE A CATEGORICAL DEPENDENT


VARIABLE

An overview

Linear regression model cannot directly apply to this situation as some of its assumptions
are violated.

Solution;
Latent variable approach: (Econometric approach). It assumes that there is an
underlying continuous variable (i.e unobserved/ latent variable) associated with a
response categorical variable under study and we only observe a particular category of a
response categorical variable when an underlying continuous variable is at particular
level.

Transformational approach (Statistical approach) i.e. Logit and Probit links.


Categorical dependent variable is considered as inherently categorical and should be
modeled as such. In this approach there is one –to- one correspondence between
population parameters of interest and sample statistics.

Logit link is derived from Logistic distribution (to be explained later) while probit link is
from cumulative normal distribution.

Basically transformation is done in order for the functions to have linear properties like
that of linear regression and hence easily solved.

Logit link is from logistic regression. We are going to concentrate on logistic regressions
(as it is more easy to understand) specifically BINARY LOGISTIC REGRESSION.

There are several types of Logistic regression depending on nature of categorical


dependent variable i.e i) Binary logistic regression (for binary response variable) ii)
Multinomial logistic regression (for nominal dependent variable with more than two
categories), iii) Ordered logistic regression (for ordinal dependent variable)

James Lwelamira (PhD) Page 109


SIMPLE BINARY LOGISTIC REGRESSION

In social sciences we frequently encounter response variables with binary responses i.e.
two categories of response (BINARY VARIABLE). In this situation, it is logical to
study (examine) probability of observing a particular category of a response variable (Y)
given particular levels of an independent variable. A relationship between probabilities of
observing a particular category of response variable given particular levels of
independent variable is NON LINEAR.
We can model this non linear relationship using logistic equation (Logit link) or
cumulative normal distribution equation (Probit link). In this handout we are going to use
logit link.

Suppose we have a dependent variable (Y) with two response categories; 1 if a condition
is observed and 0 if otherwise (not observed), and one independent variable (X).

Expected probability that Y = 1 given a particular level of X (i.e. conditional mean of


probability), that is expected P(Y  1 | X  x) or  x  or  for simplification, in a
logistic equation can be given by;

e x

1  e x
OR

1

1  e ( x )
Whereby;
e = a constant with its value equals to 2.718
 = regression constant
 = regression coefficient

(Note: you may encounter in literature different notations for the expected
P(Y  1 | X  x) such as  x  ,  , P, Pi ,  i , (x) and many others. Try to use the
simple one)

The main task in the above equation is the estimation of  and  . These can easily be
estimated after transforming the above equation into linear form (i.e. logit
transformation).

How??

James Lwelamira (PhD) Page 110


If expected probability of observing Y = 1 given a particular level of X is
P(Y  1 | X  x) =  . Then, expected probability of NOT observing Y = 1 given a
particular level of X (i.e. expected probability of observing Y = 0 given a particular level
of X) can be given by 1  P(Y  1 | X  x)  1  

Therefore, probability of observing Y = 1 given particular level of X relative to not


observing it (i.e. relative to Y = 0 given particular level of X), that is ODDS of
Y  1 | X  x relative to Y  0 | X  x (in other words odds of experiencing event) can
be given by;

P(Y  1 | X  x) 

1  P(Y  1 | X  x) 1  
It can be shown that;

 P(Y  1 | X  x)    
g = ln    ln   =   x
 1  P(Y  1 | X  x)  1  

Note: we call g (which can also be written as g (x) ) as logit (  ) i.e. log-odds

 is a value of logit when a value for X equals to zero and  is the amount of change in
logit when X is increased by 1 unit (for a continuous independent variable), or when a
categorical X (a categorical independent variable) is changed from a base category to a
particular category.

The logit have some desirable properties like that of linear regression. It is linear on
its parameters, and it may be continuous and may range from   to  depending
on range of x

A logit equation for a sample can be written as;

 P(Y  1 | X  x)    
ĝ = ln    ln   = a  bx
 1  P(Y  1 | X  x)  1  

OR

 P(Y  1 | X  x)     ˆ ˆ
ĝ = ln    ln   =   x
 1  P(Y  1 | X  x)  1  

James Lwelamira (PhD) Page 111


Then a and b in logit equation above can be computed ( i.e. estimation of  and  )
using Maximum likelihood approach (i.e. from maximum likelihood function).
Likelihood function (l) is the one which indicate the probability of obtained observed
data set. (it is based on the concept of joint probability)

Note1: a is the estimate for  , and b is the estimate for  .

Note2;
During interpretation, sometimes b can be transformed into ODDS RATIO (OR) to make
interpretation easy to understand. This is done by computing exp(b).

What is ODDS RATIO? (OR);

OR is a CHANGE (i.e. factor change) in likelihood of observing a particular category


(i.e. category coded as 1) of a response variable when a category of categorical (discrete)
independent variable is change from that of reference category to a particular category or
a continuous independent variable is increased by one unit.

Statistical tests in simple binary logistic regression

1. Goodness of fit-test

There are several tests i.e Nagelkerke R2, Cox and Snell R2, McFaden R2 etc.

2. Testing of significance (hypotheses testing)

i). Likelihood ratio (LR) test

Does the model that includes the variable in the question tell us more about the outcome
(or response) variable than a model that does not include the variable?. i.e Likelihood
ratio test (LR test).

For assessing the significance of an independent variable we compare the value of D


(deviance- equivalent to Residual SS in linear regression) with and without the
independent variable in the equation. The change in D due to the inclusion of the
independent variable in the model is obtained as;

 likelihood without the var iable 


G  2 ln  
 likelihood with the var iable  

James Lwelamira (PhD) Page 112


Under the null hypothesis that  is equal to zero, the statistic G follows a chi-square
distribution with 1 degree of freedom.

Note: We have 1 degree of freedom because we have 1 predictor, or 2 parameters (i.e. a


constant and a coefficient) -1)

Note2: G is the test statistic. The larger the G, the more likely we are going to reject null
hypothesis and hence declaring a variable in the model had significant effect (influence)
on Y

Hypotheses; Stated in the similar way as that for simple linear regression

The calculation of the log likelihood and the likelihood ratio test are standard features of
all logistic regression software. In the simple case of single independent variable, we first
fit a model containing only the constant term. We then fit a model containing
independent variable along with the constant. This give rise to new log likelihood (LL).
The Likelihood Ratio (LR) test is obtained by multiplying the difference between
these two values by -2. The results along with the associated p-value for the chi-
square distribution, may be obtained from most of the software packages

G = -2 (LL with only a constant - LL with constant and a variable)

(For more information refer to Scott and Long, 1998; Hosmer and Lemershow, 2003;
Agresti, 2007)

Basically;

l 
G  2 log 0   2log( l0 )  log( l1 )  2L0  L1 
 l1 

Where L0 and L1 denote maximized log-likelihood (LL) functions for a model with only
intercept and a model with intercept and a variable (i.e. a model with a variable added),
respectively. Under H 0 :   0 , this test statistic (G) has a large-sample chi-square
distribution with df = 1 (i.e. asymptotic chi-square distribution). Why df=1?, because we
introduced a single (only one) independent variable.

Note: Some authors denote this statistic (G) as G2

ii). Wald test

James Lwelamira (PhD) Page 113


The Wald test is obtained by comparing the maximum likelihood estimate of the slope
 ˆ 
parameter, ˆ (i.e. b), to an estimate of its standard error  i.e . The resulting
ˆ 
 S .E (  ) 
ratio, under hypothesis that  = 0 will follow a standard normal distribution. Therefore,
test statistic is Z

ˆ
W= =Z
S .E ( ˆ )

If the ratio is squared

2
 ˆ 
W (i.e. Z ) = 
2 2
 = 2

ˆ
 S .E (  ) 

Under the hypothesis that  is equal to zero (i.e. null hypothesis), the statistic W2
follows a chi-square distribution (  2 ) with 1 degree of freedom. SPSS program produce
output for this test.

Hypotheses; Stated in the similar way as that for simple linear regression

Note: For Z test, you can as well use C.I to make decision on whether to accept or
reject null hypothesis (this approach can also be used to some previously presented
statistical tests)

The concept of ˆ ± 1.96SE( ˆ1 ) vs if the interval contain zero vs significance of the
estimate at 5% level of significance (i.e 95% C.I). If dealing OR, to be significant interval
of OR should not contain1

iii). Score test


As opposed to LR test and W test, it does not require ML estimate of 
(Consult a book by Hosmer and Lemershow (2003)

James Lwelamira (PhD) Page 114


As it was commented earlier in other regression analyses, we usually need to specify a
regression equation/model that we are going to estimate when writing research
proposals/reports. It is usually advised to write it in a simple and more understandable
form. In a simple binary logistic regression, a model can be written in the form indicated
in the example below;

Example of a simple binary logistic regression equation (model)

Suppose we want to establish if there is relationship between a variable “If ever given
birth (BIRTH)” by a female adolescent as a dependent variable (with values 1 = Ever
given birth, meaning YES and 0 = Not ever given birth, meaning NO) and age of an
adolescent (AGE) (in years) as independent variable (from our file: Workshop SPSS
working file4), the simple binary logistic regression equation/model can be written as;

 P(YES ) i 
ln      AGEi
 1  P(YES ) i 

Whereby;
P(YES ) i = Probability that an adolescent had ever given birth
AGE= Age of an adolescent (Years)
 = Regression constant
 = Regression coefficient

OR

 Pi 
ln      AGEi
 1  Pi 
Whereby;
Pi = Probability that an adolescent had ever given birth
AGE= Age of an adolescent (Years)
 = Regression constant
 = Regression coefficient

James Lwelamira (PhD) Page 115


 P(Yi  1) 
ln      AGEi
 1  P(Yi  1) 
Whereby;
P (Yi  1) = Probability that an adolescent had ever given birth
AGE= Age of an adolescent (Years)
 = Regression constant
 = Regression coefficient

Note1: in the above example we had a continuous independent variable (i.e. age). If you
have a categorical independent variables you must indicate it when defining terms in the
model by indicating its categories.

Note2: to simplify interpretation, it is important to assign higher value for a category of


interest in dependent variable. Example if your dependent is Adoption with categories
“Adopted” and “Not- Adopted”, we may code the responses as 1 = Adopted, 0 = Not
adopted/ or 2= Adopted, 1 = Not-adopted.

Procedures for simple logistic regression in SPSS

Analyze Regression Binary Logistic Choose a dependent

variable and put it to a Dependent box Select independent variable and put it

to Covariate box Click Categorical (if independent var. is categorical)

Select categorical independent variable and put it to Categorical Covariate box

Specify a reference category and click Change button (Option but I prefer first

category!!!) Continue OK.

Note3: if independent variable is continuous just click OK after putting it in Covariate


box
Note4: when changing a reference category remember to click a button change after
selection!.
Note5: You can also command a computer to print 95% C.I for Odds Ratio (OR) after
clicking the option button.

James Lwelamira (PhD) Page 116


Note5: To interpret your data properly from SPSS output (i.e. to know direction of
effect), start by studying carefully coding in the tables “Dependent Variable Encoding”
and “Categorical Variables Coding”.

Example for simple logistic regression;

The case of a continuous independent variable

In the output below are the results for logistic regression analysis for the null hypothesis
that probability of an adolescent to report birth had no relationship with her age vs
(against) alternative hypothesis that it is influenced by age (i.e had relationship with age).

Note: age of respondent was continuous (years)

File name: Workshop SPSS working file4

As shown earlier, model for this analysis was as follows;

James Lwelamira (PhD) Page 117


 P 
ln  i     AGEi
 1  Pi 
Whereby;
P i = Probability that an adolescent had ever given birth (i.e. reported birth)
AGE= Age of an adolescent (Years)
 = Regression constant
 = Regression coefficient

Logistic Regression
Case Processing Summary
a
Unweighted Cases N Percent
Selected Cases Included in Analy sis 202 100.0
Missing Cases 0 .0
Total 202 100.0
Unselected Cases 0 .0
Total 202 100.0
a. If weight is in ef f ect, see classif ication table f or the total
number of cases.

Dependent Variable Encoding

Original Value Internal Value


Not ev er giv en birt h 0
Ev er giv en birth 1

Block 0: Beginning Block


Classification Tablea,b

Predicted
if ev er giv en birth (birth
status)
Not ev er Ev er giv en Percentage
Observ ed giv en birth birth Correct
Step 0 if ev er giv en birth Not ev er giv en birt h 153 0 100.0
(birth status) Ev er giv en birth 49 0 .0
Ov erall Percentage 75.7
a. Const ant is included in the model.
b. The cut v alue is .500

James Lwelamira (PhD) Page 118


Variables in the Equation

B S. E. Wald df Sig. Exp(B)


St ep 0 Constant -1.139 .164 48.116 1 .000 .320

Variables not in the Equation

Score df Sig.
St ep 0 Variables AGE 61.608 1 .000
Ov erall Statistics 61.608 1 .000

Block 1: Method = Enter


Omnibus Tests of Model Coefficients

Chi-square df Sig.
St ep 1 St ep 69.301 1 .000
Block 69.301 1 .000
Model 69.301 1 .000

Model Summary

-2 Log Cox & Snell Nagelkerke


Step likelihood R Square R Square
1 154.527 .290 .434

Classification Tablea

Predicted
if ev er giv en birth (birth
status)
Not ev er Ev er giv en Percentage
Observ ed giv en birth birth Correct
Step 1 if ev er giv en birth Not ev er giv en birt h 143 10 93.5
(birth status) Ev er giv en birth 29 20 40.8
Ov erall Percentage 80.7
a. The cut v alue is .500

Variables i n the Equation

95.0% C.I.f or EXP(B)


B S.E. Wald df Sig. Exp(B) Lower Upper
Step
a AGE .613 .095 41.738 1 .000 1.845 1.532 2.222
1 Constant -12.408 1.818 46.596 1 .000 .000
a. Variable(s) entered on st ep 1: AGE.

Interpretation/reporting results

Based on Tables for Model Summary and Variables in the equation (i.e. last table) it can
be said that age was a good predictor of a probability of reporting birth by an adolescent

James Lwelamira (PhD) Page 119


(Nagelkerke R2 = 43%). Furthermore, based on Wald- Statistic results indicate age had
significant relationship with probability (log-odds) for reporting birth (P<0.001). Results
for Odds Ratio (OR) indicate one year increase in age was associated with nearly two
times increase in likelihood of reporting birth by an adolescent (OR = 1.85; 95% C.I, 1.53
– 2.22).

The case of a categorical independent variable with two categories

Example;
In the output below are the results for logistic regression analysis for the null hypothesis
that probability of an adolescent to report birth has no relationship with alcohol use vs
(against) alternative hypothesis that it is influenced by alcohol use (i.e it had relationship
with alcohol use by an adolescent) .

Note: Alcohol use was categorical with two categories (i.e. “Not use”, and “Use”)

File name: Workshop SPSS working file4

The model for this analysis was as follows;

 Pi 
ln      ALCOHOLi
 1  Pi 
Whereby;
P i = Probability that an adolescent had ever given birth (i.e. reported birth)
ALCOHOL= If use alcohol (1 = Yes, 0 = No)
 = Regression constant
 = Regression coefficient

Note: Coding for “ALCOHOL” shown above was based on the reference category
chosen (i.e. “not use”) in which in computer coding a design variable/dummy
variable produced was coded as 1 = if use 0 = if not use (meaning “not use” was a
reference category, though in Data file it was coded as 1 = if not use and 2 = if use;
Reason for produced coding by computer is that FIRST category was chosen as a
reference category).

You can as well use directly in the data file the coding style of 0 = for reference
category and 1 = for non-reference category. But file used in this example coded 1 =

James Lwelamira (PhD) Page 120


not used and 2 = used, and during analysis FIRST category was chosen as a
reference category, consequently computer coding specified 1 = used and 0 = not
used. That is what has been indicated in model specification above. Don’t confuse
with what has been coded in the data file used!.

Logistic Regression
Case Processing Summary
a
Unweighted Cases N Percent
Selected Cases Included in Analy sis 202 100.0
Missing Cases 0 .0
Total 202 100.0
Unselected Cases 0 .0
Total 202 100.0
a. If weight is in ef f ect, see classif ication table f or the total
number of cases.

Dependent Variable Encoding

Original Value Internal Value


Not ev er giv en birt h 0
Ev er giv en birth 1

Categorical Variables Codings

Paramet e
Frequency r coding
(1)
if use alcohol Not use 152 .000
Use 50 1.000

Block 0: Beginning Block


Classification Tablea,b

Predicted
if ev er giv en birth (birth
status)
Not ev er Ev er giv en Percentage
Observ ed giv en birth birth Correct
Step 0 if ev er giv en birth Not ev er giv en birt h 153 0 100.0
(birth status) Ev er giv en birth 49 0 .0
Ov erall Percentage 75.7
a. Const ant is included in the model.
b. The cut v alue is .500

James Lwelamira (PhD) Page 121


Variables in the Equation

B S. E. Wald df Sig. Exp(B)


St ep 0 Constant -1.139 .164 48.116 1 .000 .320

Variables not in the Equation

Score df Sig.
St ep 0 Variables ALCOHOL(1) 36.440 1 .000
Ov erall Statistics 36.440 1 .000

Block 1: Method = Enter


Omnibus Tests of Model Coefficients

Chi-square df Sig.
St ep 1 St ep 33.147 1 .000
Block 33.147 1 .000
Model 33.147 1 .000

Model Summary

-2 Log Cox & Snell Nagelkerke


Step likelihood R Square R Square
1 190.681 .151 .226

Classification Tablea

Predicted
if ev er giv en birth (birth
status)
Not ev er Ev er giv en Percentage
Observ ed giv en birth birth Correct
Step 1 if ev er giv en birth Not ev er giv en birt h 131 22 85.6
(birth status) Ev er giv en birth 21 28 57.1
Ov erall Percentage 78.7
a. The cut v alue is .500

Variables in the Equation

95.0% C.I. f or EXP(B)


B S. E. Wald df Sig. Exp(B) Lower Upper
Sta ep ALCOHOL(1) 2.072 .369 31.465 1 .000 7.939 3.849 16.375
1 Constant -1.831 .235 60.655 1 .000 .160
a. Variable(s) entered on step 1: ALCOHOL.

Interpretation/reporting results

Based on Tables for Model Summary and Variables in the equation (i.e. last table) it can
be said that Alcohol use predicted 23% of variations in probability of reporting birth
among adolescents, therefore, moderately good predictor (Nagelkerke R2 = 23%). Based
James Lwelamira (PhD) Page 122
on Wald- Statistic results further indicate alcohol use had a significant relationship with
log-odds for reporting birth (P<0.001). Results for Odds Ratio (OR) indicate alcohol
users were eight times more likely to report birth compared non- users (OR = 7.94; 95%
C.I, 3.85 – 16.38).

The case of a categorical independent variable with more than two categories (e.g.
three categories

Example;
In the output below are the results for logistic regression analysis for the null hypothesis
that probability of an adolescent reporting birth has no relationship with wealth status of
family vs (against) alternative hypothesis that it has relationship with wealth status of a
family

Note: Wealth status (WEALTH) was categorical with three categories (i.e. “LOW”,
“MODERATE”, and “HIGH”)

File name: Workshop SPSS working file4

The model for this analysis could be;

 Pi 
ln      1MODERATEi   2 HIGHi
 1  Pi 
Whereby;
P i = Probability that an adolescent had ever given birth (i.e. reported birth)
MODERATE= If an adolescent is from moderate income family (1 = if Yes, 0 =
otherwise)
HIGH = If an adolescent is from high income family (1 = if Yes, 0 = otherwise)
 = Regression constant
1 ,  2 = Regression coefficients

OR

 Pi 
ln      1WEALTH 1i   2WEALTH 2i
 1  Pi 
Whereby;

James Lwelamira (PhD) Page 123


P i = Probability that an adolescent had ever given birth (i.e. reported birth)
WEALTH1= If an adolescent is from moderate income family (1 = if Yes, 0 =
otherwise)
WEALTH2 = If an adolescent is from high income family (1 = if Yes, 0 = otherwise)
 = Regression constant
1 ,  2 = Regression coefficients

Logistic Regression
Case Processing Summary
a
Unweighted Cases N Percent
Selected Cases Included in Analy sis 202 100.0
Missing Cases 0 .0
Total 202 100.0
Unselected Cases 0 .0
Total 202 100.0
a. If weight is in ef f ect, see classif ication table f or the total
number of cases.

Dependent Variable Encoding

Original Value Internal Value


Not ev er giv en birt h 0
Ev er giv en birth 1

Categorical Variables Codings

Paramet er coding
Frequency (1) (2)
Wealth st at us low 98 .000 .000
of a f amily moderate 55 1.000 .000
high 49 .000 1.000

Block 0: Beginning Block

James Lwelamira (PhD) Page 124


Classification Tablea,b

Predicted
if ev er giv en birth (birth
status)
Not ev er Ev er giv en Percentage
Observ ed giv en birth birth Correct
Step 0 if ev er giv en birth Not ev er giv en birt h 153 0 100.0
(birth status) Ev er giv en birth 49 0 .0
Ov erall Percentage 75.7
a. Const ant is included in the model.
b. The cut v alue is .500

Variables in the Equation

B S. E. Wald df Sig. Exp(B)


St ep 0 Constant -1.139 .164 48.116 1 .000 .320

Variables not i n the Equation

Score df Sig.
Step Variables WEALTH 26.558 2 .000
0 WEALTH(1) 3.880 1 .049
WEALTH(2) 14.333 1 .000
Ov erall Statistics 26.558 2 .000

Block 1: Method = Enter


Omnibus Tests of Model Coefficients

Chi-square df Sig.
St ep 1 St ep 29.748 2 .000
Block 29.748 2 .000
Model 29.748 2 .000

Model Summary

-2 Log Cox & Snell Nagelkerke


Step likelihood R Square R Square
1 194.080 .137 .204

James Lwelamira (PhD) Page 125


Classification Tablea

Predicted
if ev er giv en birth (birth
status)
Not ev er Ev er giv en Percentage
Observ ed giv en birth birth Correct
Step 1 if ev er giv en birth Not ev er giv en birt h 153 0 100.0
(birth status) Ev er giv en birth 49 0 .0
Ov erall Percentage 75.7
a. The cut v alue is .500

Variables i n the Equation

95.0% C.I.f or EXP(B)


B S.E. Wald df Sig. Exp(B) Lower Upper
Step
a
WEALTH 20.462 2 .000
1 WEALTH(1) -1.357 .435 9.746 1 .002 .258 .110 .604
WEALTH(2) -2.743 .751 13.344 1 .000 .064 .015 .280
Constant -.414 .206 4.024 1 .045 .661
a. Variable(s) entered on step 1: WEALTH.

Note1: Note a category “LOW” of independent variable was chosen as a reference


category;

Note2: description of coding for dummy variable with more than two categories
shown above is also applicable in linear regression analysis/models.

Interpretation/reporting results

Based on Tables for Model Summary and Variables in the equation (i.e. last table) it can
be said that wealth status of a family predicted 20% of variations in probability of
reporting birth among adolescents, therefore, moderately good predictor (Nagelkerke R2
= 20%). Based on Wald- Statistic results further indicate family wealth had a significant
relationship with log-odds for reporting birth (P<0.001). Results for Odds Ratio (OR)
indicate being from moderate income family by an adolescent was associated with
significant reduction in odds (chances) for reporting birth relative to be from low income
family (OR = 0.26; 95% C.I, 0.11 – 0.60). Similarly, being from high income family by
an adolescent was also associated with significant reduction in odds for reporting birth
compared to an adolescent from low income family (OR = 0.06; 95% C.I, 0.02 – 0.28).

Note 3: Highlight on Crude Odds Ratio (COR) vs Adjusted Odds Ratio (AOR)

James Lwelamira (PhD) Page 126


MULTIPLE BINARY LOGISTIC REGRESSIONS

Binary multiple logistic regression

It is an extension of simple binary logistic regression

Here we have more than one independent variables

Consider a collection of p independent variables denoted by the vector x’ = (x1, x2,…,xp)

Note: x is in bold to denote vector of independent variables

If all independent variables are at interval/ratio scale (i.e. continuous), conditional


probability of Y = 1 given a set of independent variables Xs i.e. P(Y = 1 |x) =  (x), or 
for simplicity, using logistic equation that probability can be given by;

   x   x ...  x
e 11 2 2 p p

    x   x ... x
1 e 1 1 2 2 p p

Logit for above case can be written as;

  
ln      1 x1   2 x2  ...   p x p
1  

g(x) =  o  1 x1   2 x2  ...   p x p

or for simplicity

g =   1 x1   2 x2  ...   p x p

A logit equation for a sample for estimating logit equation for a population can be written
as;
ĝ = ˆ  ˆ1 x1  ˆ2 x2  ...  ˆ p x p

James Lwelamira (PhD) Page 127


or
ĝ = a  b1 x1  b2 x2  ...  b p x p

If some of the independent variables are categorical i.e discrete, nominal scale variables
such as race, sex, etc, we need to generate design variables (D) (or dummy variables) (i.e
using a particular coding scheme) as shown in simple logistic regression. There are
different style for coding scheme (Refer to Hosmer and Lemershow, 2003), however, the
style shown previously in simple logistic regression is the simplest one.

Therefore if a categorical independent variable has k possible values( categories), there


would be k-1 dummy variables. Suppose that Jth independent variable x j has k j levels
(categories), The k j  1 dummy variables can be denoted as D jl and the coefficients of
the dummy variables can be denoted as  jl , l = 1, 2,…, k j  1 .
Thus, the logit for a model with p variables and jth variable being categorical would be;

k j 1

g(x) =   1 x1  ...    jl D jl   p x p
l 1

Fitting the multiple logistic regression model

The coefficients can be estimated by maximizing the respective likelihood function.

Statistical tests in Binary multiple logistic regression

1. Goodness of fit-test

There are several tests i.e Nagelkerke R2, Cox and Snell R2, McFaden R2 etc.

2. Testing for significance for coefficients

i) Likelihood ratio test

James Lwelamira (PhD) Page 128


Compare the model with only regression constant  with that with all other variables
included plus regression constant using Likelihood ratio test as in simple binary logistic
regression.

OR
Compare reduce model vs full model if testing significance of additional variable (s) in
the model

Test statistic;

G = -2(log likelihood of reduced model or model with only the intercept – log
likelihood of full model or fitted model)

G will follow chi-square distribution with p degree of freedom i.e. (p+1)-1 under null
hypothesis that the p “slopes” coefficients for the covariates (independent variables) in
the model are equal to zero;

It perform the same task as ANOVA in multiple linear regression

Hypotheses- The same as in linear regression analysis for ANOVA

In carrying out this test you may possibly ending up into rejecting null hypothesis and
hence concluding that at least one and perhaps all p coefficients are different from zero,
an interpretation analogous to that in multiple linear regression.

ii) Significance of individual coefficients (Wald test)

Before concluding that any or all of the coefficients are nonzero, we may wish to look at
the univariate Wald test statistics. i.e. we can use Wald test to test significance of
individual coefficients

ˆ j
Wj 
 = Z
SE ˆ j
j

Note: under hypothesis that an individual coefficient is zero, the above Wald statistic will
follow the standard normal distribution.

If the ratio is squared

James Lwelamira (PhD) Page 129


2
2  ˆ j 
Wj (i.e. Z j ) =   = 2
2
 S .E ( ˆ ) 
 j 

Under the null hypothesis that a coefficient is zero the resultant statistic will follow chi-
square distribution with 1 degree of freedom. SPSS program produce output for this test.

Hypotheses; Stated in the similar way as that for multiple linear regression

Note: For Z test, you can as well use C.I to make decision on whether to accept or
reject null hypothesis (this approach can also be used to some previously presented
statistical tests)

The concept of ˆ ± 1.96SE( ˆ1 ) vs if the interval contain zero vs significance of the
estimate at 5% level of significance (i.e 95% C.I). If dealing OR, to be significant interval
of OR should not contain1

Further clarifications on the use of Odds ratio (OR)

In simplest term OR is a CHANGE(i.e. factor change) in likelihood of observing a


particular category (i.e. category coded as 1) of a response variable when a category of
categorical (discrete) independent variable is change from that of reference category to a
particular category or a continuous independent variable is increased by one unit.

When OR = 1 means that changing of a category of response variable from that of


reference to a particular category or a continuous independent variable is increased by
one unit has no effect on likelihood (odds or probability) for observing a particular
category of a response variable (i.e that coded as 1).

When OR >1 (i.e. 1.5, 2.4 3.1 etc) means that changing of a category of response variable
from that of reference to a particular category or a continuous independent variable is
increased by one unit would result into increased likelihood (odds or probability) for
observing a particular category of a response variable (i.e. that coded as 1).
1
When OR < 1 (i.e. 0.36, 0.42, 0.67), means that changing of a category of response
variable from that of reference to a particular category or a continuous independent
variable is increased by one unit would result into decreased likelihood (odds or
probability) for observing a particular category of a response variable (i.e that coded as
1).

Example1: a case of categorical independent variable.

James Lwelamira (PhD) Page 130


Suppose we want to determine the effect of area of residence (coded as 0 = Rural, and 1 =
Urban), an independent variable, on likelihood of having a breast cancer by a women of
reproductive age, a dependent variable (coded as 0 = if diagnosed negative, and 1 = if
diagnosed positive),
When having values above 1,
The value for OR = 1.4, imply that women residing in urban areas were 40% more likely
to be diagnosed positive for breast cancer relative to those living in rural areas (i.e
relative to their counterpart). Alternatively, living in urban areas increased odds for being
diagnosed positive for breast cancer by 40% relative to rural areas.

The value for OR = 2.4, imply that women living in urban areas were two times more
likely to be diagnosed positive for breast cancer relative to those living in rural areas.

The value for OR = 4.1, imply that women living in urban areas were four times more
likely to be diagnosed positive for breast cancer relative to those living in rural areas.

The value for OR = 1.7, imply that women living in urban areas were almost two times
more likely to be diagnosed positive for breast cancer relative to those living in rural
areas. Alternatively, the OR can also be interpreted as women living in urban areas were
70% more likely to be diagnosed positive for breast cancer relative to those living in rural
areas (i.e. relative to their counterpart). Another alternative is that the OR can also be
interpreted as living in urban areas increased odds for being diagnosed positive for breast
cancer by 70% relative to rural areas.

The value for OR = 3.0, imply that women living in urban areas were three times more
likely to be diagnosed positive for breast cancer relative to those living in rural areas.

When having values below 1,

The value for OR = 0.36, imply that living in urban areas by a woman was associated
with 64% reduction in odds (or probability) for being diagnosed positive for breast
cancer. Or living in urban area by a woman decreased the likelihood (odds or probability)
of being diagnosed positive for breast cancer by 64%.

The value for OR = 0.50, imply that living in urban areas by a woman was associated
with 50% reduction in odds (or probability) for being diagnosed positive for breast
cancer. Or living in urban area by a woman decreased the likelihood (odds or probability)
of being diagnosed positive for breast cancer by 50%.

The value for OR = 0.25, imply that living in urban areas by a woman was associated
with 75% (or three quarters) reduction in odds (or probability) for being diagnosed
positive for breast cancer. Or living in urban area by a woman decreased the likelihood
(odds or probability) of being diagnosed positive for breast cancer by 75% (or three
quarters).

James Lwelamira (PhD) Page 131


Example2: a case of continuous independent variable.
Suppose we want to determine the effect of age recorded as it is (not categorized i.e
continuous), an independent variable, on likelihood of having a breast cancer by a women
of reproductive age, a dependent variable (coded as 0 = if diagnosed negative, and 1 = if
diagnosed positive),

The value for OR = 0.36, imply that increase in age by one year would be associated with
64% reduction in odds (or probability) for being diagnosed positive for breast cancer. Or
increase in age by one year would decrease the likelihood (odds or probability) of being
diagnosed positive for breast cancer by 64%.

The value for OR = 1.4, imply that increase in age by one year would increase likelihood
(odds or probability) for being diagnosed positive for breast cancer by 40%.

The value for OR = 3.0, imply that increase in age by one year would result into three
times increase in likelihood (odds or probability) of being diagnosed positive for breast
cancer.

Etc.

Note; different grammatical style for interpreting OR can be encountered in literature.


However, the styles presented above are the common one and are easy to understand.

How do we calculate OR?

It can be shown that the regression coefficient  1 is related to OR in the following way;

OR  e 1 , whereby e = 2.718, a constant


Therefore, we can compute OR once we have the estimated coefficient by simply
computing Exp (  ) .

THIS SIMPLE RELATIONSHIP BETWEEN THE COEFFICIENT AND THE ODDS


RATIO IS THE FUNDAMENTAL REASON WHY LOGISTIC REGRESSION HAS
PROVEN TO BE SUCH A POWERFUL ANALYTIC RESEARCH TOOL.

Likewise, ln (OR) =  1

As it was commented earlier in other regression analyses, we usually need to specify a


regression equation/model that we are going to estimate when writing research
proposals/reports. It is usually advised to write it in a simple and more understandable
form. In multiple binary logistic regressions, a model can be written in the form indicated
in examples below;

James Lwelamira (PhD) Page 132


Example of multiple binary logistic regression equation (model)

Suppose we want to establish if there is relationship between a variable “If ever given
birth (BIRTH)” by a female adolescent as a dependent variable and Age (AGE), Highest
education level attained (HIHESED), Religious affiliation (RELIGION), Ethnicity
(ETHNIC), Wealth status of a family (WEALTH), Living arrangement (LIVING), Type
of marriage by parents (TMARRIAG), Having close friends that are sexually active
(ACTIVE), and Use of alcohol (ALCOHOL) by an adolescent as independent variables;
(from our file: Workshop SPSS working file4)

Note:
- A dependent variable “BIRTH” was categorical with two categories ( Ever given
birth, Not ever given birth).
- Independent variable “AGE” was continuous (in years).
- Independent variable “HIHESED” was categorical with two categories (Primary
and below, Secondary and above).
- Independent variable “RELIGION” was categorical with three categories
(Catholic, Protestant, Moslem).
- Independent variable “ETHNIC” was categorical with two categories (Gogo,
Others)
- Independent variable “WEALTH” was categorical with three categories (Low,
Moderate, High).
- Independent variable “LIVING” was categorical with three categories (Single
parent, Both parents, Others).
- Independent variable “TMARRIAG” was categorical with two categories
(Polygamy, Monogamy)
- Independent variable “ACTIVE” was categorical with two categories (Yes, No)
- Independent variable “ALCOHOL” was categorical with two categories (Not use,
Use)

Therefore, multiple binary logistic regression model/logit model for the above problem
can be written as;

 P(Yi  1) 
ln       1 AGE i   2 HIHESEDi   2 RELIGION1i   3 RELIGION 2 i   4 ETHNICi 
 1  P(Yi  1) 
 5WEALTH 1i   6WEALTH 2 i   7 LIVING1i   8 LIVING 2 i   9 TMARRIAG i   10 ACTIVE i   11 ALCOHOLi

Whereby;
P(Yi  1) = Probability that an adolescent had ever given birth
AGE= Age of an adolescent (Years)
HIHESED= Highest education level attained (1= if secondary and above, 0 = if primary
and below,).
RELIGION1 = If religious affiliation is protestant (1 = Yes, 0 = Otherwise).
RELIGION2 = If religious affiliation is moslem (1 = Yes, 0 = Otherwise).
ETHNIC = Ethnicity (1 = if others, 0 = if Gogo)

James Lwelamira (PhD) Page 133


WEALTH1= If family wealth status is moderate (1 = Yes, 0 = Otherwise)
WEALTH2= If family wealth status is high (1 = Yes, 0 = Otherwise)
LIVING1= If living with both parents (1 = Yes, 0 = Otherwise)
LIVING2= If living with others (1 = Yes, 0 = Otherwise)
TMARRIAG = Type of marriage by parents (1 = if monogamy, 0 = if polygamy)
ACTIVE = If have close friends that are sexually active (1 = Don’t have, 0 = Have)
ALCOHOL = If use alcohol (1 = Yes, 0 = No)
 = Regression constant
1 ...11 = Regression coefficients

OR

 P(Yi  1) 
ln       1 AGE i   2 HIHESEDi   2 PROTESTANT i   3 MOSLEM i   4 ETHNICi 
 1  P(Yi  1) 
 5 MODERATE i   6 HIGH i   7 BOTH i   8 OTHERS i   9 TMARRIAG i   10 ACTIVE i   11 ALCOHOLi

P(Yi  1) = Probability that an adolescent had ever given birth


AGE= Age of an adolescent (Years)
HIHESED= Highest education level attained (1= if secondary and above, 0 = if primary
and below).
PROTESTANT = If religious affiliation is protestant (1 = Yes, 0 = Otherwise).
MOSLEM = If religious affiliation is moslem (1 = Yes, 0 = Otherwise).
ETHNIC = Ethnicity (1 = if others, 0 = if Gogo)
MODERATE = If family wealth status is moderate (1 = Yes, 0 = Otherwise)
HIGH = If family wealth status is high (1 = Yes, 0 = Otherwise)
BOTH = If living with both parents (1 = Yes, 0 = Otherwise)
OTHERS = If living with others (1 = Yes, 0 = Otherwise)
TMARRIAG = Type of marriage by parents (1 = if monogamy, 0 = if polygamy)
ACTIVE = If have close friends that are sexually active (1 = Don’t have, 0 = Have)
ALCOHOL = If use alcohol (1 = Yes, 0 = No)
 = Regression constant
1 ...11 = Regression coefficients

Note: You can use other format for expressing the information on left hand side of the
equation (i.e. logit) as we have seen in a simple logistic regression. This include;

 Pi 
ln    ...
 1  Pi 
Whereby;
Pi = Probability that an adolescent had ever given birth
OR

James Lwelamira (PhD) Page 134


 P(Yi  1) 
ln    ...
 1  P(Yi  1) 
Whereby;
P (Yi  1) = Probability that an adolescent had ever given birth

Procedures for multiple binary logistic regression in SPSS

Analyze Regression Binary Logistic Choose a dependent

variable and put it to a Dependent box Select independent variables and pit them

to Covariate box Click Categorical Select independent variables

that are categorical and put them to Categorical Covariate box Specify a

reference category and click Change button (Option: but I prefer first category!!!)

Continue OK.

James Lwelamira (PhD) Page 135


Example:

Suppose we want to establish if there is relationship between reporting birth by a female


adolescent and various socio-demographic, socio- environment, and behavioral
variables such as her age, education level, religious affiliation, ethnicity, wealth status of
a family, living arrangement, type of marriage by parents, peer pressure (if have close
friends that are sexually active), and use of alcohol. The model for this analysis would be
as indicated in examples above i.e;

 P(Yi  1) 
ln       1 AGE i   2 HIHESEDi   2 RELIGION1i   3 RELIGION 2 i   4 ETHNICi 
 1  P(Yi  1) 
 5WEALTH 1i   6WEALTH 2 i   7 LIVING1i   8 LIVING 2 i   9 TMARRIAG i   10 ACTIVE i   11 ALCOHOLi

Whereby;
P(Yi  1) = Probability that an adolescent had ever given birth
AGE= Age of an adolescent (Years)
HIHESED= Highest education level attained (1= if secondary and above, 0 = if primary
and below,).
RELIGION1 = If religious affiliation is protestant (1 = Yes, 0 = Otherwise).
RELIGION2 = If religious affiliation is moslem (1 = Yes, 0 = Otherwise).

James Lwelamira (PhD) Page 136


ETHNIC = Ethnicity (1 = if others, 0 = if Gogo)
WEALTH1= If family wealth status is moderate (1 = Yes, 0 = Otherwise)
WEALTH2= If family wealth status is high (1 = Yes, 0 = Otherwise)
LIVING1= If living with both parents (1 = Yes, 0 = Otherwise)
LIVING2= If living with others (1 = Yes, 0 = Otherwise)
TMARRIAG = Type of marriage by parents (1 = if monogamy, 0 = if polygamy)
ACTIVE = If have close friends that are sexually active (1 = Don’t have, 0 = Have)
ALCOHOL = If use alcohol (1 = Yes, 0 = No)
 = Regression constant
1 ...11 = Regression coefficients

Hypotheses; would be the same as those in multiple linear regression;

Example:

For testing overall significance (using LR test)

Ho: Independent variables included in the model collectively had no significant influence
on probability for reporting birth (i.e. had no influence on a dependent variable)
Ha: Independent variables included in the model collectively had significant influence on
probability for reporting birth (i.e. had influence on a dependent variable)

For testing significance of individual coefficients (usingWald-square test)

Ho: All independent variables included in the model had no significant influence on
probability for reporting birth (had no influence on a dependent variable)
Ha: At least one independent variable included in the model had significant influence on
probability for reporting birth (i.e. had influence on a dependent variable)

Results

Logistic Regression

James Lwelamira (PhD) Page 137


Case Processing Summary
a
Unweighted Cases N Percent
Selected Cases Included in Analy sis 202 100.0
Missing Cases 0 .0
Total 202 100.0
Unselected Cases 0 .0
Total 202 100.0
a. If weight is in ef f ect, see classif ication table f or the total
number of cases.

Dependent Variable Encoding

Original Value Internal Value


Not ev er giv en birt h 0
Ev er giv en birth 1

Categorical Variables Codi ngs

Paramet er coding
Frequency (1) (2)
Wealth st atus of a low 98 .000 .000
f amily moderate 55 1.000 .000
high 49 .000 1.000
religion af f iliation catholic 52 .000 .000
protestant 134 1.000 .000
moslem 16 .000 1.000
Liv ing arrangement single parent 79 .000 .000
both parents 103 1.000 .000
others (relativ es) 20 .000 1.000
Et hnicity gogo 153 .000
others 49 1.000
if use alcohol Not use 152 .000
Use 50 1.000
Ty pe of marriage by Poly gamy 123 .000
parents Monogamy 79 1.000
if close f riends are Y es 173 .000
sexually activ e No 29 1.000
highest education primary or below 146 .000
lev el Sec and abov e 56 1.000

Block 0: Beginning Block

James Lwelamira (PhD) Page 138


Classification Tablea,b

Predicted
if ev er giv en birth (birth
status)
Not ev er Ev er giv en Percentage
Observ ed giv en birth birth Correct
Step 0 if ev er giv en birth Not ev er giv en birt h 153 0 100.0
(birth status) Ev er giv en birth 49 0 .0
Ov erall Percentage 75.7
a. Const ant is included in the model.
b. The cut v alue is .500

Variables in the Equation

B S. E. Wald df Sig. Exp(B)


St ep 0 Constant -1.139 .164 48.116 1 .000 .320

Variables not i n the Equation

Score df Sig.
Step Variables AGE 61.608 1 .000
0 HIHESED(1) 4.193 1 .041
RELIGION 8.003 2 .018
RELIGION(1) 6.796 1 .009
RELIGION(2) .005 1 .942
ETHNIC(1) 14.333 1 .000
WEALTH 26.558 2 .000
WEALTH(1) 3.880 1 .049
WEALTH(2) 14.333 1 .000
LIVI NG 33.230 2 .000
LIVI NG(1) 27.550 1 .000
LIVI NG(2) .219 1 .640
TMARRI AG(1) .003 1 .956
ACTIVE(1) 21.762 1 .000
ALCOHOL(1) 36.440 1 .000
Ov erall Statistics 109.874 12 .000

Block 1: Method = Enter


Omnibus Tests of Model Coefficients

Chi-square df Sig.
St ep 1 St ep 155.839 12 .000
Block 155.839 12 .000
Model 155.839 12 .000

James Lwelamira (PhD) Page 139


Model Summary

-2 Log Cox & Snell Nagelkerke


Step likelihood R Square R Square
1 67.989 .538 .803

Classification Tablea

Predicted
if ev er giv en birth (birth
status)
Not ev er Ev er giv en Percentage
Observ ed giv en birth birth Correct
Step 1 if ev er giv en birth Not ev er giv en birt h 146 7 95.4
(birth status) Ev er giv en birth 10 39 79.6
Ov erall Percentage 91.6
a. The cut v alue is .500

Variables in the Equation

95.0% C.I. f or EXP(B)


B S. E. Wald df Sig. Exp(B) Lower Upper
Sta ep AGE 1.097 .226 23.618 1 .000 2.996 1.925 4.664
1 HIHESED(1) -2.703 .866 9.737 1 .002 .067 .012 .366
RELIGION 6.576 2 .037
RELIGION(1) -1.903 .870 4.781 1 .029 .149 .027 .821
RELIGION(2) -3.614 1.593 5.150 1 .023 .027 .001 .611
ETHNIC(1) -5.271 1.813 8.450 1 .004 .005 .000 .180
WEALTH 11.626 2 .003
WEALTH(1) -2.933 .907 10.460 1 .001 .053 .009 .315
WEALTH(2) -2.101 1.163 3.262 1 .071 .122 .013 1.196
LIVI NG 10.136 2 .006
LIVI NG(1) -2.347 .810 8.399 1 .004 .096 .020 .468
LIVI NG(2) 1.247 1.268 .968 1 .325 3.479 .290 41.724
TMARRI AG(1) -1.399 .709 3.894 1 .048 .247 .061 .991
ACTIVE(1) -.156 .891 .031 1 .861 .856 .149 4.910
ALCOHOL(1) 3.129 .878 12.711 1 .000 22.860 4.092 127.711
Constant -17.659 3.897 20.538 1 .000 .000
a. Variable(s) entered on step 1: AGE, HIHESED, RELIGI ON, ETHNIC, WEALTH, LI VING, TMARRI AG, ACTIVE, ALCOHOL.

James Lwelamira (PhD) Page 140


Summarizing and interpreting the results

Table.. Multiple logistic regression for factors influencing fertility (i.e. reporting birth)
among female adolescents
Variable B S.E. Wald df Sig. Exp(B) 95.0% C.I.for EXP(B)
Lower Upper
AGE 1.097 .226 23.618 1 .000 2.996 1.925 4.664
HIHESED -2.703 .866 9.737 1 .002 .067 .012 .366
RELIGION1 -1.903 .870 4.781 1 .029 .149 .027 .821
RELIGION2 -3.614 1.593 5.150 1 .023 .027 .001 .611
ETHNIC -5.271 1.813 8.450 1 .004 .005 .000 .180
WEALTH1 -2.933 .907 10.460 1 .001 .053 .009 .315
WEALTH2 -2.101 1.163 3.262 1 .071 .122 .013 1.196
LIVING1 -2.347 .810 8.399 1 .004 .096 .020 .468
LIVING2 1.247 1.268 .968 1 .325 3.479 .290 41.724
TMARRIAG -1.399 .709 3.894 1 .048 .247 .061 .991
ACTIVE -.156 .891 .031 1 .861 .856 .149 4.910
ALCOHOL 3.129 .878 12.711 1 .000 22.860 4.092 127.711
Constant -17.659 3.897 20.538 1 .000 .000
Nagelkerke R2= 0.80

Based on Tables for Model Summary and Variables in the equation (i.e. last table) it can
be said variables included in the model were good predictor for reporting birth by an
adolescent (Nagelkerke R2= 0.80). Wald- chi-square test indicate education level,
religious affiliation, ethnicity, wealth status of a family, living arrangement, type of
marriage by parents and alcohol use had significant influence on probability of reporting
birth (i.e. having pre- marital fertility) by an adolescent. Effect of peer pressure was not
significant. Increase in age by one year was associated increase in likelihood of reporting
birth by three times (OR = 3.0, 95% C.I 1.93 - 4.66). Having secondary education and
above was associated with decreasing likelihood for reporting birth (OR = 0.07, 0.01 –
0.37). Results also indicate being Protestant relative to catholic, and being Moslem
relative to catholic were associated with reduced odds for reporting birth by an adolescent
(OR = 0.15, 95% C.I 0.03 – 0.81 and OR = 0.00 – 0.61, respectively). Adolescents from
other tribes were less likely to report birth compared to those from Gogo tribe (OR =
0.01, 95% C.I 0.00 – 0.18). Adolescents from Middle income families were less likely to
report birth compared to those from low income families (OR = 0.05, 95% C.I = 0.01 –
0.32). The effect of high income relative to low income was not significant. Regarding
living arrangement, living with both parents by an adolescent reduced likelihood for
reporting birth relative to living with single parent (OR = 0.10 , 95% C.I = 0.02 – 0.47).
Living with others had no effect on likelihood for reporting birth relative to living with
single parent (OR = 3.5, 95% C.I = 0.29 – 41.72). Likewise, type of marriage by parents
being monogamy and not having close friends that are sexually active did not had an
influence on probability for reporting birth relative to the counterparts (OR = 0.25, 95%
C.I 0.06 – 0.99 and OR = 0.86, 95% C.I 0.15 – 4.91). Alcohol use increased chances for
reporting births relative to non- use (OR = 0.22, 95% C.I 4.09 – 127.71).

James Lwelamira (PhD) Page 141


Note: Remember to explain/clarify on the following;
- A problem of independent variables selection
- Forward and Backward selection vs Enter
- Doing preliminary analysis (bivariate i.e Pearson’s Chi-square) and proceed
with significant independent variables in multivariable regression analysis
- Systematic selection of independent variables in models and hence having
several models to compare: i.e. Model1, Model2, Model 3, etc.

James Lwelamira (PhD) Page 142

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy