Purpose of Analysis Is To Answer The Research Questions Outlined in The Objectives
Purpose of Analysis Is To Answer The Research Questions Outlined in The Objectives
When you selected the variables for your study, you did so with the assumption that they either: - would help to define your problem (dependent variables) and its different components or that - they were contributory factors to your problem (independent variables) The purpose of data analysis is to identify whether these assumptions were correct or not, and to highlight possible new views on the problem under study. The ultimate purpose of analysis is to answer the research questions outlined in the objectives with your data. First, before we look at how variables may be affecting one another, we need to summarise the information obtained on each variable in simple tabular form or in a figure. Some of the variables may have produced numerical data, while other variables produced categorical data. In analysing our data, it is important first of all to determine the type of data that we are dealing with. This is crucial because the type of data used largely determines the type of statistical techniques that should be used to test whether the results of the study are significant.
Categorical data
There are two types of categorical data: they are nominal or ordinal (see Module 8). In NOMINAL DATA, the variables are divided into a number of named categories. These categories, however, cannot be ordered one above another (as they are not greater or lesser than each other). For example:
In ORDINAL DATA, the variables are also divided into a number of categories, but these can be ordered one above another, from lowest to highest or vice versa For example:
Numerical data
We speak of NUMERICAL DATA if they are expressed in numbers There are two types of numerical data: they are discrete or continuous. DISCRETE DATA are a distinct series of numbers. For example:
CONTINUOUS DATA come from variables that can be measured with greater precision, depending on the accuracy of the measuring instrument, and each value can increase or decrease without limit. For example:
Frequency distributions Percentages, proportions, ratios and rates Figures Measures of central tendency
We will now discuss these operations one after each other for both categorical and numerical data.
Example 1: To identify what family planning methods were used by teenagers in Kweneng, West Botswana, teenagers were asked what method they were using. The results are presented in the following frequency distribution:
These data are NOMINAL. A frequency distribution is calculated by simply totalling the number of responses in each category. You should always check that the total number of responses agrees or tallies with the number of subjects (respondents). If necessary, there should be a category for missing answers.
We usually express frequency distributions in percentages (see Part III of this Module). By looking at the frequency distribution above you can conclude that roughly 75% or three out of four of the teenagers are not using family planning. For those who are using family planning methods, condoms and pills are the most commonly used methods. Example 2: Health personnel from 148 different rural health institutions were asked the following question: How often have you run out of drugs for the treatment of malaria in the past two years? This was a closed question with the following possible answers: never, 1 to 2 times (rarely), 3 to 5 times (occasionally), more than 5 times (frequently). The number of responses in each category was totalled to give the following frequency distribution:
In this example, the data are ORDINAL. The ordering of the categories is important as each category from top to bottom indicates increasing severity of the problem. The frequency distribution results indicate that most clinics rarely experience shortages of anti-malarial drugs, but that it is an occasional problem in about one sixth of the clinics and a severe problem in a few.
2. Numerical data
Procedures for making frequency distributions of numerical data are very similar to those for categorical data, except that now the data have to be grouped in categories. The steps involved in making a frequency distribution are as follows: 1. Select groups for grouping the data. 2. Count the number of measurements in each group. 3. Add up and check the results. When grouping data, the way the groups are selected can affect what the results are going to look like. There is little substitute for common sense here, but it may be necessary to change the grouping if you suspect the information is being hidden by a poor selection of the groups.
Example 3: Health centres of District X are submitting numbers of malaria cases and you wish to summarise them. Compare the daily and weekly summaries of the same data as presented in Table 22.1: Both daily and weekly data show an increasing amount of malaria, but the improving situation shown in days 19, 20 and 21 is not reflected in the weekly summary. It would therefore be better to use the daily data if you want to indicate when exactly the numbers of reported malaria cases started going down. Table 22.1: Daily and weekly summaries of malaria cases in health centres in District X
When grouping data the following rules are important: The groups must not overlap, otherwise there is confusion concerning in which group a measurement belongs. There must be continuity from one group to the next, which means that there must be no gaps. Otherwise some measurements may not fit in a group. The groups must range from the lowest measurement to the highest measurement so that all of the measurements have a group to which they can be assigned. 5
The groups should normally be of an equal width, so that the counts in different groups can easily be compared. Sometimes, however, it is valid to choose groups that are of different widths, for example if you are interested in specific age groups (e.g., less than 1 year, 1 to 4 years, 5 to 14 years). When you start summarising data it is better to make too many groups than too few. This is because during data analysis you can combine groups to form new categories without having to go through all your data again, whereas if you have too few groups you have to go back to your raw data to make more groups. A larger number of groups will generally give a more precise picture, but when using too many groups one can lose the overview. As a general rule choose round numbers for the lower values of the group limits. For example: 1.009.99, 10.0019.99, 20.0029.99, or: 04; 59, 1014, etc.
Instead of presenting data in frequency tables using absolute numbers it is often better to calculate percentages. A PERCENTAGE is the number of units in the sample with a certain characteristic, divided by the total number of units in the sample and multiplied by 100. Percentages may also be called RELATIVE FREQUENCIES. Percentages standardise the data, which means that they make it easier to compare them with similar data obtained in another sample of different size or origin. Example 4: 82 clinics in one district were asked to submit the number of patients treated for malaria in one month. The researchers presented both the frequency distribution and percentages (or relative frequencies):
Table 22.2: Distribution of clinics according to number of patients treated for malaria in one month
Note: Usually you do not include missing data in the calculation of percentages. The frequency of responses in each group is calculated as the percentage of those study elements for which you obtained data (or, if a question is being asked to interviewees, the percentage of those interviewees who answered the question). However, the number of missing data (e.g., people who did not respond to a question) is a useful indication of the adequacy of your data collection. Therefore this number should be mentioned, for example as a note to your table. (See Table 22.2.) Remember that dont know is a special category that should NOT be counted as missing data. If applicable, dont know should appear as a category in the table. One should be cautious when calculating and interpreting percentages if the total number is small, because one unit more or less would make a big difference in terms of percentages. As a general rule, percentages should not be used when the total is less than 30. Therefore it is recommended that the number of observations or total cases studied should always be given together with the percentage.
2. Proportions
A PROPORTION is a numerical expression that compares one part of the study units to the whole; A proportion can be expressed as a FRACTION or in DECIMALS.
Example 5: Out of a total of 55 patients attending a clinic on a specific day 22 are males and 33 are females. We may say that the proportion of males is 22/55 or 2/5, which is equivalent to 0.40. Note that when a proportion expressed in decimals is multiplied by 100, the value obtained is a percentage. In the example, 0.40 is equivalent to 40%.
3. Ratios
A RATIO is a numerical expression that indicates the relationship in quantity, amount or size between two or more parts. In Example 5 above the ratio of males to females is 22:33, or 2:3.
4. Rates
A RATE is the quantity, amount or degree of a disease or event measured over a specified period of time Commonly used rates in the health sector are: Birth Rate Death Rate Infant Mortality Rate (IMR) = The number of live births per 1000 population over a period of one year = The number of deaths per 1000 population over a period of one year = The number of deaths of infants under one year deaths of age per 1000 live births over a period of one year
Maternal Mortality = The number of maternal pregnancy-related in one year per Rate (MMR) 100,000 total births in the same year Incidence Rate Prevalence Rate = The number of new cases per population over a specific period of time (usually a year) = The number of existing cases per population over a specific period of time (usually a year)
IV. FIGURES If your report contains many descriptive tables, it may be more readable if you present the most important ones in figures. The most frequently used figures for presenting data include: Bar charts Pie charts Histograms Line graphs Scatter diagrams Maps We will now look at example of the above-mentioned figures that can be used for presenting data. } for numerical data } for categorical data
1. Bar chart
The data from Example 2 can be presented in a bar chart, using either absolute frequencies or relative frequencies/percentages (see Figure 22.1). Figure 22.1: Relative frequency of shortage of anti-malaria drugs in rural health institutions (n=148)
Note that the sample size must be indicated if you present the data in percentages.
2. Pie charts
A pie chart can be used for the same set of data, providing the reader with a quick overview of the data presented in a different form. A pie chart illustrates the relative frequency of a number of items. All the segments of the pie chart should add up to 100%. Figure 22.2: Relative frequency of shortage of anti-malaria drugs in rural health institutions (n=148)
10
3. Histograms
Numerical data are often presented in histograms, which are very similar to the bar charts which are used for categorical data. An important difference however is that in a histogram the bars are connected (as long as there is no gap between the data), whereas in a bar chart the bars are not connected, as the different categories are distinct entitles. The data of Example 4 is presented as a histogram in Figure 22.3. Figure 22.3: Percentage of clinics treating different numbers of malaria patients in one month (n=80).
A line graph is particularly useful for numerical data if you wish to show a trend over time. The data from Example 3 can be presented as a line graph as in Figure 22.4. Figure 22.4: Daily number of malaria patients at the health centres in District X
It is easy to show two or more distributions in one graph, as long as the difference between the lines is easy to distinguish. Thus it is possible to compare frequency
11
distributions of different groups, i.e., the age distribution between males and females, or cases and controls.
5. Scatter diagrams
Scatter diagrams are useful for showing information on two variables which are possibly related. The example of a scatter diagram given below is used in Module 31, where we are dealing with the concepts of association and correlation. Figure 22.5: Weight of five-year-olds according to annual family income
Note: It is important that all figures presented in your research report have numbers, clear titles and clear labels (or keys). In addition to the figures above, the use of maps may be considered to present information. For instance, the area where a study was carried out can be shown in a map. If the study explored the epidemiology of cholera, a map could be produced showing the geographical distribution of cholera cases, together with the distribution of protected water sources, thus illustrating that there is an association. If the study related to vaccination coverage, a map could be developed to indicate the clinic sites and the vaccination coverage among under-fives in each village, perhaps showing that homeclinic distance is an important factor associated with vaccination status. V. MEASURES OF CENTRAL TENDENCY Frequency distributions and histograms provide useful ways of looking at a set of observations of a variable. In many circumstances, it is essential to produce them to understand the patterns in the data. However, if one wants to further summarise a set of observations, it is often helpful to use a measure which can be expressed in a single number.
12
First of all, one would like to have a measure for the centre of the distribution. The three measures used for this purpose are the MEAN, the MEDIAN and the MODE.
1. Mean
The MEAN (or arithmetic mean) is also known as the AVERAGE. It is calculated by totalling the results of all the observations and dividing by the total number of observations. Note that the mean can only be calculated for numerical data. Example 6: Measurement of the heights of 7 girls gave the following results: 141, 141, 143, 144, 145, 146, 155 cm (a total of 1015 cm for 7 measurements) The mean is thus 1015/7, which is 145 cm.
2. Median
The MEDIAN is the value that divides a distribution into two equal halves. The median is useful when some measurements are much bigger or much smaller than the rest. The mean of such data will be biased toward these extreme values. Thus the mean is not a good measure of the centre of the distribution in this case. The median is not influenced by extreme values. The median value, also called the central or halfway value, is obtained in the following way:
List the observations in order of magnitude (from the lowest to the highest value or vice versa). Count the number of observations (n). The median value is the value belonging to observations number (n + 1) / 2 if n is odd or the average of the middle two numbers.
Example 8: The weights of 7 pregnant women are 40, 41, 42, 43, 44, 47, 72 kg. The median value is the value belonging to observation number (7 + 1)/2, which is the fourth one: 43 kg. Note that the mean weight of this set of observations is 47 kg. This is an illustration of how the mean is affected by extreme values (in this case 72 kg) while the median is not. If the largest weight in this set of observations had been 51 kg instead of 72 kg, the median would still have been 43 kg, but the mean weight would have been 44 kg.
13
Note also that if there would be 8 observations: 40, 41, 42, 43, 44, 47, 49 and 72, the median would be 43.5 kg (the average of 43 and 44); the mean in this case would be 47.25 kg.
3. Mode
The MODE is the most frequently occurring value in a set of observations. The mode is not very useful for numerical data that are continuous. It is most useful for numerical data that have been grouped. In Example 4 (number of patients treated for malaria at clinics) the mode is 0 to 19, as this outcome is recorded most frequently (25 times out of 80). The mode can also be used for categorical data, whether they are nominal or ordinal. In Example 1 (method of family planning) the mode is none. In Example 2 (number of clinics experiencing drug shortage) the mode is rarely. In summary, the mean, the median and the mode are all measures of central tendency. The mean is most widely used. It contains more information because the value of each observation is taken into account in its calculation. However, the mean is strongly affected by values far from the centre of the distribution, while the median and the mode are not. The calculation of the mean forms the beginning of more complex statistical procedures to describe and analyse data. Figure 22.6 shows a distribution curve in which the mean, the median and the mode have different values. Figure 22.6: Mean, median and mode in a distribution curve.
14
15