Biostatics Course
Biostatics Course
Understanding Statistics
Statistics are used in virtually all scientific disciplines such as the physical and social sciences as
well as in business, medicine, the humanities, government, and manufacturing. Statistics is a branch
of applied mathematics including calculus and linear algebra that developed from the application of
mathematical tools to probability theory.
It's the idea that we can learn about the properties of large sets of objects or events (a population) by
studying the characteristics of a smaller number of similar objects or events (a sample). Gathering
comprehensive data about an entire population is too costly, difficult, or impossible in many cases
so statistics start with a sample that can be conveniently or affordably observed.
Statistics is the study of collecting data, analysing it, processing it, interpreting the results and
presenting them in such a way that the data can be understood by everyone. It is at once a science, a
method and a set of techniques. Data analysis is used to describe the phenomena studied, make
forecasts and take decisions about them. In this way, statistics is an essential tool for understanding
and managing complex phenomena.
Descriptive and Inferential Statistics
The two major areas of statistics are descriptive statistics and inferential statistics.
Descriptive statistics describes the properties of sample and population data. Inferential statistics
uses those properties to test hypotheses and draw conclusions.
Descriptive statistics include mean or average, variance, skewness, and kurtosis. Inferential statistics
include linear regression analysis, analysis of variance or ANOVA, logit/Probit models, and null
hypothesis testing.
Descriptive statistics help describe and explain the features of a specific data set by giving short
summaries about the sample and measures of the data. The most recognized types of descriptive
statistics are measures of center. For example, the mean, median, and mode, which are used at
almost all levels of math and statistics, are used to define and describe a data set. The mean, or the
average, is calculated by adding all the figures within the data set and then dividing by the number
of figures within the set.
For example, the sum of the following data set is 20: (2, 3, 4, 5, 6). The mean is 4 (20/5). The mode
of a data set is the value appearing most often, and the median is the figure situated in the middle of
the data set. It is the figure separating the higher figures from the lower figures within a data set.
However, there are less common types of descriptive statistics that are still very important.
People use descriptive statistics to repurpose hard-to-understand quantitative insights across a large
data set into bite-sized descriptions. A student's grade point average (GPA), for example, provides a
good understanding of descriptive statistics. The idea of a GPA is that it takes data points from a
range of individual course grades, and averages them together to provide a general understanding of
a student's overall academic performance. A student's personal GPA reflects their mean academic
performance.
Descriptive statistics are brief informational coefficients that summarize a given data set, which can
be either a representation of the entire population or a sample of a population. Descriptive statistics
are broken down into measures of central tendency and measures of variability (spread). Measures
of central tendency include the mean, median, and mode, while measures of variability
include standard deviation, variance, minimum and maximum variables, kurtosis, and skewness.
Descriptive statistics focus mostly on the central tendency, variability, and distribution of sample
data. Central tendency refers to the estimate of the characteristics, a typical element of a sample or
population. It includes descriptive statistics such as mean, median, and mode.
Variability refers to a set of statistics that show how much difference there is among the elements of
a sample or population along the characteristics measured. It includes metrics such as
range, variance, and standard deviation.
The distribution refers to the overall “shape” of the data. This can be depicted on a chart such as a
histogram or a dot plot and includes properties such as the probability distribution function,
skewness, and kurtosis.
Descriptive statistics can also describe differences between observed characteristics of the elements
of a data set. They can help us understand the collective properties of the elements of a data sample
and form the basis for testing hypotheses and making predictions using inferential statistics.
Types of Descriptive Statistics
All descriptive statistics are either measures of central tendency or measures of variability, also
known as measures of dispersion.
Central Tendency
Measures of central tendency focus on the average or middle values of data sets, whereas measures
of variability focus on the dispersion of data. These two measures use graphs, tables, and general
discussions to help people understand the meaning of the analyzed data.
Measures of central tendency describe the center position of a distribution for a data set. A person
analyzes the frequency of each data point in the distribution and describes it using
the mean, median, or mode, which measures the most common patterns of the analyzed data set.
Measures of Variability
Measures of variability (or measures of spread) aid in analyzing how dispersed the distribution is for
a set of data. For example, while the measures of central tendency may give a person the average of
a data set, it does not describe how the data is distributed within the set. So while the average of the
data might be 65 out of 100, there can still be data points at both 1 and 100. Measures of variability
help communicate this by describing the shape and spread of the data set. Range, quartiles, absolute
deviation, and variance are all examples of measures of variability. Consider the following data set:
5, 19, 24, 62, 91, 100. The range of that data set is 95, which is calculated by subtracting the lowest
number (5) in the data set from the highest (100).
Distribution
Distribution (or frequency distribution) refers to the number of times a data point occurs.
Alternatively, it can be how many times a data point fails to occur. Consider this data set: male,
male, female, female, female, other. The distribution of this data can be classified as:
The number of males in the data set is 2.
The number of females in the data set is 3.
The number of individuals identifying as other is 1.
The number of non-males is 4.
Univariate vs. Bivariate
In descriptive statistics, univariate data analyzes only one variable. It is used to identify
characteristics of a single trait and is not used to analyze any relations hips or causations.
For example, imagine a room full of high school students. Say you wanted to gather the average age
of the individuals in the room. This univariate data is only dependent on one factor: each person's
age. By gathering this one piece of information from each person and dividing by the total number
of people, you can determine the average age.
Bivariate data, on the other hand, attempts to link two variables by searching for correlation. Two
types of data are collected, and the relationship between the two pieces of information is analyzed
together. Because multiple variables are analyzed, this approach may also be referred to
as multivariate.
Let's say each high school student in the example above takes a college assessment test, and we
want to see whether older students are testing better than younger students. In addition to gathering
the ages of the students, we need to find out each student's test score. Then, using data analytics, we
mathematically or graphically depict whether there is a relationship between student age and test
scores.
Inferential Statistics
Inferential statistics is a tool used by statisticians to draw conclusions about the characteristics of a
population. It's drawn from the characteristics of a sample. It's also used to determine how certain
the statistician can be of the reliability of those conclusions. Statisticians can calculate the
probability that statistics will provide an accurate picture of the corresponding parameters of the
whole population from which the sample is drawn based on sample size and distribution.
Inferential statistics are used to make generalizations about large groups such as estimating average
demand for a product by surveying the buying habits of a sample of consumers or attempting to
predict future events. This might mean projecting the future return of a security or an asset class
based on returns in a sample period.
Regression analysis is a widely used technique of statistical inference. It's used to determine the
strength and nature of the relationship between a dependent variable and one or more explanatory or
independent variables. The output of a regression model is often analyzed for statistical significance.
A result from findings generated by testing or experimentation isn't likely to have occurred
randomly or by chance.
Statistical significance suggests that the results are attributable to a specific cause explained by the
data.The data studied can be of any kind, which makes statistics useful in all disciplinary fields and
explains why it is taught in all university courses, from economics to biology, psychology and of
course engineering sciences.
Statistics involves :
- Gathering data.
- Presenting and summarising the data.
- Drawing conclusions about the population studied and helping to make decisions.
What Are Descriptive Statistics?
Inferential statistics can be classified as either parametric or nonparametric. Nonparametric statistics
are most commonly used for variables at the nominal or ordinal level of measurement, which
basically means that they are used for variables that do not have a normal distribution. Statistical
significance is calculated using information contained only in the sample (rather than the population)
and may use measures of central tendency appropriate for nominal or ordinal level data (ie, the
median rather than the mean). Parametric statistics are the most common approach to inferential
statistical analysis. Parametric statistics require that the variables be measured at the interval or ratio
level. Use of parametric statistics also relies on other assumptions, such as the expectation that values
for a given variable will be normally distributed in the population. Inferential statistics encompass a
variety of statistical significance tests that investigators can use to make inferences about their
sample data. These tests can be divided into three basic categories depending on their intended
purpose: evaluating differences, examining relationships, and making predictions. The decision of
which procedure to use is determined, in part, by the investigator’s research question or research
design.
Section 2
Introduction
Many people are familiar with the term statistics. It denotes recording of numerical facts and figures,
for example, the daily prices of selected stocks on a stock exchange, the annual employment and
unemployment of a country, the daily rainfall in the monsoon season, etc. However, statistics deals
with situations in which the occurrence of some events cannot be predicted with certainty. It also
provides methods for organizing and summarizing facts and for using information to draw various
conclusions. Historically, the word statistics is derived from the Latin word status meaning state. For
several decades, statistics was associated solely with the display of facts and figures pertaining to
economic, demographic, and political situations prevailing in a country. As a subject, statistics now
encompasses concepts and methods that are of far-reaching importance in all enquires/questions that
involve planning or designing of the experiment, gathering of data by a process of experimentation or
observation, and finally making inference or conclusions by analyzing such data, which eventually
helps in making the future decision. Fact finding through the collection of data is not confined to
professional researchers. It is a part of the everyday life of all people who strive, consciously or
unconsciously, to know matters of interest concerning society, living conditions, the environment,
and the world at large. Sources of factual information range from individual experience to reports in
the news media, government records, and articles published in professional journals. Weather
forecasts, market reports, costs of living indexes, and the results of public opinion are some other
examples. Statistical methods are employed extensively in the production of such reports. Reports
that are based on sound statistical reasoning and careful interpretation of conclusions are truly
informative. However, the deliberate or inadvertent misuse of statistics leads to erroneous
conclusions and distortions of truths.
In this unit, we shall talk about the basics of statistics. We shal1 define the terms which we shall be
using again and again throughout this course. It is possible that you have read all this before. But that
might have been some years ago. So a quick look through this unit will help you to recall the relevant
facts. In case you have never beeh introduced to statistics before, this unit will gradually acquaint you
with its basic concepts. You will find that most of the terms we use in statistics are part of our daily
vocabulary. But we have to know their precise meaning before we use them in statistics.
Objectives of descriptive (or exploratory) statistics:
On reading this unit, you should be able to : distinguish between a qualitative and a quantitative
character, differentiate between a discrete and a continuous variable, draw up a frequency table and
get the relative frequencies, cumulative frequencies and frequency densities, decide upon a suitable
mode of representing a frequency distribution , diagrammatically.
- to summarise and synthesise the information contained in the statistical series and highlight its
properties.
- To suggest hypotheses about the population from which the sample is drawn.
For close-ended questions, there are four types of options to respond the questions:
● You can have a two-option as the responses possibilities which are known as dichotomous scales.
● If you add more than two options for the respondents, the scale is known as nominalpolychromous.
● In ordinal-polytomous scales, you prepare more than two options which are also ordinal.
● Finally, you can use continuous or bounded types which use a continuous scale as a possible
response case.
B) The Mode of Administration
Questionnaires can be implemented in different ways. A face-to-face questionnaire mode can be used
which provides the chance of presenting the questions orally, paper-and-pencil types can be utilized
with the items presented in the paper or computerized questionnaires for data collection (Kabir,
2016). Questionnaires can be also utilized through telephone, online, or even posting. An online
questionnaire is a cost-efficient option; however, you should consider the possibility of missing
samples due to problems with internet access. In these types, different online survey services can be
used which provide questionnaires for the purpose of study, and then the collected data can be easily
added to the analyzing software. In all these choices, it is important to secure ethical concerns such as
the confidentiality of participants. On the other hand, participants should try to answer the questions
politely and clearly.
C) General Rules for Constructing a Questionnaire
● Use simple and short questions as much as possible;
● Navigate respondents clearly to avoid any difficulty and motivate par participants through
answering questions ;
● Use understandable, simple, and clear statements for all respondents with different educational
levels;
● Utilize positive sentences;
● Do not use more than one question (double-barreled) in one item;
● Add an open-answer possibility after providing the listed answers and where possible;
● Avoid making assumptions for the respondents ;
● Try to increase reliability by appropriate word selection;
● Avoid directing the respondent to any answer using objective questions including clues,
suggestions, and hints) ;
There are also specific challenges and concerns that may be faced through designing an
appropriate questionnaire. First, the maximum respondents’ rates should be guaranteed together
with securing maximum reliability and validity as much as possible. The respondent rates can be
maximized when you:
● Can convince them you secure their information and keep their side;
● Can reward their cooperation.
On the other hand, you can gain an accurate data set considering two points:
● Prepare a suitable set of questions;
● Select appropriate sample size and type which can avoid biases and non-responded questions.
D) Advantages of Questionnaires
Questionnaires provide several merits in comparison to other survey methods as listed in the
following:
● Collecting a large amount of data from a large sample size;
● Time saver;
● Cost-effective options;
● Highly structured;
● The possibility of gaining high accurate data;
● The possibility of being carried out by other people instead of the researcher regardless of
affecting the reliability and validity term, and the possibility of group administrations;
● Analyzing the results easily by entering the achieved data to the software quickly in the majority of
cases;
The opportunity of more objective and scientific analysis;
● The achieved quantitative data can be used to compare and contrast the results of the study with
others to measure the changes;
● The possibility of achieving comprehensive design and tests, and administrating the research with
required details;
● Creating novel theories or/ and testing an existing hypothesis using the achieved quantitative data ;
● Suitable in a wide range of study fields;
● Suitable and reliable in special cases .
E) Disadvantages of Questionnaires
However, there are also several demerits that are not negligible. There are several difficulties
researchers may face using questionnaires as the following:
● Hard or inadequate to perceive gathered data in some cases such as emotional, feelings, and
behavioral changes;
● Human errors for example if the respondent is forgetful and cannot consider the whole concept
truly;
● Determining the reliability of answers is not possible;
● The possibility of misunderstanding the questions which can overshadow the answers;
● The effects of differences in human beliefs on their answers in some cases since even a standard
subject can be considered good for one group and bad for others (Kabir, 2016);
● Facing difficulties when participants need clarifications for particular questions in impersonal
administrations and the possibility of failing to answer those questions (Taherdoost, 2021);
● Low response rates if respondents’ low interests cannot be addressed to answer questions
(Frechtling, 2002);
● The possibility of illegible answers;
● Useless and wrong answers are prevalent (Pandey & Pandey, 2015).
1.2.3 Verification
The results obtained are used to draw conclusions from the starting data.
1.2.4 Interpretation
The statistician presents the significance of the results obtained and proposes solutions with
associated risk assessments, to help the user choose between the different decisions.
1.3 Statistical vocabularies
1.3.1 . Observations
These are the target data relating to a phenomenon in the course of an investigation, an experiment,
etc. They must be grouped together, corrected and ordered. They must be grouped, corrected and
ordered.
1.3.2. Interviews
In interviews, as a fundamental way of social interaction, questions are asked and data is collected
using provided answers and it is in contrast to the questionnaire with indirectly collected data
methodology. Thus, the chance of getting confidential data from interviewees is also possible;
however, it requires special skills which are not necessary for questionnaires. Researchers can
employ different methods to conduct an interview and perform them in individual, or group face-to-
face interviews, as well as not personally for example using telephone, computer, etc.
1.3.3. Observational Methods
In these techniques, first-hand data is gathered through the observation of events, behaviors,
interactions, processes, etc. directly to obtain an understanding of the concepts. For example,
observation is an appropriate technique to evaluate teaching methods in the classes. It can be used
when focus groups and interviews cannot help to gather data due to the different reasons including
times that participants:
● Are not aware of the concept;
● Are not able to talk about the concept;
● Do not prefer to discuss the concept.
It can be also utilized to explore whether a study is progressing as planned, or whether the study has
been successful or not. In the evaluation of studies, these two phases are known as formative and
summative, . It also can be helpful when the concept is unexplored or not well-known. If it is
required to explore a subject in the natural setting and the reported information can be different from
the findings of the real setting, an observational technique should be used.
This method can collect both qualitative and quantitative data. The qualitative data is gathered as a
description of events in the setting. The quantitative data can be obtained by using the duration or
frequency of the particular subjects. During this kind of systematic observation, formal and structured
instruments and protocols nominal, ordinal, ratio, and interval scales are utilized. Thus, it can be used
to record the findings template coding sheets with specific guides if the observer is not the main
researcher. On the other hand, data achieved through this method can be used in conjunction with the
quantitative findings of other methods.
Generally, observation helps the researcher to find out what is going in the surrounding environment;
however, as a data collection method, it is further than just listening and looking. This method
includes an engagement with the setting, a clear expression of the events, technical improvisations,
high attention, and good recording.
A) Advantages and Disadvantages
The observational method also possesses several pros and cons. In this section, the most important
ones are listed. The advantages are as the following:
● Gathering direct information;
● The participation of evaluators in the natural setting;
● Flexible and natural atmosphere;
● Free from biases;
● Can be generalized as large samples can be covered in the studies;
● High reliable and precise data can be achieved.
These techniques also provide some difficulties as:
● They can be time-consuming and not economical
● The training of observers is effective
● Observers can be selective and distort data
● It can be sometimes unreliable due to the misrepresenting of the qualitative data measurement
● It does not consider processes and the changes during them and cannot be appropriate for fresh
concepts
B) Special Notes
After conducting an observation, the researcher should first analyze achieved data. For this purpose,
data is summarized in a process known as data reduction and it is coded based on particular criteria to
specific categories. The reliability of data according to the agreements of independent observers
should be also considered to show how the behaviors are measured accurately. It should be noted that
the participants can act differently when they are in the research setting. These acts should be
controlled using techniques of controlling reactivity such as indirect observation, the adaptation of
participants, and unobtrusive measurements.
Biases of the observers are another important point to consider what happens when an observer's bias
can affect what behaviors to choose and record; however, it can be minimized by keeping them
unaware about the aims.
1.3.4 Survey Methods
A survey simply is an appropriate method to determine feelings, opinions, and thoughts. The aim of
the survey can be both globally and specifically. They can provide a large volume of data using
telephone calls, emails, or face-to-face interviews.
On the other hand, data can be collected in self-completion surveys or by the interviewer. A survey
can be used to explore social behaviors such as measuring the behavior of political candidates and
professional people in educational institutions. However, it is not useful when evaluating people for
government programs since in these programs, all members of the population should be studied.
Overall, in both formative and summative phases of a study, surveys are useful when it is required to
collect information from a large target population, and detailed and in-depth data are not necessary in
the project.
In a survey, a set of questions are provided to give a sample that is chosen from a specific target
population. This sample presents the characteristics and behaviors of the population. Surveys are
conducted to explore the populations' attitudes, the differences between different populations'
behavior and discover the possible changes over time by repeating surveys in regular time intervals.
Thus, the sample selection is an important stage in this process which can highly affect the findings.
Sample sizes should be chosen based on the possibility of selecting every participant with a non-zero
chance. Therefore, samples need to be chosen using a non-volunteer and non-haphazard selection
technique.
Sampling process steps can be simply listed as the following:
● Defining target population such as the number of individuals that are living in the country;
● Selecting a frame for sampling as the actual cases that we select a sample from them;
● Choosing the method of sampling which can be either a random or non-random technique;
● Measuring the appropriate sample size to avoid biases and sampling errors using the related
Formula.
Biases of participants can commonly happen, especially when they need to answer about sensitive
subjects, or when the individuals need to trust the team before giving the right answer.
Questions can be designed in different ways as the following:
● Open-ended questions in which participants answer questions in their own ways;
● Close-ended questions mostly based on yes/no or true /false answers;
● Multiple choices that provide the opportunity to choose a favorite topic by participants.
Here, as discussed in the questionnaires, the questions should be written considering several aspects
ranging from their language to their length and their presenting order. For example, sensitive
questions should be added among final questions as well. Cover letter and introduction should be also
provided as discussed in other types.
1.4. Statistical series
A set of measurements of one or more variables made on a population or sample of individuals.
1.5. Population
This is the set of elements on which the statistical study will be carried out Ω.
A population is also a set of homogeneous elements (with the same characteristics) in which we are
interested and on which our statistical study is based.
In statistics, we work with populations. The term comes from the fact that demography, the study of
human populations, played a central role in the early days of statistics, particularly through
population censuses. However, in statistics, the term population is applied to any statistical object
under study, whether students (at a university or in a country), households or any other group on
which statistical observations are made. We define the notion of population.
For example, the students in a section, the bacteria in a petri dish.......
1.6.the Sample
In statistics, a sample is a set of individuals representative of a population, drawn randomly and
exhaustively.
Exhaustive drawing of an individual: Drawing without replacement: the individual is not returned
to the population after being drawn.
Non-exhaustive drawing of an individual: Drawing with delivery. An individual can be selected
several times.
1.7. Statistical units
A population is made up of individuals. The individuals that make up this statistical population are
called statistical units.
The elements that make up a sample are also called statistical units.
A population is made up of individuals. The individuals that make up this statistical population are
called statistical units.
The elements that make up a sample are also called statistical units.
1.8. Size of a population (or sample)
Represents the number of individuals in a sample or population. It is symbolised by n in the case of a
sample and by N in the case of a population.
1.9.Character (statistical variable)
Characters fall into two broad categories.
There are certain characters which take varying forms for different individuals but cannot be
expressed numerically.The brand name of motor cars plying in an Indian city is such a character; it
may be Ambassador Contessa, Premier Padmini Deluxe, Standard Herald Gazelle, Maruti 1000 or
other. The employees in a city hospital may be observed for their smoking habits; any given
employee will then be recorded as a smoker or a non-smoker. Such a character, whose possible forms
can be distinguished verbally, but not numerically, is called a qualitative character (or attribute). On
the other hand, we can express characters like the size of families, age of teachers, lieight of students,
weight of eggs, etc., in numerical or quantitative terms. The size of a family (i.e., the number of
members in the family) will be a positive integer1,2,3, etc. The age of a teacher may be given in
years or in years and months. The height of a student may be given in centimetres and may be
rounded off to the nearest centimetre. The weight of an egg may be recorded in grams and again may
be rounded off to the nearest tenth of a gram. Such characters are called quantitative characters (or
variables). A qualitative character, too ultimately yields numerical data. This is because we will
finally note how many of the individuals under study have any given form of the character. This is
the particular aspect that we wish to study.
A statistical variable or characteristic: Observations concerning a particular theme have been made
on these individuals.
We have classified characters into two categorieq: qualitative and quantitative. Now quantitative
characters or variables, in their turn, may be classified as discrete and continuous. A discrete variable
is one that can conceivably assume only some discrete,-or isolated values. The size of families, the
proportion or the number of males in each group of 25 students, or the length of a word are variables
of this type. The size of a family or the length of a word may take values like 1,2,3, etc., but no
values in between. The number of males in a group of 25 students may be 0,1,2, ..., 24 or 25, while
the proportion of males may be 0,0.04,0.08 ...., 0.96 or 1; values in between these - numbers are
inconceivable. A continuous variable, on the other hand, can possibly take any value in some
interval. For example, the age (in years) of teachers, the height (in cm.) of students, the weight (in
grams) of eggs are all continuous variables. Supposing the minimum age at which a person can join
the teaching profession is a years and that every member of the teaching community has to retire on
reaching the age P years, then the age of teachers must vary between a and p and can take an); value
within the interval [a, PI. Indeed; the actual age of a teacher may well be 32.119237 years! However,
there will be hardly any need to record the age with this much precision! The enquirer may be
satisfied by taking the age correct to the second decimal plaa so that the teachers age may be
recorded as 32.12 years. This is an example of how limitations of the measuring instruments can
introduce a discreteness into the observations of a continuous variable. Similarly, the actual monthly
income of an Indian which is a continuous variable, has to be expressed in rupees or in rupees and
paise, since the paisa happens to be the smallest denomination coin in the Indian system of currency.
This is also the case with the score in an examination of students taking the examination. The score is
invariably expressed in integers and yet it has to be regarded as a continuous variable. This is because
the score is supposed to measure the p~oficiency of the students in the subject concerned, and the
proficiency may be taken to vary in a continuous manner (say, between 0 and 100).
The distinction between a discrete and a continuous variable is important. Quite often, the statistical
analysis of the data will differ accordingly. In fact, there are some techniques of statistical infer-,
which are based on the assumption that the variable under study is continuous. These are dearly
inapplicable to data on a discrete variable. - In the next section, we shall discuss the concept of
frequency distributions of qualiiative characters and variables.
The series of observations forms what is known as a statistical variable.
For example: Students' marks in the Statistics exam, the grades they obtained in their A-levels, their
sex, the colour of their eyes, the turnover per SME, the number of children per household, grouping,
etc. ...
For example : In the case of a group of people, we may be interested in their age, sex, height,
1.10. Modalities
These are the different possible situations of the characteristic.
Example: Sex is a characteristic with two states: female or male.
Example: As for the number of children per family, the states of this characteristic can be 0, 1, 2, ...,
10, ....
Note: The states of a characteristic must be incompatible and exhaustive; every individual must have
one and only one state.
Note: It is customary to distinguish between the two types of characteristic.
1.10.1.Qualitative characteristic
Its modalities are not expressed by a number.
Example : Coat colour, blood groups, different nucleotides in DNA, ... .
1.10.2.Quantitative characteristic
Its modalities are numerical.
Example: The number of cells in a culture, the blood sugar level, the number of white or red blood
cells, ... .
1.11. Statistical variable (SV)
1.11.1. Discrete statistical variable
X is said to be discrete if E = x1 , ..., xn , finite or infinite set of isolated values countable, usually
integer values.
For example: the number of houses per neighbourhood in a town, the number of children per
household can only be 0, or 1, or 2, or 3, ... .
1.11.2 Continuous statistical variable
X is said to be continuous if E = [a0 , a1 [∪... ∪ [an-1 , an [ or ∀i = 1 : n, ai ∈ IR.
For example, the weight of students in a section, the height of students in a school, laboratory tests
(glucose levels, cholesterol levels......
1.11.3. Qualitative variable:
when the modalities (not measurable) or the values it takes are designated by names or a code.
Qualitative characteristics are those whose modalities cannot be ordered, i.e. if we consider two
characteristics taken at random, we cannot say that one of the characteristics is less than or equal to
the other.
For example, the modalities of the variable Sex are : Male and Female; the terms of the variable Eye
Colour are : Blue, Brown, Black and Green; the terms of the variable Mention au Bac are : TB, B,
AB and P.
There are two types of qualitative variables:
- Ordinal qualitative variables
- Nominal qualitative variables.
More precisely, a qualitative variable is said to be ordinal when its terms can be classified in a certain
natural order (this is the case, for example, with the variable Mention au Bac);
a qualitative variable is said to be nominal, when its terms cannot be classified in a natural way (this
is the case, for example, with the variable Eye Colour or the variable Sex).
1.12 Numbers and frequencies, cumulative numbers and frequencies
1.12.1 Ungrouped Frequency
Distributions We use ungrbuped frequency distributions when the data is of a qualitative nature, or
when the variable under consideration is discrete. Here, we will take one example of each situation
for illustration.
Frequency Distribution of a Qualitative Character
A botanist obtained a variety of linseed by cross-breeding of two pure varieties. She observed the
colour of flowers of plants grown through inbreeding of the new mixed type (called plants of the Fa
generation). On the basis of these observations, she - prepared the following table.
Table 1 : Classification of flowers in an F2, population of linseed by colour
Colour Number of Relative
flowers frequency
(frequency)
Blue 169 0.538
Lilac 61 0.194
White 62 0.197
Pink 22 0.070
Total 314 0.999
The figures in the second column of Table 1 are called the frequencies of the four classes (or of the
four colours). So 'frequency' indicates how frequently the corresponding form of the character under
study (viz., colour) occurs in the collected data. The sum of the frequencies, 314 in this case, is said
to be the total frequency. The first two columns in Table 1 constitute a frequency table. Since these
indicate the manner in which the total frequency 314 (or the total number of individuals) is
distributed among the four classes, they are also said to represent the frequency distribution of colour
for the 314 flowers. Perhaps a better expression is 'the frequency distribution of the 314 flowers by
colour'. Alternatively, we can also write the frequency distribution in terms of the proportions of
blue, lilac, white and pink flowers in the group. These proportions give the relative frequencies, and
are shown in the third column of Table 1. By definition, frequency of the class relative frequency of a
class = -, .. . (1) total of frequency Then what is the total relative frequency? One, of course. But you
can see that in Table 1, the relative frequencies do not add up exactly to 1. This is because the ,
individual figures are all approximate, rounded off to a certain number of decimal places. ' Note that
while the distribution of frequencies answers questions of the type 'How many flowers in tbe given
group are blue?', the relative frequency has to do with questions like 'what is the proportion (or
percentage) of blue flowers in the group?' Further, in any situation, a frequency must be non-negative
integer. The value 0 is admissible, for in the above situation it is conceivable that we might have a
fifth flower colour, say yellow, which was absent in the sample. A relative frequency, on the other
hand, must be a rational number in the interval [0,1]. The simplest type of classification of a group of
individuals by a qualitative character is a dichotomy, i.e., a classification with just two classes. A
group of students may ' thus he classified by sex as boys and girls or by performancr at an
examination as succesdful and msuccessful.
1.12.2 Headcount, cumulative headcount
The headcount of a class (or of a value) designates the number of individuals associated with this
class (or with this value).
If, in a statistical series, the values of a characteristic can be ordered, the cumulative number of
individuals for the value x is the sum of the numbers of all the values less than or equal to x.
This is an increasing cumulative number, but a decreasing cumulative number could also be defined
by taking the sum of the numbers of all the values greater than or equal to x. The number of
individuals (denoted ni) with the characteristic xi is called the number of individuals.
1.12.3. Relatives frequencies, cumulatives frequencies
A relative frequency is the percentage corresponding to the number of a data item in relation to the
total number.
A cumulative relative frequency is the percentage corresponding to the cumulative size of a data item
in relation to the total size.
The number fr = ni /N is called the frequency of characteristic xi.
Note 0 ≤ fr ≤ 1
2. Statistical test
The aim of descriptive statistics is to study the characteristics of a set of observations, such as the
measurements obtained in an experiment. The experiment is the preliminary stage in any statistical
study. It involves making ‘contact’ with the observations. Generally speaking, the statistical method
is based on the following concept. The statistical test is an experiment that is provoked.
2.1.Graphical representation of a variable
2.1.1 Table presentation
When gathering the first data on a given phenomenon, it is difficult to take advantage of this data in this
form, which is why we try to present it in the form of a table and then a graph. The steps to follow to
draw up a table are:
*Calculate the range e= Xmax- Xmin of the statistical distribution.
Class constitution rule: The number of classes should be no less than 5 and no more than 20 (it
generally varies between 6 and 15).
This choice depends on the number of observations and their dispersion. In practice,
the Sturges formula can be used: k = 1+ 3.32log10n
or the Yule formula k = 2.54 √n (k=2.5(n)1/4 )
or
k = √n.
∗ Calculate the class length a=E/k
For example :
For a group of 15 students, we observed the values of the variables: Eye Colour, Sex, Bac Mark and
Statistics Exam Mark, and obtained the following data table. This data will be used frequently in this
chapter.
Data Table
Individual Eye colour Sex Baccalaureate score Statistics exam score
X1 green women TB (very good) 18
X2 black women B (good) 14
X3 blue men P (passable) 10
X4 black men AB (quite good) 12
X5 black men B(good) 15
X6 green women P(passable) 8.75
X7 black women AB (quite good) 10
X8 black women TB(very good) 17.5
X9 black men B(good) 13.75
X10 brown men P (passable) 9
X11 brown men TB(very good) 18
X12 black men B(good) 14
X13 black men B(good) 14.75
X14 black women AB (quite good) 12
X15 Green women P (passable) 11
Note
Generally speaking, an individual belongs to one and only one mode of a qualitative variable.
Very often, among the categories of a qualitative variable, there is an Other category (non-
respondents or missing values or something like that) in which we place the individuals that we are
unable to fit into another category of this variable.
Let's look at the example of the Eye Colour variable.
We start by counting the number of individuals belonging to each of the modalities of this variable:
nBlue = 4 individuals 2 have blue eyes, nbrown = 3 have brown eyes, nblack = 4 have black eyes and
ngreen = 4 have green eyes; all this can be summarised in the following summary table:
Sector diagram
di = ni × 360 /N
The band diagram (organ pipes)
On the x-axis we plot the arbitrary categories. The ordinates are rectangles whose length is
proportional to the numbers, or frequencies, of each mode.
organ pipes
Example
The breakdown of the number of mobile phone subscribers in Algeria in 2014 is given below.
Opérators Djezzy Ooridoo Mobilis Autres Total
Number of subscribers 9.86 19.72 29.68 1.74 61
in millions
Frequencies in % 16.16 32.327 48.655 2.852 100%
Corresponding angle 58.17 116.377 175.158 10.26 360°
Complete the table above.
Construct below the appropriate circular diagram showing the distribution of subscribers, specifying
for each sector: the Operator, the percentage and the corresponding angle.
Example
We counted 1000 leukocytes in an individual and looked at their shape (category).
category of leucocytes Neutrophils Eosinophils Basophils Lymphocytes Monocytes
Number ni 600 20 10 110 260
-What is the characteristic being studied and what is its nature?
Graph this statistical series.
Solution
Characteristic studied : Category of leukocytes
Nature: a qualitative characteristicGraphical representation