AGE 301 - NOTE - A-1
AGE 301 - NOTE - A-1
by
&
1. Introduction
Engineering research generates enormous data which, when analysed, form the basis for inferences,
decisions and conclusions. For meaningful research output, the engineer must analyse his data as
appropriate and presents the results in acceptable standard formats. This is a major role of statistics
in engineering. There are two broad aspects of statistics, namely: Descriptive Statistics and
Inferential Statistics. The former is used to explore and summarize the information contained in
the data, while the latter involves drawing inferences about the population from the data.
Procedures under descriptive statistics include the use of tables, graphs, and numerical measures
(computation of simple statistics such as mean, median, mode, variance, standard deviation, etc. to
describe the data). On the other hand, statistical inference involves formulation of statistical
hypotheses, testing of the hypotheses, and making inference or drawing conclusions based on the
results obtained.
Statistics is the mathematical science involving the collection, analysis and interpretation of data. A
number of specialties have evolved to apply statistical theory and methods to various disciplines
e.g. Engineering (combines engineering and statistics), Environmental Science, Geosciences,
Operations research, Quality and process control, etc .
Descriptive statistics provides simple summaries about the sample/observations that have been
made. Such summaries may be either quantitative, i.e. summary statistics, or visual, i.e. simple-to-
understand graphs. These summaries may either form the basis of the initial description of the data
as part of a more extensive statistical analysis, or they may be sufficient in and of themselves for a
particular investigation.
Inferential statistics (or inductive statistics) use the data to learn about the population that the
sample of data is representing. These statistics are developed on the basis of probability theory.
Deterministic Data are data generated in accordance to known and precise laws e.g. is the fall of a
body subject to the Earth’s gravity. The attributes of deterministic data: Within the precision of the
measurements, under repeated experiments in well defined conditions the same data will be
obtained,.
1
Random Data are data that seem to occur in a purely haphazard way e.g. Thermal noise generated
in electrical resistances, Brownian motion of tiny particles in a fluid; Weather variables; Financial
variables such as Stock Exchange share, Gambling game outcomes (dice, cards, roulette, etc.). In
none of these examples can a precise mathematical law describe the data. Also, there is no
possibility of obtaining the same data in repeated experiments, performed under similar conditions.
Missing data – failure to obtain for certain objects/cases the values of one or more variables – will
always undermine the degree of certainty of the statistical conclusions. Many software products
provide means to cope with missing data. These can be simply coding missing data by symbolic
numbers or tags, such as “na” (“not available”) which are neglected when performing statistical
analysis operations. Another possibility is the substitution of missing data by average values of the
respective variables. Yet another solution is to simply remove objects with missing data. Whatever
method is used the quality of the project is always impaired.
It must be stressed that there are also many free computer statistical packages (Freewares)
available for use nowadays (e.g. MAKESENS - for trend detection in time series data). Some are
even customised for specific task. With the knowledge of one, it becomes relatively easy to use
others once the manual is available. During this course, we will focus on Microsoft Excel and other
readily available softwares. However, it should be noted that, Microsoft Excel is only a spreadsheet
and not a statistical package. Therefore it has serious limitations when it comes to doing rigorous
statistical analysis. Notwithstanding, it is a very powerful tool for exploratory data analysis
(EDA). Thus, the basic knowledge of MS Excel is assumed for this WMA 501.
2. Data Analyses
Histogram
Figure 1: Nigeria annual rainfall series using histogram (the outlier is enclosed in the oval shape).
The histogram is overlay with the theoretical normal distribution curve.
Box Plot
In descriptive statistics, a box plot (also known as a box-and-whisker diagram) is a convenient way
of graphically depicting groups of numerical data through their five-number summaries: the
smallest observation (sample minimum), lower quartile (Q1), median (Q2), upper quartile (Q3),
and largest observation (sample maximum). A box plot may also indicate which observations, if
any, might be considered outliers. Figure 2 shows the annual rainfall series of Nigeria using box
plot.
3
Figure 2: Annual rainfall series (1901-2000) of Nigeria using box plot (the outlier is encircled and
correspond to the 83rd data point or 1983 rainfall value)
Quantitative measures of dependence include correlation (such as Pearson's r when both variables
are continuous, or Spearman's rho if one or both are not) and covariance (which reflects the scale
upon which variables are measured). The slope, in regression analysis, also reflects the relationship
between variables, etc.
Graphics
Statistical graphics, also known as graphical techniques, are information graphics in the field of
statistics used to visualize quantitative data. Graphical techniques allow results to be displayed in
some sort of pictorial form. They include plots such as scatter plots (e.g. Figure 3), histograms,
etc.
4
If one is not using statistical graphics, then one is forfeiting insight into one or more aspects of the
underlying structure of the data.
15
12
Fd (g cm-2 h -1)
3
dry wet Transition
0
0 100 200 300 400
Eo (W m-2)
Figure 3: A scatter plot showing relation between sap flow (Fd) and potential evaporation (Eo) in a
cashew orchard in Ghana.
The model could be linear or curve-linear depending on the data structure and inherent relationship
between the variables.
There are different ways of fitting a curve other than a line to a data.
Deriving the regression formula, which may be cumbersome, and requires some calculus
knowledge;
Linearization: some nonlinear regression problems can be moved to a linear domain by a
suitable transformation.
5
Using optimization algorithms to minimise the error space, e.g. Levenberg–Marquardt
algorithm
Using Nonlinear modelling techniques, such as neural networks.
Goodness-of-fit
Commonly used checks of goodness of fit include the coefficient of determination, R2, analyses of
residuals and hypothesis testing. Statistical significance can be checked by an F-test of the overall
fit, followed by t-tests of individual parameters.
2200
2000
1800
Rainfall, mm/yr
1600
1400
1200
1000
1970 1980 1990 2000 2010
YEAR
Figure 4: Time series plot of the mean annual rainfall total of Nigeria.
Time series analysis comprises methods for analyzing time series data in order to extract
meaningful statistics and other characteristics of the data. In addition, time series analysis (trend)
techniques may be divided into parametric and non-parametric methods.
6
(ii) the variance of the dependent variable in the source population is constant regardless of the
value of the independent variable(s), and
(iii) that the residuals are independent of each other.
Autocorrelation (Auto-correlogram)
Autocorrelation is used to identify periodic signal (stochastic component) in time series datasets.
Autocorrelation analysis correlates a time series dataset with itself at different time lags. It is useful
in checking randomness, finding repeating patterns, or identifying presence of a periodic signal in a
time series dataset.
Fourier Analysis
Also known as spectral analysis Use the exploration of cyclical patterns of data to decompose time
series datasets into spectrum of cycles of different lengths. This helps to uncover reoccurring cycles
of different length in a time series, which at first looks like a random noise. It is possible to use the
unfiltered datasets for the spectral analysis to retain the contribution of high frequency signals. It is
more efficient than autocorrelation as it uses variance (not correlation).
Cluster Analysis
Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in
the same group (called a cluster) are more similar (in some sense or another) to each other than to
those in other groups (clusters). In other words cluster analysis is an exploratory data analysis tool
which aims at sorting different objects into groups in a way that the degree of association between
two objects is maximal if they belong to the same group and minimal otherwise.
7
Wavelet Analysis
Wavelet analysis is becoming a common tool for analyzing localized variations of power within a
time series. By decomposing a time series into time-frequency space, one is able to determine both
the dominant modes of variability and how those modes vary in time. The wavelet transform has
been used for numerous studies in geophysics, including tropical convection, the El Niño–Southern
Oscillation (ENSO) atmospheric cold fronts , Rainfall and temperature , etc.
Concluding Remarks
Please note that all the major topics presented above will be illustrated with examples during the
practical session using relevant softwares e.g. MS Excel, SPSS and MAKESENS (freeware) as
contained in the task and tutorials for this course.