0% found this document useful (0 votes)
35 views8 pages

AGE 301 - NOTE - A-1

The document provides an overview of Engineering Statistics, emphasizing the importance of data analysis in engineering research. It covers key concepts in descriptive and inferential statistics, data collection, and analysis methods including univariate, bivariate, and multivariate analyses. Additionally, it discusses various statistical software tools and techniques such as regression analysis, time series analysis, and principal component analysis.

Uploaded by

igilijosephine
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
35 views8 pages

AGE 301 - NOTE - A-1

The document provides an overview of Engineering Statistics, emphasizing the importance of data analysis in engineering research. It covers key concepts in descriptive and inferential statistics, data collection, and analysis methods including univariate, bivariate, and multivariate analyses. Additionally, it discusses various statistical software tools and techniques such as regression analysis, time series analysis, and principal component analysis.

Uploaded by

igilijosephine
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

AGE 301- Engineering Statistics - Part I

by

Prof. Ayorinde A. Olufayo

&

Prof. Philip G. Oguntunde

Department of Agricultural & Environmental Engineering,


The Federal University of Technology, Akure, Nigeria

1. Introduction

Engineering research generates enormous data which, when analysed, form the basis for inferences,
decisions and conclusions. For meaningful research output, the engineer must analyse his data as
appropriate and presents the results in acceptable standard formats. This is a major role of statistics
in engineering. There are two broad aspects of statistics, namely: Descriptive Statistics and
Inferential Statistics. The former is used to explore and summarize the information contained in
the data, while the latter involves drawing inferences about the population from the data.
Procedures under descriptive statistics include the use of tables, graphs, and numerical measures
(computation of simple statistics such as mean, median, mode, variance, standard deviation, etc. to
describe the data). On the other hand, statistical inference involves formulation of statistical
hypotheses, testing of the hypotheses, and making inference or drawing conclusions based on the
results obtained.

1.1 Basic Concepts & Definitions

Statistics is the mathematical science involving the collection, analysis and interpretation of data. A
number of specialties have evolved to apply statistical theory and methods to various disciplines
e.g. Engineering (combines engineering and statistics), Environmental Science, Geosciences,
Operations research, Quality and process control, etc .

Descriptive statistics provides simple summaries about the sample/observations that have been
made. Such summaries may be either quantitative, i.e. summary statistics, or visual, i.e. simple-to-
understand graphs. These summaries may either form the basis of the initial description of the data
as part of a more extensive statistical analysis, or they may be sufficient in and of themselves for a
particular investigation.

Inferential statistics (or inductive statistics) use the data to learn about the population that the
sample of data is representing. These statistics are developed on the basis of probability theory.

Deterministic Data are data generated in accordance to known and precise laws e.g. is the fall of a
body subject to the Earth’s gravity. The attributes of deterministic data: Within the precision of the
measurements, under repeated experiments in well defined conditions the same data will be
obtained,.

1
Random Data are data that seem to occur in a purely haphazard way e.g. Thermal noise generated
in electrical resistances, Brownian motion of tiny particles in a fluid; Weather variables; Financial
variables such as Stock Exchange share, Gambling game outcomes (dice, cards, roulette, etc.). In
none of these examples can a precise mathematical law describe the data. Also, there is no
possibility of obtaining the same data in repeated experiments, performed under similar conditions.

Datasets & Preliminaries


A statistical data analysis project starts with the data collection task. The quality of performing this
task is a major determinant of the quality of the overall project. Issues such as reducing the number
of missing data, recording the pertinent documentation on what the problem is and how the data
was collected and inserting the appropriate description of the meaning of the variables involved
must be adequately addressed. Data screening and quality control are very pertinent.

Missing data – failure to obtain for certain objects/cases the values of one or more variables – will
always undermine the degree of certainty of the statistical conclusions. Many software products
provide means to cope with missing data. These can be simply coding missing data by symbolic
numbers or tags, such as “na” (“not available”) which are neglected when performing statistical
analysis operations. Another possibility is the substitution of missing data by average values of the
respective variables. Yet another solution is to simply remove objects with missing data. Whatever
method is used the quality of the project is always impaired.

Outliers and extreme values


Outlier is an observation that is numerically distant from the rest of the data. An outlying
observation is one that appears to deviate markedly from other members of the sample in which it
occurs. Outliers, being the most extreme observations, may include the sample maximum or sample
minimum, or both, depending on whether they are extremely high or low. However, the sample
maximum and minimum are not always outliers because they may not be unusually far from other
observations. Outliers arise due to changes in system behaviour, fraudulent behaviour, human or
instrument errors or simply through natural deviations in populations.

1.2 Application Software Tools


There are many software tools for statistical analysis, covering a broad spectrum of possibilities. At
one end we find “closed” products where the user can only perform menu operations. SPSS and
STATISTICA are examples of “closed” products. At the other end we find “open” products
allowing the user to program any arbitrarily complex sequence of statistical analysis operations.
MATLAB and R are examples of “open” products providing both a programming language and an
environment for statistical and graphic operations.

It must be stressed that there are also many free computer statistical packages (Freewares)
available for use nowadays (e.g. MAKESENS - for trend detection in time series data). Some are
even customised for specific task. With the knowledge of one, it becomes relatively easy to use
others once the manual is available. During this course, we will focus on Microsoft Excel and other
readily available softwares. However, it should be noted that, Microsoft Excel is only a spreadsheet
and not a statistical package. Therefore it has serious limitations when it comes to doing rigorous
statistical analysis. Notwithstanding, it is a very powerful tool for exploratory data analysis
(EDA). Thus, the basic knowledge of MS Excel is assumed for this WMA 501.

2. Data Analyses

2.1 Univariate analysis


Univariate analysis involves describing the distribution of a single variable, including its central
tendency (e.g. mean, median, and mode) and dispersion (e.g. the range and quantiles of the data-
2
set, and measures of spread such as the variance and standard deviation). The shape of the
distribution may also be described via indices such as skewness and kurtosis. Characteristics of a
variable's distribution may also be depicted in graphs or tables such as histograms and box plots.

Histogram

A histogram (Figure 1) is a graphical representation showing a visual impression of the


distribution of data. It is an estimate of the probability distribution of a continuous variable.

Figure 1: Nigeria annual rainfall series using histogram (the outlier is enclosed in the oval shape).
The histogram is overlay with the theoretical normal distribution curve.

Box Plot

In descriptive statistics, a box plot (also known as a box-and-whisker diagram) is a convenient way
of graphically depicting groups of numerical data through their five-number summaries: the
smallest observation (sample minimum), lower quartile (Q1), median (Q2), upper quartile (Q3),
and largest observation (sample maximum). A box plot may also indicate which observations, if
any, might be considered outliers. Figure 2 shows the annual rainfall series of Nigeria using box
plot.

3
Figure 2: Annual rainfall series (1901-2000) of Nigeria using box plot (the outlier is encircled and
correspond to the 83rd data point or 1983 rainfall value)

2.2 Bivariate analysis


When a sample consists of two variable, descriptive statistics may be used to describe the
relationship between pairs of variables. In this case, the statistics include:

 Cross-tabulations and contingency tables


 Graphical representation via scatter plots
 Quantitative measures of dependence
 Descriptions of conditional distributions

Quantitative measures of dependence include correlation (such as Pearson's r when both variables
are continuous, or Spearman's rho if one or both are not) and covariance (which reflects the scale
upon which variables are measured). The slope, in regression analysis, also reflects the relationship
between variables, etc.

Graphics

Statistical graphics, also known as graphical techniques, are information graphics in the field of
statistics used to visualize quantitative data. Graphical techniques allow results to be displayed in
some sort of pictorial form. They include plots such as scatter plots (e.g. Figure 3), histograms,
etc.

Graphical statistical methods have four objectives:

 The exploration of the content of a data set


 The use to find structure in data
 Checking assumptions in statistical models
 Communicate the results of an analysis.

4
If one is not using statistical graphics, then one is forfeiting insight into one or more aspects of the
underlying structure of the data.

15

12
Fd (g cm-2 h -1)

3
dry wet Transition

0
0 100 200 300 400
Eo (W m-2)

Figure 3: A scatter plot showing relation between sap flow (Fd) and potential evaporation (Eo) in a
cashew orchard in Ghana.

2.3 Regression analysis & models


Regression analysis is a statistical technique for estimating the relationships among variables. It
includes many techniques for modeling and analyzing several variables, when the focus is on the
relationship between a dependent variable and one or more independent variables. Regression
models involve the following variables:

 The unknown parameters, denoted as β, which may represent a scalar or a vector.


 The independent variables, X.
 The dependent variable, Y.

A regression model relates Y to a function of X and β.

The model could be linear or curve-linear depending on the data structure and inherent relationship
between the variables.

Curve fitting techniques

There are different ways of fitting a curve other than a line to a data.

 Deriving the regression formula, which may be cumbersome, and requires some calculus
knowledge;
 Linearization: some nonlinear regression problems can be moved to a linear domain by a
suitable transformation.

5
 Using optimization algorithms to minimise the error space, e.g. Levenberg–Marquardt
algorithm
 Using Nonlinear modelling techniques, such as neural networks.

Goodness-of-fit
Commonly used checks of goodness of fit include the coefficient of determination, R2, analyses of
residuals and hypothesis testing. Statistical significance can be checked by an F-test of the overall
fit, followed by t-tests of individual parameters.

2.4 Time Series Analysis


In hydrometeorology (including: hydrology and water resources, agrometeorology, Ecohydrology;
climatology; etc) a time series is a sequence of data points, measured typically at successive time
instants spaced at uniform time intervals. Examples of time series plot of the annual rainfall total of
Nigeria (Figure 4). Goals of Time series analysis include: Identifying of patterns and Predicting
future values.

2200

2000

1800
Rainfall, mm/yr

1600

1400

1200

1000
1970 1980 1990 2000 2010
YEAR

Figure 4: Time series plot of the mean annual rainfall total of Nigeria.

Time series analysis comprises methods for analyzing time series data in order to extract
meaningful statistics and other characteristics of the data. In addition, time series analysis (trend)
techniques may be divided into parametric and non-parametric methods.

2.4.1 Trend Analysis

Parametric and non parametric trend test


Generally before embarking on the parametric trend test or least square regression analysis, the data
must be checked for its suitability for regression analysis by checking three assumptions that a
linear regression makes about the data.

The regression assumes:


(i) that the source population is normally distributed,

6
(ii) the variance of the dependent variable in the source population is constant regardless of the
value of the independent variable(s), and
(iii) that the residuals are independent of each other.

2.4.2 Analysis of seasonality (Cycles and Periodicities)

Autocorrelation (Auto-correlogram)
Autocorrelation is used to identify periodic signal (stochastic component) in time series datasets.
Autocorrelation analysis correlates a time series dataset with itself at different time lags. It is useful
in checking randomness, finding repeating patterns, or identifying presence of a periodic signal in a
time series dataset.

Fourier Analysis
Also known as spectral analysis Use the exploration of cyclical patterns of data to decompose time
series datasets into spectrum of cycles of different lengths. This helps to uncover reoccurring cycles
of different length in a time series, which at first looks like a random noise. It is possible to use the
unfiltered datasets for the spectral analysis to retain the contribution of high frequency signals. It is
more efficient than autocorrelation as it uses variance (not correlation).

3. Further Data Analyses


Multivariate analysis comprises a set of techniques dedicated to the analysis of data sets with more
than one variable. Several of these techniques were developed recently in part because they require
the computational capabilities of modern computers.

3.1 Cross-correlation and multiple regressions


The correlation matrix is often used for a first inspection of the interrelationships among the
variables of the multivariate datasets. Multiple linear regression is the extension of the simple linear
regression in which the predictors are generally more than one. It may sometime serve as a follow
up to the cross-correlation analysis.

3.2 Data Structure Analysis


Statistical techniques that allow us to analyse the data structure with the dual objective of
dimensional reduction and improved data interpretation. Such include Principal Components
Analysis (PCA), Cluster Analysis (CA), Wavelet Analysis (WA) and Self Organising Maps
(SOM). Detail discussion on these is beyond the scope of this workshop.

Principal Components Analysis


Principal component analysis (PCA) is a technique used to emphasize variation and bring out
strong patterns in a dataset. It's often used to make data easy to explore and visualize. Principal
component analysis (PCA) is a statistical procedure that uses an orthogonal transformation to
convert a set of observations of possibly correlated variables into a set of values of linearly
uncorrelated variables called principal components. It is a data reduction technique that can also
be used to find coupling among complex data sets.

Cluster Analysis
Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in
the same group (called a cluster) are more similar (in some sense or another) to each other than to
those in other groups (clusters). In other words cluster analysis is an exploratory data analysis tool
which aims at sorting different objects into groups in a way that the degree of association between
two objects is maximal if they belong to the same group and minimal otherwise.

7
Wavelet Analysis
Wavelet analysis is becoming a common tool for analyzing localized variations of power within a
time series. By decomposing a time series into time-frequency space, one is able to determine both
the dominant modes of variability and how those modes vary in time. The wavelet transform has
been used for numerous studies in geophysics, including tropical convection, the El Niño–Southern
Oscillation (ENSO) atmospheric cold fronts , Rainfall and temperature , etc.

Self Organising Maps


Invented by Teuvo Kohonen. It uses Unsupervised ANN (Artificial Neural Networks) using
competitive learning. Provide a mechanism for visualising complex relationships in multi-
dimensional data sets. A tool used for clustering, visualisation and dimension reduction. “Given an
N-dimensional cloud of data points, the SOM will seek to place an arbitrary number of nodes
within the data space such that the distribution of nodes is representative of the multi-dimensional
distribution function, with the nodes being more closely spaced in regions of high data

Concluding Remarks
Please note that all the major topics presented above will be illustrated with examples during the
practical session using relevant softwares e.g. MS Excel, SPSS and MAKESENS (freeware) as
contained in the task and tutorials for this course.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy