Zhang Haoze 202112 MSC
Zhang Haoze 202112 MSC
by
Haoze Zhang
Master of Science
in
Mining Engineering
postprocessing. Exploratory data analysis (EDA) is an early step in preprocessing. It provides the
characteristics of data and helps identify erroneous or inconsistent data. In the context of geostatis‑
tics, missing data and below detection limit (BDL) data are an important anomaly to be understood
in EDA. Missing data are problematic in EDA techniques such as principal component analysis
(PCA). BDL data also cause problems when conducting cluster analysis and other analysis. Geo‑
statistical models need to be conducted in stationary domains, so multivariate and spatial cluster
analysis is another important aspect in EDA. It separates data into smaller groups in which data
This thesis covers multiple aspects of geostatistical EDA. A data map examines missing data, and
it shows the number of missing data in each variable and location. A combined permutation and
Kolmogorov–Smirnov (KS) test identify if the missingness in variables is systematic. BDL data are
investigated in univariate and bivariate methods. A BDL statistics table complements histograms.
Three methods measure the spikiness of data. Bivariate analysis compares observed distributions
with expected distributions which indicate full independence of BDL occurrence. Kullback–Leibler
(KL) test quantifies the difference between the distributions, obtaining combinations of variables in
which the BDL occurrence can be dependent. This helps the understanding of the reasons for BDL
data.
The handling of BDL data in cluster analysis is addressed, including a workflow that finds the
optimal number of clusters. Tests on synthetic data examine the compatibility of the workflow with
different data transformations and clustering methods. K‑means is a suitable clustering method for
dealing with BDL spikes. Four transformations compatible with the workflow are combined with
k‑means to examine clusters in real data. The trade‑off between spatial continuity and multivariate
continuity in cluster analysis is addressed. A novel classification method is proposed to find the
optimal clustering and domain labels. Ensemble clustering labels are used as inputs for the classifi‑
cation. The classification algorithm takes multiple sets of clustering labels as inputs. The domains
are assigned based on clustering labels and two hyperparamters ‑ spatial weight and number of
domains. The matrix of classification results shows higher spatial weight results in more contin‑
uous domains. Flow simulation results show that the domain label assignment has an impact on
ii
Abstract
the performance of the final geostatistical models, because flow responses are highly sensitive to
iii
DEDICATION
To PW.
iv
ACKNOWLEDGMENTS
I would like to thank my supervisor Dr. Clayton Deutsch. This thesis is not possible without your
continued guidance and support. Your brilliance and diligence keep motivating me throughout my
studies. I would also like to thank Centre for Computational Geostatistics (CCG) for the financial
support. Thank my friends at CCG for the help with my questions and more importantly the great
memories.
v
TABLE OF CONTENTS
1 Introduction 1
1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.1 Exploratory Data Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.2 Problems with Geostatistical Exploratory Data Analysis . . . . . . . . . . . . 3
1.2 Thesis outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
vi
Table of Contents
3.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
6 Conclusion 94
vii
Table of Contents
6.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
6.2 Limitations and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
References 97
viii
LIST OF TABLES
2.1 The first 5 rows of the table of p value. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.2 Observed KS test results dobs for the synthetic dataset. . . . . . . . . . . . . . . . . . . . . 21
3.1 Univariate distribution informaion for each variable. The shortened column names are
explained in the context. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.2 Results from the two methods. Left using the quadratic equation and right using the log
equation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.3 The BDL boundaries for Sn and S in Gaussian space. . . . . . . . . . . . . . . . . . . . . . 34
3.4 Expected probability for Sn and S. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.5 Observed probability for Sn and S. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.1 The correctness rates for each transform and the clustering methods. . . . . . . . . . . . 61
5.1 Example data of calculating multivariate entropy. Left is the number of data within each
label and domain. Right is the corresponding probabilities. . . . . . . . . . . . . . . . . . 76
5.2 The within group variance of the domains obtained from the classification. . . . . . . . 81
5.3 The entropy measurements of domain sizes obtained from classification. . . . . . . . . . 81
5.4 The merged measurement of the domains performance. . . . . . . . . . . . . . . . . . . . 82
5.5 The correlation matrix of three variables. . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
ix
LIST OF FIGURES
1.1 Flow chart of Geostatistical EDA workflow . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1 An example of data table. Red color represents missing data. Blue color represents
observed data. Data index are on the veritcal axis and the variable names are on the
horizontal axis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2 The plot of the data and highlighted missing part. The figure is divided by dashed lines
showing the complete and incomplete datasets. . . . . . . . . . . . . . . . . . . . . . . . 10
2.3 The zoomed‑in plot of columns (variables) to be dropped. . . . . . . . . . . . . . . . . . 11
2.4 The partial observations of three variables from the complete dataset. . . . . . . . . . . . 12
2.5 KS test for two distributions. The blue vertical line is the result d. . . . . . . . . . . . . . 13
2.6 The distibution of d. The orange line represents dobs . . . . . . . . . . . . . . . . . . . . . 13
2.7 The p value for all the combinations of missing and non‑missing variables. . . . . . . . . 15
2.8 The histograms of the subsets of P given the missing variable Sn. . . . . . . . . . . . . . 16
2.9 The pr dataframe after combining the p value with the relevance. . . . . . . . . . . . . . 17
2.10 The plots of dataframe showing the level of missingness considering the missing size
and relevance. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.11 The synthetic data map. The highlighted columns are to be dropped. . . . . . . . . . . . 20
2.12 The histogram plots of d distributions for different variables . . . . . . . . . . . . . . . . 21
2.13 The results of the synthetic data p from permutation test. . . . . . . . . . . . . . . . . . . 22
2.14 The results of the synthetic data from permutation test considering missing size and
relevance. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
x
List of Figures
3.9 Percentage of dependence for each combination of the two variables. Only the combina‑
tions have percentage larger than 10% are shown. . . . . . . . . . . . . . . . . . . . . . . 39
4.1 New clustering methods can handle complex clusters such as the moon shape clusters
(Fred & Jain, 2005). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.2 The original synthetic data (left) and the data with synthetic spikes (right). . . . . . . . . 45
4.3 Uniform transformed data, spreading out the spikes. The marginal distributions are
shown on the edges. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.4 The silhouette coefficient for data and the corresponding clusters when using k‑means
and NC=5. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.5 The silhouette coefficient for different NC. . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.6 The gap statistic and the corresponding log(Wk ) for reference and data in a range of NC. 48
4.7 The prediction strength over a range of N C. . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.8 K‑means results of the transformed data. . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.9 Linearly rescaled synthetic data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.10 Gap statistic (left) and silhouetter coefficient (right) on the linearly transformed data. . . 53
4.11 K‑means (left) and GMM (right) clustering results on the linearly transformed data. . . 53
4.12 Synthetic data with outliers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.13 Gap statistic (left) and silhouetter coefficient (right) on linearly scaled data containing
outliers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.14 Results of k‑means clustering using cluster number of 8. Right one is the zoomed in
scatter plot of the region of interests. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.15 Results of k‑means clustering using cluster number of 5. . . . . . . . . . . . . . . . . . . 55
4.16 Results of gap statistic and silhouette coefficient using GMM when data are uniform
transformed with spikes spread. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.17 Results of GMM clustering when data are uniform transformed with spikes spread. . . 56
4.18 Synthetic data after uniform transform and spikes preserved. . . . . . . . . . . . . . . . 57
4.19 Results of gap statistic and silhouette coefficient using k‑means and GMM when data
are uniform transformed with spikes preserved. . . . . . . . . . . . . . . . . . . . . . . . 57
4.20 Resulting clusters from k‑means (left) and GMM (right) when data are uniform trans‑
formed with spikes preserved. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.21 Gaussian transformed synthetic data with spikes spread out. . . . . . . . . . . . . . . . . 58
4.22 Results of gap statistic (left) and silhouette coefficient using k‑means and GMM (right)
when data are Gaussian transformed with spikes spread out. . . . . . . . . . . . . . . . . 59
4.23 Clustering results using k‑means (left) and GMM (right) when data are Gaussian trans‑
formed with spikes spread. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.24 Gaussian transformed synthetic data with spikes preserved. . . . . . . . . . . . . . . . . 60
xi
List of Figures
4.25 Results of gap statistic (left) and silhouette coefficient using k‑means and GMM (right)
when data are Gaussian transformed with spikes preserved. . . . . . . . . . . . . . . . . 60
4.26 Clustering results using k‑means (left) and GMM (right) when data are Gaussian trans‑
formed with spikes preserved. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.27 Ten samples on a 2d space still give two clusters. . . . . . . . . . . . . . . . . . . . . . . . 62
4.28 Gap statistic and silhouette coefficient results for different transform methods using k‑
means. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.29 The planes illustrating two clusters in real data. . . . . . . . . . . . . . . . . . . . . . . . 65
5.1 k‑means clustering results on 2D multivariate data. Left represents the multivariate la‑
bels. Right is the domain distribution. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
5.2 k‑means clustering results on 2D spatial data. Left represents the multivariate labels.
Right is the domain distribution. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
5.3 An illustration of ensemble clustering method. Left four plots are individual clustering
results. Right plot is the merged ensemble clustering result. . . . . . . . . . . . . . . . . 68
5.4 The within cluster sum of squares (WCSS) and entropy are negatively correlated (Martin,
2019). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
5.5 An illustration of three linkage method. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
5.6 An example of dendrogram using 20 data. x axis is the index of data. y axis is the distance
between data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
5.7 The distance matrix calculated from the ensemble clustering method. . . . . . . . . . . . 72
5.8 The dendrogram calculated from distance matrix. Each node on x axis represents a data
point. y axis represents the data distance. . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
5.9 The result of ensemble clustering on the real data. x and y axises represent location.
Different colors represent different groups. . . . . . . . . . . . . . . . . . . . . . . . . . . 73
5.10 The result of k‑means clustering on the real data. x and y axises represent location. Dif‑
ferent colors represent different groups. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
5.11 An illustration of a local search window. The window is marked as a blue circle. . . . . 76
5.12 The classification of the domains when spatial weight is set to 0. Left is the input clus‑
tering labels. Right is the classified domains. . . . . . . . . . . . . . . . . . . . . . . . . . 78
5.13 6 sets of clustering labels obtained from ensemble clustering. . . . . . . . . . . . . . . . . 79
5.14 The matrix of domains, given multiple Wsp and number of domains. . . . . . . . . . . . 80
5.15 The correlation matrix of the data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
5.16 The 2D scatter plots of the multivariate data. . . . . . . . . . . . . . . . . . . . . . . . . . 83
5.17 The cluster labels used as inputs for the domain classification. . . . . . . . . . . . . . . . 84
5.18 The domain labels and the location map of the three variables. . . . . . . . . . . . . . . . 85
xii
List of Figures
5.19 The domain labels in multivariate space. Upper row for Wsp = 0.0. Lower row for
Wsp = 0.7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
5.20 Categorical modeling of the domains with grid size 50 × 50. . . . . . . . . . . . . . . . . 86
5.21 The scatter plots of the variables after projection pursuit multivariate transform (PPMT).
Each row represents the transformed multivariate data in each domain. . . . . . . . . . 87
5.22 The variograms of variables in each domain for Wsp = 0.0. . . . . . . . . . . . . . . . . . 88
5.23 The variograms of variables in each domain for Wsp = 0.7. . . . . . . . . . . . . . . . . . 88
5.24 One of the realizations of three variables after merging the domain labels. The upper
row is for Wsp = 0.0. The lower row is for Wsp = 0.7. . . . . . . . . . . . . . . . . . . . . 89
5.25 The scatter plots of the variables from original data and the realizations of Wsp = 0.0
and Wsp = 0.7. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
5.26 The realizations of permeability model using universal thresholds. The upper row for
Wsp = 0.0. The bottom row for Wsp = 0.7. . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
5.27 One realization of the flow path. The left margin has a hydraulic head of 10 m. The right
margin has a hydraulic head of 0 m. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
5.28 The histograms of the arrival time for quantile 0.15 (left) and quantile 0.85 (right) par‑
ticles of 100 realizations (permeability converted from universal thresholds case). Blue
histograms represent Wsp = 0.0 breakthrough times and orange histograms represent
Wsp = 0.7 breakthrough times. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
xiii
LIST OF ABBREVIATIONS
Abbreviation Description
2‑D Two‑dimensional
3‑D Three‑dimensional
BDL below detection limit
BU Bayesian Updating
CDF cumulative distribution function
EDA exploratory data analysis
GMM Gaussian mixture model
GSLIB Geostatistical software library
KL Kullback–Leibler
KS Kolmogorov–Smirnov
MAR missing at random
MCAR missing completely at random
MNAR missing not at random
NC number of clusters
PCA principal component analysis
PPMT projection pursuit multivariate transform
WCSS within cluster sum of squares
xiv
CHAPTER 1
INTRODUCTION
1.1 Background
Geostatistical modeling uses geological data for resource estimation and the workflow has several
components. Exploratory data analysis (EDA) finds the characteristics of data. If necessary, the an‑
alyzed data are cleaned and imputed (Abrevaya & Donald, 2017; Silva & Deutsch, 2018). The next
step is to conduct the modeling, including transforming data to Gaussian space, variogram infer‑
ence, kriging or simulation and back‑transformation (M. J. Pyrcz & Deutsch, 2014). Postprocessing
verifies the models using statistical tools such as cross‑validation (Browne, 2000). To generate accu‑
rate models, the quality of the input data is of great importance, and this is more likely when EDA
is conducted appropriately.
EDA is an approach to summarize the characteristics of data. The summary can possibly gener‑
ate suggestions for collecting new data, using suitable data for further analysis or modeling, and
reasons for the observed data features (Behrens, 1997; Tukey et al., 1977). The univariate data dis‑
tribution can be summarized by statistical tables. They provide information such as the mean, vari‑
ance and quantiles. Quantiles describe the univariate features more robustly when the data are
highly skewed (Takeuchi, Le, Sears, Smola, et al., 2006). Data visualization is also important in
EDA. It provides direct and concise observations of the univariate and multivariate data distribu‑
tions. Cumulative distribution function (CDF) and histograms examine the univariate properties
of data, which include the range of data, the frequency of data and the data skewness. Box plots
examine data characteristics across multiple categories. In each category, the data is summarized
with the minimum (q0 ), maximum (q100 ), median (q50 ), first quartile (q25 ) and third quartile (q75 ).
Scatter plots show bivariate relations, which can illustrate the correlations or non‑linearity between
variables.
Real data are rarely homogeneous. Data can be missing because of data collection errors. These
missing data can influence the performance of classification or modeling (Ding, Han, Zhao, & Chen,
2015). There are several methods to address the problem, including dropping the missing data, fill‑
ing in the missing data with mean or median, and predicting the missing values with regression
(Little & Rubin, 2019). Adopting which method depends on the nature of missingness. If the miss‑
ingness is random, dropping the data can be feasible. If the missingness is systematic, regression
may be applied (Efron, 1994; Van Buuren, 2018). Outliers are extremely low or high data that appear
1
1. Introduction
far from the majority of the data. Several outliers can drastically influence the results of regression.
For example, in logistic regression, outliers can shift the decision boundaries greatly (Menard, 2002).
Outliers are often omitted or capped, and there are several methods to identify them. For univariate
and bivariate data, outliers can be observed visually using box plots or scatter plots. For high di‑
mensional data, Z‑score can be applied. It measures how many standard deviations data are from
the mean value. Outliers are often identified as data beyond 3 standard deviations (Rousseeuw &
Hubert, 2011).
Advanced multivariate analysis tools explore high‑dimensional relations in data, including prin‑
cipal component analysis (PCA) (Hotelling, 1933) and cluster analysis (Fred & Jain, 2005; Romes‑
burg, 2004). PCA is a dimension reduction technique, and also creates new coordinates that decorre‑
late the original data. For high dimensional data, some dimensions do not show significant variabil‑
ity. By reducing the high dimensional data to fewer dimensions that exhibit the most variations, the
modeling can be faster. The new coordinates (principal components) are orthogonal to each other.
Starting from the first principal component, the lines are the ones that minimize the average squared
distance of points to the lines. The following lines need to be orthogonal to the previous ones and
minimize the average squared distance (Abdi & Williams, 2010). Data first need to be standardized
to zero mean and a standard deviation of one. The zero mean is necessary as data need to be ro‑
tated to the new coordinates. The standard deviation of one makes the interpretation of PCA results
easier. Then, the covariance matrix is calculated and decomposed into the eigenvector matrix and
the eigenvalue matrix. The resulting PCA transformed data is calculated by multiplying data with
the eigenvector matrix. PCA calculates the eigenvalues and eigenvectors of multivariate data and
projects data to the eigenvectors. The variability of each eigenvector is reflected by the correspond‑
ing eigenvalue. The resulting coordinates may not be the same as the original ones. Projecting data
to the first several principal components can represent the majority of the data variations.
Cluster analysis groups similar data together, separating them into subsets that have more dis‑
tinguishable features. Further analysis conducted on the well separated clusters can provide in‑
sight into the data. Data less than 4 dimension may be clustered visually. For higher dimensional
data, statistical tools are necessary. The similarity of data is defined differently and this leads to
different types of clustering methods. The most common ones include connectivity‑based cluster‑
ing (hierarchical clustering) (Johnson, 1967), centroid‑based clustering (k‑means) (Krishna & Murty,
1999), distribution‑based clustering (Gaussian mixture model (GMM)) (McLachlan & Basford, 1988;
Reynolds, 2009) and density‑based clustering (Density‑based spatial clustering of applications with
noise (DBSCAN)) (Shen et al., 2016). Hierarchical clustering groups data based on their connectiv‑
ity. Data closer to each other are grouped in early stages, forming intermediate groups. Different
definitions of distance between groups result in different linkage criteria. The commonly used ones
are maximum, minimum and average criteria. Using different linkage criteria can lead to different
clustering of the intermediate groups. The clustering results and the number of clusters can be ob‑
2
1. Introduction
served using a dendrogram (Sander, Qin, Lu, Niu, & Kovarsky, 2003). K‑means clustering groups
data based on the distance between clustering centroids and data. Data are assigned to the nearest
centroids and the new centroids are calculated for the next iteration of cluster assignment. The pro‑
cess continues until the algorithm converges. The results are determined by the number of clusters
and the location of the initially generated centroids, so different initial centroids are generated and
the best clustering results are returned as the final results. GMM shares similar process as k‑means.
The difference is that multiple Gaussian kernels are generated rather than centroids, and data are
assigned based on their probabilities in different Gaussian kernels. Gaussian kernels can handle
clusters with elongate shapes better than k‑means (Lücke & Forster, 2019). The key parameter of
DBSCAN is the radius r. For each data point i, DBSCAN search the number of data within r. If the
number of data is above a threshold q, the data are defined as core points. Points within r of other
core points are called directly reachable. Points are called reachable if there is a path for them to be
connected to core points. Points are called noise points when they are not reachable by other points.
For a core point, the cluster is defined as all data reachable from it. With many different clustering
methods, each clustering type has their own application situation. No clustering can outperform
another in all situations.
Geostatistical EDA focuses on understanding missing data, below detection limit (BDL) data and
outliers (Prades, 2017). Geostatistical missing data can come from the high cost of acquiring drill
hole data, or data collection errors. Sometimes data are missing in a variable because the data
in other variables are below a threshold (Little & Rubin, 2019). Missing data can occur for most
variables at some locations or in several variables at many locations. It can cause problems for
multivariate analysis such as PCA. If some variables are missing, the coordinates in the multivariate
space are unknown, and these data cannot be used in PCA. To handle the missing data, the nature
of missingness needs to be understood. BDL data originate primarily measurement equipment
limitations (Palarea‑Albaladejo & Martin‑Fernandez, 2013). The concentrations of some elements
are so low that measurement equipment cannot detect them. They are often recorded as 0.0 and
form spikes in histograms. These spikes are problematic in quantile transformation (Prades, 2017).
The spikes can be distributed from low to high, which is also known as despiking (Verly, 1984). How
to assign quantiles for the data in spikes results in multiple despiking methods. Different ways of
transforming spikes can lead to different EDA results. To find an appropriate way of despiking
the BDL data, the characteristics of the BDL spikes and the dependence of BDL occurrence are
examined. Outliers are data with extremely high values, sometimes orders of magnitude larger than
the mean of data. They may come from very high concentrations or data collection errors. Outliers
can also appear as extreme low values, and orders of magnitude less than the majority of data. The
3
1. Introduction
corresponding distribution is negatively skewed such as Fe or SiO2 . Outliers can cause problems
when clustered with centroid‑based methods (Chawla & Gionis, 2013; Prades, 2017). They shift the
centroids drastically compared with ordinary data, resulting in inaccurate clustering results or an
incorrect number of clusters.
Applying advanced EDA methods to geostatistical data helps identify stationary domains. Clus‑
ter analysis finds such domains in multivariate space, but there are several factors that affect the
performance of cluster analysis. The multivariate clusters cannot be identified simply through uni‑
variate or bivariate plots. Different clusters may not be obvious until analyzed in high‑dimensional
space, so statistical tools are of great importance for cluster analysis. The number of clusters is the
most important hyperparameter for popular clustering methods (Milligan & Cooper, 1985; Tibshi‑
rani, Walther, & Hastie, 2001). Setting different numbers of clusters can lead to different clustering
results. Therefore, robust methods to find the correct number of clusters are needed. The anomaly
data mentioned above can also affect the performance of clustering analysis. GMM can falsely as‑
sign a Gaussian kernel only for the BDL spikes, so different transformations are compared and the
appropriate ones are used to amend the problems caused by anomaly data. Validation methods
are also important to ensure the clustering results are trustworthy. Cross‑validation applied to
clustering results finds if data are clustered or partitioned.
Geostatistical data have two components, the spatial arrangement of the values and the multi‑
variate data values. Cluster analysis groups data in multivariate space. Data are labeled to ensure
multivariate continuity, but the corresponding spatial distribution of the labels (domains) may be
scattered. The scattered spatial continuity can lead to unstable variograms, and further influence
the performance of modeling. To obtain continuous domains, cluster analysis could be considered
with spatial data only. In this case, spatial continuity is ensured, but the clustering labels may be
scattered in multivariate space. It is not recommended to cluster spatial data directly because of the
complexity in the shape and geometry of geological domains. There is a clear trade‑off between the
multivariate and spatial continuity (Martin, 2019). It would be beneficial to modeling performance
if optimal clustering labels ensure reasonable continuity both in multivariate and spatial space.
This thesis addresses selected problems in geostatistical EDA, including missing data, BDL data, dif‑
ferent transformations of BDL in data cluster analysis, and the trade‑off effect of clustering labels
between spatial and multivariate continuity. Chapter 2 examines missing data in a geochemical
dataset. A data map shows the information about the missingness in variables and locations. Mul‑
tiple statistical tools determine if the missingness are random or systematic. Chapter 3 explores
the BDL data in the same dataset. The data are analyzed in both univariate and bivariate ways.
A BDL statistical table complements histograms. Three different methods evaluate data spikiness.
4
1. Introduction
Input data
Bivariate analysis of the BDL data examines if the occurrence of BDL between two variables are in‑
dependent. Chapter 4 compares the effects of different transformations of data on cluster analysis.
Different transformations are considered to handle the potential problems caused by BDL spikes
and a workflow is proposed to identify the optimal number of clusters. This chapter also inves‑
tigates the compatibility of the workflow with different transformations and clustering methods.
Chapter 5 aims at finding an optimal set of domain labels which ensures both multivariate and spa‑
tial continuity. Ensemble clustering is used to cluster multivariate data. Then, a novel classification
method classifies domains given the clustering labels and spatial configuration of the data.
The tools covered in this thesis can formulate a flowchart shown in Fig.??. Missing data analysis
should be conducted first for the imputation of data. BDL analysis should be conducted next before
the spikes are despiked or preserved in cluster analysis. Cluster analysis and domain classification
can be conducted simultaneously.
5
CHAPTER 2
2.1 Introduction
2.1.1 Background
As sampling equipment improves and technical decisions become more challenging, an increasing
amount of multivariate data are acquired for geostatistical analysis. Although multivariate data
provide extra information, not all of the data are homotopic (equally sampled). Some variables are
missing due to cost, data vintage, and other considerations, and this may cause problems during
analysis. First, it could lead to undefined values in compositional data calculation. In Aitchison
(1982); Pawlowsky‑Glahn, Egozcue, and Tolosana‑Delgado (2015), the classical definition of com‑
positional data excludes the possibility of missing data. Arbitrarily assigning the missing values
mean values or zero values does not satisfy the sum to unity. Another problem with missing data
is encountered when conducting PCA (Abdi & Williams, 2010; Hotelling, 1933). PCA is an effective
method to reduce the dimensionality of multivariate data. It finds the dimensions that capture the
most variability and projects the full‑dimensional data onto these dimensions. The projected data
are also decorrelated, which simplifies geostatistical modeling. In PCA, the heterotopic (unequally
sampled) data cannot be used, because their locations in the full‑dimensional space are unknown.
With the potential problems caused by missing data, the missing data need to be imputed. Un‑
derstanding the nature of missingness is the first step before data imputation. It helps decide which
imputation method should be applied. For example, if the data are missing at random, the tradi‑
tional imputation methods do not introduce bias. There are three types of missingness: missing
completely at random (MCAR), missing at random (MAR), and missing not at random (MNAR)
6
2. Missing Data Analysis
(Rubin, 1976). MCAR means the missing data occurrence is independent of observed and missing
samples. It can happen for some systematic reasons but the missing values are completely random
with respect to the data values. In this case, simply omitting the missing data does not introduce
bias, but MCAR rarely occurs. MAR means the missingness occurrence is dependent on observed
data but does not depend on missing data. For example, some variables are not sampled when other
variables are below a threshold. The missingness is classified as MNAR when it does not belong to
the previous two classifications. In this case, the missingness is dependent both on observed and
missing data. This is the most challenging scenario as the imputation based only on observed data
can introduce bias.
To avoid the aforementioned problems, the missing data can be dropped, but this may lead to
bias, especially when the nature of missingness is not random. Another solution is to impute the
missing data. Different from the well‑established theories of imputation in other fields (Enders,
2010), imputed geostatistical data should retain multivariate relationships and spatial structure.
One approach is to use Bayesian Updating (BU) (Barnett & Deutsch, 2015; Doyen, Den Boer, Pillet,
et al., 1996). The missing data are informed on by different data sources. One source is the data
of the same variable from other locations (primary data). The other is the collocated data at the
location of the missing value (secondary data). The two results are merged and the final result
is sampled to create multiple possibilities. Gaussian mixture model (GMM) imputation (Silva &
Deutsch, 2018) considers a non‑parametric fitting of the multivariate to adapt to more complex
features. The imputed data carry the uncertainty through subsequent analysis.
The real data used in this chapter come from the Government of the Northwest Territories. It is
part of the National Geochemical Reconnaissance stream sediment and water survey and the field
collected data serve the purpose of building a geochemical database for mineral potential. There
are three types of samples: stream silt samples, stream water samples, and bulk steam sediment
samples (Falck et al., 2012). The dataset consists of 51 variables (elements) and about 8500 data
samples. The dataset has missing data, below detection limit data, and outliers. Here, the missing
data are examined.
In this chapter, the difference between MCAR and MAR is examined through numerical anal‑
ysis. The term systematic missingness refers to MAR. First a data map is generated showing the
general information of the missingness, including the missing data location, variables containing
the most missing data, and an optimal dataset that contains no missingness. Then, a statistical tool is
developed to explore the nature of missingness. It compares the two subsets of complete variables,
where the target missing variable is present and absent. The developments are demonstrated with
the Northwest Territories dataset. The tool is further validated on a synthetic dataset generated
7
2. Missing Data Analysis
In this section, missing data notations are introduced first. Then, basic information about the miss‑
ingness in the real dataset is presented through a data map. A procedure to find the optimal subset
of data containing no missing values is proposed and the optimal data subset are highlighted in the
data map.
Suppose there is a data table (Fig.2.1). Rows represent data and columns represent variables. The
data are denoted {Zα,k ; α = 1, · · · , N ; k = 1, · · · , K} where N represents the number of data and
K represents the number of variables. Zα,k is a number when the data are recorded above or below
detection limit (recorded as 0.0), and a null value or not a number (NAN) when the data are missing.
Some terms are defined as follows: “data” refers to multivariate observations α {α = 1, · · · , N },
“variable” refers to all observations for a specific column k {k = 1, · · · , K}, and “value” refers to
specific entries Zα,k in the data table. When calculating the number of missing data P in row α,
X
K
Pα = 1{Zα,k = N AN }, α = 1, · · · , N,
k=1
where 1{T rue} = 1 and 1{F alse} = 0. When calculating the number of missing data M in column
k,
X
N
Mk = 1{Zα,k = N AN }, k = 1, · · · , K.
α=1
The following equation holds for the total number of missing data Nt :
X
N X
K
Nt = Pα = Mk .
α=1 k=1
Figure 2.1: An example of data table. Red color represents missing data. Blue color represents observed data.
Data index are on the veritcal axis and the variable names are on the horizontal axis.
8
2. Missing Data Analysis
Consider the data map in Fig.2.2. The data are ordered based on the number of missing data in
each row (data observation) and column (variable). Blue represents available data and red repre‑
sents missing data. The plots on the edges show the marginal distributions of the number of missing
data. Data observations closer to the bottom have more missing variables. There are around 50 un‑
sampled locations. Variables closer to the right edge have more missing observations. 5 variables
(Sr, Sn, F, Zr, B) contain a large number of missing data, and Zr and B have almost 50% of miss‑
ing data. The area is divided into four regions: one region with complete rows and columns, two
regions with either incomplete rows or columns, and one region with both incomplete rows and
columns. Since there are no complete columns, the vertical line representing the complete/incom‑
plete column boundary is overlapped with the left margin. The dash line representing complete/in‑
complete rows is shown in the figure. The first table in the figure shows the basic information
about the missingness in data. For example, there are about half complete data locations and half
incomplete data locations.
If a dataset with no missing data is required and imputation is not an option, the missing data can
be dropped. Because only a complete row or column can be omitted, there is an optimal dataset that
contains the most remaining data. Since it costs more data to drop a column than a row (dropping
a column eliminates more than 8500 data, while dropping a row only costs 50 data), and there are
less missing data in columns close to the left margin, variables are looped starting from the very left
column.
2. If dropping these rows costs less data than dropping the column, drop the rows. Otherwise,
drop the column.
3. Move to the next column. Repeat the process on the clipped dataset.
In the early stage of the procedure, rows are dropped. For many variables in the middle of the data
map, there are no missing data to be found as the missing rows already clipped. When the loop
reaches the last five variables, dropping the columns costs less data than dropping the missing rows.
The columns and rows to be dropped are highlighted light blue in Fig.2.2. The information about
the optimization is tabulated in the second table. There are 51 variables, and 5 variables are dropped.
90% of the variables are preserved. There are 433602 data, and 46466 data are dropped. The opti‑
mal dataset has 89% remaining data. This algorithm for choosing a large number of homotopic
data aims at preserving the most data after dropping the missing data, and it may be overridden
by understanding that some observations or some variables are important so they should not be
dropped.
9
2. Missing Data Analysis
Figure 2.2: The plot of the data and highlighted missing part. The figure is divided by dashed lines showing
the complete and incomplete datasets.
In this section, a method to explore the nature of missingness is introduced and the results are visu‑
alized by three different plots. For comparison, further analysis considering the relevance between
variables and the size of missing data is conducted on the observed results. As observed in Fig.2.3,
the last five columns (missing variables) are to be dropped. Doing so excludes more than 40,000 po‑
tentially useful data.The missing data should be imputed. If the missingness is MCAR, the missing
data can be imputed by traditional methods. Otherwise, more advanced techniques (BU, GMM)
should be applied.
To understand the nature of missingness, missing variables are compared with non‑missing
variables (the variables kept after optimization). Consider Fig.2.4. Sn is the missing variable and Ag
is the non‑missing variable. The two subsets of Ag where Sn is present and absent are compared,
and they are denoted XAg|Sn = {Zα,Ag |Zα,Sn = R} and XAg|N oSn = {Zβ,Ag |Zβ,Sn = N AN }
10
2. Missing Data Analysis
respectively, where α and β refer to data locations (rows). If the two subsets show very different
patterns, we may conclude the missingness is not random. The reason for this comparison is that
the collocated data variables are related as they are collected at the same location. The reason for
not comparing between the missing variables is clear: when Sn is present F could be absent. This
decreases the available data for comparison.
The common quantitative approaches to measure the difference between the two distributions
XAg|Sn and XAg|N oSn , such as comparing the mean or the median, ignore the shape of distribu‑
tions. KS test (Young, 1977) solves this problem. KS test measures the maximum distance d between
the cumulative distribution function (CDF) of different distributions. By definition, the KS result
d ∈ [0, 1]. The bigger the d value, the more different the two distributions are. As shown in Fig.2.5,
the maximum distance between the CDFs is marked by the blue line. The two distributions are
Gaussian distributions with different means and standard deviations. The red distribution has a
mean of 1.04 and a standard deviation of 0.96, and the gray distribution has a mean of 1.82 and a
standard deviation of 2.04. The two distributions are different, so the maximum distance d is equal
to 0.31. Since the CDFs are compared, the center and the shape of the distributions are both con‑
sidered. The problem of KS‑test is that different sample sizes could lead to artificial errors such as
high value due to few data. Another issue is that different variables retain different baseline d, so
it is difficult to set a threshold d to distinguish MCAR and MAR for all variables.
11
2. Missing Data Analysis
Figure 2.4: The partial observations of three variables from the complete dataset.
The permutation test (Odén, Wedel, et al., 1975) is proposed, combined with KS test to solve
the issues above. Suppose there are two subsets X1 and X2 with n1 and n2 data respectively. The
whole sample X = X1 + X2 with n = n1 + n2 data. First, X1 and X2 are compared. Then n1 and n2
samples are randomly drawn with no replacement from X for m times as if they were from the same
population. For each iteration i, the n1 subsample is denoted Xi1 and the n2 subsample is denoted
Xi2 . The m pairs of Xi1 and Xi2 are compared. If the comparison of the observed subsamples X1 and
X2 shares similar features with the comparison of m pairs , X1 and X2 may come from the same
population. Otherwise, they belong to different populations. KS test is conducted on XAg|Sn and
XAg|N oSn , obtaining dobs , and the sample sizes are n1 and n2 respectively. Then n1 and n2 data are
sampled from XAg , obtaining XAg1 and XAg2 . The KS test is conducted on the random sampled
subsets to obtain di . The resampling is iterated 1000 times (adequate in this case), and a pool of
d = [d1 , d2 , · · · , d1000 ] is obtained. The observed dobs is compared with d to decide how different
the two observed samples really are. Fig.2.6 shows the results of the permutation test. The vertical
12
2. Missing Data Analysis
Figure 2.5: KS test for two distributions. The blue vertical line is the result d.
orange line shows the value of dobs , and the sampled d form the blue histogram. The histogram
of d shows if two distributions are sampled from the same population, their difference in KS test
should be around 0.03. The observed value is far from the cluster of random samples, so the subsets
XAg|Sn and XAg|N oSn are not randomly drawn from the same population, and the missingness in
Sn may be systematic.
Sn can be compared with other 46 non‑missing variables and obtain a cumulative measurement
of the systematic missingness. The same procedure is conducted on the other 4 missing variables.
The cumulative results of different missing variables show which missing variable has the most
systematic missingness. Since different combinations of missing and non‑missing variables can
generate multiple sets of d, a universal quantitative measurement of the missingness that applies
to all combinations is needed. This measurement is denoted as p and p is calculated as:
dobs − dmean
p=
σd
13
2. Missing Data Analysis
where dmean and σd are the mean and standard deviation of d. p measures how many standard
deviations the observed d is from the mean of d. d represents the set of the KS test results when two
subsets are drawn from the same population. In this way, the dobs is standardized and a threshold
value p can be set as the distinction between MCAR and MAR. Algorithm 1 summarized procedure
to calculate the cumulative measurement of the missingness in pseudocodes.
2.4.1 p measurement
Sr Sn F Zr B
Ag 22.361275 33.367028 31.659201 17.397683 17.397683
Fe 23.506660 32.704326 19.658435 25.118166 25.118166
Zn 17.654142 30.638435 24.279118 18.707961 18.707961
Mo 15.892016 23.819519 20.156665 7.322113 7.322113
Na 18.245105 13.962434 2.526203 11.726063 11.726063
Table.2.1 shows the dataframe obtained from Algorithm 1. Columns are the missing variables
and rows are the non‑missing variables. p shows how different the two subsets of the non‑missing
variables are. pSn|Ag > pSn|N a means Ag indicates the systematic missingness more strongly than
Na. The dataframe is also plotted in three different formats (Fig.2.7). Each of the three plots fo‑
cuses on different aspects. The first plot shows the top 5 non‑missing variables giving the highest
p for each missing variable, and the cumulative p is also shown. The second plot arranges the non‑
14
2. Missing Data Analysis
Figure 2.7: The p value for all the combinations of missing and non‑missing variables.
missing variables in the same order. The magnitude of p value for a specific combination is easy to
find. The third tornado chart ranks the non‑missing variables. It is convenient to find the ranking
of the non‑missing variables and their corresponding p value, but it is hard to find a specific com‑
bination and the cumulative p value on the tornado chart. As observed from the figure, Sn exhibits
the most systematic missingness. P is the variable that indicates the missingess in Sr, Sn and F most.
The difference of the subsets of P is examined in Fig.2.8. The blue histogram shows the distribu‑
tion of XP |Sn , and the orange histogram shows the distribution of XP |N oSn . Note the numbers of
15
2. Missing Data Analysis
data are different, and the histogram is normalized. The statistic table shows the centers of the two
distributions are significantly different, which validates the high p value of this combination
Figure 2.8: The histograms of the subsets of P given the missing variable Sn.
The results above measure the difference between the two observed subsets of the collocated vari‑
ables, but it does not necessarily indicate the difference between observed and missing data. For
example, XP |Sn and XP |N oSn are significantly different but the difference between XSn and XN oSn
may not be significant if P and Sn are unrelated. Since the collocated data are related, this rele‑
vance can link the measurement of non‑missing variables with missing variables. At the locations
where P and Sn are both present, equal size KS test is applied to find the relevance between the
two variables. Since different variables share different units, to compare their CDFs, each variable
is standardized. In this case, KS test mainly identifies the shape difference. For variable pairs shar‑
ing low d value, the two distributions share similar shape, and the relevance between variables is
calculated as r = 1 − d. The p value considering the relevance is denoted as pr and is calculated as
pr = r ·p (e.g. prAg|Sr = rAg|Sr ·pAg|Sr ). The pr value is plotted in Fig.2.9. The relative cumulative pr
value between missing variables is adjusted. Zr has the lowest cumulative pr value, which means
Zr has less relevance with the other variables. Sn still shows the most systematic missingness, but
the sequence of the important non‑missing variables have adjusted. This can be revealed from the
tornado chart. The most significant non‑missing variable for Sr changes to Ag, which means the
difference between XAg|Sr and XAg|N oSr are more representative for XSr and XN oSr .
Furthermore, the relative size of missing and non‑missing data can be considered. If the missing
data take up only 5% of the total data, the missingness may not be a major concern, so the mea‑
surement of missingess should be low and this measurement is denoted as pn. Note, the low value
does not mean the missingness is random. In Fig.2.9, Sr has a higher cumulative pr value than Zr
16
2. Missing Data Analysis
Figure 2.9: The pr dataframe after combining the p value with the relevance.
while the missing data size of Sr is 20 times less than that of Zr. Taking this factor into consideration,
the pr value of each non‑missing variable is divided by the ratio of non‑missing and missing size.
For example, Sr has 8000 non‑missing data and 200 missing data, so the ratio is 40. Each pr value
in Sr is divided by 40. Fig.2.10 shows the plots after this adjustment. The relative importance of
non‑missing variables does not change for missing variables. The ranks of cumulative pn change
significantly compared with that of pr. Sr and Sn have the least pn, indicating their systematic miss‑
ingness is not very concerning considering their missing data size. Observing the middle plot, the
17
2. Missing Data Analysis
magnitudes of missingness can be divided into three parts. The minimum is Sr. It has the least
missing data and the cumulative pn is smaller than 10. The second part includes F and Sn. Their
cumulative pn range from 50 to 100. Zr and B have the most pn which is above 250, indicating their
systematic missingness is the most concerning among missing variables.
Figure 2.10: The plots of dataframe showing the level of missingness considering the missing size and rele‑
vance.
18
2. Missing Data Analysis
In this section, a synthetic dataset containing systematic and random missing data is used for val‑
idating the previous methods. A full dataset which contains no missing data is generated. Then,
data are dropped randomly and systematically. If the method is robust, it should identify the basic
information and the mechanism of the missingness correctly. The results show the method is robust
and the mechanism of systematic missingness is also understood.
The full dataset to start with is the optimal dataset obtained in Fig.2.2, because the variables
in the synthetic data should be intrinsically related. 20 variables are randomly drawn from the
non‑missing variables, and the data observations are shuffled. Variable Sc and W are dropped ran‑
domly with the missing size of 500 and 3000 respectively. Variable Th, Cu, Ca and La are dropped
systematically as follows:
3
zCa ∈ [ mCa , 1.5mCa ]
4
1
zLa ∈ [ mLa , 1.2mLa ]
4
where zi represents the data list dropped in variable i (i= Th, Cu, Ca, La) and mi is the full dataset
mean of variable i. For example, in variable Th, the data below one third of the mean and above 1.5
times of the mean are dropped.
The synthetic dataset is ordered based on the size of missingness, and Fig.2.11 shows the data map
of the general missingness information. The right side of the figure has the columns with the most
missing data, and the bottom of the figure have locations containing the most missing data. 14 vari‑
ables are complete and 6 variables have missing data. The smallest size of missingness is in variable
Sc as only 500 data samples are dropped. Ca and La have the most missing data as the majority of
the data around mean value are dropped. The general missingness information is consistent with
the way the synthetic data are generated. To obtain the optimal dataset with no missingness, all
missing columns are dropped, and no rows need to be dropped. The synthetic data map does not
have patterns in the observed one where Zr and B have missing data at the same locations, because
data are dropped independently in each variable.
19
2. Missing Data Analysis
Figure 2.11: The synthetic data map. The highlighted columns are to be dropped.
Table 2.2 shows the KS test results of dobs for all combinations of missing and non‑missing variables.
The smaller the d is, the closer the two CDFs are. Sc and W have relatively low d value as the missing
data are dropped randomly, and other missing variables have d value at least 10 times larger, which
implies the missingness is systematic. The results are consistent with the design of the synthetic
data. The permutation test results (Fig.2.13) show a relatively low p value in variables Sc and W. p
implies how much non‑missing variables reflect the difference between the missing and observed
data of a missing variable. The results also reveal that the variables missing the middle data ranges
(Ca, La) have higher p than the variables missing the outer data ranges (Cu, Th). The magnitude of
Cu and Th are similar to that of the missing variables in the real dataset. This implies the missing
data in the five variables (Sr, Sn, F, Zr, B) lie in the outer data ranges. The difference of dobs and
the set of random sampled subsets are shown in Fig.2.12. dobs of Sc fall within the majority of the
20
2. Missing Data Analysis
Sc Cu Th W Ca La
Cd 0.022855 0.174078 0.115534 0.026534 0.135357 0.428173
Rb 0.027880 0.237570 0.532445 0.018968 0.644970 0.944102
Ni 0.035398 0.348219 0.147874 0.013417 0.397736 0.760646
Zn 0.037868 0.246505 0.080256 0.036658 0.339642 0.560598
Eu 0.015316 0.287281 0.379795 0.007999 0.547218 0.442209
Al 0.047457 0.312095 0.362104 0.017139 0.522110 0.876179
Sb 0.035239 0.216004 0.112636 0.033488 0.214670 0.481931
Ag 0.063809 0.233299 0.105138 0.025822 0.400627 0.664889
Hf 0.036257 0.212964 0.257828 0.014200 0.615349 0.839151
Na 0.023399 0.137106 0.189646 0.015000 0.449543 0.613481
Bi 0.028937 0.233261 0.542790 0.025964 0.612153 0.874565
Br 0.042346 0.127733 0.123594 0.029850 0.220423 0.436009
Mg 0.047230 0.170895 0.264166 0.015042 0.779581 0.867649
Co 0.040629 0.364067 0.299488 0.014464 0.505220 0.884321
Table 2.2: Observed KS test results dobs for the synthetic dataset.
permutation subsets (d) whereas the dobs of Th is far from the permutation subsets d because the
missing data in Th are dropped systematically. It also explains the higher cumulative p in systematic
missing variables compared with random missing variables.
Fig.2.13 shows the plots of the cumulative p of the synthetic data, and they illustrate the missing‑
ness in variables correctly. Take the relevance between variables and the size of missing data into
consideration. The pn results are plotted in Fig.2.14. For example, La has the most missing data and
the missingness in La is the most systematic, so La has the largest cumulative pn. The two randomly
dropped variables have fairly small cumulative pn. Each one of the results in the section captures
different aspects of the missingness in the synthetic dataset. The number of missing variables, ran‑
dom and systematic missingness, and the missing sizes are illustrated. Thus, the proposed method
is robust at assessing the property of missingness in a dataset.
21
2. Missing Data Analysis
Figure 2.13: The results of the synthetic data p from permutation test.
2.6 Discussion
Although MNAR is difficult to analyze statistically, it can be observed from the data map. Com‑
paring the two data maps (Fig.2.2 and Fig.2.11), the major differences are in the ordering of the
missing data. In the real data map, missing data can be well clustered into the right bottom corner
while the synthetic data do not possess this feature. That is because the synthetic missing data are
22
2. Missing Data Analysis
Figure 2.14: The results of the synthetic data from permutation test considering missing size and relevance.
dropped independently for each variable, while the missingness in real data occur at the same loca‑
tions. Especially for Zr and B that are missing together, their availability depend both on observed
and missing data. When the missing data can be clustered as in Fig.2.2, the missingness is likely to
be MNAR.
The reasons for using permutation test rather than bootstrap are that permutation is used to
test null hypothesis whereas bootstrap is used to obtain confidence intervals. The result of KS
23
2. Missing Data Analysis
test (d) is the property between two subsets, while the confidence interval is calculated within one
subset. The null hypothesis here is the two distributions XA|B and XA|N oB are drawn from the same
population. If the hypothesis holds, the missingness in B is random. Otherwise, the missingness
may be systematic. Moreover, the reason for using the permutation test combined with the KS test
is that different sample sizes and variables generate multiple sets of observed d. When dobs = 0.1
represents significant difference in one variable, it may not be significant in another. d ∈ [0, 1] but a
threshold is needed to distinguish the random and systematic missingness for all variables. When
combined with the permutation test, the value of dobs is calibrated and the p value is a universal
measurement. So a threshold can be set for p. From the observations of this dataset, it is recommend
that the missingness can be viewed as systematic when p > 10.
The robustness of the method is validated by the synthetic data. Different patterns of systematic
missingness lead to various magnitudes of p value, and this helps understand the mechanism of
missingness in the real data. For example, the variables missing outer data ranges (Cu, Th) share
similar p value with missing variables in the real dataset (Sr, Sn, F, Zr, B). This implies the missing
variables in the original dataset may also miss the outer data ranges. So the mechanism (missing
which data ranges) of the missingness may be explored by dropping different data ranges of the
synthetic data and comparing the p values with the original ones. Note, this only works when the
synthetic missing data are generated from the same real data as the relations between variables
remain the same.
2.7 Conclusions
Missing data come from multiple sources. They can cause problems in data analysis such as PCA.
Different data imputation methods can be applied depending on the types of missingness, so it
is important to understand if the missingness in data is random or systematic. The technique of
missing data exploratory analysis performs well on the Northwest Territory data and it has two
major tools. First, it illustrates the general information of the missing data, that is, the number of
missing variables, boundary of missing and non‑missing data, and the optimal dataset. The second
tool is the KS‑permutation test that identifies the nature of missingness. The three types of results,
p, pr, and pn, convey different information about the missingness. The major information conveyed
in Fig.2.7 is the cumulative p value that indicates if the missingness is random. The pr value (Fig.2.9)
identifies the most relevant non‑missing variable that could be used for data imputation. The scaled
pn plots (Fig.2.10) show the most concerning missing variable considering the missing size. The
robustness of the method on this dataset is validated through a synthetic dataset generated from
the real data with no missing data. Variables dropping missing data randomly and systematically
are well identified.
24
CHAPTER 3
3.1 Introduction
In geochemical data, there are often data below detection limit. The concentration of some elements
is so low that they are beyond the detection capability of the measurement equipment, such as
inductively coupled plasma (ICP) (Thompson, 2012), thermal ionization mass spectrometry (TIMS)
(Richter & Goldberg, 2003) and Energy‑Dispersive X‑ray spectroscopy (EDS) (d’Alfonso, Freitag,
Klenov, & Allen, 2010). The BDL data are recorded as either 0.0 or the minimum detectable value.
In either case, there are many duplicated data and they may form spikes in the distribution, which
could be problematic for exploratory data analysis and modeling (M. J. Pyrcz & Deutsch, 2014). In
cluster analysis, data are often normal score transformed to eliminate the effects of outliers and
scale variables to the N (0, 1) range (Prades, 2017). A problem arises in how to handle the BDL
data during transformation so that cluster analysis gives reasonable results. Since normal score
transformation uses quantile to quantile transform, there are two evident ways to handle the BDL
spikes. One is to spread the spike over a range of the normal distribution, in which case, each data
25
3. Below Detection Limit Data
has a unique Gaussian data value. The other way is to retain the spikes, and the BDL data share the
same rank. For example, if 30% of the data are BDL, they are all at the 0.3 quantile or some arbitrarily
low value. With spikes preserved, the transformed units behave similarly to the original units, so
they are suitable for centroid based clustering (k‑means), but not suitable for distribution based
clustering (Gaussian Mixture Model) (Prades, 2017). When spikes are spread, the transformed data
are not suitable for centroid based method, because the cluster centroids are shifted, and the spread
spikes distort the relative distance between data. Suppose a spike consist of 50% data is transformed
to Gaussian space. The original data are all 0, but most of the spread BDL data are distributed from
‑3 to 0.
There are two major methods to spread the spikes: random despiking and local average despik‑
ing (M. J. Pyrcz & Deutsch, 2014; Verly, 1984). These two methods have their limitations of a too
high or too low nugget effect respectively. With random despiking, the quantiles of BDL data are
assigned randomly. Thus, the variogram have a high nugget effect. With local average despiking,
the BDL data are ranked based on averages of surrounding data. High local averages give the BDL
data high ranks, while low local averages give data low ranks. This may cause the transformed data
to be too smooth spatially, and a low nugget effect. Prades (2017) proposes to combine the local
and random despiking methods. BDL data are ranked based on local average first, and data value
are assigned from BDL value X1 to the nearest value X2 incrementally. A random value ranges
between X1 and X2 is added. Its weight is controlled by a hyper‑parameter W1 . The results show
setting W1 to 0.5 can achieve a suitable trade‑off between random despiking and local despiking.
Before applying the methods to handle spikes in data, the first step is to understand the nature
of spikes and this helps choose the despiking method. In this chapter, BDL spikes and duplicate
data spikes are analysed with univariate and bivariate methods. First, the univariate distribution
is summarized by an information table to overcome the binning effect of histogram plots. Three
measurements of spikiness are developed to reveal different types of spike distributions, including
few spikes containing many data in each spike, many spikes containing few data, and spikes differ‑
ent from the expected distribution. The bivariate analysis uses KL divergence (Kullback & Leibler,
1951) to measure the discrepancy between the observed bivariate BDL distribution and an indepen‑
dent bivariate distribution. When the occurrence of BDL data are dependent, further investigation
can be done to understand the relation between variables. The same Northwest Territory data from
Section.2.1.2 (Falck et al., 2012) are used for demonstration.
For univariate analysis, it is easy to plot histograms and examine the distributions visually. His‑
tograms show the data distribution over a range of values. BDL data are binned with surrounding
low value data, which makes BDL spikes less obvious. A statistics table is created to show informa‑
26
3. Below Detection Limit Data
Variables Min Value BDL Num. Aval. Data Sec. Min. Value Sec. Num. Average Ave. Exclude Min.
Au 0.0 6981 8486 0.30 1 0.96 5.41
B 0.0 2416 4554 1.00 342 1.79 3.81
Ba 0.0 462 8490 50.00 10 1357.60 1435.73
Bi 0.0 469 8441 0.02 328 0.26 0.28
Br 0.0 183 8466 0.50 18 4.11 4.20
Ce 0.0 191 8466 5.00 43 58.76 60.11
Cs 0.0 600 8466 0.50 102 4.23 4.55
Eu 0.0 5276 8466 1.00 1493 0.68 1.83
Hf 0.0 944 8466 1.00 438 6.47 7.28
Hg 0.0 703 8486 5.00 126 41.06 44.77
Lu 0.0 3831 8466 0.20 534 0.25 0.46
Na 0.0 122 8490 0.001 144 0.01 0.01
Rb 0.0 122 8466 5.00 37 71.75 72.80
S 0.0 2159 8441 0.01 158 0.09 0.12
Se 0.0 594 8445 0.10 585 0.95 1.02
Sn 0.0 2983 7515 0.10 6 1.34 2.22
Ta 0.0 2745 8466 0.50 256 0.75 1.11
Tb 0.0 3165 8466 0.50 344 0.62 1.00
Te 0.0 3269 8445 0.02 1114 0.029 0.04
Ti 0.0 694 8445 0.001 2008 0.007 0.008
W_ 0.0 5384 8490 1.00 1517 1.51 4.14
Yb 0.0 4618 8466 2.00 1190 1.47 3.23
Zr 0.0 3085 4557 200.00 31 133.04 411.88
Table 3.1: Univariate distribution informaion for each variable. The shortened column names are explained
in the context.
The Northwest Territory data consist of 51 variables and about 8500 data samples. Most of the
variables contain BDL data. Table.3.1 shows information about the variables containing more than
100 BDL data. The relative size of the BDL spike influences the performance of quantile transform.
The first column shows the names of variables. The second column represents the recorded value
of BDL data. Here, all BDL values are recorded as 0.0. The third and fourth columns show the
number of BDL data and the available data. The number of available data varies as some variables
contain missing data. Fig.3.1 shows the percentage of the BDL data in each variable, along with the
number of BDL data and the number of available data. The variables are ranked based on their BDL
percentage. Au has the most BDL data, so it is ranked first. Zr has less BDL data but also ranked
the second because the available data in Zr is only around 5000. The proportions of the BDL data
are used in bivariate analysis.
The ’Sec. Min. Value’ column means the minimum detectable value. Note different variables
27
3. Below Detection Limit Data
may use different units. Some use ppm and some use ppb. The ’Sec. Num.’ column is the number
of the second minimum data. These two columns are a measurement of the precision of equipment
and the general spike size in the variables. When the number of duplicated data in measured values
is small, the BDL spike can be problematic.
The second last column shows the average of data, considering the BDL data. The last column
shows the average value excluding the BDL data. In variables which contain many BDL, the aver‑
age excluding the BDL are significantly higher than the overall average. It measures how far the
detectable mean are from the BDL spike. It can also help with despiking. The despiked BDL data
should not be far from the detectable mean to retain a realistic data distribution. If the BDL data
is too far from the detectable mean, the spike may need to be preserved close to the second mini‑
mum value. If the BDL data is close to the detectable mean, the spike may be spread between the
minimum and the second minimum value.
Similar to BDL data, there are other spikes originating from the round up or round down effect. If
the precision of equipment is 0.01 ppm, the mineral content of 1.122 ppm and 1.123 ppm are both
28
3. Below Detection Limit Data
recorded as 1.12 ppm. Unlike the large spikes created by BDL data, these smaller spikes are not easy
to identify from histograms. So the number of spikes and the size of spikes need to be measured
quantitatively. These features are the spikiness of data.
There are two major types of spikes, which are demonstrated in Fig.3.2. One is the ”few spikes
but many data” type. In the distribution of Au, there are only several spikes but they take more than
70% of the available data. The other type is the ”many spikes but few data” type. In the distribution
of Fe, there are multiple spikes and each spike takes only about 5% of the data. The histogram
illustration may look different because of data are binned together. Different measurements are
developed for the two types of spikes.
Suppose there are K variables, and Ni is the number of data at data value i. The number of unique
data value is L and each data value is denoted as l. For example, if there are 50 data with a value
of 0.3, N0.3 = 50. Only data with Ni larger than 1% of the total number of data are considered
as spikes. To identify the variables with few spikes but many data in each spike, the quadratic
equation is applied:
v
u L
uX
Mg (k) = t Ni2 (k), k = 1, · · · , K
i=l
and to identify the variables with many spikes but few data in each spike, the log equation is applied:
X
L
Ms (k) = log(Ni (k)), k = 1, · · · , K
i=l
where Mg (k) and Ms (k) are the scores of spikiness for variable k. g stands for giant spikes and s
stands for small spikes. The value of Mg (k) is dominated by the size of large spikes, while the value
of Ms (k) is dominated by the number of spikes. The difference between the two measurements is
the relative importance of the size of spikes and the number of spikes. The quadratic term gives
large spikes more weight in Mg (k) measurement, and the logarithm term give the number of spikes
more weight in Ms (k) measurement.
When using these two methods, the number of duplicate data is more important than the pro‑
portion of duplicate data. Fig.3.3 shows the scores of spikiness for all variables, using different
measurements. Variables are ranked based on their scores in each method. The top figure shows
the scores using the quadratic method, and the bottom figure shows the scores using the log method.
Au ranks the first in the quadratic method, which means it has the largest spike. The rank of vari‑
ables using the quadratic method generally follows the proportion of BDL data in each variable.
The higher the BDL proportion, the higher the Mg (k) spikiness score, which is consistent with the
purpose of the quadratic method. Cs ranks the first in the log method, which implies it has the most
spikes. The variable ranks are generally different than the quadratic method. It is anticipated as
29
3. Below Detection Limit Data
Figure 3.3: Measurement of spikiness using the quadratic and logrithm methods.
there are not many spikes with great size when the number of data is limited. The zero values in
both figures indicates there are no spikes in the corresponding variables. Table.3.2 also shows the
methods work as expected. The left table shows the quadratic method can find large spikes. The
five highest ranked variables all have giant spikes with Ni > 3000. The right table shows the log
method finds evenly distributed spikes. When the size of spikes are small and evenly distributed,
the number of spikes can be large.
30
3. Below Detection Limit Data
Mes. Spikiness Five biggest count Mes. Spikiness Five biggest count
Au 7,008.18 [6981, 468, 309, 235, 139] Cs 103.62 [600, 224, 161, 158, 143]
W 5,672.25 [5384, 1517, 917, 216, 120] U 95.12 [227, 225, 214, 201, 196]
Eu 5,631.11 [5276, 1493, 1252, 294, 84] Br 95.04 [203, 199, 192, 192, 192]
Yb 5,070.30 [4618, 1543, 1190, 751, 194] Sc 90.20 [275, 254, 252, 251, 247]
Lu 4,294.60 [3831, 1280, 1042, 663, 534] La 88.05 [263, 258, 246, 237, 232]
Table 3.2: Results from the two methods. Left using the quadratic equation and right using the log equation.
Kullback–Leibler divergence
where X is a discrete random variable and x1 , · · · , xn are the possible outcomes. P (xi ) is the proba‑
bility of xi . The more information each observation can provide, the higher H(X) is. The equation
for the relative entropy is as follows:
X P (x)
D(P ||Q) = P (x)log( ) (3.2)
Q(x)
where P (x) is the observed distribution and Q(x) is the expected distribution. It compares the
difference between two discrete distributions. The higher the D(P ||Q), the more different they are.
When two distributions are the same, the log term is equal to zero, and D(P ||Q) is equal to zero,
which is also the minimum value. The divergence indicates how much information is lost if Q(x)
is used to represent P(x). For example, suppose there is a coin and it is assumed to be fair, so the
probability of head and tail are both 0.5 (Q(head) = Q(tail) = 0.5). To verify the assumption,
the coin is tossed 1000 times with 700 heads and 300 tails (P (head) = 0.7, P (tail) = 0.3). The
observation diverts from the assumption, and the divergence is calculated using Eq.(3.2).
In the following sections, KL divergence is used to compare the difference between discrete distri‑
butions. The distributions can be univariate or multivariate.
Scaled method
The scores of the quadratic and the log methods become higher when the available data increase. To
amend this issue, the scaled method can be applied. It measures how different the observed spike
distribution is from the expected spike distribution. Here, spikes are defined differently. Instead
of using the number of data at value i (Ni ), the probability of data at value i (Pi ) is used. Suppose
31
3. Below Detection Limit Data
Figure 3.4: An illustration of spikes in the scaled method. Variable A and B represent two random variables.
Figure 3.5: The results of the measurement of spikiness, using the scaled method.
there are 1,000 data in total and 200 data have the value of 4.3 ppm. Then P4.3 = 0.2. If there
are 50 unique values in the 1,000 data, the expected probability for each value i is Pi = 0.02. This
is defined as the expected distribution, which is compared with the observed distribution. For
example, in Fig.3.4, there are three possible values. The expected probability for each value (the
blue bars) is 0.33. Variable A has the same distribution as the expected one, so there is no spike in
A. For variable B, Pi distribution is different from the expected distribution and the difference in
measured using KL divergence. The same measurement can be applied to the variables in the real
data and each variable has an expected distribution and an observed distribution. The spikiness
score for a variable is calculated using Eq.(3.2).
Fig.3.5 shows the results of comparing the observed distribution with the expected distribution
using KL divergence. The ranks are similar to those in the quadratic method, but they do not follow
the same principle. The first ranked variable Cs in the log method is ranked low in this case. It
means the observed spike distribution in Cs is similar to the expected spike distribution.
32
3. Below Detection Limit Data
The purpose of bivariate analysis is to identify if the occurrence of BDL data between variables
are dependent. If so, more investigation can be conducted on the pairs of variables to understand
the reasons for BDL occurrence. The bivariate distribution is divided by BDL boundaries into four
regions ‑ one region where both dimensions are at BDL, two regions where either one of the di‑
mensions is at BDL, and one region where both dimensions are not at BDL (Fig. 3.6). Based on the
univariate analysis in the previous section, the proportions of BDL in variables are known. Assum‑
ing two variables are independent, the bivariate distribution of BDL occurrence can be calculated.
If the observed probabilities are far from the independent distribution, we may conclude the BDL
data occurrence is dependent.
From probability theory, if two events A1 and A2 are independent, the joint probability is simply
the multiplication of the probabilities of each event.
where f represents the probability of an event. Since the probability of BDL for each variable is
known, it is easy to calculate the probabilities Pind (B1 , B2 ), Pind (B1 , N2 ), Pind (N1 , B2 ) and Pind (N1 , N2 ),
where Pind is the probability when two events are independent, and Bi and Ni represent variable
i is at BDL and not at BDL respectively. Fig.3.6 shows the four regions in a 2D plane. The univari‑
ate BDL proportions are marked by red dashed lines. However, Fig.3.7 shows some variables are
correlated. When calculating the expected probability, the correlations have to be considered. The
highest negative correlation is ‑0.16. In this case, Pind should be replaced by Pexp , which means the
expected probability of BDL considering correlations.
To obtain Pexp , a simple way is to sample in multi‑Gaussian space and count the number that
falls in each region. Dividing the number in each region by the sample size gives the expected
probability. Now, the BDL boundaries in Gaussian units given the univariate BDL proportions need
to be calculated. This is achieved through quantile transform. The boundary of BDL in Gaussian
space is calculated by taking the inverse of Gaussian cumulative distribution function (CDF) given
the BDL proportions in the original space
bk = G−1 (P (Bk ))
where bk is the BDL boundary of variable k in Gaussian space, P (Bk ) is the probability of BDL
in variable k. Table.3.3 shows the BDL boundaries in Gaussian space for Sn and S. The column
”Probability” refers to the probability of BDL. ”Gauss Unit” shows the converted BDL boundaries
in Gaussian space. Note the Gaussian space is standard. To calculate the expected probability of
each region, for example the region where both Sn and S are BDL, samples are below ‑0.26 for Sn and
33
3. Below Detection Limit Data
Figure 3.6: Illustration of the four BDL regions in a bivariate setting. The original data are quantile transformed
and the spikes are spread. The random despiked BDL data are shown as the diagonal line. The marginal
distributions are Gaussian.
‑0.61 for S are counted, and divided by the total number of samples. The expected probabilities using
10000 samples are shown in Table.3.4. ”GBoth_bdl” means the expected probability of data are BDL
for both variables. ”GVar1_bdl” means the expected probability of data are BDL for variable 1 (Sn).
”GVar2_bdl” means the expected probability of data are BDL for variable 2 (S). ”GNon_bdl” means
the expected probability of no data being BDL. Since the correlation between Sn and S is 0.009, the
expected probabilities of the four regions are similar to the independent probabilities.
To obtain the observed probabilities Pobs (B1 , B2 ), Pobs (B1 , N2 ), Pobs (N1 , B2 ) and Pobs (N1 , N2 ), the
number of data in the four regions are counted and divided by the number of available data. Ta‑
ble.3.5 shows the results of the observed probabilities for Sn and S. ”Perct_both” means the per‑
centage of data are BDL for both variables. ”Perct_Col1” means the percentage of data are BDL for
34
3. Below Detection Limit Data
Figure 3.7: The correaltion matrix between variables with over 100 BDL data. The minimum correlation is
‑0.16.
variable 1 (Sn). ”Perct_Col2” means the percentage of data are BDL for variable 2 (S). ”Perct_None”
means the percentage of data are not BDL for both variables. The observed bivariate distribu‑
tion is different from the expected distribution. It has higher probabilities for Pobs (B1 , B2 ) and
Pobs (N1 , N2 ). KL divergence measures the difference between the two discrete distribution (Pexp
and Pobs ) quantitatively. Given the data from Table.3.4 and Table.3.5, the difference is calculated
using Eq.(3.2):
The same procedure of sampling expected bivariate distribution, calculating observed distribu‑
tion, and calculating the difference using KL divergence is applied on combinations of two variables
having more than 1000 BDL data in Table.3.1. The variables include Au, B, Eu, Lu, S, Sn, Ta, Tb, Te,
W, Yb and Zr. If the resulting Dobs is large, we may conclude that the BDL occurrence between
the variables are not independent and further inspection can be conducted. Fig.3.8 shows Dobs for
different combinations. Only combinations with Dobs larger than 0.1 are shown in the figure. The
right figures show the probability distributions of four regions for observed (blue) and expected
35
3. Below Detection Limit Data
Figure 3.8: D values for each combination of the two variables. Only the combinations with D value larger
than 0.1 are shown.
(orange) distribution. Col1 refers to the first variable in combination and Col2 refers to the second
variable. The largest Dobs value is 0.22 for Sn and Tb and only 10 combinations are larger than 0.1
in all 66 (12 × 11/2) combinations. Considering the theoretical maximum of Dobs is +∞, the dif‑
ference may not seem to be large, but the distributions are fairly different in the right column of
Fig.3.8. To reveal more combinations of variables, simply lowering the threshold of Dobs can work,
for example using 0.01 rather than 0.1, but different datasets have different range of D. It is diffi‑
cult to set a threshold suitable for all cases. In fact, the range of Dobs is constrained by the possible
observed bivariate distribution Pobsp . The possible bivariate distributions are constrained by the
univariate BDL proportions. The maximum Dmax can be found to standardize the observed Dobs ,
and the scaled D should provide more combinations that show difference between the observed
and expected distributions. A threshold can also be set as the scaled D is between 0 and 1.
Find Dmax
The range of Dobs can be calculated when treating Pexp as constant and Pobsp as variable. Pexp
represents independent BDL occurrence. When Pobsp is close to Pexp , Dobs is small and the observed
BDL occurrence is also independent. When Pobsp is very different from Pexp , Dobs is large and the
BDL occurrence is dependent. To find the range of Dobs , we need to find the range of Pobsp . First,
36
3. Below Detection Limit Data
how the univariate BDL probabilities constrain the bivariate distributions is shown. Suppose the
probability of BDL in variable 1 is x and the probability of BDL in variable 2 is y. The following
equations hold:
This is visualized on Fig.3.6. x, y, 1 − x, 1 − y are the marginal probabilities. Although there are 4
equations, only 3 are independent (any three equations can derive the fourth). Since there are 3 in‑
dependent equations with 4 unknown parameters, once one of the four probabilities(Pobsp (B1 , B2 ),
Pobsp (B1 , N2 ), Pobsp (N1 , B2 ) and Pobsp (N1 , N2 )) is set the other three are determined. The possible
range for each region is also bounded as P ∈ [0, 1]. To show the range of possible bivariate distri‑
butions, it is convenient to treat the region with P ∈ [0, min(x, y, 1 − x, 1 − y)] as the independent
variable and the other three as the dependent variables. Now it is shown when Dmax can be reached.
To make the demonstration convenient, assume x < y < 0.5 .(If they are set larger than 0.5, 1 − x
and 1 − y are less than 0.5. The demonstration remains the same and only symbols change.) In this
case, x is the minimum, and the independent probability region is Pobsp (B1 , B2 ) ∈ [0, x]. Denote
Pobsp (B1 , B2 ) = v, and the probabilities of the other regions are
Pobsp (B1 , N2 ) = x − v,
Pobsp (N1 , B2 ) = y − v,
Pobsp (N1 , N2 ) = 1 − x − y + v.
The independent variable of expected distribution is denoted as Pexp (B1 , B2 ) = p (Note p is con‑
stant) and the other regions are denoted as
Pexp (B1 , N2 ) = x − p,
Pexp (N1 , B2 ) = y − p,
Pexp (N1 , N2 ) = 1 − x − y + p.
37
3. Below Detection Limit Data
3.3.3 Discussion
Although the problem of small Dobs value is solved by scaling Dobs using Dmax , the actual threshold
for determining the dependence is unclear. 100% represents full dependence and 0 means complete
independence. The intermediate percentage needs to be examined and a threshold for concern set.
When calculating the expected probability, only the linear relation between variables in the orig‑
inal space is considered. There are two reasons for that. First, when sampling in a multi‑Gaussian
space, besides mean, only the correlations impacts the shape of the Gaussian distribution. Another
reason is, no matter how non‑linear the data are, the number of data in each region does not change,
and non‑linearity can only be observed in the region where data are not at BDL. It could be a prob‑
lem when the area is divided finer, but here only considering the linear dependence is reasonable.
38
3. Below Detection Limit Data
Figure 3.9: Percentage of dependence for each combination of the two variables. Only the combinations have
percentage larger than 10% are shown.
3.4 Conclusion
BDL data come from the detection limit of equipments. They are recorded as the same value and
form a spike. The spike can be problematic for geostatistical modeling and multiple despiking
methods are proposed. Before despiking the BDL data, the characteristics of data spikes need to
be understood. The BDL data table, which reveals more details on the BDL side, complements
histograms. When measuring the spikiness of data, three different methods are applied. Each
method has a unique application. Quadratic method reveals variables with large spikes, log method
reveals variables with many spikes and the scaled method focuses on variables with different spike
distribution from the expected one.
The proposed bivariate method detects the dependence of BDL occurrence between variables.
Observed bivariate distributions are compared with expected distributions which assume indepen‑
dent BDL occurrence. When calculating the expected distribution, correlations between variables
are considered. Since joint probabilities are difficult to calculate directly, expected distributions
are simulated in a multi‑Gaussian space considering correlations. The observed probabilities are
compared with the expected one, using KL divergence and obtaining Dobs . Considering the theo‑
retical value of KL divergence ranging from 0 to infinity, it is difficult to set a threshold to deter‑
mine whether BDL occurrence is dependent. The bivariate distributions are bounded by univariate
BDL probabilities, so a maximum Dmax can be obtained, representing the full dependence of BDL
occurrence. The observed results are scaled by the maximum value to evaluate the level of BDL
39
3. Below Detection Limit Data
• Convert the probability boundary to standard Gaussian unit boundary bk through quantile
transform
• Sample data in a standard bivariate Gaussian space considering the correlation between vari‑
ables
• Calculate the observed distribution by counting the number of data in each region
With the scaled D, 20 combinations show dependence of BDL occurrence when the threshold is
set to 0.1. In the real data, 5 combinations having 20% of BDL occurrence dependence, which can
be worth further investigating. The strong dependence of the BDL occurrence can come from the
dilution during the measurement. When the one variable is diluted, other variables contained in
the same solution are diluted at the same time. If the concentrations of some variables are already
low, the dilution can result in the concentrations of these variables at BDL simultaneously, which
is reflected as the dependence of the BDL occurrence.
40
CHAPTER 4
4.1 Introduction
The purpose of exploratory data analysis is to understand the nature of data. It includes exploring
univariate and multivariate distributions, finding outliers and duplicated data, and evaluating sum‑
mary statistics of the data. If data have multiple clusters, the statistical inference may not provide
accurate information, because they are averaged. Therefore, it is important to apply cluster analy‑
sis. Further statistical calculations and geostatistical modeling can be conducted on representative
groups.
The idea of cluster analysis is straightforward: keeping similar data together. Different ways of
measuring similarity between data result in different types of clustering methods, such as distance
based and distribution based clustering. Distance based methods use distance between data as a
measurement of similarity. The closer the distance, the more similar the data are. Most distance
based clustering methods use the Euclidean distance to quantify the similarity of data:
v
u k
u X
d(x, y) = t|| xi − yi ||2
i=1
41
4. Multivariate Cluster Analysis
where d(x, y) is the distance between data vector x and y, and i represents variable i and k is the
number of variables. Distribution based methods fit multiple kernels to data distribution. The
probabilities of data belong to a same kernel show their similarity. Gaussian kernels are often used
because a small number of parameters can formulate multivariate Gaussian distributions.
K‑means clustering is the most commonly used distance based method (Abubaker & Ashour,
2013; MacQueen et al., 1967; Pedregosa et al., 2011). K‑means clusters data by assigning them to
their closest centroids. The number of centroids is the number of clusters. The procedure is as
follows: centroids are assigned randomly at the beginning. With the centroids assigned, data are
clustered to their closest centroids. Then the cluster centroids are recalculated. Data are assigned
again to the closest new centroids. The iterations continue until the centroids remain unchanged.
The process is equivalent to finding the clusters giving the minimum within cluster sum of squares
(WCSS):
X
N
W CSS = min (||xi − µj ||2 )
µj ∈C
i=0
where N is the number of data, µj is the mean of cluster j, and C is the number of clusters. The
minimum term shows the sum of squares is only calculated for data to their closest centroids (within
the same cluster). The algorithm can settle in local minima, and the final centroids are determined
by the initial assignment of centroids, so multiple realizations of initial centroids are generated, and
the final clusters are the ones giving the minimum WCSS.
Gaussian mixture model (GMM) is a common distribution based clustering method (Pedregosa
et al., 2011; Reynolds, 2009). It fits Gaussian density kernels to clusters and data are assigned based
on their probabilities in each kernel. The clustering procedure is similar to k‑means. Gaussian ker‑
nels are assigned in the initial state, and data are assigned to the kernels in which they have the
maximum probability. Then new Gaussian kernels are generated that give the maximum likeli‑
hood of data in the same cluster. Data are assigned again based on the new kernels. This process
continues until the Gaussian kernels stay the same.
Although the fitting process is similar for k‑means and GMM, each method has its own advan‑
tages. k‑means is more stable than GMM. When there are too few data to calculate a non‑singular
covariance matrix, GMM can diverge (Yamazaki & Watanabe, 2003). The cluster shape is more
flexible for GMM. The covariance matrix in GMM can handle elongated cluster shapes, while k‑
means assumes clusters have isotropic shape. Different methods need to be adopted for different
situations.
There are new clustering methods which can handle data with special shapes shown in Fig.4.1
(Fred & Jain, 2005). Using robust and efficient clustering methods such as k‑means and GMM
should be sufficient for identifying clusters in geostatistical data. In this chapter, k‑means and
GMM are used to examine clusters in data.
42
4. Multivariate Cluster Analysis
Figure 4.1: New clustering methods can handle complex clusters such as the moon shape clusters (Fred & Jain,
2005).
Most data need to be processed prior to cluster analysis. One common practice is rescaling the
data to [0,1] for equal scale in every dimension. This is important especially for distance based
clustering methods. If the scale of one dimension is orders of magnitudes larger than others, the
clustering of data would be dominated by this dimension. The advantage of rescaling is that it
keeps the original shape of data, but it is not reliable when outliers and spikes are present. Outliers
are the data with extremely high value, sometimes 100 times higher than the mean. This can be
problematic for clustering, because the cluster centroids can be shifted drastically by outliers. Spikes
are duplicated data, mainly coming from the precision limitations of the measurement equipment.
The biggest spike in real data can be the BDL spike. They are the data below the detection limit
of the measurement equipment, and can be recorded as 0.0 or the minimum detectable value. The
BDL data form a large spike at the low value region. This can cause distribution based clustering to
fit a narrow kernel only for the BDL data, while they could belong to other groups if the true values
were known.
Quantile transformation is another data transform used commonly in geostatistical modeling to
address outliers and skewed distributions. Suppose z represents data in the original units and y
represents the transformed units. F (z) is the cumulative distribution function (CDF) of data in the
original units and G(y) is the CDF in the space to be transformed to. When data are normal score
transformed, G(y) is the Gaussian density (M. Pyrcz & Deutsch, 2018). The following equation is
used to transform the univariate z to y:
43
4. Multivariate Cluster Analysis
Eq.(4.1) can transform the original univariate distribution to a Gaussian distribution. In multivari‑
ate cases, the transformation is conducted in each dimension separately. For example, in normal
score transform, the 1D marginal distribution has Gaussian shape but the multivariate data do not
necessarily have a multi‑Gaussian shape. This is desired as cluster analysis depends on the multi‑
variate relations. The transformed outliers are closer to the main data. The transformation of spikes
is also important. If spikes are preserved, data in spikes are assigned the same quantile. If spikes
are spread, data in spikes are assigned different quantiles.
Prades (2017) proposes to normal score transform data with the spikes spread when using GMM
and with the spikes preserved when using k‑means. However, transferring data to a uniform dis‑
tribution could be better, since it does not create an artificial cluster in the center of the space. In
this chapter, data rescaling, normal score transform and uniform transform of data are compared,
along with different treatments for data spikes.
The number of clusters (NC) is another important aspect that affects clustering performance. If
the NC is not optimal, data are partitioned rather than clustered. It is relatively easy to visualize
NC in a space less than 4 dimensions. For higher dimensional data, statistical tools are required.
The statistical tools used in this chapter are Hopkins statistic (Lachheb, 2021; Lawson & Jurs, 1990),
gap statistic (Tibshirani et al., 2001), silhouette coefficient (Aranganayagi & Thangavel, 2007) and
prediction strength (Tibshirani & Walther, 2005). Hopkins statistic determines if there is any cluster
in the distribution by comparing data with uniform samples. Gap statistic and silhouette coefficient
find the optimal NC. Prediction strength uses cross‑validation to verify if data are truly clustered
or partitioned. A workflow combining these four tools is proposed to determine the optimal NC.
In this chapter, a series of tools are introduced for handling spikes when conducting cluster analysis.
First, a workflow to find the optimal NC is introduced with a detailed explanation of the statistic
tools used. Its application is demonstrated using synthetic data. Then the workflow’s compatibil‑
ity with different transformations and clustering methods is examined. The clustered results are
compared to the true labels and a measurement of the correctness rate is used to determine the
appropriate transforms and clustering methods. The synthetic data show k‑means clustering com‑
bined with linear transform, uniform transform and Gaussian transform with spikes preserved are
feasible. The workflow and the appropriate transforms are used on high‑dimensional real data. The
real data come from the Northwest Territories with the missing data eliminated (Falck et al., 2012).
From the previous chapters, the number of BDL data is significant. The large spikes are handled
when conducting cluster analysis.
44
4. Multivariate Cluster Analysis
Figure 4.2: The original synthetic data (left) and the data with synthetic spikes (right).
Figure 4.3: Uniform transformed data, spreading out the spikes. The marginal distributions are shown on the
edges.
In most clustering methods, NC is an important parameter for correct clustering results. This sec‑
tion introduces a workflow of determining NC. It has 3 aspects: the detection of clusters existence
using Hopkins statistics, finding NC using gap statistic and silhouette coefficient, and the cross‑
validation of the resulting NC.
Fig.4.2 shows the synthetic data used for illustration purposes. The left figure shows the original
data. The data consist of 3000 samples and 5 clusters with Gaussian shape, and the 5 clusters have
45
4. Multivariate Cluster Analysis
different mean and variances. The right figure shows the synthetic data with spikes. 20% low value
data are treated as BDL. To handle spikes, the data are uniform transformed with spikes spread.
Fig.4.3 shows the uniform transformed data. The range of the uniform space is from 0 to 1. The
BDL data quantiles are randomly assigned. The left bottom line represents the data that are BDL
in both dimensions. They appear as a single dot at the left bottom in Fig.4.2. Data with only one
dimension BDL are on the bottom or left margins in Fig.4.2. The BDL boundary is at 0.2 because the
uniform space range is [0,1] and 20% data are set to be BDL. If the range is [0,2], the BDL boundary
is at 0.4. If 30% data are BDL, the boundary is at 0.3. From the marginal histogram distribution,
the data are uniformly transformed. The five clusters are still distinguishable. The transformed
clusters shapes are relatively isotropic, which is important for applying k‑means.
The first step of cluster analysis is to determine if there are any clusters in data. It is also called
clustering tendency. Lawson and Jurs (1990) proposed the Hopkins statistic for this task. The idea
is to compare the dataset with uniform distributed samples (used as references). If the results are
very different from the reference, there can be clusters.
Suppose there are N data (dtj j = 1, · · · , N ). N samples are simulated in a uniform space that
shares the same range with the data, which are denoted as si , i = 1, · · · , N . The Hopkins statistic
H is calculated as
PN
i=1 yi
H = PN PN
i=1 yi + j=1 xj
where yi is the distance of random sample si to its nearest neighbor sk (k denotes the nearest neigh‑
bor), and xj is the distance of data dtj to its nearest neighbor dtk . From the equation, if there are
clusters, the sum of xj is much smaller than the sum of yi , so H is close to 1. When there are no
clusters, the sum of xi is similar to the sum of yi . H is close to 0.5. Since the reference y is sampled
throughout the uniform sapce, Hopkins statistic is not very accurate when outliers are present. The
empty space between outliers and main data is uniformly sampled, and the sum of reference data
is much larger than the original data, resulting in H close to 1. The Hopkins statistic is more infor‑
mative when the data are uniform transformed. The uniform transformed synthetic data in Fig.4.2
have H value of 0.8, so it indicates clusters existence.
Silhouette coefficient and gap statistic are used to determine the optimal NC. Silhouette coefficient
measures the sparsity of clusters. For each data i, the silhouette coefficient Si is calculated as
bi − ai
Si =
max(ai , bi )
46
4. Multivariate Cluster Analysis
Figure 4.4: The silhouette coefficient for data and the corresponding clusters when using k‑means and NC=5.
where ai is the average distance between i and the data in its own cluster, and bi is the average
distance between i and the data in the nearest cluster. Here, the distance means the Euclidean
distance between data. From the equation, Si ∈ [−1, 1]. When Si is close to 1, it indicates data i
should belong to the current cluster whereas Si close to ‑1 indicates the data may belong to other
clusters. The silhouette coefficient for the dataset SN is the average Si over all data, where i =
1, 2, · · · , N . SN close to 1 means the clusters are well separately. When SN is close to 0, it indicates
there may not be distinguishable clusters. When using Euclidean distance, silhouette coefficient
assumes clusters have isotropic shapes, so when SN is negative, it does not necessarily mean clusters
are wrongly grouped. It is also possible that clusters have irregular shapes. As shown in Fig.4.4, the
transformed data are clustered into 5 groups using k‑means. In the left figure, each horizontal line
represents Si for each data (3000 lines in total). The vertical red dashed line represents the averaged
silhouette coefficient SN . Data in cluster number 3 have a high silhouette coefficient because they
are compact and distant from other clusters. There are some negative values in cluster number 2
and 0. As observed from the right figure, they may be the data to the top‑left corner of cluster 0 and
bottom‑right corner of cluster 2, which are wrongly clustered.
The silhouette coefficients are plotted against a range of NC, and the optimal one is indicated by
the highest silhouette coefficient. Fig.4.5 shows the optimal NC is 5, which is corresponding to the
true NC of the synthetic data. Note ”Clustering Number” in the figures means the number of clus‑
ters. The silhouette coefficient range from 0.43 to 0.53. The difference is not very great considering
how differnt the data can be clustered using different NC. Especially when NC is 4, the silhouette
coefficient is fairly close to that of NC equal to 5. If the true NC is unknown, it is difficult to decide
between 4 or 5 clusters, so another measurement to validate the results is needed.
Similar to Hopkins statistic, the gap statistic uses uniform distributed samples as a reference to
47
4. Multivariate Cluster Analysis
Figure 4.6: The gap statistic and the corresponding log(Wk ) for reference and data in a range of NC.
compare with data (Tibshirani et al., 2001). Suppose N data are divided into K clusters. Within a
specific cluster r (r = 1, · · · , K), the sum of pairwise squared Euclidean distance Dr is calculated
as
Di,j = ||xi − xj ||2
X (4.2)
Dr = Di,j
i,j∈Cr
where Di,j is the squared Euclidean distance between data i and j, and Cr represents cluster r. For
all K clusters,
X
K 1
WK = Dr ,
r=1
2Nr
where Nr is the number of data in Cr . Gap statistic is equal to
where the first term represents the expected value of log(WK ) providing multiple realizations of N
′
samples from the multivariate uniform distribution. The second term log(WK ) is calculated from
the observed data. The optimal K is the NC which gives the maximum Gap(K).
48
4. Multivariate Cluster Analysis
Fig.4.6 shows the results of the gap statistic on the uniform transformed data. The NC is from
1 to 10. The left figure implies the optimal NC is 5, and Gap(4) is much smaller than Gap(5). It
provides another measure to help choose the optimal NC when silhouette coefficents are close. The
right figure gives an intuitive explanation for the gap statistic. Increasing NC decreases the overall
Wk , and the reference data provide a reference to illustrate the tendency of the decrease. When
′
K is not the optimal NC, the log(WK ) of observed data decreases similarly to the reference data.
When K is the optimal number, observed data are well separated and the WK decreases faster
′
than other cases, thus giving the maximum gap between the observed log(WK ) and the reference
log(WK ). Note the optimal NC selection criterion is modified to give a more intuitive explanation.
The disadvantage of the gap statistic is similar to that of the Hopkins statistic. When outliers are
present, the decreasing of WK may not be in the same magnitude for reference data and observed
data. Gap statistic is more reliable when data are uniform transformed, so it is mainly used as a
complement to the silhouette coefficent.
The cross validation for cluster analysis is also called ”prediction strength”. It is used to validate
the NC obtained from previous steps. The main idea is to conduct clustering twice on different
proportions of data and compare the results. The first time can be the whole data and the second
time randomly sampled 80% of the data. Since the proportion of data are randomly sampled, the
general shape of data should remain the same. If the results are similar, data are well clustered. If
the results are fairly different, data can be partitioned because the NC is not optimal or the clustering
method is not suitable.
First, all data are clustered (training data), obtaining cluster centroids. Then part of data are
sampled (testing data). They can be a proportion of data with or without replacement. 80% of data
without replacement are used here. Testing data are clustered, obtaining cluster labels. For testing
data sharing the same label, they are classified using cluster centroids from the training data. If
data in the same testing cluster are classified into the same group, the prediction strength for this
cluster is high. If they are classified into multiple different groups, the prediction strength is low.
The resulting prediction strength is the lowest among all clusters.
Suppose the training and testing data are clustered into K clusters separately. The prediction
strength:
1 X
ps(K) = min( D[C(Xtr , k), Xte ]i,i′ ), r = 1, · · · , K
Nr (Nr − 1) ′
i̸=i ∈Cr
′
where Nr is the number of data in testing cluster Cr , i and i are the data in cluster Cr , [C(Xtr , K), Xte ]
means cluster the training data Xtr to K clusters, and use the K centroids to cluster the testing data
′
Xte . D[C(Xtr , k), Xte ]i,i′ = 1 if testing data i and i are clustered together using the training cen‑
49
4. Multivariate Cluster Analysis
troids, and D[C(Xtr , k), Xte ]i,i′ = 0 otherwise. The summation term calculates the pairwised sim‑
ilarity in testing cluster Cr . From the equation, ps(K) ∈ [0, 1]. The reason for using the minimum
value is because the cluster giving the minimum prediction strength is where partitioning happens.
The more similar training and testing data are clustered, the more robust the resulting NC.
Fig.4.7 shows the prediction strength over a range of K values on the uniform transformed
data. ps(1) is always 1.0, because training and testing data have only one cluster, all data in the one
cluster of testing data are grouped together by the only centroid. The 5 clusters has a high prediction
strength of 0.95. It validates 5 as the optimal NC and the suitability of k‑means for the data. The plot
should not be used to determine the optimal NC as different testing data size changes the K giving
maximum prediction strength. The optimal NC does not always promise the highest prediction
strength. It should be used to validate the choice of NC and the compatibility of clustering methods
with data. A prediction strength higher than 0.9 indicates the NC is trustworthy.
Fig.4.8 shows the results of k‑means clustering using 5 clusters. In general, the 5 clusters are well
identified. Some data are miss‑clustered such as the left top corner data in the blue cluster. The
k‑means results are reasonable when data are uniform transformed with spikes spread. For high
dimensional data, there is no luxury of visualizing the results, so the workflow is more important.
Its compatibility with different clustering methods and different data transformation is examined
in the next section.
In this section, the compatibility between the proposed workflow and different transformations is
compared. The transformations include linear scaling, uniform score transformation with spikes
50
4. Multivariate Cluster Analysis
spreaded and preserved, and normal score transformation with spikes spreaded and preserved.
The clustering methods used are k‑means and GMM. Synthetic data containing spikes from the
previous section is used here. Deemed appropriate clustering and transformations are used to ex‑
amine real data.
The linear transform rescales variables into the same range. Otherwise, dimensions with larger
scale may distort distance based clustering. Suppose a dataset with N samples and M dimensions
dij , i = 1, · · · , N, j = 1, · · · , M . Consider in dimension j:
dij − min(dj )
drij =
max(dj ) − min(dj )
where drij is the rescaled data, and dj represents all N samples in dimension j. The transform
rescales all variables to a range of [0,1]. The advantage of the linear transform is that it preserves
the original shape of data, which is preferred in cluster analysis. The method also has several limi‑
tations. It does not have alternative ways to handle spikes, and it is not robust to centroid shifting
caused by outliers.
Fig.4.9 shows the rescaled data. The shape remains the same as in Fig.4.2 but the data are
rescaled from 0 to 1. The Hopkins statistic of the data is 0.94, indicating cluster existence. Gap
statistic and silhouette coefficients are calculated over a range of NC. Fig.4.10 shows the results
when data are clustered using k‑means and GMM. Both clustering tools show the optimal NC is 5,
except for the silhouette coefficient plot when using GMM. The results of k‑means and GMM clus‑
tering using 5 clusters are shown in Fig.4.11. It is obvious that k‑means method gives better results.
GMM clusters the BDL data as one single cluster, which is shown at the bottom left corner in the
51
4. Multivariate Cluster Analysis
right figure. This comes from GMM fitting a kernel specific to the spike formed by BDL data. The
spike also results in the wrong optimal NC using silhouette coefficient and GMM in Fig.4.10. The
cross‑validation results of k‑means is 0.98 while the results of GMM is 0.52, which means the results
of using GMM change drastically when using different proportion of data. GMM is not compatible
with the transformation when spikes present.
Since the true label is available for the dataset, the percentage of data correctly clustered can be
calculated. More specifically, consider one cluster at a time and calculate the percentage of data that
truly belong to that cluster. The minimum percentage among all clusters is used as a measurement
for clustering performance and refer to it as the correctness rate. Suppose there are K clusters after
clustering. In a specific cluster Cr , r = 1, · · · , K, there are Tr true labels and each label is denoted
as t. The correctness rate Rr in this cluster is calculated as:
where P (t) is the proportion of the true label t in cluster Cr . The correctness rate of the clustering
results R is calculated as
R = min(Rr ) r = 1, · · · , K (4.4)
The k‑means results have a correctness rate of 0.90 while the correctness rate of GMM results is only
0.72. It means in the cluster labeled orange, only 72% of the data truly belong to that cluster. GMM
does not appear appropriate for the linear transformed data when spikes are present.
Now consider the scenario of outliers, which is very common in real data. Fig.4.12 is the same
synthetic dataset but with some outliers added. Some outliers are 10 times larger than the main
data. Since GMM does not work well when spikes are present, only the performance of k‑means
is examined here. The Hopkins statistic is 0.97, showing clustering existence in the data. Fig.4.13
shows the plots of gap statistic and silhouette coefficient, indicating the optimal NC should be 8.
Fig.4.14 shows the eight clusters using k‑means. Although the optimal NC is not the same as the
52
4. Multivariate Cluster Analysis
Figure 4.10: Gap statistic (left) and silhouetter coefficient (right) on the linearly transformed data.
Figure 4.11: K‑means (left) and GMM (right) clustering results on the linearly transformed data.
true one, all five major clusters are well clustered and the outliers are separately clustered. By using
k‑means, data with outliers can still be well clustered. The outliers only affects the analysis of NC.
However, the cross‑validation results are 0.0 when using 8 clusters. It could come from the testing
samples containing outliers. The number of data in outlier clusters is small, so it is possible that the
outliers in testing data are clusterd very differently from those in the training data. If the real NC
is used, the data are poorly clustered as shown in Fig.4.15. It comes from the outliers change the
cluster centroids drastically. Thus, using k‑means for linearly transformed data is reasonable, but
when outliers present, the number of data in each cluster may need to be examined.
There are two reasons for uniform transformation. Firstly, it scales different dimensions to a range
of [0,1]. The quantile transformation also handles the outliers. The outliers distort the clustering
by switching the cluster centroids drastically or changing the real NC as shown in the previous
section. Through quantile transformation, outliers appear as normal data and their effects on the
clustering results can be minimized. The quantile transformation can also treat spikes differently,
which increases its compatibility with different clustering methods.
First the synthetic data are uniform transformed with spikes spread. The robustness of using
53
4. Multivariate Cluster Analysis
Figure 4.13: Gap statistic (left) and silhouetter coefficient (right) on linearly scaled data containing outliers.
Figure 4.14: Results of k‑means clustering using cluster number of 8. Right one is the zoomed in scatter plot
of the region of interests.
54
4. Multivariate Cluster Analysis
Figure 4.16: Results of gap statistic and silhouette coefficient using GMM when data are uniform transformed
with spikes spread.
k‑means on this transformation is demonstrated in the previous section, so only GMM is examined
in this part. Fig.4.3 shows the transformed data. The Hopkins statistic of the data is 0.8. Fig.4.16
shows the plots of the gap statistic and silhouette coefficients against different NC, and the results
indicate the optimal NC is 5. Fig.4.17 shows the results of GMM clustering. The 5 clusters have
similar shape as in the k‑means results (Fig.4.8), but there are more wrongly clustered data in the
right upper corner and right bottom corner of red cluster. The cross‑validation score is 0.84 for
GMM results, which is lower than that of k‑means. The correctness rate of the resulting clusters
is 0.78. In the wrost scnario (the red cluster), only 78% of data truly belong to that cluster. On the
contrary, the correctness rate of k‑means is 0.9. Therefore, when data are uniform transformed with
spikes spread out, it is better to use k‑means method.
Now data are uniform transformed with spikes preserved. Fig.4.18 shows the transformed data.
Outliers are at the margins, and they appear much closer to the main data compared with the origi‑
nal units. The spikes are obvious on the marginal histograms, and 5 clusters are visually distinguish‑
able. The Hopkins statistic is 0.83. It is higher than the spikes spreading case as spike data remain
55
4. Multivariate Cluster Analysis
Figure 4.17: Results of GMM clustering when data are uniform transformed with spikes spread.
close together. Fig.4.19 shows the gap statistic and silhouette coefficients plots using k‑means and
GMM. The k‑means results indicate the optimal NC should be 5 but the GMM results have prob‑
lems identifying an optimal NC. Fig.4.20 shows the results of the two clustering methods. K‑means
performs better than GMM. The 5 clusters are separated better, although some miss‑clustered data
in the red cluster. The outliers do not influence the clustering as much as in Fig.4.15. Their contri‑
butions in shifting cluster centroids are similar to other data. The clusters of GMM show similar
features with the linear transformed data. The spike is independently separated rather than being
grouped with other data. The corresponding cluster is shown as a single dot in the figure. This leads
to the wrong grouping of green and orange clusters. The cross‑validation score for k‑means is 0.96
while the score for GMM is only 0.46, which means data are partitioned by GMM. The correctness
rate for k‑means is 0.91 while the rate is only 0.61 for GMM. When clustering uniform transformed
data with spikes preserved, it appears better to use the k‑means method.
Gaussian transform is another type of quantile transformation and widely used in geostatistical
modeling. It shares similar advantages with uniform transformation. The transformed data are
clustered in the center of multi‑Gaussian space. This can cause the Hopkins statistic and gap statistic
inaccurate as they use uniform distribution as their reference distribution. These two statistics can
be adapted to use Gaussian distribution as reference in future work.
Fig.4.21 is the Gaussian transformed data with spikes spread as shown on the marginal his‑
togram. The Hopkins statistic is 0.93. This high value comes from the 5 clusters and the Gaus‑
sian transformation clusters data in the center of the space. The five clusters are not as visually
distinguishable compared to the previous transforms, because data are centered around (0,0) co‑
56
4. Multivariate Cluster Analysis
Figure 4.18: Synthetic data after uniform transform and spikes preserved.
Figure 4.19: Results of gap statistic and silhouette coefficient using k‑means and GMM when data are uniform
transformed with spikes preserved.
Figure 4.20: Resulting clusters from k‑means (left) and GMM (right) when data are uniform transformed with
spikes preserved.
57
4. Multivariate Cluster Analysis
Figure 4.21: Gaussian transformed synthetic data with spikes spread out.
ordinate. Fig.4.22 shows the gap statistic and silhouette coefficients plots. Only the gap statistic
on k‑means results gives the right NC, which indicates this transformation is not very compatible
with the workflow. Fig.4.23 shows the clustering results using 5 clusters. Neither approach gives
reasonable clustering results. GMM gives worse results by clustering the transformed spike as a
single cluster. Although k‑means performs relatively better, many data are miss‑clustered into the
orange cluster. The clustering does not separate data well especially at the center of space where
the data distribution is dense. The cross‑validation scores are 0.54 and 0.52 for k‑means and GMM
respectively, which means the transformed data are not reasonably clustered using these two meth‑
ods. The correctness rate of k‑means is only 0.77, which is much lower than that of the uniform
transformed data, and the correctness rate for GMM is 0.43. So transforming data to Gaussian units
with spike spread may not be compatible with the workflow and clustering methods.
Fig.4.24 is the Gaussian transformed data with spikes preserved. The five clusters are more
visually distinguishable compared with the spike spread case. The Hopkins statistic is 0.92. Fig.4.25
has similar patterns as in Fig.4.22. Only the gap statistic on k‑means results gives the correct NC.
Other methods cannot give a reasonable estimate of the optimal NC. Fig.4.26 shows the GMM and
k‑means clustering results using 5 clusters. K‑means clusters the data relatively better. The five
clusters are well separated, despite some data miss‑clustered in the blue and purple clusters. GMM
fits a specific kernel for the spike, which is shown as the red cluster on the left margin. Four major
clusters are mixed into two clusters, and outliers are clustered into one group. The GMM clustering
on the transformed data is not successful. The cross‑validation results are 0.95 and 0.48 for k‑means
58
4. Multivariate Cluster Analysis
Figure 4.22: Results of gap statistic (left) and silhouette coefficient using k‑means and GMM (right) when data
are Gaussian transformed with spikes spread out.
Figure 4.23: Clustering results using k‑means (left) and GMM (right) when data are Gaussian transformed
with spikes spread.
and GMM respectively. GMM only partitions the data. The correctness rate for k‑means is 0.91,
while the correctness rate for GMM is 0.60. The workflow on GMM results cannot provide correct
optimal NC and clustering results are poorly separated, so GMM may not be compatible with the
Gaussian transformed and spikes preserved data.
Table.4.1 summarizes the compatibility of data transformations and clustering methods with the
workflow. The correctness rates of k‑means are generally higher than that of GMM. When BDL
data and outliers present in data, k‑means is more compatible with the transformations than GMM.
Gaussian transformation with spikes spread is not an appropriate transformation as both cluster‑
ing methods give low correctness rate. Linear transformation of data does not handle outliers well
as it either gives incorrect NC or fails the cross‑validation test. Although Gaussian transformation
with spikes preserved performs well using k‑means, only the gap statistic results give the correct
NC. When analyzing real data, the transformation may have difficulty determining the optimal NC.
For the methods having low correctness rate, they either do not give the optimal NC or partition
data rather than cluster them. Since the synthetic data do not have complicated distributions to
59
4. Multivariate Cluster Analysis
Figure 4.25: Results of gap statistic (left) and silhouette coefficient using k‑means and GMM (right) when data
are Gaussian transformed with spikes preserved.
Figure 4.26: Clustering results using k‑means (left) and GMM (right) when data are Gaussian transformed
with spikes preserved.
60
4. Multivariate Cluster Analysis
k‑means GMM
Linear Transform 0.90 0.72
Uni. Spread 0.90 0.78
Uni. Preserved 0.91 0.61
Gauss. Spread 0.77 0.43
Gauss. Preserved 0.91 0.60
Table 4.1: The correctness rates for each transform and the clustering methods.
be clustered, the appropriate transformations that can be applied on real data should have correct‑
ness rates above 0.9, which include linear transform, uniform transform with spikes spread and
preserved, and Gaussian transform with spikes preserved.
There are two features worth noting: the poor clustering performance of GMM when spikes
preserved, and the poor compatibility of Gaussian transformation with the workflow. GMM is a
distribution‑based clustering method. When spikes are preserved, their distribution is very dif‑
ferent from the rest data. Therefore, GMM tends to fit a sharp Gaussian model with almost zero
variance to spikes. The rest data cannot be grouped into this Gaussian kernel, even though they
are close to the spike data. The synthetic clusters are relatively isotropic. GMM may have a better
performance when the clusters are highly anisotropic. The poor compatibility between Gaussian
transform and the workflow could come from the mechanism of the transformation. The reference
state of the workflow is the uniform distribution, but the Gaussian transformation pushes data
around the center of the space while the marginal data are separated farther, which changed the
reference distribution of data. Thus, the Gaussian transformed data are not very compatible with
the workflow.
In this section, the appropriate transformations are applied to examine clusters in real data. As
shown above, k‑means outperforms GMM when BDL data are present, so only k‑means clustering
is applied here. The proposed workflow is applied to determine the NC. The real data used in this
chapter come from the Government of the Northwest Territories (Falck et al., 2012). The missing
data are eliminated. The data consist of 8500 observations and 46 variables. Some of the variables
contain over 1000 BDL data as shown in the previous chapter. Since the NC in high‑dimensional
data cannot be visually examined, the workflow is of great importance.
Some may argue there are too few data for the 46 dimensional space, also known as ”the curse of
dimensionality” (Bellman, 1966; McLachlan, 2004; Taylor, 1993). For an M dimensional space, the
number of data should be around the magnitude of 10M to make meaningful statistical inference.
It is necessary for mean and variance, but it wouldn’t be a problem for cluster analysis. In Fig.4.27,
there are only 10 data in a 2D space but the two clusters are still distinguishable. In cluster analysis,
61
4. Multivariate Cluster Analysis
the proximities between data are more important than the number of data.
The Hopkins statistic for the 46 dimensional data is 0.74, indicating there could be clusters exis‑
tence. The data are transformed using transformation with correctness rate higher than 0.9 in the
previous section. Fig.4.28 shows the gap statistic and silhouette coefficients on k‑means clustering
results. The gap statistic and silhouette coefficient show different trends. The optimal NC is be‑
tween 6 to 10 for gap statistic, and different transformations give different plots. The uncertainty
indicates the NC provided by the gap statistic may not be reliable. Also, the gap statistic should
only be used as complement to the silhouette coefficient. All silhouette coefficient plots indicate
the optimal NC is 2, and the value at 2 is much higher than other NC, so 2 clusters are used for
the further analysis. The cross‑validations on the four transformation are all above 0.9. It indicates
k‑means separates the two clusters in the real data well.
It is difficult to visualize the 46 dimensional space to validate the two clusters, but the high‑
dimensional data can be projected to an easily visualized 2D plane. If there are two obvious clusters
on such a plane, we can conclude that there are at least two clusters in the real data. Since the work‑
flow has eliminated possibilities of other NC, the two clusters on a 2D plane should be sufficient to
validate the two clusters existence in the high‑dimensional space.
Algorithm 2 explains the procedure of finding such planes. Real data are projected to a 2D
plane, and k‑means using 2 clusters are conducted on the 2D data, obtaining the corresponding
silhouette coefficient. The process iterates thousands of times, and the 2D plane with the highest
silhouette coefficient is returned. While there are an infinite number of 2D planes, the algorithm
only needs to repeat enough times to find a certain plane that shows the clusters. The procedure
of effectively sampling is explained in Algorithm 3. Imagine there are 3 clusters in a 3D space. To
view the most separable clusters on a 2D plane, the plane should be the one created by the 3 cluster
centroids. When there are only 2 clusters, planes are sampled parallelly to the two cluster centroids.
The same logic can be applied to higher dimensional spaces. When there are more than 3 clusters,
62
4. Multivariate Cluster Analysis
Figure 4.28: Gap statistic and silhouette coefficient results for different transform methods using k‑means.
3 cluster centers are randomly sampled to generate a plane. The algorithm samples around the
cluster centroids for more flexibility.
The purpose of the algorithms is not to find the plane that gives the highest silhouette coefficient.
It only needs to find one plane that can show the two clusters of the projected 2D data. After apply‑
ing the algorithm on differently transformed data, Fig.4.29 shows the 2D projected data that reveal
2 clusters. The coordinates do not have to be aligned with any variables. The two clusters in the left
two figures are closely connected with some extent of overlap. They have elongated shape and can
be distinguished by different orientations. The two clusters in the two right figures have elliptical
shapes, and they can be distinguished by different sizes and cluster centers. The four transformed
data all indicate 2 clusters on a 2D plane. Now, we can conclude that there can be two clusters in
63
4. Multivariate Cluster Analysis
4.5 Conclusion
This chapter covers several aspects of cluster analysis on data with spikes. First, the workflow of
determining the optimal NC is introduced. The workflow consists of several statistic tools. Hop‑
kins statistic examines if there is any cluster in data. The silhouette coefficient and gap statistic
find the optimal NC by plotting the value against a range of NC. The optimal NC should have the
highest value of the measurements. Prediction strength validates if the chosen NC is reliable by ap‑
plying the clustering on testing data. This chapter also examines the compatibility of the workflow
with different transformations and clustering methods. The two clustering methods are k‑means
and GMM. Provided the true labels of synthetic data, correctness rate measures the proportion of
data truly belong to the same cluster, and it is used to evaluate the performance of the clustering
results. For the synthetic data, the suitable transformations are linear rescaling, uniform transform
with spikes spread and preserved, and Gaussian transform with spikes preserved. The appropri‑
ate clustering method is k‑means. These transformations and k‑means clustering are applied on
real high‑dimensional data with many BDL spikes. The results show 2 clusters in all four differ‑
64
4. Multivariate Cluster Analysis
ently transformed data. To validate the results visually, high‑dimensional data are projected to 2D
planes, using the proposed algorithm to sample efficiently. All four figures indicate there are two
clusters in the 2D projected data.
65
CHAPTER 5
5.1 Introduction
5.1.1 Motivation
In exploratory data analysis (EDA), clustering techniques group large data into smaller groups,
making further analysis more precise. When clustering geostatistical data, the incompatibility be‑
tween the units and the properties of multivariate space and spatial coordinates can be problem‑
atic. Most clustering techniques ensure the continuity in multivariate space but the corresponding
spatial domains are scattered, and this causes difficulties in geostatistical modeling (M. J. Pyrcz &
Deutsch, 2014). For example, the scattered domains increase the difficulty of variogram inference
and prediction of what domain label prevails at an unsampled location. In this chapter, labels refer
to multivariate labels and domains refer to spatial labels.
In Fig.5.1, the bivariate data is clustered using k‑means (Krishna & Murty, 1999). Note XCOO
represents x coordinates, and YCOO represents y coordinates. Although the bivariate space is
grouped into two continuous clusters, the spatial domains are scattered. In Fig.5.2, the k‑means
clustering is conducted on the spatial data although this is not recommended, because the cluster‑
66
5. Ensemble clustering and classification
ing on spatial data cannot accommodate the complex spatial shapes of geological features. While
the spatial continuity is assured, the multivariate clustering result is scattered. In these examples,
there is a trade‑off between the continuity of multivariate clusters and spatial domains (They are
the same labels, but shown in different space).
Figure 5.1: k‑means clustering results on 2D multivariate data. Left represents the multivariate labels. Right
is the domain distribution.
Figure 5.2: k‑means clustering results on 2D spatial data. Left represents the multivariate labels. Right is the
domain distribution.
To address the trade‑off problem, Martin (2019) proposes a clustering method considering the
spatial continuity and combined with ensemble clustering. This work extends Martin’s work into an
optimization framework. First, ensemble clustering is introduced. Ensemble clustering assembles
multiple clustering techniques together to obtain an optimal clustering result (Fred & Jain, 2005).
The individual clusterings used in the ensemble clustering are also called weak clustering (the term
weak and individual clustering are used interchangeably). As shown in Fig.5.3, individual k‑means
clusterings (weak clusterings) do not group the clusters well, but the merged ensemble clustering
result looks correct.
67
5. Ensemble clustering and classification
Figure 5.3: An illustration of ensemble clustering method. Left four plots are individual clustering results.
Right plot is the merged ensemble clustering result.
The weak clusterings are generated similar to samples in random forests (Svetnik et al., 2003).
Each individual clustering uses different clustering techniques, set of data, and number of clusters.
These parameters do not have to be correct, but they are required to be various enough to generate
independent and identically distributed samples. From the ensemble of these weak clusterings, a
similarity matrix is obtained, and this matrix is used in hierarchical clustering (Johnson, 1967) to
determine the final clustering results.
In Martin (2019), the individual clustering technique considers both the spatial domains and
multivariate clusters. The performance of each individual result is measured quantitatively. The
spatial continuity is measured using entropy (Shannon, 2001) and the multivariate continuity is
obtained using within cluster sum of squares (WCSS). High entropy represents scattered domains
and high WCSS represents scattered clusters. These two measurements are negatively correlated.
As observed from the previous figures (Fig.5.1 and Fig.5.2), when the spatial entropy for a clustering
result is high, the WCSS is low, and vice versa (Fig. 5.4). Reasonable clustering results should have
low entropy and WCSS. To obtain the optimal ensemble clustering result, the entropy and WCSS
of each individual clustering in the ensemble is calculated and only the individual clusterings with
entropy and WCSS value below specific thresholds are used for the ensemble merging.
Although Martin’s method provides reasonable results, subjective choices of the thresholds de‑
termine the quality of final labels. When deciding the thresholds, practitioners need to examine
dozens of individual results. However, there are thousands of results that could be merged, so the
quality of the ensemble result is uncertain and subjective.
Hierarchical clustering is a classic and robust clustering method in addition to k‑means and Gaus‑
sian mixture model (GMM). It is used for merging individual clusterings in ensemble clustering.
68
5. Ensemble clustering and classification
Figure 5.4: The WCSS and entropy are negatively correlated (Martin, 2019).
The main difference of hierarchical clustering from the other two is the clusters do not have cen‑
troids, so the resulting clusters can be in any shapes. This feature is useful when the shapes of
clusters are uncertain. There are two types of hierarchical clustering: agglomerative and divisive.
The first one is also referred to as “bottom up” approach and the latter one “top‑down” approach
(?). Here, the “bottom up” approach is considered. This approach treats each data as its own cluster
at the start, and merges the closest pair at each step. The process goes iteratively until all data are
merged into one group.
Similar to k‑means, the distances between data are the key to hierarchical clustering results.
Most commonly, Euclidean distance or Manhattan distance are used. In The intermediate steps, two
groups are to be clustered together. Since each group contains many data, there are several ways to
determine the distance between groups. The distance between groups is also called linkage criteria.
Common linkage criteria include complete‑linkage, single‑linkage and average linkage (Fig.5.5). As
observed from the figure, complete‑linkage considers the maximum distance of data as the distance
between groups, single‑linkage considers the minimum distance of data as the distance between
groups, and average‑linkage considers the average distance of data as the distance between groups.
There is also Ward’s criterion (Murtagh & Legendre, 2014; Ward Jr, 1963). At each merge step,
Ward’s method merges the two groups which lead to the minimum group variance, and it provides
similar results to average‑linkage. In ensemble clustering, Euclidean distance and average‑linkage
are applied.
The last step of hierarchical clustering is to determine the number of clusters. This can be
achieved from observing the dendrogram. Fig.5.6 is an example using only 20 data. The x axis
shows the data index, and the y axis is the distance between each group. Each data is shown as
leaf nodes and they are clustered into one group when the distance reaches the maximum. The
smaller the distance between groups, the earlier they are grouped together. From the figure, there
69
5. Ensemble clustering and classification
are three major groups. To reveal these major groups, 3 clusters are chosen and the corresponding
distance is at 1.1. The number of clusters can be chosen directly, or the group distance is set and
the corresponding number of clusters can be easily found. In this chapter, the number of clusters
is chosen directly.
ϭ ͘ ϲ
ϭ ͘ ϰ
ϭ ͘ Ϯ
ϭ ͘ Ϭ
Ϭ ͘ ϴ
Ϭ ͘ ϲ
Ϭ ͘ ϰ
Ϭ ͘ Ϯ
Ϭ ͘ Ϭ
ϭ ϴ ϱ ϭ ϵ ϭ ϱ ϭ ϲ Ϯ ϵ ϭ ϳ ϲ ϭ Ϯ ϭ ϰ ϭ ϳ ϭ Ϭ ϭ ϭ ϭ ϯ ϯ ϴ Ϭ ϰ
Figure 5.6: An example of dendrogram using 20 data. x axis is the index of data. y axis is the distance between
data.
A novel workflow is introduced in this chapter. The traditional ensemble clustering is conducted
first. The individual clustering techniques used is k‑means where the parameters are sampled from
a set of choices. The purpose is to generate independent and identically distributed random sam‑
ples. The less correlated the individual cluster results are, the more robust the final ensemble results.
The next step is to use the ensemble clustering results as inputs for classification. An objective
function in the classification considers both spatial and multivariate continuity, and their relative
importance is adjusted through a spatial weight parameter. The inputs of the classification can be
70
5. Ensemble clustering and classification
multiple clustering results with different number of clusters. At the beginning, the domains are as‑
signed randomly, and the objective function is calculated. Then each data are resampled through all
possible domains. If the new domain improves the objective function, it is preserved. The process
keeps iterating until the algorithm converges. Since there is a possibility of the algorithm converg‑
ing to a local minima, multiple initial states are generated and the best result is kept. The number
of domains and spatial weight are used as hyper‑parameters. Practitioners can generate a matrix
of results given a range of number of domains and spatial weight, then choose the optimal one con‑
sidering the geological understanding of the domains. The robustness of the workflow is checked,
and the effect of spatial weight is demonstrated through geostatistical modeling. With high spatial
weight, the modeling simulates more connected values, while with low spatial weight, the modeling
simulates more random disconnected values.
The proposed workflow addresses the trade‑off between spatial and multivariate continuity in two
steps. The first step is to only consider the multivariate continuity, generating multiple clustering
labels. These labels are used as inputs in the second step, in which the importance of spatial con‑
tinuity is controlled by spatial weight. To obtain an optimal spatial and multivariate continuity,
the quality of the inputs clustering labels are important. In this section, the procedure to obtain
clustering labels using ensemble clustering is explained, and the superior results quality compared
with k‑means is demonstrated.
As explained above, ensemble clustering merges multiple individual clustering results to obtain
a better one. The individual clustering used here is k‑means with a set of different parameters. To
sample independent realizations, each k‑means realization samples 80% of the data with replace‑
ment and the number of clusters range from 10 to 25. 100 realizations of k‑means results are used for
one realization of ensemble clustering. The similarity matrix used for the merging step is generated
by calculating the data’s pairwise occurrence in the same group.
Suppose there are N data z(ui ), i = 1, · · · , N , and individual clusterings are conducted M
times on z(u). Each individual clustering k = 1, · · · , M , has its own data samples and they are
denoted as zk (u). The clustering label for z(ui ) is denoted as yk (ui ). If z(ui ) is not sampled in
zk (u), yk (ui ) = N aN . S is a N × N matrix representing the pairwise similarity. To calculate the
similarity between the z(ui ) and z(uj ):
PM
k=1 1{yk (ui ) = yk (uj )}
Sij = PM , i, j = 1, · · · , N (5.1)
k=1 1{yk (ui ) ̸= N aN & yk (uj ) ̸= N aN }
where 1{T rue} = 1 and 1{F alse} = 0. The similarity between data i and j is the proportion of
their sharing the same label throughout their co‑occurrence in M iterations. If two data are grouped
together most of the time, they possibly belong to the same cluster. For example, in 100 realizations,
71
5. Ensemble clustering and classification
z(ui ) and z(uj ) are present together 80 times and are grouped together 60 times. Then Sij is 0.75.
After obtaining the similarity matrix from Eq.(5.1), the distance matrix of the data is simply
calculated as
Mij = 1 − Sij , i, j = 1, · · · , N
The real data used in the ensemble clustering comes from “Kola Ecogeochemistry Project” (Filz‑
moser, Garrett, & Reimann, 2005; Reimann, 2005; Reimann, Filzmoser, & Garrett, 2005). The site
is famous for its rich mineral deposits. The geochemical data consist of 618 data points and 26
variables. The data are standardized before clustering. The distance matrix (Fig.5.7) is used in hier‑
archical clustering to determine the ensemble clustering results. As observed from the figure, the
distance is equal to 0 for diagonal elements, as the the distance to a data itself is 0. The larger the
distance between data, the less likely they belong to the same cluster. The next step is to use it as
the predefined distance matrix in hierarchical clustering. The advantage of hierarchical clustering
is that the number of clusters does not change the clustering mechanism and the optimal number
of clusters can be observed from dendrogram. As shown in Fig.5.8, closer data are grouped earlier.
The small clusters grouped last on the left of the figure have long distance to the rest of the data,
indicating they could be outliers. Average‑linkage merging criterion is used here.
Figure 5.7: The distance matrix calculated from the ensemble clustering method.
Fig.5.9 and Fig.5.10 are the results of the ensemble clustering and k‑means clustering respec‑
tively. The number of clusters ranges from 3 to 8. The silhouette coefficient is used to evaluate
the performance of the clustering results. The higher the value, the better the data are grouped
in multivariate space. Considering the general higher silhouette coefficient, the ensemble method
72
5. Ensemble clustering and classification
Figure 5.8: The dendrogram calculated from distance matrix. Each node on x axis represents a data point. y
axis represents the data distance.
outperforms k‑means. It is worth noting that in the ensemble clustering, when the number of clus‑
ters is low, the clustering technique mainly isolates outliers. In contrast, k‑means groups outliers
with main clusters, which can be the reason for its lower silhouette coefficient. The performance is
only evaluated in the multivariate space and the spatial plot is only used for displaying the labels
distribution.
Figure 5.9: The result of ensemble clustering on the real data. x and y axises represent location. Different
colors represent different groups.
73
5. Ensemble clustering and classification
Figure 5.10: The result of k‑means clustering on the real data. x and y axises represent location. Different
colors represent different groups.
5.3 Classification
One advantage of the proposed workflow is that there can be multiple clustering labels used as
inputs for the classification. Since there are some randomness in ensemble clustering, multiple
ensemble labels can be generated, and the reasonable ones are inputs. The difficult problem of
finding the best clustering label is avoided at the first step.
Clustering ensures the continuity in multivariate space, but it does not make sense to conduct clus‑
tering on spatial data due to the complexity of geological shapes, a method is needed to ensure
the spatial continuity of domains. Another issue is the clustering results may identify many small
outlier groups. In the case of 8 clusters in Fig.5.9, there are only 2 to 3 main clusters. Finding ma‑
jor clusters is more important than identifying small outlier groups for the problem. The domain
classification mitigates this problem.
Like other classification methods, the domain classification needs an objective function to indi‑
cate approaching an optimum. Since there is a trade‑off, a hyper‑parameter (Wsp for spatial weight)
to adjust the importance of the spatial continuity is needed. Also the number of domains needs to
be specified. The general format of the objective function is as follows:
where O(d, y) is the objective function value, M (d, y) is the multivariate entropy, S(d) is the spatial
entropy, d represents the spatial domains and y represents the clustering labels. The objective
74
5. Ensemble clustering and classification
function has the following features: when the number of domains is fixed, with Wsp equal to 0,
complete multivariate continuity is ensured. When Wsp increases, the domains are redistributed
and deviate from the labels. When Wsp is close to 1, complete spatial continuity is ensured. The
classified domains should give the minimum value of the objective function in each circumstance.
Now, the details of each term in Eq.(5.2) are explained. The multivariate entropy measures the
discrepancy between multivariate labels and spatial domains. Suppose the number of domains is
D, and the number of clusters is C. In domain i (i = 1, · · · , D), the probability of finding label j
(j = 1, · · · , C), Pij , is calculated as the number of data with label j in domain i, N (yji )), divided by
the number of data in domain i, N (di ).
The multivariate entropy of the whole data is the weighted sum of Ei and is calculated as:
PD
i=1 N (di ) · Ei
M (d, y) = (5.5)
N
where N is the total number of data. When M (d, y) is low, domains are consistent with labels.
When M (d, y) is high, the labels are distributed randomly within domains. For example, Table.5.1
shows the calculation of the probabilities. There are 3 labels and 2 domains. The probability is
calculated within each domain. After obtaining the probabilities, the entropy for domain 1 and 2
are:
E1 = −(0.133 log 0.133 + 0.2 log 0.2 + 0.667 log 0.667) = 0.86
E2 = −(0.3 log 0.3 + 0.5 log 0.5 + 0.2 log 0.2) = 1.02
The entropy for the whole data is the the weighted average of E1 and E2 :
150 · E1 + 100 · E2
M= = 0.924
250
The distribution of labels in domain 2 is more random, which means the domain and labels are less
consistent. When the number of domains is the same as the number of labels and there is only one
unique label in each domain, the multivariate entropy is zero, and this represents full multivariate
continuity. In practice, there can be P sets of labels used for multivariate entropy calculation, and
the final entropy is the average entropy over the P sets.
PP
p=1 M (d, yp )
M (d, y1 , · · · , yP ) = (5.6)
P
The final objective function is adjusted from Eq.(5.2) to
75
5. Ensemble clustering and classification
Table 5.1: Example data of calculating multivariate entropy. Left is the number of data within each label and
domain. Right is the corresponding probabilities.
where p(k) is the proportion of domain k and K is the available domains. A value of p(k) = 0 would
contribute 0 to the entropy (at the limit). The average S(d) over all data is simply calculated as
PN
q=1 S(dq )
S(d) = , q = 1, · · · , N
N
The more scattered the spatial labels are, the higher S(d) is, and this is not desirable. The optimal
domain distribution should give the lowest possible O(d, y1 , · · · , yP ).
Figure 5.11: An illustration of a local search window. The window is marked as a blue circle.
76
5. Ensemble clustering and classification
With the objective function established, it is used to classify spatial domains. The spatial weight and
the number of domains need to be specified. Suppose there are 3 domains. The spatial domains are
randomly assigned at the beginning. Domains d are uniformly sampled from 1, 2 and 3, and the
initial measurement Oinit from Eq.(5.7) is obtained. Then the domains are resampled one by one in a
random order. If O decreases compared with the previous state, the new domain label is preserved,
otherwise, dismissed. All data are visited once in each iteration, and be revisited in a different
order in the next iteration until the algorithm converges. As shown in Algorithm 4, the objective
function is recalculated every time when data domains are changed. If there are 3 domains and 600
data, in each iteration, the objective function is calculated 1800 times. The maximum iteration is
set to 10. In practice, the algorithm converges much faster, because it can settle in a local minima.
To overcome this problem, the algorithm is run multiple times, which is equivalent to initiating
multiple beginning states, and the domains giving the lowest O are preserved.
With the classification procedure explained, its robustness needs to be validated. When Wsp = 0,
the algorithm only considers the multivariate continuity. If there is only one set of clustering label
input and the number of domains is set equal to the number of clusters, even starting from a random
distribution of domains, a robust algorithm should return a domain distribution identical to the
clustering labels. They may have different label names, but the spatial distribution should be the
same. Fig.5.12 shows the resulting domains in this situation. The domains are identical to the input
clustering labels. Although the colors are different, within a same domain, there is only one type
77
5. Ensemble clustering and classification
Figure 5.12: The classification of the domains when spatial weight is set to 0. Left is the input clustering labels.
Right is the classified domains.
The algorithm is used to classify clustering labels with the spatial weight Wsp increasing. The
spatial continuity increases and the multivariate continuity decreases. Fig.5.13 shows the input
clustering labels for the algorithm. The multivariate data used for the ensemble clustering is stan‑
dardized, which gives them 0 mean and variance of 1. This can decrease the influence of highly
skewed distributions. The labels are obtained from multiple realizations of ensemble clustering
with different numbers of clusters. The 6 inputs are denoted as y1 , · · · , y6 . When calculating the
objective function, O has the form of O(d, y1 , · · · , y6 ). The number of clusters ranges from 8 to
14, from which some are tiny outlier groups. When considering the number of domains, only the
major groups are considered. Here, the number of domains is set from 3 to 5. Also, Wsp is another
important hyper‑parameter. Multiple Wsp values are tested and the change of spatial continuity is
demonstrated.
Fig.5.14 shows the domains obtained from the input clusters in Fig.5.13. Wsp ranges from 0 to 1,
showing the process of the domain classification emphasizing more on spatial continuity. In each
small figure, MV represents multivariate entropy, and SP represents spatial entropy. The lower the
entropy value, the higher the corresponding continuity. In each row (when the number of domains
is fixed), as Wsp increases, the multivariate continuity decreases and spatial continuity increases,
which is the desired performance when the algorithm is designed. In each column (when Wsp is
fixed), when the number of domains increases, the multivariate continuity increases and the spatial
continuity decreases. This is anticipated as more domains group data finer in multivariate space,
and lead to less continuous domains. With Wsp lower than 0.25, the domains are fairly scattered,
while with Wsp larger than 0.75, the domains are too continuous. In practice, Wsp can be set between
0.25 and 0.75, and be fine tuned.
When Wsp is zero, classified domains can be viewed as the averaged clustering labels. In some
78
5. Ensemble clustering and classification
high Wsp figures, the actual number of domains are smaller than the defined value, because when
the number of domains decreases, the continuity of domains increases. Since the domains are as‑
signed randomly at the beginning, the final results have high uncertainty when Wsp is close to 1,
because no clustering information can put constraints on the classification. In practice, clustering
labels are considered and Wsp is set in a reasonable range, which stabilize the algorithm. Practition‑
ers can generate their own matrix of domains and make decisions considering external geological
knowledge.
The proposed workflow is conducted on data for the purpose of dividing them into groups with
distinguishable features, and this can be verified by testing the within group variance (Kasim &
Raudenbush, 1998). The relative size of domains is another measurement of the classification per‑
formance. These quantitative measurements can also be used to determine appropriate hyper‑
parameters.
The total variance of data represents how scattered they are distributed. When the data are
grouped into smaller clusters, there are within group variance and between group variance. If the
groups are well clustered, the variance within each cluster should be small, and the corresponding
between group variance should be large. It also means the differences between data within the
same group are small and the differences between data in different groups are large.
Suppose there are N data, and they are classified into K groups. The following equation regard‑
79
5. Ensemble clustering and classification
Figure 5.14: The matrix of domains, given multiple Wsp and number of domains.
1X
N 1X
K 1X
K X
Nj
(xi − x̄)2 = Nj (x¯j − x̄)2 + (xij − x¯j )2 (5.9)
N i=1
N j=1
N j=1 i=1
where Nj is the number of data in group j, x̄ is the grand mean over N data and x¯j is the mean
of group j. The first term is referred to as the total variance, the second term between group vari‑
ance and the third term within group variance. The total variance is a constant when data is fixed.
When data are classified into different groups, the latter two terms change. When dealing with
multivariate data, the data are standardized in each dimension, the variances are calculated in each
dimension separately, and the average variances over all dimensions are inputs for Eq.(5.9). For
example, there are data with 5 variables. The standardized data have 0 mean and standard devi‑
ation of 1 in every dimension. Then the total variance is calculated as the average variance over 5
variables. In this case, the total variance is always 1. When data are clustered, in one of the clusters,
the within group variances are 0.5, 0.6, 0.7, 0.8 and 0.9 for 5 dimensions respectively. The average
within group variance 0.7 is used in Eq.(5.9).
Table.5.2 shows the within group variance of the results obtained from the domain classifica‑
tion. The within group variance may not have a dramatic decrease because the variances are the
average over 26 variables. Some variables may not be very informative to help with the classifi‑
cation. Suppose an extreme case where half dimensions have within group variance of 0 and the
80
5. Ensemble clustering and classification
Table 5.2: The within group variance of the domains obtained from the classification.
Table 5.3: The entropy measurements of domain sizes obtained from classification.
other half have within group variance of 1, the resulting univariate within group variance is 0.5, so
a 0.2 decrease of the average variance can be significant. When Wsp is around 0.5, the within group
variance is smaller which indicates the data are better grouped. From the table, the optimal choice
of hyper‑parameters can be 5 domains with Wsp equal to 0.25. The results can be improved if the
Wsp is finer tuned.
Another aspect to evaluate the performance of the classification is the domain size. The domain
sizes should be similar, but there can also be cases where some small domains are very different
from the rest. In general, similar size domains can represent well grouped data. To quantify this,
the entropy of the domain probabilities is measured. For K domains,
X
K
E=− Pi log Pi i = 1, · · · , K (5.10)
i=1
where Pi is the proportion of data grouped into domain i. From the equation, if the entropy is large,
the domain sizes are similar. If the entropy is small, one of the domains dominates. Table.5.3 shows
the entropy value for domains in Fig.5.14. These values are consistent with the distribution in the
figure. 3 domains with Wsp equal to 0.5 leads to one large domain. The corresponding entropy is
only 0.66. When there are 5 domains and Wsp is equal to 0.5, the domains have similar size and
the entropy is 1.57. The entropy combined with the within group variance can provide information
about distinguishable groups with similar size. Since low within group variance and high entropy
value are preferred, simply dividing the values elementwisely in Table.5.2 by Table.5.3 gives the
desired measurement. The merged results is shown in Table.5.4, in which lower values represent
better classification results. From the table, choosing 5 domains with Wsp between 0.25 and 0.5 can
result in reasonable results.
81
5. Ensemble clustering and classification
Al Si Mg
Al 1.000000 0.348090 0.046925
Si 0.348090 1.000000 0.004131
Mg 0.046925 0.004131 1.000000
In this section, the effects of the spatial weight Wsp on geostatistical modeling are demonstrated.
The workflow includes simulating a gridded multivariate model in the region of Fig.5.11, relat‑
ing permeability models to multivariate models, and running flow simulation on the permeability
models. 100 realizations of multivariate models are simulated for each Wsp . Several realizations
are plotted for visual checking. The flow simulation is used to assess the results of 100 realizations.
If the multivariate models are different, the resulting permeability models are different, and this is
reflected in highly non‑linear sensitive response variables such as breakthrough time.
The first step is to choose the variables to be used for multivariate modeling. The variables should
be as uncorrelated as possible to assess the influence of different variables. The performance of
multivariate modeling is also affected by the number of available data. Since there are 618 data,
using 3 variables is appropriate. From Fig.5.15, there are some variables strongly correlated such
as Cr and As. Modeling these correlated variables do not provide much extra information, so Al, Si
and Mg are chosen to be the variables for the multivariate modeling. Table.5.5 shows the correlation
of the three variables. Mg is not correlated with Al and Si. Al and Si are barely correlated. Fig.5.16
shows the scatter plots of data. From the figure, the data are more clustered in low value regions
and generally skewed. There are no extreme outliers. Note the data are standardized, which is the
reason for the negative values. Although the negative value has no physical meaning, it does not
influence the flow simulation as the permeability models are generated from the relative values of
the data.
The next step is to obtain domain labels. The domain labels are generated using the procedure
demonstrated in Section 5.3. The clustering inputs are shown in Fig.5.17. To demonstrate the effect
of different spatial weight, the spatial weight is chosen to be 0.0 and 0.7. The reason for not choosing
82
5. Ensemble clustering and classification
Wsp larger than 0.7 is that the full spatial continuity ignores the clustering inputs and the final
domains may have artificial errors. The number of domains is set to be 3. Fig.5.18 shows the results
of the domain classification and the location maps of the variables. As observed from the figure,
the domain distribution for Wsp = 0.0 is scattered as only multivariate continuity is considered.
Domain distribution of Wsp = 0.7 is more continuous as expected. Note the domain names can be
different in each case. For example, domain 1 in Wsp = 0.0 is referred to as domain 0 in Wsp = 0.7.
The data distributions are shown in the second row. For Al , high value data cluster in south east
region. For Si , the southern region has general higher value than the northern half. On the contrary,
Mg has higher values in the northern part.
83
5. Ensemble clustering and classification
Fig.5.19 is the reflection of the domain labels in multivariate space, in which the distributions
are consistent with the statement of this chapter. Spatially scattered domains have continuous mul‑
tivariate clusters, while spatial continuous domains have scattered multivariate clusters. Fig.5.20
illustrates 3 realizations of the domain models for each Wsp . The simulated domain layouts are con‑
sistent with the domain labels. The grid size is 50 × 50. Data in Fig.5.18 are modeled independently
within each domain, and merged together based on their domain models in Fig.5.20. For example,
data labeled as domain 0 are used as inputs for multivariate modeling, and the results are only kept
where the grid cell is labeled 0.
ϰ Ϭ Ϭ Ϭ Ϭ Ϭ ϱ Ϭ Ϭ Ϭ Ϭ Ϭ ϲ Ϭ Ϭ Ϭ Ϭ Ϭ ϳ Ϭ Ϭ Ϭ Ϭ Ϭ ϴ Ϭ Ϭ Ϭ Ϭ Ϭ ϰ Ϭ Ϭ Ϭ Ϭ Ϭ ϱ Ϭ Ϭ Ϭ Ϭ Ϭ ϲ Ϭ Ϭ Ϭ Ϭ Ϭ ϳ Ϭ Ϭ Ϭ Ϭ Ϭ ϴ Ϭ Ϭ Ϭ Ϭ Ϭ ϰ Ϭ Ϭ Ϭ Ϭ Ϭ ϱ Ϭ Ϭ Ϭ Ϭ Ϭ ϲ Ϭ Ϭ Ϭ Ϭ Ϭ ϳ Ϭ Ϭ Ϭ Ϭ Ϭ ϴ Ϭ Ϭ Ϭ Ϭ Ϭ
E Ƶ ŵ ď Ğ ƌ Ž Ĩ ů Ƶ Ɛ ƚ Ğ ƌ Ɛ ϴ E Ƶ ŵ ď Ğ ƌ Ž Ĩ ů Ƶ Ɛ ƚ Ğ ƌ Ɛ ϭ Ϯ E Ƶ ŵ ď Ğ ƌ Ž Ĩ ů Ƶ Ɛ ƚ Ğ ƌ Ɛ ϭ ϰ
ϭ Ğ ϲ ϭ Ğ ϲ ϭ Ğ ϲ
ϳ ͘ ϵ ϳ ͘ ϵ ϳ ͘ ϵ
ϰ Ϭ Ϭ Ϭ Ϭ Ϭ ϱ Ϭ Ϭ Ϭ Ϭ Ϭ ϲ Ϭ Ϭ Ϭ Ϭ Ϭ ϳ Ϭ Ϭ Ϭ Ϭ Ϭ ϴ Ϭ Ϭ Ϭ Ϭ Ϭ ϰ Ϭ Ϭ Ϭ Ϭ Ϭ ϱ Ϭ Ϭ Ϭ Ϭ Ϭ ϲ Ϭ Ϭ Ϭ Ϭ Ϭ ϳ Ϭ Ϭ Ϭ Ϭ Ϭ ϴ Ϭ Ϭ Ϭ Ϭ Ϭ ϰ Ϭ Ϭ Ϭ Ϭ Ϭ ϱ Ϭ Ϭ Ϭ Ϭ Ϭ ϲ Ϭ Ϭ Ϭ Ϭ Ϭ ϳ Ϭ Ϭ Ϭ Ϭ Ϭ ϴ Ϭ Ϭ Ϭ Ϭ Ϭ
Figure 5.17: The cluster labels used as inputs for the domain classification.
When conducting multivariate modeling, it is important to model the variables efficiently and re‑
tain their multivariate shape. Projection pursuit multivariate transform (PPMT) is used for this
purpose (Barnett, Manchuk, & Deutsch, 2014; Barnett, Manchuk, Deutsch, et al., 2016). The idea is
to use a series of methods to transform the multivariate data to an identical multi‑Gaussian shape,
model them independently and back‑transform, returning the original multivariate relations. The
transformation methods include linear decorrelation such as principal component analysis (Abdi
& Williams, 2010) and min/max autocorrelation factors (Vargas‑Guzmán & Dimitrakopoulos, 2003).
These transformations start with sphered data (normal scored and with a correlation of 0). Correla‑
84
5. Ensemble clustering and classification
ϳ ͘ ϴ Ϯ ϳ ͘ ϴ Ϯ
ϳ ͘ ϳ ϳ ͘ ϳ
E Ž ƌ ƚ Ś ŝ Ŷ Ő ; ŵ Ϳ
E Ž ƌ ƚ Ś ŝ Ŷ Ő ; ŵ Ϳ
ϭ ϭ
ϳ ͘ ϲ ϳ ͘ ϲ
ϳ ͘ ϱ ϳ ͘ ϱ
Ϭ Ϭ
ϳ ͘ ϰ ϳ ͘ ϰ
ϰ Ϭ Ϭ Ϭ Ϭ Ϭ ϱ Ϭ Ϭ Ϭ Ϭ Ϭ ϲ Ϭ Ϭ Ϭ Ϭ Ϭ ϳ Ϭ Ϭ Ϭ Ϭ Ϭ ϴ Ϭ Ϭ Ϭ Ϭ Ϭ ϰ Ϭ Ϭ Ϭ Ϭ Ϭ ϱ Ϭ Ϭ Ϭ Ϭ Ϭ ϲ Ϭ Ϭ Ϭ Ϭ Ϭ ϳ Ϭ Ϭ Ϭ Ϭ Ϭ ϴ Ϭ Ϭ Ϭ Ϭ Ϭ
Ă Ɛ ƚ ŝ Ŷ Ő ; ŵ Ϳ Ă Ɛ ƚ ŝ Ŷ Ő ; ŵ Ϳ
ϭ Ğ ϲ ů ϭ Ğ ϲ ^ ŝ ϭ Ğ ϲ D Ő
ϳ ͘ ϵ ϴ ͘ ϭ ϵ ϳ ͘ ϵ ϰ ͘ ϰ ϳ ͘ ϵ ϵ ͘ ϯ ϯ
E Ž ƌ ƚ Ś ŝ Ŷ Ő ; ŵ Ϳ
E Ž ƌ ƚ Ś ŝ Ŷ Ő ; ŵ Ϳ
ϯ ͘ ϱ ϵ Ϭ ͘ ϳ ϵ ϱ ϰ ͘ Ϭ ϰ
ϳ ͘ ϲ ϳ ͘ ϲ ϳ ͘ ϲ
ϰ Ϭ Ϭ Ϭ Ϭ Ϭ ϱ Ϭ Ϭ Ϭ Ϭ Ϭ ϲ Ϭ Ϭ Ϭ Ϭ Ϭ ϳ Ϭ Ϭ Ϭ Ϭ Ϭ ϴ Ϭ Ϭ Ϭ Ϭ Ϭ Ͳ ϭ ͘ Ϭ ϭ ϰ Ϭ Ϭ Ϭ Ϭ Ϭ ϱ Ϭ Ϭ Ϭ Ϭ Ϭ ϲ Ϭ Ϭ Ϭ Ϭ Ϭ ϳ Ϭ Ϭ Ϭ Ϭ Ϭ ϴ Ϭ Ϭ Ϭ Ϭ Ϭ Ͳ Ϯ ͘ ϴ ϭ ϰ Ϭ Ϭ Ϭ Ϭ Ϭ ϱ Ϭ Ϭ Ϭ Ϭ Ϭ ϲ Ϭ Ϭ Ϭ Ϭ Ϭ ϳ Ϭ Ϭ Ϭ Ϭ Ϭ ϴ Ϭ Ϭ Ϭ Ϭ Ϭ Ͳ ϭ ͘ Ϯ ϲ
Ă Ɛ ƚ ŝ Ŷ Ő ; ŵ Ϳ Ă Ɛ ƚ ŝ Ŷ Ő ; ŵ Ϳ Ă Ɛ ƚ ŝ Ŷ Ő ; ŵ Ϳ
Figure 5.18: The domain labels and the location map of the three variables.
Figure 5.19: The domain labels in multivariate space. Upper row for Wsp = 0.0. Lower row for Wsp = 0.7 .
85
5. Ensemble clustering and classification
E Ž ƌ ƚ Ś ŝ Ŷ Ő ; ŵ Ϳ
E Ž ƌ ƚ Ś ŝ Ŷ Ő ; ŵ Ϳ
E Ž ƌ ƚ Ś ŝ Ŷ Ő ; ŵ Ϳ
ϭ ϭ ϭ
ϳ ͘ ϲ ϳ ͘ ϲ ϳ ͘ ϲ
E Ž ƌ ƚ Ś ŝ Ŷ Ő ; ŵ Ϳ
E Ž ƌ ƚ Ś ŝ Ŷ Ő ; ŵ Ϳ
ϭ ϭ ϭ
ϳ ͘ ϲ ϳ ͘ ϲ ϳ ͘ ϲ
Figure 5.20: Categorical modeling of the domains with grid size 50 × 50.
tion only describes their linear relation, so the sphered data do not promise a multi‑Gaussian shape.
If data have multi‑Gaussian shape, when projected to any one 1D dimension, they should retain a
univariate Gaussian shape, and this 1D dimension does not have to be aligned with the coordinates.
The PPMT method projects the sphered data to a 1D dimension that exhibits the most non‑
Gaussian shape currently and normal scores the data in that dimension. Then the multivariate data
is projected to the second most non‑Gaussian 1D dimension and normal scored again. The pro‑
cess is repeated iteratively until the multivariate data exhibits the desired level of multi‑Gaussian
shape. These 1D transforms are recorded for the back‑transformation. Fig.5.21 shows the plots of
the multivariate data after PPMT when Wsp = 0.7, and they are in multi‑Gaussian shapes. Note
the multivariate models are simulated independently in each domain, so the data are transformed
separately based on their domain label. With the data in multi‑Gaussian shape, the variables can
be modeled independently.
Since the data are modeled within each domain and each variable independently, there are 9
variograms inferred for each Wsp . As observed from Fig.5.22 and Fig.5.23, the variograms ranges
are longer for Al and Mg. When Wsp = 0.7, the variograms are relatively more stable, because
the variogram inference have more available pairs when domains are more continuous, The major
direction is 110°azimuth and the minor direction is 20°azimuth. When the resulting variograms are
not stable, omnidirectional search is used. Note the variograms are inferred from the normal score
transformed data, not the PPMT transformed data (refer to (Barnett et al., 2014)). Each variogram
generates 100 realizations of the univariate models. These univariate models are back‑transformed
to the original units first and then combined with the domain realizations in Fig.5.20, giving 100
86
5. Ensemble clustering and classification
ϯ ϯ ϯ
Ϯ Ϯ Ϯ
ϭ ϭ ϭ
Ž ŵ Ă ŝ Ŷ Ϭ
D Ő
ů
ů
Ϭ Ϭ Ϭ
ϭ ϭ ϭ
Ϯ Ϯ Ϯ
ϯ ϯ ϯ
ϯ Ϯ ϭ Ϭ ϭ Ϯ ϯ ϯ Ϯ ϭ Ϭ ϭ Ϯ ϯ ϯ Ϯ ϭ Ϭ ϭ Ϯ ϯ
^ ŝ ^ ŝ D Ő
ϯ ϯ ϯ
Ϯ Ϯ Ϯ
ϭ ϭ ϭ
Ž ŵ Ă ŝ Ŷ ϭ
D Ő
ů
ů
Ϭ Ϭ Ϭ
ϭ ϭ ϭ
Ϯ Ϯ Ϯ
ϯ ϯ ϯ
ϯ Ϯ ϭ Ϭ ϭ Ϯ ϯ ϯ Ϯ ϭ Ϭ ϭ Ϯ ϯ ϯ Ϯ ϭ Ϭ ϭ Ϯ ϯ
^ ŝ ^ ŝ D Ő
Ϯ Ϯ Ϯ
ϭ ϭ ϭ
Ž ŵ Ă ŝ Ŷ Ϯ
D Ő
ů
ů
Ϭ Ϭ Ϭ
ϭ ϭ ϭ
Ϯ Ϯ Ϯ
Ϯ ϭ Ϭ ϭ Ϯ Ϯ ϭ Ϭ ϭ Ϯ Ϯ ϭ Ϭ ϭ Ϯ
^ ŝ ^ ŝ D Ő
Figure 5.21: The scatter plots of the variables after PPMT. Each row represents the transformed multivariate
data in each domain.
87
5. Ensemble clustering and classification
Figure 5.22: The variograms of variables in each domain for Wsp = 0.0.
Figure 5.23: The variograms of variables in each domain for Wsp = 0.7.
88
5. Ensemble clustering and classification
data are multivariate scattered, so the resulting multivariate simulations are not as smooth. This
feature may be less obvious when the number of domains increases and the domain simulations
are more scattered, but their effect on the simulation smoothness may not be as dominant as the
conditioning data.
E Ž ƌ ƚ Ś ŝ Ŷ Ő ; ŵ Ϳ
E Ž ƌ ƚ Ś ŝ Ŷ Ő ; ŵ Ϳ
ϯ ͘ ϱ ϴ Ϭ ͘ ϳ ϵ ϱ ϰ ͘ Ϭ ϯ
ϳ ͘ ϲ ϳ ͘ ϲ ϳ ͘ ϲ
E Ž ƌ ƚ Ś ŝ Ŷ Ő ; ŵ Ϳ
E Ž ƌ ƚ Ś ŝ Ŷ Ő ; ŵ Ϳ
ϯ ͘ ϱ ϵ Ϭ ͘ ϳ ϵ ϱ ϰ ͘ Ϭ ϯ
ϳ ͘ ϲ ϳ ͘ ϲ ϳ ͘ ϲ
Figure 5.24: One of the realizations of three variables after merging the domain labels. The upper row is for
Wsp = 0.0. The lower row is for Wsp = 0.7.
Fig.5.25 illustrates the scatter plots of the merged results. The top row is the original data plots
(618 data) , the middle row is the plots of one realization of Wsp = 0.0 results (2500 data), and
the bottom row is the plots of the one realization of Wsp = 0.7 results (2500 data). Both simulation
results retain the original multivariate shape of the data. The difference lies in the proportion of low
and high values. The Wsp = 0.0 results have a larger proportion of low values, while the Wsp = 0.7
results have a larger proportion of high values. This feature may come from the data merging. In
the upper row of Fig.5.20, high values are mostly in the blue domain. When merged in the last
step of modeling, most of the high values are clipped. While in the Wsp = 0.7 case, high values are
grouped into three domains and easier to be preserved in the merging step.
The observations of Fig.5.24 and Fig.5.25 are based on visual check of several realizations. Flow
simulation is used on all realizations, validating the previous observations. First, the multivariate
models are converted to univariate permeability models. Since the data are standardized, when
adding the variables together, they should contribute similarly to the sum. A new variable D is
89
5. Ensemble clustering and classification
ϴ ϴ
ϴ
ϲ ϲ
ϲ
K ƌ ŝ Ő ŝ Ŷ Ă ů Ă ƚ Ă
ϰ ϰ
D Ő
ϰ
ů
ů
Ϯ Ϯ Ϯ
Ϭ Ϭ Ϭ
Ϯ Ϭ Ϯ ϰ Ϯ Ϭ Ϯ ϰ Ϭ Ϯ ϰ ϲ ϴ
^ ŝ ^ ŝ D Ő
ϴ ϴ
ϴ
ϲ ϲ
ϲ
^ Ɖ t Ğ ŝ Ő Ś ƚ Ϭ ͘ Ϭ
ϰ ϰ
D Ő
ϰ
ů
ů
Ϯ Ϯ Ϯ
Ϭ Ϭ Ϭ
Ϯ Ϭ Ϯ ϰ Ϯ Ϭ Ϯ ϰ Ϭ Ϯ ϰ ϲ ϴ
^ ŝ ^ ŝ D Ő
ϴ ϴ
ϴ
ϲ ϲ
ϲ
^ Ɖ t Ğ ŝ Ő Ś ƚ Ϭ ͘ ϳ
ϰ ϰ
D Ő
ϰ
ů
ů
Ϯ Ϯ Ϯ
Ϭ Ϭ Ϭ
Ϯ Ϭ Ϯ ϰ Ϯ Ϭ Ϯ ϰ Ϭ Ϯ ϰ ϲ ϴ
^ ŝ ^ ŝ D Ő
Figure 5.25: The scatter plots of the variables from original data and the realizations of Wsp = 0.0 and Wsp =
0.7.
where ui represents the location i, N is the total available data. For the modeling case, N is equal
to 2500. Note the negative Mg is added, because the spatial distribution of Mg values is opposite of
Al and Si.
To demonstrate the different low and high values proportion observed in Fig.5.25, D(u) is gen‑
erated for each case (D0.0 (u) and D0.7 (u) for Wsp = 0.0 and Wsp = 0.7 respectively), and two
universal thresholds (Thigh and Tlow ) are defined for converting D(u) to permeability. When D(u)
is above Thigh , the collocated permeability is set to 10 mD. When D(u) is below Tlow , the collocated
90
5. Ensemble clustering and classification
permeability is set to 0.1 mD. When D(u) is in the middle, the collocated permeability is set to 1
mD. In this case, Thigh is chosen from the 0.8 quantile of the 100 realizations of D0.0 (u) and D0.7 (u)
(1.1), and Tlow is chosen from the of 0.2 quantile of the 100 realizations (‑1.45). Fig.5.26 shows 3
realizations for each Wsp . The realizations do not show significant difference as the multivariate
models are generated from the same conditioning data. The low and high value regions does not
vary significantly. When converted to permeability, two thresholds group the multivariate data
into three categories, making the difference less obvious. The difference between the two Wsp gen‑
erated permeability models needs to be examined through flow simulation. Note the proportion of
high, medium and low values are different in each realization.