Data-Science-49415926: Dowload Ebook
Data-Science-49415926: Dowload Ebook
com
OR CLICK HERE
DOWLOAD EBOOK
ebooknice.com
ebooknice.com
ebooknice.com
https://ebooknice.com/product/sat-ii-success-
math-1c-and-2c-2002-peterson-s-sat-ii-success-1722018
ebooknice.com
(Ebook) Master SAT II Math 1c and 2c 4th ed (Arco Master
the SAT Subject Test: Math Levels 1 & 2) by Arco ISBN
9780768923049, 0768923042
https://ebooknice.com/product/master-sat-ii-math-1c-and-2c-4th-ed-
arco-master-the-sat-subject-test-math-levels-1-2-2326094
ebooknice.com
ebooknice.com
ebooknice.com
https://ebooknice.com/product/mathematical-foundations-of-big-data-
analytics-51589028
ebooknice.com
https://ebooknice.com/product/kannan-r-foundations-of-data-
science-55750890
ebooknice.com
Texts in Computer Science
Series Editors
David Gries, Department of Computer Science, Cornell University, Ithaca, NY,
USA
Orit Hazzan , Faculty of Education in Technology and Science, Technion—Israel
Institute of Technology, Haifa, Israel
Titles in this series now included in the Thomson Reuters Book Citation Index!
‘Texts in Computer Science’ (TCS) delivers high-quality instructional content for
undergraduates and graduates in all areas of computing and information science,
with a strong emphasis on core foundational and theoretical material but inclusive
of some prominent applications-related content. TCS books should be reasonably
self-contained and aim to provide students with modern and clear accounts of topics
ranging across the computing curriculum. As a result, the books are ideal for
semester courses or for individual self-study in cases where people need to expand
their knowledge. All texts are authored by established experts in their fields,
reviewed internally and by the series editors, and provide numerous examples,
problems, and other pedagogical tools; many contain fully worked solutions.
The TCS series is comprised of high-quality, self-contained books that have
broad and comprehensive coverage and are generally in hardback format and
sometimes contain color. For undergraduate textbooks that are likely to be more
brief and modular in their approach, require only black and white, and are under
275 pages, Springer offers the flexibly designed Undergraduate Topics in Computer
Science series, to which we refer potential authors.
Tomas Hrycej • Bernhard Bermeitinger •
Matthias Cetto • Siegfried Handschuh
Mathematical
Foundations of Data
Science
123
Tomas Hrycej Bernhard Bermeitinger
Institute of Computer Science Institute of Computer Science
University of St. Gallen University of St. Gallen
St. Gallen, Switzerland St. Gallen, Switzerland
This Springer imprint is published by the registered company Springer Nature Switzerland AG
The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
Preface
Data Science is a rapidly expanding field with increasing relevance. There are
correspondingly numerous textbooks about the topic. They usually focus on various
Data Science methods. In a growing field, there is a danger that the number of
methods grows, too, in a pace that it is difficult to compare their specific merits and
application focus.
To cope with this method avalanche, the user is left alone with the judgment
about the method selection. He or she can be helped only if some basic principles
such as fitting model to data, generalization, and abilities of numerical algorithms
are thoroughly explained, independently from the methodical approach. Unfortu-
nately, these principles are hardly covered in the textbook variety. This book would
like to close this gap.
v
vi Preface
Besides students as the intended audience, we also see a benefit for researchers
in the field who want to gain a proper understanding of the mathematical foun-
dations instead of sole computing experience as well as practitioners who will get
mathematical exposure directed to make clear the causalities.
Comprehension Checks
In all chapters, important theses are summarized in their own paragraphs. All
chapters have comprehension checks for the students.
Preface vii
Acknowledgments
During the writing of this book, we have greatly benefited from students taking our
course and providing feedback on earlier drafts of the book. We would like to
explicitly mention the help of Jonas Herrmann for thorough reading of the manu-
script. He gave us many helpful hints for making the explanations comprehensible,
in particular from a student’s viewpoint. Further, we want to thank Wayne Wheeler
and Sriram Srinivas from Springer for their support and their patience with us in
finishing the book.
Finally, we would like to thank our families for their love and support.
ix
x Contents
Part II Applications
6 Specific Problems of Natural Language Processing . . . . . . . . . . . . . . 167
6.1 Word Embeddings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168
6.2 Semantic Similarity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170
6.3 Recurrent Versus Sequence Processing Approaches . . . . . . . . . . . 171
6.4 Recurrent Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173
6.5 Attention Mechanism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177
6.6 Autocoding and Its Modification . . . . . . . . . . . . . . . . . . . . . . . . . 180
6.7 Transformer Encoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180
6.7.1 Self-attention . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182
6.7.2 Position-Wise Feedforward Networks . . . . . . . . . . . . . . . . 184
6.7.3 Residual Connection and Layer Normalization . . . . . . . . . 184
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192
7 Specific Problems of Computer Vision . . . . . . . . . . . . . . . . . . . . . . . 195
7.1 Sequence of Convolutional Operators . . . . . . . . . . . . . . . . . . . . . 196
7.1.1 Convolutional Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197
7.1.2 Pooling Layers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200
Contents xi
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209
Acronyms
AI Artificial Intelligence
ARMA Autoregressive Moving Average
BERT Bidirectional Encoder Representations from Transformers
CNN Convolutional Neural Network
CV Computer Vision
DL Deep Learning
DS Data Science
FIR Finite Impulse Response
GRU Gated Recurrent Unit
IIR Infinite Impulse Response
ILSVRC ImageNet Large Scale Visual Recognition Challenge
LSTM Long Short-Term Memory Neural Network
MIMO Multiple Input/Multiple Output
MSE Mean Square Error
NLP Natural Language Processing
OOV Out-of-Vocabulary
PCA Principal Component Analysis
ReLU Rectified linear units
ResNet Residual Neural Network
RNN Recurrent Neural Network
SGD Stochastic Gradient Descent
SISO Single Input/Single Output
SVD Singular value decomposition
SVM Support vector machine
xiii
Data Science and Its Tasks
1
As the name Data Science (DS) suggests, it is a scientific field concerned with data.
However, this definition would encompass the whole of information technology.
This is not the intention behind delimiting the Data Science. Rather, the focus is on
extracting useful information from data.
In the last decades, the volume of processed and digitally stored data has reached
huge dimensions. This has led to a search for innovative methods capable of coping
with large data volumes. A naturally analogous context is that of intelligent infor-
mation processing by higher living organisms. They are supplied by a continuous
stream of voluminous sensor data (delivered by senses such as vision, hearing, or
tactile sense) and use this stream for immediate or delayed acting favorable to the
organism. This fact makes the field of Artificial Intelligence (AI) a natural source of
potential ideas for Data Science. These technologies complement the findings and
methods developed by classical disciplines concerned with data analysis, the most
prominent of which is statistics.
The research subject of Artificial Intelligence (AI) is all aspects of sensing, recog-
nition, and acting necessary for intelligent or autonomous behavior. The scope of
Data Science is similar but focused on the aspects of recognition. Given the data,
collected by sensing or by other data accumulation processes, the Data Science tasks
consist in recognizing patterns interesting or important in some defined sense. More
concretely, these tasks can adopt the form of the following variants (but not limited
to them):
Depending on the character of the task, the data processing may be static or
dynamic. The static variant is characterized by a fixed data set in which a pattern is
to be recognized. This corresponds to the mathematical concept of a mapping: Data
patterns are mapped to their pattern labels. Static recognition is a widespread setting
for image processing, text search, fraud detection, and many others.
With dynamic processing, the recognition takes place on a stream of data provided
continuously in time. The pattern searched can be found only by observing this stream
and its dynamics. A typical example is speech recognition.
Historically, the first approaches to solving these tasks date back to several cen-
turies ago and have been continually developed. The traditional disciplines have
been statistics as well as systems theory investigating dynamic system behavior.
These disciplines provide a large pool of scientifically founded findings and meth-
ods. Their natural focus on linear systems results from the fact that these systems are
substantially easier to treat analytically. Although some powerful theory extensions
to nonlinear systems are available, a widespread approach is to treat the nonlinear
systems as locally linear and use linear theory tools.
AI has passed several phases. Its origins in the 1950s focused on simple learn-
ing principles, mimicking basic aspects of the behavior of biological neuron cells.
Information to be processed has been represented by real-valued vectors. The corre-
sponding computing procedures can be counted to the domain of numerical mathe-
matics. The complexity of algorithms has been limited by the computing power of
information processing devices available at that time. The typical tasks solved have
been simple classification problems encompassing the separation of two classes.
Limitations of this approach with the given information processing technology
have led to an alternative view: logic-based AI. Instead of focusing on sensor informa-
tion, logical statements, and correspondingly, logically sound conclusions have been
investigated. Such data has been representing some body of knowledge, motivating
to call the approach knowledge based. The software systems for such processing
have been labeled “expert systems” because of the necessity of encoding expert
knowledge in an appropriate logic form.
This field has reached a considerable degree of maturity in machine processing of
logic statements. However, the next obstacle had to be surmounted. The possibility of
describing a real world in logic terms showed its limits. Many relationships important
for intelligent information processing and behavior turned out to be too diffuse for
the unambiguous language of logic. Although some attempts to extend the logic
by probabilistic or pseudo-probabilistic attributes (fuzzy logic) delivered applicable
results, the next change of paradigm has taken place.
With the fast increase of computing power, also using interconnected computer
networks, the interest in the approach based on numerical processing of real-valued
data revived. The computing architectures are, once more, inspired by neural systems
of living organisms. In addition to the huge growth of computing resources, this phase
1 Data Science and Its Tasks 3
The authors hope to present concise and transparent answers to these questions
wherever allowed by the state of the art.
Part I
Mathematical Foundations
Application-Specific Mappings
and Measuring the Fit to Data 2
with a parameter vector w. For linear mappings of type (2.2), the parameter vector
w consists of the elements of matrix B.
There are several basic application types with their own interpretation of the
mapping sought. The task of fitting a mapping of a certain type to the data requires a
measure of how good this fit is. An appropriate definition of this measure is important
for several reasons:
• In most cases, a perfect fit with no deviation is not possible. To select from alter-
native solutions, comparing the values of fit measure is necessary.
• For optimum mappings of a simple type such as linear ones, analytical solutions
are known. Others can only be found by numerical search methods. To control the
search, repeated evaluation of the fit measure is required.
• The most efficient search methods require smooth fit measures with existing or
even continuous gradients, to determine the search direction where the chance for
improvement is high.
For some mapping types, these two groups of requirements are difficult to meet
in a single fit measure.
There are also requirements concerning the correspondence of the fit measure
appropriate from the viewpoint of the task on one hand and of that used for (mostly
numerical) optimization on the other hand:
• The very basic requirement is that both fit measures should be the same. This
seemingly trivial requirement may be difficult to satisfy for some tasks such as
classification.
• It is desirable that a perfect fit leads to a zero minimum of the fit measure. This is
also not always satisfied, for example, with likelihood-based measures. Difficulties
to satisfy these requirements frequently lead to using different measures for the
search on one hand and for the evaluation of the fit on the other hand. In such
cases, it is preferable if both measures have at least a common optimum.
The most straightforward application type is using the mapping as what it mathe-
matically is: a mapping of real-valued input vectors to equally real-valued output
vectors. This type encompasses many physical, technical, and econometric applica-
tions. Examples of this may be:
2.1 Continuous Mappings 9
• Failure rates (y) determined from operation time and conditions of a component
(x).
• Credit scoring, mapping the descriptive features (x) of the credit recipient to a
number denoting the creditworthiness (y).
• Macroeconomic magnitudes such as inflation rate (y) estimated from others such
as unemployment rate and economic growth (x).
For a vector mapping f (x, w), the error (2.4) is a column vector. The vector
product e e is the sum of the squares of the errors of individual output vector elements.
Summing these errors over K training examples result in the error measure
K
K
M
E= ek ek = 2
emk (2.5)
k=1 k=1 m=1
Different scaling
of individual elements of vector patterns can make scaling
weights S = s1 . . . s M appropriate. Also, some training examples may be more
important than others, which can be expressed by additional weights rk . The error
measure (2.5) has then the generalized form
K
K
M
E= ek Sek rk = 2
emk sm rk (2.6)
k=1 k=1 m=1
For linear mappings (2.2), explicit solutions for reaching zero in the error measure
(2.5) and (2.6) are known. Their properties have been thoroughly investigated and
some important aspects are discussed in Chap. 4. Unfortunately, most practical ap-
plications deviate to a greater or lesser extent from the linearity assumption. Good
analytical tractability may be a good motivation to accept a linear approximation if
the expected deviations from the linearity assumption are not excessive. However, a
lot of applications will not allow such approximation. Then, some nonlinear approach
is to be used.
Modeling nonlinearities in the mappings can be done in two ways that strongly
differ their application.
2.1 Continuous Mappings 11
The first approach preserves linearity in parameters. The mapping (2.3) is ex-
pressed as
y = Bh (x) (2.7)
with a nonparametric function h (x) which plays the role of the input vector x itself.
In other words, h (x) can be substituted for x in all algebraic relationships valid for
linear systems. This includes also explicit solutions for Mean Square Errors (MSEs)
(2.5) and (2.6).
The function h (x) can be an arbitrary function but a typical choice is a polynomial
in vector x. This is motivated by the well-known Taylor expansion of an arbitrary
multivariate function [7]. This expansion enables an approximation of a multivariate
function by a polynomial of a given order on an argument interval, with known error
bounds.
For a vector x with two elements x1 and x2 , a quadratic polynomial is
h x1 x2 = 1 x1 x2 x12 x22 x1 x2 (2.8)
For a vector x with three elements x1 , x2 , and x3 , it is already as complex as
follows:
h x1 x2 x3 = 1 x1 x2 x3 x12 x22 x32 x1 x2 x1 x3 x2 x3 (2.9)
For a vector x of length N , the length of vector h (x) is
(N − 1) N N 2 + 3N
1+ N + N + =1+ (2.10)
2 2
For a polynomial of order p, the size of vector h (x) grows with the pth power of
N . This is the major shortcoming of the polynomial approach for typical applications
of DS where input variable numbers of many thousands are common. Already with
quadratic polynomials, the input width would increase to millions and more.
Another disadvantage is the growth of higher polynomial powers outside of the
interval covered by the training set—a minor extrapolation may lead to excessively
high output values.
So, modeling the multivariate nonlinearities represented by polynomials is practi-
cal only for low-dimensional problems or problems in which it is justified to refrain
from taking full polynomials (e.g., only powers of individual scalar variables). With
such problems, it is possible to benefit from the existence of analytical optima and
statistically well-founded statements about the properties of the results.
These properties of parameterized mappings linear in parameters have led to
the high interest in more general approximation functions. They form the second
approach: mappings nonlinear in parameters. A prominent example are neural net-
works, discussed in detail in Chap. 3. In spite of intensive research, practical state-
ments about their representational capacity are scarce and overly general, although
there are some interesting concepts such as Vapnik–Chervonenkis dimension [21].
Neural networks with bounded activation functions such as sigmoid do not exhibit
the danger of unbounded extrapolation. They frequently lead to good results if the
number of parameters scales linearly with the input dimension, although the optimal-
ity or appropriateness of their size is difficult to show. Determining their optimum
size is frequently a result of lengthy experiments.
12 2 Application-Specific Mappings and Measuring …
Minimizing the MSE (2.5) or (2.6) leads to a mapping making a good (or even
perfect, in the case of a zero error) forecast of the output vector y. This corresponds
to the statistical concept of point estimation of the expected value of y.
In the presence of an effect unexplained by input variable or of some type of noise,
the true values of the output will usually not be exactly equal to their expected values.
Rather, they will fluctuate around these expected values according to some probability
distribution. If the scope of these fluctuations is different for different input patterns
x, the knowledge of the probability distribution may be of crucial interest for the
application. In this case, it would be necessary to determine a conditional probability
distribution of the output pattern y conditioned on the input pattern x
g (y | x) (2.11)
If the expected probability distribution type is parameterized by parameter vector
p, then (2.11) extends to
g (y | x, p) (2.12)
From the statistical viewpoint, the input/output mapping (2.3) maps the input
pattern x directly to the point estimator of the output pattern y. However, we are
free to adopt a different definition: input pattern x can be mapped to the conditional
parameter vector p of the distribution of output pattern y. This parameter vector
has nothing in common with the fitted parameters of the mapping—it consists of
parameters that determine the shape of a particular probability distribution of the
output patterns y, given an input pattern x. After the fitting process, the conditional
probability distribution (2.12) becomes
g (y, f (x, w)) (2.13)
It is an unconditional distribution of output pattern y with distribution parameters
determined by the function f (x, w). The vector w represents the parameters of the
mapping “input pattern x ⇒ conditional probability distribution parameters p” and
should not be confused with the distribution parameters p themselves. For example,
in the case of mapping f () being represented by a neural network, w would corre-
spond to the network weights. Distribution parameters p would then correspond to
the activation of the output layer of the network for a particular input pattern x.
This can be illustrated on the example of a multivariate normal distribution with
a mean vector m and covariance matrix C. The distribution (2.12) becomes
1 −1 −1
g (y | x, p) = N (m (x) , C (x)) = e 2 (y−m(x)) C(x) (y−m(x))
(2π ) |C (x)|
N
(2.14)
2.1 Continuous Mappings 13
The vector y can, for example, represent the forecast of temperature and humidity
for the next day, depending on today’s meteorological measurements x. Since the
point forecast would scarcely hit the tomorrow’s state and thus be of limited use, it
will be substituted by the forecast that the temperature/humidity vector is expected
to have the mean m (x) and the covariance matrix C (x), both depending on today’s
measurement vector x. Both the mean vector and the elements of the covariance ma-
trix together constitute the distribution parameter vector p in (2.12). This parameter
vector depends on the vector of meteorological measurements x as in (2.13).
What remains is to choose an appropriate method to find the optimal mappings
m (x) and C (x) which depend on the input pattern x. In other words, we need
some optimality measure for the fit, which is not as simple as in the case of point
estimation with its square error. The principle widely used in statistics is that of
maximum likelihood. It consists of selecting distribution parameters (here: m and C)
such that the probability density value for the given data is maximum.
For a training set pattern pair (xk , yk ), the probability density value is
1 1 −1 (y −m(x ))
e− 2 (yk −m(xk )) C(xk ) k k (2.15)
(2π ) |C (xk )|
N
For independent samples (xk , yk ), the likelihood of the entire training set is the
product
K
1 1 −1
e− 2 (yk −m(xk )) C(xk ) (yk −m(xk )) (2.16)
k=1 (2π ) |C (x k )|
N
• Every symmetric matrix such as C (xk )−1 can be expressed as a product of a lower
triangular matrix L and its transpose L , that is, C (xk )−1 = L (xk ) L (xk ) .
• The determinant of a lower diagonal matrix L is the product of its diagonal ele-
ments.
• The determinant of L L is the square of the determinant of L.
• The inverse L −1 of a lower diagonal matrix L is a lower diagonal matrix and its
determinant is the reciprocal value of the determinant of L.
We are then seeking the parameter pair (β(x), η(x)) depending on the input pattern
x such that the log-likelihood over the training set
K
β (xk ) yk yk β(xk )
ln + β (xk ) − ln −
η (xk ) η (xk ) η (xk )
k=1
(2.23)
K
yk β(xk )
= ln β (xk ) − β (xk ) ln η (xk ) + (β (xk ) − 1) ln yk −
η (xk )
k=1
is minimum. The parameter pair can, for example, be the output layer (of size 2)
activation vector
β η = f (x, w) (2.24)
2.2 Classification
• images in which the object type is sought (e.g., a face, a door, etc.);
• radar signature assigned to flying objects;
• object categories on the road or in its environment during autonomous driving.
Sometimes, the classes are only discrete substitutes for a continuous scale. Dis-
crete credit scores such as “fully creditworthy” or “conditionally creditworthy” are
only distinct values of a continuous variable “creditworthiness score”. Also, many
social science surveys classify the answers to “I fully agree”, “I partially agree”, “I
am indifferent”, “I partially disagree”, and “I fully disagree”, which can be mapped
to a continuous scale, for example [−1, 1]. Generally, this is the case whenever the
classes can be ordered in an unambiguous way.
Apart from this case with inherent continuity, the classes may be an order-free
set of exclusive alternatives. (Nonexclusive classifications can be viewed as separate
tasks—each nonexclusive class corresponding to a dichotomy task “member” vs.
“nonmember”.) For such class sets, a basic measure of the fit to a given training or test
set is the misclassification error. The misclassification error for a given pattern may
be defined as a variable equal to zero if the classification by the model corresponds to
the correct class and equal to one if it does not. More generally, assigning the object
with the correct class i erroneously to the class j is evaluated by a nonnegative real
number called loss L i j . The loss of a correct class assignment is L ii = 0.
The so-defined misclassification loss is a transparent measure, frequently directly
reflecting application domain priorities. By contrast, it is less easy to make it opera-
tional for fitting or learning algorithms. This is due to its discontinuous character—a
class assignment can only be correct or wrong. So far, solutions have been found
only for special cases.
16 2 Application-Specific Mappings and Measuring …
Let us consider a simple problem with two classes and two-dimensional patterns
[x1 , x2 ] as shown in Fig. 2.3. The points corresponding to Class 1 and Class 2
can be completely separated by a straight line, without any misclassification. This
is why such classes are called linearly separable. The attainable misclassification
error is zero.
The existence of a separating line guarantees the possibility to define regions in
the pattern vector space corresponding to individual classes. What is further needed
is a function whose value would indicate the membership of a pattern in a particular
class. Such function for the classes of Fig. 2.3 is that of Fig. 2.4. Its value is unity
for patterns from Class 1 and zero for those from Class 2.
Unfortunately, this function has properties disadvantageous for treatment by nu-
merical algorithms. It is discontinuous along the separating line and has zero gradient
elsewhere. This is why it is usual to use an indicator function of type shown in Fig. 2.5.
It is a linear function of the pattern variables. The patterns are assigned to Class
1 if this function is positive and to Class 2 otherwise.
Many or even the most class pairs cannot be separated by a linear hyperplane. It
is not easy to determine whether they can be separated by an arbitrary function if the
2.2 Classification 17
family of these functions is not fixed. However, some classes can be separated by
simple surfaces such as quadratic ones. An example of this is given in Fig. 2.6. The
separating curve corresponds to the points where the separating function of Fig. 2.7
intersects the plane with y = 0.
The discrete separating function such as that of Fig. 2.4 can be viewed as a
nonlinear step function of the linear function of Fig. 2.5, that is,
1 for b x ≥ 0
s bx = (2.25)
0 for b x < 0
18 2 Application-Specific Mappings and Measuring …
To avoid explicitly mentioning the absolute term, it will be assumed that the last
element of input pattern vector x is equal to unity, so that
⎡ ⎤
x1 ⎡ ⎤
⎢ .. ⎥ x1
⎢ ⎥ ⎢ ⎥
b x = b1 · · · b N −1 b N ⎢ . ⎥ = b1 · · · b N −1 ⎣ ... ⎦ + b N
⎣x N −1 ⎦
x N −1
1
The misclassification sum for a training set with input/output pairs (xk , yk ) is
equal to
K
s b xk − yk
2
E= (2.26)
k=1
2.2 Classification 19
Here, yk is the class indicator of the kth training pattern with values 0 or 1. For
most numerical minimization methods for error functions E, the gradient of E with
regard to parameters b is required to determine the direction of descent towards low
values of E. The gradient is
∂E K
ds
=2 s b xk − yk xk (2.27)
∂b dz
k=1
with z being the argument of function s (z).
However, the derivative of the nonlinear step function (2.25) is zero everywhere
except for the discontinuity at z = 0 where it does not exist. To receive a useful
descent direction, the famous perceptron rule [16] has used a gradient modification.
This pioneering algorithm iteratively updates the weight vector b in the direction of
the (negatively taken) modified gradient
∂E K
= s b xk − yk xk (2.28)
∂b
k=1
ds
This modified gradient can be viewed as (2.27) with dz substituted by unity (the
derivative of linear function s (z) = z). Taking a continuous gradient approxima-
tion is an idea used by optimization algorithms for non-smooth functions, called
subgradient algorithms [17].
The algorithm using the perceptron rule converges to zero misclassification rate
if the classes, as defined by the training set, are separable. Otherwise, convergence
is not guaranteed.
An error measure focusing on critical patterns in the proximity of separating
line is used by the approach called the support vector machine (SVM) [2]. This
approach is looking for a separating line with the largest orthogonal distance to the
nearest patterns of both classes. In Fig. 2.8, the separating line is surrounded by
the corridor defined by two boundaries against both classes, touching the respective
nearest points. The goal is to find a separating line for which the width of this corridor
is the largest. In contrast to the class indicator of Fig. 2.4 (with unity for Class 1
and zero for Class 2), the support vector machine rule is easier to represent with a
symmetric class indicator y equal to 1 for one class and to −1 for another one. With
this class indicator and input pattern vector containing the element 1 to provide for the
absolute bias term, the classification task is formulated as a constrained optimization
task with constraints
yk b xk ≥ 1 (2.29)
If these constraints are satisfied, the product b xk is always larger than 1 for Class
1 and smaller than −1 for Class 2.
The separating function b x of (2.29) is a hyperplane crossing the x1 /x2 -coordinates
plane at the separating line (red line in Fig. 2.8). At the boundary lines, b x is equal
to constants larger than 1 (boundary of Class 1) and smaller than −1 (boundary
of Class 2). However, there are infinitely many such separating functions. In the
20 2 Application-Specific Mappings and Measuring …
cross section perpendicular to the separating line (i.e., viewing the x1 /x2 -coordinates
plane “from aside”), they may appear as in Fig. 2.9.
There are infinitely many such hyperplanes (appearing as dotted lines in the cross
section of Fig. 2.9), some of which becoming very “steep”. The most desirable
variant would be that exactly touching the critical points of both classes at a unity
“height” (solid line). This is why the optimal solution of the SVM is such that it has
the minimum norm of vector b:
simple: “separated” (successful fit) and “non-separated” (failing to fit). The absence
of intermediary results makes the problem of discontinuous misclassification error
or loss irrelevant—every separation is a full success.
For Gaussian classes with column vector means m 1 and m 2 , and common co-
variance matrix C, matrix A and some parts of the constant d become zero. The
discriminant function becomes linear:
b x + d > 0
with
b = (m 1 − m 2 ) C −1
1 p1
d = − b (m 1 + m 2 ) + ln
2 p2 (2.37)
1 p1
= − (m 1 + m 2 ) C −1 (m 1 + m 2 ) + ln
2 p2
This linear function is widely used in the linear discriminant analysis.
Interestingly, the separating function (2.37) can, under some assumptions, be
received also with a least squares approach. For simplicity, it will be assumed that
the mean over both classes m 1 p2 + m 2 p2 is zero. Class 1 and Class 2 are
coded by 1 and −1, and the pattern vector x contains 1 at the last position.
The zero gradient is reached at
b X X = y X (2.38)
By dividing both sides by the number of samples, matrices X and X X contain
sample moments (means and covariances). Expected values are
1 1
E b XX = E yX (2.39)
K K
The expression X X corresponds to the sample second moment matrix. With the
zero mean, as assumed above, it is equal to the sample covariance matrix. Every
covariance matrix over a population divided into classes can be decomposed to the
intraclass covariance C (in this case, identical for both classes) and the interclass
covariance
M = m1 m2
p1 0
P= (2.40)
0 p2
Ccl = M P M
This can be then rewritten as
C + MPM 0
b = p1 m 1 − p2 m 2 p1 − p2 (2.41)
0 1
resulting in
C + Ccl 0 −1
b = p1 m 1 − p2 m 2 p1 − p2
0 1
−1
C + Ccl 0 (2.42)
= p1 m 1 − p2 m 2 p1 − p2
0 1
= p1 m 1 − p2 m 2 C + Ccl −1 p1 − p2
24 2 Application-Specific Mappings and Measuring …
It is interesting to compare the linear discriminant (2.37) with least square solution
(2.37) and (2.42). With an additional assumption of both classes having identical prior
probabilities p1 = p2 (and identical counts in the training set), the absolute term of
both (2.37) and (2.42) becomes zero. The matrix Ccl contains covariances of only two
classes and is thus of maximum rank two. The additional condition of overall mean
equal to zero reduces the rank to one. This results in least squares-based separating
vector b to be only rescaled in comparison with that of separating function (2.37).
This statement can be inferred in the following way.
In the case of identical prior probabilities of both classes, the condition of zero
mean of distribution of all patterns is m 1 +m 2 = 0, or m 2 = −m 1 . It can be rewritten
as m 1 = m and m 2 = −m with the help of a single column vector of class means m.
The difference of both means is m 1 − m 2 = 2m. The matrix Ccl is
1 1
Ccl = m1 m2 m1 m2 = m 1 m 1 + m 2 m 2 = mm (2.43)
2 2
with rank equal to one—it is an outer product of only one vector m with itself.
The equation for separating function b of the linear discriminant is
bC = 2m (2.44)
while for separating function bLS of least squares, it is
bLS (C + Ccl ) = 2m (2.45)
Let us assume the proportionality of both solutions by factor d:
bLS = db (2.46)
Then
db (C + Ccl ) = 2dm + 2dm C −1 Ccl = 2m (2.47)
or
1−d
m C −1 Ccl = m C −1 mm = m = em (2.48)
d
with
1−d
e= (2.49)
d
and
1
d= (2.50)
1+e
The scalar proportionality factor e in (2.48) can always be found since Ccl = mm
is a projection operator to a one-dimensional space. It projects every vector, i.e.,
also the vector m C −1 , to the space spanned by vector m. In other words, these
two vectors ale always proportional. Consequently, a scalar proportionality factor
d for separating functions can always be determined via (2.50). This means that
proportional separating functions are equivalent since they separate identical regions.
The result of this admittedly tedious argument is that the least square solution
fitting the training set to the class indicators 1 and −1 is equivalent with the optimum
linear discriminant, under the assumption of
2.2 Classification 25
This makes the least squares solution interesting since it can be applied without
assumptions about the distribution—of course with the caveat that is not Bayesian
optimal for other distributions. This seems to be the foundation of the popularity of
this approach beyond the statistical community, for example, in neural network-based
classification.
Its weakness is that the MSE reached cannot be interpreted in terms of misclassifi-
cation error—we only know that in the MSE minimum, we are close to the optimum
separating function. The reason for this lack of interpretability is that the function
values of the separating function are growing with the distance from the hyperplane
separating both classes while the class indicators (1 and −1) are not—they remain
constant at any distance. Consequently, the MSE attained by optimization may be
large even if the classes are perfectly separated. This can be seen if imagining a “lat-
eral view” of the vector space given in Fig. 2.10. It is a cross section in the direction
of the class separating line. The class indicators are constant: 1 (Class 1 to the
left) and −1 (Class 2 to the right).
More formally, the separating function (for the case of separable classes) assigns
the patterns, according to the test b x + d > 0 for Class 1 membership, to the
respective correct class. However, the value of b x + d is not equal to the class
2
indicator y (1 or −1). Consequently, the MSE b x + d − y is far away from zero
in the optimum. Although alternative separating functions with identical separating
lines can have different slopes, no one of them can reach zero MSE. So, the MSE
does not reflect the misclassification rate.
This shortcoming can be alleviated by using a particular nonlinear function of
the term b x + d. Since this function is usually used in the form producing class
26 2 Application-Specific Mappings and Measuring …
indicators 1 for Class 1 and zero for Class 2, it will reflect the rescaled linear
situation of Fig. 2.11.
The nonlinear function is called logistic or logit function in statistics and econo-
metrics. With neural networks, it is usually referred to as sigmoid function, related
via rescaling to tangent hyperbolicus (tanh). It is a function of scalar argument z:
1
y = s (z) = (2.51)
1 + e−z
This function is mapping the argument z ∈ (−∞, ∞) to the interval [0, 1], as
shown in Fig. 2.12.
Applying (2.51) to the linear separating function b x+d, that is, using the nonlinear
separating function
1
y = s b x + d = −(b x+d) (2.52)
1+e
will change the picture of Fig. 2.11 to that of Fig. 2.13. The forecast class indicators
(red crosses) are now close to the original ones (blue and green circles).
The MSE is
s b x + d − y
2
(2.53)
For separable classes, MSE can be made arbitrarily close to zero, as depicted in
Fig. 2.14. The proximity of the forecast and true class indicators can be increased
2.2 Classification 27
where the exponents yk and 1 − yk acquire values 0 or 1 and thus “select” the correct
alternative from (2.55).
For a sample (or training set) of mutually independent samples, the likelihood
over this sample is the product
K
f (xk , w) yk (1 − f (xk , w))1−yk
k=1
(2.57)
K
K
= ( f (xk , w)) (1 − f (xk , w))
k=1,yk =1 k=1,yk =0
If the training set is a representative sample from the statistical population as-
sociated with pattern xk , the expected value of likelihood per pattern L/K can be
evaluated. The only random variable in (2.58) is the class indicator y, with probability
p of being equal to one and 1 − p of being zero:
With a parameterized approximator f (x, w) that can exactly compute the class
probability for a given pattern x and some parameter vector w, the exact fit is
at both the maximum of likelihood and the MSE minimum (i.e., least squares).
Of course, to reach this exact fit, an optimization algorithm that is capable of
finding the optimum numerically has to be available. This may be difficult for
strongly nonlinear approximators.
Least squares with logistic activation function seem to be the approach to
classification that satisfies relatively well the requirements formulated at the
beginning of Sect. 2.2.
2.2 Classification 31
It has to be pointed out that convergence of the perceptron rule to a stable state
for non-separable classes is not guaranteed without additional provisions.
Also, multiple class separation by SVM is possible by decomposing to a set of
two-class problems [4].
The generalization of the Bayesian two-class criterion (2.32) to a multiclass prob-
lem is straightforward. The posterior probability of membership of pattern x in the
ith class is
f (x | i) pi
Pi = M (2.67)
f (x | j) p j
j=1
Seeking for the class with the maximum posterior probability (2.67), the denom-
inator, identical for all classes, can be omitted.
Under the assumption that the patterns of every class follow multivariate normal
distribution, the logarithm of the numerator of (2.67) is
1 1
ln − (x − m i ) Ci−1 (x − m i ) + ln ( pi )
(2π ) |Ci |
N 2
N 1 1 1 −1
= − ln (2π ) − ln (|Ci |) − x Ci−1 x + m i Ci−1 x − m i Ci m i + ln ( pi )
2 2 2 2
(2.68)
which can be organized to the quadratic expression
qi (x) = x Ai x + bi x + di (2.69)
with
1
Ai = − Ci−1
2
bi = m i Ci−1 (2.70)
N 1 1 −1
di = − ln (2π ) − ln (|Ci |) − m i Ci m i + ln ( pi )
2 2 2
The Bayesian optimum which simultaneously minimized the misclassification
rate is to assign pattern xk to class i with the largest qi (xk ). Assuming identical
covariance matrices Ci for all classes, the quadratic terms become identical (because
of identical matrices Ai ) and are irrelevant for the determination of misclassification
rate minimum. Then, the separating functions become linear as in the case of two
classes.
of scaled impulses. For example, if the signal consists of the sequence (3, 2, 4) at
times t = 0, 1, 2, it is equivalent to the sum of a triple impulse at time t = 0, a
double impulse at t = 1, and a quadruple impulse at t = 2. So, the output at time t
is the sum of correspondingly delayed impulse responses. This is exactly what the
equality (2.71) expresses: the output yt is the sum of impulse responses bh to delayed
impulses scaled by input values xt−h , respectively.
The response of most dynamical systems is theoretically infinite. However, if
these systems belong to the category of stable systems (i.e., those that do not diverge
to infinity), the response will approach zero in practical terms after some finite time
[1]. In the system shown in Fig. 2.17, the response is close to vanishing after about
15 time steps. Consequently, a mapping of the input pattern vector consisting of the
input measurements with delays h = 0, . . . , 15 may be a good approximation of the
underlying system dynamics.
For more complex system behavior, the sequence length for a good approximation
may grow. The impulse response of the system depicted in Fig. 2.18 is vanishing
in practical terms only after about 50 time steps. This plot describes the behavior
of a second-order system. An example of such system is a mass fixed on an elastic
spring, an instance of which being the car and its suspension. Depending on the
damping of the spring component, this system may oscillate or not. Strong damping
as implemented by faultless car shock absorbers will prevent the oscillation while
insufficient damping by damaged shock absorbers will not.
With further growing system complexity, for example, for systems oscillating
with multiple different frequencies, the pattern size may easily grow to thousands of
time steps. Although the number of necessary time steps depends on the length of
the time step, too long time steps will lead to a loss of precision and at some limit
value even cease to represent the system in an unambiguous way.
The finite response representation of dynamic systems is easy to generalize to
nonlinear systems. The mapping sought is
yt = f xt . . . xt−h , w (2.72)
for training samples with different indices t, with correspondingly delayed input
patterns. A generalization to multivariate systems (with vector output yt and a set
of delayed input vectors xt−h ) is equally easy. However, the length of a necessary
2.3 Dynamical Systems 35
The first two miles of our descent we by no means found difficult, but
wishing to take a minute survey of the picturesque Pass of Llanberris,
we changed the route generally prescribed to strangers, and
descended a rugged and almost perpendicular path, in opposition to
the proposals of our guide, who strenuously endeavoured to dissuade
us from the attempt; alleging the difficulty of the steep, and relating
a melancholy story of a gentleman, who many years back had broken
his leg. This had no effect: we determined to proceed; and the vale
of Llanberris amply rewarded us for the trouble.
Mr. Williams of Llandigai, in his observations on the Snowdon
mountains (which, from his having been a resident on the spot, may
be considered as entitled to the greatest credit,) makes the following
remarks on the probable derivation of their names, and the customs
and manners of their inhabitants.
“It would be endless to point out the absurd conjectures and
misrepresentations of those who have of late years undertaken to
describe this country. Some give manifestly wrong interpretations of
the names of places, and others, either ignorantly or maliciously,
have as it were caricatured its inhabitants. Travellers from England,
often from want of candour, and always from defect of necessary
knowledge, impose upon the world unfavourable as well as false
accounts of their fellow-subjects in Wales; yet the candour of the
Welsh is such, that they readily ascribe such misrepresentations to an
ignorance of their language, and a misconception of the honest,
though perhaps warm temper of those that speak it. And it may be,
travellers are too apt to abuse the Welsh, because they cannot or will
not speak English. Their ignorance ought not to incur disgust: their
reluctance proceeds not from stubbornness, but from diffidence, and
the fear of ridicule.
“NATIVES OF ERYRI.
“The inhabitants of the British mountains are so humane and
hospitable, that a stranger may travel amongst them without
incurring any expense for diet or lodging. Their fare an Englishman
may call coarse; however, they commonly in farm-houses have three
sorts of bread, namely, wheat, barley, and oatmeal; but the oatmeal
they chiefly use; this, with milk, butter, cheese, and potatoes, is their
chief summer food. They have also plenty of excellent trout, which
they eat in its season. And for the winter they have dry salted beef,
mutton, and smoked rock venison, which they call Côch ar Wyden,
i.e. The Red upon the Withe, being hung by a withe, made of a
willow or hazel twig. They very seldom brew ale, except in some of
the principal farm-houses: having no corn of their own growing, they
think it a superfluous expense to throw away money for malt and
hops, when milk, or butter-milk mixed with water, quenches the thirst
as well.
“They are hardy and very active; but they have not the perseverance
and resolution which are necessary for laborious or continued
undertakings, being, from their infancy, accustomed only to ramble
over the hills after their cattle. In summer they go barefoot, but
seldom barelegged, as has been lately asserted by a traveller. They
are shrewd and crafty in their bargains, and jocular in their
conversation; very sober, and great economists; though a late tourist
has given them a different character. Their greetings, when they
meet any one of their acquaintance, may to some appear tedious and
disagreeable: their common mode of salutation is ‘How is thy heart?
how the good wife at home, the children, and the rest of the family?’
and that often repeated. When they meet at a public house, they will
drink each other’s health, or the health of him to whom the mug goes
at every round. They are remarkably honest.
“Their courtships, marriages, &c. differ in nothing from what is
practised on these occasions among the lowlanders or other Welsh
people; but as there are some distinct and local customs in use in
North Wales, not adopted in other parts of Great Britain, I shall, by
way of novelty, relate a few of them:—When Cupid lets fly his shaft at
a youthful heart, the wounded swain seeks for an opportunity to have
a private conversation with the object of his passion, which is usually
obtained at a fair, or at some other public meeting; where he, if bold
enough, accosts her, and treats her with wine and cakes. But he that
is too bashful will employ a friend to break the ice for him, and
disclose the sentiments of his heart: the fair one, however, disdains
proxies of this kind, and he that is bold, forward, and facetious, has a
greater chance of prevailing; especially if he has courage enough to
steal a few kisses: she will then probably engage to accept of his
nocturnal visit the next Saturday night. When the happy hour
arrives, neither the darkness of the night, the badness of the
weather, nor the distance of the place, will discourage him, so as to
abandon his engagement. When he reaches the spot, he conceals
himself in some out-building, till the family go to rest. His fair friend
alone knows of and awaits his coming. After admittance into the
house a little chat takes place at the fireside, and then, if every thing
is friendly, they agree to throw themselves on a bed, if there is an
empty one in the house; when Strephon takes off his shoes and coat,
and Phillis only her shoes; and covering themselves with a blanket or
two, they chat there till the morning dawn, and then the lover steals
away as privately as he came. And this is the bundling or courting in
bed, [168] for which the Welsh are so much bantered by strangers.
“This courtship often lasts for years, ere the swain can prevail upon
his mistress to accept of his hand. Now and then a pregnancy
precedes marriage; but very seldom, or never, before a mutual
promise of entering into the marriage state is made. When a
matrimonial contract is thus entered into, the parents and friends of
each party are apprised of it, and an invitation to the wedding takes
place; where, at the appointed wedding-day, every guest that dines
drops his shilling, besides payment for what he drinks: the company
very often amounts to two or three hundred, and sometimes more.
This donation is intended to assist the young couple to buy bed-
clothes, and other articles necessary to begin the world. Nor does
the friendly bounty stop here: when the woman is brought to bed,
the neighbours meet at the christening, out of free good-will, without
invitation, where they drop their money; usually a shilling to the
woman in the straw, sixpence to the midwife, and sixpence to the
cook; more or less, according to the ability and generosity of the
giver.
“MODE OF BURYING.
“When the parish-bell announces the death of a person, it is
immediately inquired upon what day the funeral is to be; and on the
night preceding that day, all the neighbours assemble at the house
where the corpse is, which they call Ty Corph, i.e. ‘the corpse’s
house.’ The coffin, with the remains of the deceased, is then placed
on the stools, in an open part of the house, covered with black cloth;
or, if the deceased was unmarried, with a clean white sheet, with
three candles burning on it. Every person on entering the house falls
devoutly on his knees before the corpse, and repeats to himself the
Lord’s prayer, or any other prayer that he chooses. Afterwards, if he
is a smoker, a pipe and tobacco are offered to him. This meeting is
called Gwylnos, and in some places Pydreua. The first word means
Vigil; the other is, no doubt, a corrupt word from Paderau, or
Padereuau, that is, Paters, or Paternosters. When the assembly is
full, the parish-clerk reads the common service appointed for the
burial of the dead: at the conclusion of which, psalms, hymns, and
other godly songs are sung; and since Methodism is become so
universal, some one stands up and delivers an oration on the
melancholy subject, and then the company drop away by degrees.
On the following day the interment takes place, between two and
four o’clock in the afternoon, when all the neighbours assemble
again. It is not uncommon to see on such occasions an assembly of
three or four hundred people, or even more. These persons are all
treated with warm spiced ale, cakes, pipes and tobacco; and a dinner
is given to all those that come from far: I mean, that such an
entertainment is given at the funerals of respectable farmers. [170a]
They then proceed to the church; and at the end of that part of the
burial service, which is usually read in the church, before the corpse
is taken from the church, every one of the congregation presents the
officiating minister with a piece of money; the deceased’s next
relations usually drop a shilling each, others sixpence, and the poorer
sort a penny a-piece, laying it on the altar. This is called offering,
and the sum amounts sometimes to eight, ten, or more pounds at a
burial. The parish-clerk has also his offering at the grave, which
amounts commonly to about one-fourth of what the clergyman
received. After the burial is over the company retire to the public-
house, where every one spends his sixpence for ale; [170b] then all
ceremonies are over.”—Mr. W. then proceeds to explain the good and
ill resulting from the prevalence of Methodism, and those fanatics
termed Ranters, &c., and states, that “the mountain-people preserve
themselves, in a great measure, a distinct race from the lowlanders:
they but very seldom come down to the lowlands for wives; nor will
the lowlander often climb up the craggy steeps, and bring down a
mountain spouse to his cot. Their occupations are different, and it
requires that their mates should be qualified for such different modes
of living.
“I will not scruple to affirm, that these people have no strange blood
in their veins,—that they are the true offspring of the ancient Britons:
they, and their ancestors, from time immemorial, have inhabited the
same districts, and, in one degree or other, they are all relations.”
The vale of Llanberris is bounded by the steep precipices of
Snowdon, and two large lakes, communicating by a river. It was
formerly a large forest, but the woods are now entirely cut down.
We here dismissed our Cambrian mountaineer, and easily found our
way to Dolbadern (pronounced Dolbathern) Castle, situated between
the two lakes, and now reduced to one circular tower, thirty feet in
diameter, with the foundations of the exterior buildings completely in
ruins: in this, Owen Gôch, brother to Llewellin, last prince, was
confined in prison. This tower appears to have been the keep or
citadel, about ninety feet in height, with a vaulted dungeon. At the
extremity of the lower lake are the remains of a British fortification,
called Caer cwm y Glô: and about half a mile from the castle, to the
south, at the termination of a deep glen, is a waterfall, called
Caunant Mawr; it rushes over a ledge of rocks upwards of twenty
yards in height, falls some distance in an uninterrupted sheet, and
then dashes with a tremendous roar through the impeding fragments
of the rock, till it reaches the more quiet level of the vale. Returning
to the lakes, you have a fine view of the ruins, with the promontory
on which they are situated; and that with greatly heightened effect, if
favoured by their reflection on the glassy surface of the waters, to
which you add the rocky heights on each side; Llanberris church,
relieving the mountain scenery, and the roughest and most rugged
cliffs of Snowdon in the back-ground topping the whole, which give
together a grand and pleasing coup d’œil.
In this vicinity are large slate quarries, the property of Thomas
Asheton Smith, Esq.; and a rich vein of copper ore. These afford
employ to great numbers of industrious poor: to the men, in
obtaining the ore and slates, and the women and children in
breaking, separating, and preparing the different sorts for
exportation, or for undergoing farther preparatory processes to fit
them for smelting. From hence a rugged horse-path brought us to
the Caernarvon turnpike-road, about six miles distant; the high
towers of the castle, the very crown and paragon of the landscape, at
last pointed out the situation of
CAERNARVON;
and having crossed a handsome modern stone bridge thrown over
the river Seiont, and built by “Harry Parry, the modern Inigo, a.d.
1791,” we soon entered this ancient town, very much fatigued from
our long excursion.
The town of Caernarvon, beautifully situated and regularly built, is in
the form of a square, enclosed on three sides with thick stone walls;
and on the south side defended by the Castle.
The towers are extremely elegant; but not being entwined with ivy,
do not wear that picturesque appearance which castles generally
possess. Over the principal entrance, which leads into an oblong
court, is seated, beneath a great tower, the statue of the founder,
holding in his left hand a dagger; this gateway was originally fortified
with four portcullises. At the west end, the eagle tower, remarkably
light and beautiful, in a polygon form; three small hexagon turrets
rising from the middle, with eagles placed on their battlements; from
thence it derives its name. In a little dark room [173a] in this tower,
measuring eleven feet by seven, was born King Edward II. April 25,
1284. The thickness of the wall is about ten feet. To the top of the
tower we reckoned one hundred and fifty-eight steps; from whence
an extensive view of the adjacent country is seen to great
advantage. On the south are three octagonal towers, with small
turrets, with similar ones on the north. All these towers
communicate with each other by a gallery, both on the ground,
middle, and upper floor, formed within the immense thickness of the
walls, in which are cut narrow slips, at convenient distances, for the
discharge of arrows.
This building, founded on a rock, is the work of King Edward I., the
conqueror of the principality; the form of it is a long irregular square,
enclosing an area of about two acres and a half. From the
information of the Sebright manuscript, Mr. Pennant says, that, by
the united efforts of the peasants, it was erected within the space of
one year.
Having spent near three hours in surveying one of the noblest castles
in Wales, we walked round the environs of the town. The terrace
[173b]
round the castle wall, when in existence, was exceedingly
pleasing, being in front of the Menai, which is here upwards of a mile
in breadth, forming a safe harbour, and is generally crowded with
vessels, exhibiting a picture of national industry; whilst near it a
commodious quay presents an ever-bustling scene, from whence a
considerable quantity of slate, and likewise copper, from the
Llanberris mine, is shipped for different parts of the kingdom.
Caernarvon may certainly be considered as one of the handsomest
and largest towns in North Wales; and under the patronage of Lord
Uxbridge promises to become still more populous and extensive.
In Bangor-street, is the Uxbridge Arms hotel, a large and most
respectable inn; where, as well as at the Goat, the charges are
moderate and the accommodations excellent.
Caernarvon is only a township and chapelry to Llanbeblic. Its market
is on a Saturday, which is well supplied and reasonable; and with the
spirited improvements made to the town and harbour, has been the
means of greatly increasing its population: according to the late
returns it contains 1008 houses, and 6000 inhabitants. The church,
or rather chapel, has been rebuilt by subscription. Service is
performed here in English, and at the mother church at Llanbeblic
[174]
in Welsh.
The Port, although the Aber sand-banks forming a dangerous bar,
must ever be a great drawback upon it, has not only been
wonderfully improved, but is in that progressive state of improvement
by the modern mode of throwing out piers, that vessels can now, of
considerable tonnage, lie alongside the quay, and discharge or take in
their cargoes in perfect safety; this bids fair, as may be seen by the
rapid increase of its population and tonnage, to make it a place of
trade and considerable resort: yet still it only ranks as a creek, and its
custom-house is made dependent on that of the haven of Beaumaris;
to the comptroller of which its officer is obliged to report: this must
be a considerable hindrance to its trade, particularly in matters out of
the customary routine. The county hall, which is near the castle, is a
low building, but sufficiently commodious within to hold with
convenience the great sessions. Caernarvon possessed such great
favour with Edward the 1st. as to have the first royal charter granted
in Wales given to it. It is by that constituted a free borough: it has
one alderman, one deputy mayor, two bailiffs, a town-clerk, two
serjeants-at-mace, and a mayor; who, for the time, is governor of the
castle, and is allowed 200l. per annum to keep it in repair; it, jointly
with Conway, Nevin, Criccaeth, and Pwllheli, sends a member to
parliament; for the return of whom, every inhabitant, resident or
non-resident, who has been admitted to the freedom of the place,
possesses a vote.
It is allowed to have a prison for petty offences independent of the
sheriff. Its burgesses likewise were exempt throughout the kingdom
from tollage, lastage, passage, murage, pontage, and all other
impositions of whatever kind, with other privileges, too numerous to
insert.
The county prison is likewise near the castle. It was erected in the
year 1794. The new market-house, containing the butchers’
shambles, &c. is a well-contrived and convenient building, affording
good storage for corn and other articles left unsold.
The site of the ancient town of Segontium, which lies about half a
mile south of the present one, will be found worthy the attention of
the traveller; it was the only Roman station of note in this part of
Cambria, on which a long chain of minor forts and posts were
dependent. It is even maintained, and that by respectable
authorities, that it was not only the residence, but burial-place of
Constantius, father of Constantine the Great; but most probably this
arises from confusing Helena, the daughter of Octavius, duke of
Cornwall, who was born at Segontium, and married to Maximus, first
cousin of Constantine, with Helena his mother, whom these
authorities assert to have been the daughter of a British king. A
chapel, said to have been founded by Helen, and a well which bears
her name, are amongst the ruins still pointed out.
Since the numerous late improvements have been going forward, at
and near Caernarvon, new and interesting lights have been thrown
on the ruins in its vicinity, which will form a rich treat to the
antiquary.
Near the banks of the Seint, from which Segontium took its name,
and which runs from the lower lake of Llanberris, are the remains of
a fort, which appears to have been calculated to cover a landing-
place from the river at the time of high-water: it is of an oblong
shape, and includes an area of about an acre; one of the walls which
is now standing is about seventy-four yards, and the other sixty-four
yards long, in height from ten to twelve feet, and nearly six feet in
thickness. The peculiar plan of the Roman masonry is here
particularly discernible, exhibiting alternate layers, the one regular,
the other zig-zag; on these their fluid mortar was poured, which
insinuated itself into all the interstices, and set so strong as to form
the whole into one solid mass; retaining its texture even to the
present day, to such a degree, that the bricks and stone in the
Roman walls yield as easy as the cement.
English history has spoken so fully on this place, as connected with
Edward the 1st., on the title, which he, from his son being born in
this castle, so artfully claimed for him, and the future heirs apparent
to the British throne, as affording to the Welsh a prince of their own,
agreeable to their wishes, and the quiet annexation of the principality
to his dominions, which Edward by this means obtained, that it
appears superfluous to enlarge upon it in this work.
Several excursions may be made from Caernarvon, with great
satisfaction to the tourist; the principal of which is a visit to
PLAS-NEWYDD,
the elegant seat of the Marquis of Anglesea, situated in the Isle of
Anglesey, and distant about six miles from Caernarvon: if the wind
and tide prove favourable, the picturesque scenery of the Menai will
be viewed to great advantage by hiring a boat at the quay. [178] But if
this most advisable plan should not be approved of, the walk to the
Moel-y-don ferry, about five miles on the Bangor road, will prove
highly gratifying: the Menai, whose banks are studded with
gentlemen’s seats, appearing scarcely visible between the rich foliage
of the oak, which luxuriates to the water’s brink, is filled with vessels,
whose shining sails, fluttering in the wind, attract and delight the
observing eye; whilst the voice of the sailors, exchanging some salute
with the passing vessel, is gently wafted on the breeze.
Crossing the ferry, we soon reached the ancient residence of the
arch-druid of Britain, where was formerly stationed the most
celebrated of the ancient British academies: from this circumstance,
many places in this island still retain their original appellation, as
Myfyrim, the place of studies: Caer Idris, the city of astronomy;
Cerrig Boudin, the astronomer’s circle. The shore to the right soon
brought us to the plantations of Plâs-Newydd, consisting chiefly of
the most venerable oaks, and noblest ash in this part of the country:
BANGOR,
the oldest episcopal see in Wales; being founded in 516.
The situation is deeply secluded, “far from the bustle of a jarring
world,” and must have accorded well with monastic melancholy; for
the Monks, emerging from their retired cells, might here indulge in
that luxurious gloominess, which the prospect inspires, and which
would soothe the asperities inflicted upon them by the severe
discipline of superstition. The situation of Bangor appears more like
a scene of airy enchantment than reality; and the residences of the
Canons are endeared to the votaries of landscape by the prospect
they command. On the opposite shore, the town of Beaumaris was
seen straggling up the steep declivity, with its quay crowded with
vessels, and all appeared bustle and confusion; the contrast, which
the nearer prospect inspired, was too evident to escape our notice,
where the
MENAI.
This Strait, which separates Anglesea from the main land, although
bearing only the appearance of a river, is an arm of the sea, and
most dangerous in its navigation at particular periods of the tide, and
in boisterous weather: during the flood, from the rush of water at
each extremity, it has a double current, the clash of which, termed
Pwll Ceris, it is highly rash and dangerous to encounter. In the space
of fifteen miles, there are six established ferries: the first of which to
the south is Abermenai, the next near Caernarvon, and three miles
north from the first is Tal y foel; four miles further, Moel y don; three
miles beyond which is the principal one, called Porthaethwy, but more
generally known as Bangor Ferry; it is the narrowest part of the
Strait, and is only about half a mile wide; this is the one over which
the mails and passengers pass on their route to and from Holyhead,
and near which is the bridge, of which a particular description and
plan is for the first time given; a mile further north is the fifth, Garth
Ferry; and the sixth, and widest ferry at high water, is between the
village of Aber and Beaumaris. Yet notwithstanding these ferries, the
principal part of the horned cattle that pass from Anglesea are
compelled by their drivers to swim over the passage at Bangor Ferry,
to the terror and injury of the animals, and the disgust and horror of
the bystanders.
There appears but little doubt of Anglesea having been once
connected with the main land, as evident traces of an isthmus are
discernible near Porthaeth-hwy; where a dangerous line of rocks
nearly cross the channel, and cause such eddies at the first flowing of
the tide, that the contending currents of the Menai seem here to
struggle for superiority. This isthmus once destroyed, and a channel
formed, it has been the work of ages, by the force of spring tides and
storms, gradually to deepen and enlarge the opening; as it appears
by history, that both Roman and British cavalry, at low water, during
neap tides, forded or swam over the Strait, and covered the landing
of the infantry from flat-bottomed boats.
The violent rush of water, and consequent inconvenience, delay, and
danger, when the wind and tide are unfavourable to the passage over
Bangor Ferry, in the present state of constant and rapid
communication with Ireland, gave rise to the idea of forming a bridge
over the Menai. Various estimates and plans were submitted to the
public consideration by our most celebrated engineers, and men of
science; when, after numerous delays, Mr. Telford’s design for one on
the suspension principle was adopted, and money granted by
parliament for carrying it into effect. The first stone of this
magnificent structure was laid on the 10th of August, 1819, without
any ceremony, by the resident engineer, Mr. Provis, and the
contractors for the masonry.
“When on entering the Straits,” [189] says a recent author, “the bridge
is first seen, suspended as it were in mid air, and confining the view
of the fertile and richly-wooded shores, it seems more like a light
ornament than a massy bridge, and shows little of the strength and
solidity which it really possesses. But as we approached it nearer,
whilst it still retained its light and elegant appearance, the
stupendous size and immensity of the work struck us with awe; and
when we saw that a brig, with every stick standing, had just passed
under it,—that a coach going over appeared not larger than a child’s
toy, and that foot-passengers upon it looked like pigmies, the
vastness of its proportions was by contrast fully apparent.” The
whole surface of the bridge is in length 1,000 feet, of which the part
immediately dependent upon the chains is 590 feet, the remaining
distance being supported by seven arches, four on one side and three
on the other, which fill up the distance from the main piers to the
shore. These main piers rise above the level of the road 50 feet, and
through them, two archways, each 12 feet wide, admit a passage.
Over the top of these piers, four rows of chains, the extremities of
which are firmly secured in the rocks at each end of the bridge, are
thrown; two of them nearly in the centre, about four feet apart, and
one at each side. The floor of the road is formed of logs of wood,
well covered with pitch, and then strewn over with granite broken
very small, forming a solid body by its adhesion to the pitch
impervious to the wet. A light lattice work of wrought iron to the
height of about six feet, prevents the possibility of accidents by falling
over, and allows a clear view of the scenery on both sides, which can
be seen to great advantage from this height. Having expressed our
admiration of the skill evident in the construction, at once so simple
and so useful, and having satisfied our curiosity on the top, we
descended by a precipitous path to the level of the water, and gazed
upwards with wonder, at the immense flat surface above us, and its
connecting gigantic arches. The road is 100 feet above high water,
and the arches spring at the height of 60 feet from abutments of
solid masonry, with a span of 52 feet. These abutments taper
gradually from their base to where the arch commences, and
immense masses as they are, show no appearance of heaviness;
indeed, taking the whole of the Menai Bridge together, a more perfect
union of beauty with utility cannot be conceived. It has been erected
to bear a weight upon the chains of 2,000 tons; the whole weight at
present imposed is only 500, leaving an available strength of 1,500
tons; so that there is an easy remedy for a complaint which has been
made of its too great vibration in a gale of wind, by laying additional
weight upon it. The granite of which the piers and arches are built, is
a species of marble, admitting a very high polish; of this the
peasantry in the neighbourhood avail themselves, and every one has
some specimen of polished marble ready to offer the tourist. There
is so much magnificence, beauty, and elegance, in this grand work of
art, that it harmonizes and accords perfectly with the natural scenery
around, and though itself an object of admiration, still in connection it
heightens the effect of the general view.
BEAUMARIS,
the largest and best built town in Anglesea, is pleasantly situated on
the western shore of the bay of that name, and commands a fine
view of the sea and the Caernarvonshire mountains. Its original
name was Porth Wygyr. Its harbour is well sheltered, and affords
ample protection for coasters, and ships of considerable burthen,
which, during northerly winds, are driven there in great numbers, to
avoid the dangers of a lee shore. As no manufactures of
consequence are carried on in its neighbourhood, it is rather
calculated for great retirement, than for active bustle; but being the
county town, it is now and then enlivened by the gaieties attendant
upon assizes, elections, and other public meetings.
The castle, built by Edward I. in 1295, stands in the estate of Lord
Bulkeley, close to the town, and covers a considerable space of
ground; but from its low situation it was always inferior in point of
strength to the castles of Conway and Caernarvon.
Close above the town is Baron Hill, the seat of Lord Viscount Warren
Bulkeley, delightfully situated on the declivity of a richly wooded
bank, and possessing a complete command of every object which can
add to the charms of picturesque scenery. The park extends to, and
nearly surrounds, the west and north sides of the town; whilst the
rising ground, upon which the mansion stands, shelters the town
from the rude blasts that would otherwise assail it; thus giving it that
protection from the raging of the elements which the noble owner
ever affords to its inhabitants, when sorrow and adversities assail
their domestic peace. To enumerate all the acts of Lord Bulkeley’s
munificence and kindness would be impossible, but a few of them
may be seen in the neighbourhood of Beaumaris.
The beautiful road of four miles and a half, along the shore of the
Menai to Bangor Ferry, was made at the expense of Lord and Lady
Bulkeley in 1804: it cost about £3000, and, when completed, was
presented to the public and has since been maintained at his
lordship’s expense. A road possessed of greater picturesque beauty
is not to be found in Britain.
The church is kept in repair by his lordship, to which he has
presented an excellent organ, a set of elegant communion plate, a
clock, and a peal of six fine toned bells; together, costing about
£1200. He has also given a good house to the rector for the time
being. The national school, as well as the minister’s house, was built
by public subscription, on land given by Lord Bulkeley; and the
master’s and mistress’s salaries have since been paid by him and his
lady.
Many more acts of their liberality might be enumerated, but these are
sufficient to prove them zealous protecting friends, and kind
neighbours. Their numerous deeds of private charity ought not to be
blazoned to the world, but they will live long in the grateful
remembrance of those around them.
Beaumaris, situated 249 miles from London, had, in 1811, 249
houses, and 1,810 inhabitants; and in 1821 a population of 2,205. It
is governed by a mayor, recorder, two bailiffs, twenty-four capital
burgesses, and several inferior officers. It formerly possessed an
extensive trade; but has declined since the rise of Liverpool.
From Beaumaris we proceeded, by Dulas and Red Wharf Bay, to
Amlwch; the distance is about sixteen miles, through a pleasant
country, in parts greatly resembling England. About a mile from Red
Wharf Bay you pass the village of Pentraeth, The End of the Sands.
The situation is pleasant; and Mr. Grose was so taken with the
picturesque beauty of its small church, as to give a view of it in his
Antiquities.
Near this, in a field at Plâs Gwynn, the seat of the Panton family, are
two stones, placed, as tradition says, to mark the bounds of an
astonishing leap; which obtained for the active performer of it the
wife of his choice; but it appears, that as he leaped into her
affections with difficulty, he ran away from them with ease; for going
to a distant part of the country, where he had occasion to reside
several years, he found, on his return, that his wife had, on that very
morning, been married to another person. Einson, on hearing this,
took his harp, and, sitting down at the door, explained in Welsh metre
who he was, and where he had been resident. His wife narrowly
scrutinized his person, unwilling to give up her new spouse, when he
exclaimed:
At Llanfair, which is about a mile distant from this road, was born the
celebrated scholar and poet, Goronwy Owen, who, notwithstanding
his acknowledged and admired abilities, was, after a series of
hardships and struggles, obliged to expatriate himself to the wilds of
Virginia, where he was appointed pastor of the Church. He was well
versed in the Latin, Greek, and oriental languages, was a skilful
antiquary, and an excellent poet. His Latin odes are greatly admired;
but his Welsh poems rank him among the most distinguished bards of
his country.
About five miles west of Beaumaris is Peny-mynydd, the birth-place
of Owen Tudor, a private gentleman, who, having married Catherine
of France, the Dowager of our Henry V., in 1428, became the
ancestor of a line of monarchs. They had three sons and one
daughter. The daughter died in her infancy: Edmund was created
Earl of Richmond, and marrying a daughter of the Duke of Somerset,
had Henry, afterwards Henry VII. Jasper was created Earl of
Pembroke; and Owen became a monk. By means of his marriage,
therefore, Owen Tudor not only became father to a line of kings; but
in his son, as Gray says, Wales came to be governed again by their
own princes.
The Tudor family became extinct in Richmond Tudor, who died in
1657, and the estate belongs to Lord Bulkeley. In the Church is one
of their monuments, removed from Lanvaes Abbey at its dissolution.
LLANELIAN
is about two miles east of Amlwch, near the coast: Mr. Bingley’s
account of which, and the superstitious ceremonies still attaching to
it, is both curious and entertaining:
AMLWCH,
or the Winding Loch, is a dirty-looking straggling town, founded on
rocks. It owes its support chiefly to the copper works in its vicinity.
The church is a neat modern structure, dedicated to Elaeth, a British
saint: the port, which is but small, is, notwithstanding, excellently
adapted for the trade which is carried on; it is narrow, capable of
only containing two vessels abreast, of about 200 tons burthen each,
and of these it will furnish room for about thirty; the entrance is by a
chasm between two rocks.
The Parys mountain, like the works at Merthyr, shews what the
industry of man is capable of accomplishing in removing rocks,
mountains, and dragging forth the bowels of the earth. To those who
possess good nerves, the view of this scene of wealth and industry
will afford gratification unalloyed; but to those not so blessed, the
horrific situations in which the principal actors of the scene are
placed, poised in air, exposed to the blasting of the rocks, and the
falling of materials, which themselves are sending aloft, or from those
which may be misdirected, as ascending from the workings of others,
by striking against projecting crags, seem to threaten death in so
many varied shapes, that the wonder and admiration excited by the
place are lost in pity and anxiety for the hardy miners.
From the top of the mountain, the dreadful yawning chasm, with the
numerous stages erected over the edge of the precipice, appal rather
than gratify the observer. To see the mine to advantage, you must
descend to the bottom, and be provided with a guide, to enable you
to shun the danger, that would be considerable, from the blasts and
falling materials; the workmen generally not being able to see those
that their operations may endanger.
The Mona mine is the entire property of the Marquis of Anglesea.
The Parys mine is shared.
The mountain has been worked with varied success for about sixty-
five years: it is now believed to be under the average; but whether
that arises from the low price of the article, or the mine being
exhausted, I am unable to say: for a considerable period, it produced
20,000 tons annually. One bed of ore was upwards of sixty feet in
thickness. In the blasting the rock, to procure the ore, from six to
eight tons of gunpowder are yearly consumed.
“This celebrated mountain,” says Mr. Evans, “is easily distinguished
from the rest; for it is perfectly barren from the summit to the plain
below: not a single shrub, and hardly a blade of grass, being able to
live in its sulphurous atmosphere.
HOLYHEAD,
called in Welsh Caergybi, situated on an island at the western
extremity of Anglesea. It has lately changed its aspect from a poor
fishing village to a decent looking town, in consequence of its being
the chief resort for passengers to and from Dublin. The distance
across the channel is about fifty-five miles; and there are sailing
Welcome to our website – the ideal destination for book lovers and
knowledge seekers. With a mission to inspire endlessly, we offer a
vast collection of books, ranging from classic literary works to
specialized publications, self-development books, and children's
literature. Each book is a new journey of discovery, expanding
knowledge and enriching the soul of the reade
Our website is not just a platform for buying books, but a bridge
connecting readers to the timeless values of culture and wisdom. With
an elegant, user-friendly interface and an intelligent search system,
we are committed to providing a quick and convenient shopping
experience. Additionally, our special promotions and home delivery
services ensure that you save time and fully enjoy the joy of reading.
ebooknice.com