100% found this document useful (2 votes)

30 views81 pages

Data-Science-49415926: Dowload Ebook

The document provides information about various eBooks available for download, including 'Mathematical Foundations of Data Science' and other related titles. It emphasizes the importance of understanding mathematical principles in Data Science, particularly for advanced students and practitioners. The book aims to fill gaps in existing literature by focusing on foundational concepts rather than specific methodologies.

Uploaded by

pipulsiacca

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

100% found this document useful (2 votes)

30 views81 pages

Data-Science-49415926: Dowload Ebook

Uploaded by

pipulsiacca

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 81

Download the Full Ebook and Access More Features - ebooknice.

com

(Ebook) Mathematical Foundations of Data Science

by Tomas Hrycej, Bernhard Bermeitinger, Matthias
Cetto, Siegfried Handschuh ISBN 9783031190735,
9783031190742, 9783031190766, 3031190734,
3031190742, 3031190769
https://ebooknice.com/product/mathematical-foundations-of-
data-science-49415926

OR CLICK HERE

DOWLOAD EBOOK

Download more ebook instantly today at https://ebooknice.com

Instant digital products (PDF, ePub, MOBI) ready for you
Download now and discover formats that fit your needs...

Start reading on any device today!

(Ebook) Mathematical Foundations of Data Science by Tomas

Hrycej, Bernhard Bermeitinger, Matthias Cetto, Siegfried
Handschuh ISBN 9783031190735, 9783031190742, 3031190734,
3031190742
https://ebooknice.com/product/mathematical-foundations-of-data-
science-48215854

ebooknice.com

(Ebook) Biota Grow 2C gather 2C cook by Loucas, Jason;

Viles, James ISBN 9781459699816, 9781743365571,
9781925268492, 1459699815, 1743365578, 1925268497
https://ebooknice.com/product/biota-grow-2c-gather-2c-cook-6661374

ebooknice.com

(Ebook) Matematik 5000+ Kurs 2c Lärobok by Lena

Alfredsson, Hans Heikne, Sanna Bodemyr ISBN 9789127456600,
9127456609
https://ebooknice.com/product/matematik-5000-kurs-2c-larobok-23848312

ebooknice.com

(Ebook) SAT II Success MATH 1C and 2C 2002 (Peterson's SAT

II Success) by Peterson's ISBN 9780768906677, 0768906679

https://ebooknice.com/product/sat-ii-success-
math-1c-and-2c-2002-peterson-s-sat-ii-success-1722018

ebooknice.com
(Ebook) Master SAT II Math 1c and 2c 4th ed (Arco Master
the SAT Subject Test: Math Levels 1 & 2) by Arco ISBN
9780768923049, 0768923042
https://ebooknice.com/product/master-sat-ii-math-1c-and-2c-4th-ed-
arco-master-the-sat-subject-test-math-levels-1-2-2326094

ebooknice.com

(Ebook) Cambridge IGCSE and O Level History Workbook 2C -

Depth Study: the United States, 1919-41 2nd Edition by
Benjamin Harrison ISBN 9781398375147, 9781398375048,
1398375144, 1398375047
https://ebooknice.com/product/cambridge-igcse-and-o-level-history-
workbook-2c-depth-study-the-united-states-1919-41-2nd-edition-53538044

ebooknice.com

(Ebook) Mathematical Foundations of Data Science Using R

(2nd Edition) by Frank Emmert-Streib, Salissou Moutari,
Matthias Dehmer ISBN 9783110795882, 3110795884
https://ebooknice.com/product/mathematical-foundations-of-data-
science-using-r-2nd-edition-46539954

ebooknice.com

(Ebook) Mathematical Foundations of Big Data Analytics by

Vladimir Shikhman ISBN 9783662625200, 3662625202

https://ebooknice.com/product/mathematical-foundations-of-big-data-
analytics-51589028

ebooknice.com

(Ebook) Kannan R. Foundations of Data Science by Hopcroft

J.,

https://ebooknice.com/product/kannan-r-foundations-of-data-
science-55750890

ebooknice.com
Texts in Computer Science

Series Editors
David Gries, Department of Computer Science, Cornell University, Ithaca, NY,
USA
Orit Hazzan , Faculty of Education in Technology and Science, Technion—Israel
Institute of Technology, Haifa, Israel
Titles in this series now included in the Thomson Reuters Book Citation Index!
‘Texts in Computer Science’ (TCS) delivers high-quality instructional content for
undergraduates and graduates in all areas of computing and information science,
with a strong emphasis on core foundational and theoretical material but inclusive
of some prominent applications-related content. TCS books should be reasonably
self-contained and aim to provide students with modern and clear accounts of topics
ranging across the computing curriculum. As a result, the books are ideal for
semester courses or for individual self-study in cases where people need to expand
their knowledge. All texts are authored by established experts in their ﬁelds,
reviewed internally and by the series editors, and provide numerous examples,
problems, and other pedagogical tools; many contain fully worked solutions.
The TCS series is comprised of high-quality, self-contained books that have
broad and comprehensive coverage and are generally in hardback format and
sometimes contain color. For undergraduate textbooks that are likely to be more
brief and modular in their approach, require only black and white, and are under
275 pages, Springer offers the flexibly designed Undergraduate Topics in Computer
Science series, to which we refer potential authors.
Tomas Hrycej • Bernhard Bermeitinger •
Matthias Cetto • Siegfried Handschuh

Mathematical
Foundations of Data
Science

123
Tomas Hrycej Bernhard Bermeitinger
Institute of Computer Science Institute of Computer Science
University of St. Gallen University of St. Gallen
St. Gallen, Switzerland St. Gallen, Switzerland

Matthias Cetto Siegfried Handschuh

Institute of Computer Science Institute of Computer Science
University of St. Gallen University of St. Gallen
St. Gallen, Switzerland St. Gallen, Switzerland

ISSN 1868-0941 ISSN 1868-095X (electronic)

Texts in Computer Science
ISBN 978-3-031-19073-5 ISBN 978-3-031-19074-2 (eBook)
https://doi.org/10.1007/978-3-031-19074-2
© The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature
Switzerland AG 2023
This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether
the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of
illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and
transmission or information storage and retrieval, electronic adaptation, computer software, or by similar
or dissimilar methodology now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this
publication does not imply, even in the absence of a specific statement, that such names are exempt from
the relevant protective laws and regulations and therefore free for general use.
The publisher, the authors, and the editors are safe to assume that the advice and information in this
book are believed to be true and accurate at the date of publication. Neither the publisher nor the
authors or the editors give a warranty, expressed or implied, with respect to the material contained
herein or for any errors or omissions that may have been made. The publisher remains neutral with regard
to jurisdictional claims in published maps and institutional affiliations.

This Springer imprint is published by the registered company Springer Nature Switzerland AG
The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
Preface

Data Science is a rapidly expanding field with increasing relevance. There are
correspondingly numerous textbooks about the topic. They usually focus on various
Data Science methods. In a growing field, there is a danger that the number of
methods grows, too, in a pace that it is difficult to compare their specific merits and
application focus.
To cope with this method avalanche, the user is left alone with the judgment
about the method selection. He or she can be helped only if some basic principles
such as fitting model to data, generalization, and abilities of numerical algorithms
are thoroughly explained, independently from the methodical approach. Unfortu-
nately, these principles are hardly covered in the textbook variety. This book would
like to close this gap.

For Whom Is This Book Written?

This book is appropriate for advanced undergraduate or master’s students in

computer science, Artificial Intelligence, statistics or related quantitative subjects,
as well as people from other disciplines who want to solve Data Science tasks.
Elements of this book can be used earlier, e.g., in introductory courses for Data
Science, engineering, and science students who have the required mathematical
background.
We developed this book to support a semester course in Data Science, which is
the first course in our Data Science specialization in computer science. To give you
an example of how we use this book in our own lectures, our Data Science course
consists of two parts:
• In the first part, a general framework for solving Data Science tasks is described,
with a focus on facts that can be supported by mathematical and statistical
arguments. This part is covered by this book.
• In the second part of the course, concrete methods from multivariate statistics
and machine learning are introduced. For this part, many well-known Springer
textbooks are available (e.g., those by Hastie and Tibshirani or Bishop), which
are used to accompany this part of the course. We did not intend to duplicate this
voluminous work in our book.

v
vi Preface

Besides students as the intended audience, we also see a beneﬁt for researchers
in the ﬁeld who want to gain a proper understanding of the mathematical foun-
dations instead of sole computing experience as well as practitioners who will get
mathematical exposure directed to make clear the causalities.

What Makes This Book Different?

This book encompasses the formulation of typical tasks as input/output mappings,

conditions for successful determination of model parameters with good general-
ization properties as well as convergence properties of basic classes of numerical
algorithms used for parameter fitting.
In detail, this book focuses on topics such as
• generic type of Data Science task and the conditions for its solvability;
• trade-off between model size and volume of data available for its identification
and its consequences for model parametrization (frequently referred to as
learning);
• conditions to be satisfied for good performance of the model on novel data, i.e.,
generalization; and
• conditions under which numerical algorithms used in Data Science operate and
what performance can be expected from them.
These are fundamental and omnipresent problems of Data Science. They are
decisive for the success of the application, more than a detailed selection of a
computing method. These questions are scarcely, or not at all, treated in other Data
Science and Machine Learning textbooks. Students and many data engineers and
researchers are frequently not aware of these conditions and, neglecting them,
produce suboptimal solutions.
In this book, we did not focus on Data Science technology and methodology
except where it is necessary to explain general principles, because we felt that this
was mostly covered in existing books.
In summary, this textbook is an important addition to all existing Data Science
courses.

Comprehension Checks

In all chapters, important theses are summarized in their own paragraphs. All
chapters have comprehension checks for the students.
Preface vii

Acknowledgments

During the writing of this book, we have greatly beneﬁted from students taking our
course and providing feedback on earlier drafts of the book. We would like to
explicitly mention the help of Jonas Herrmann for thorough reading of the manu-
script. He gave us many helpful hints for making the explanations comprehensible,
in particular from a student’s viewpoint. Further, we want to thank Wayne Wheeler
and Sriram Srinivas from Springer for their support and their patience with us in
ﬁnishing the book.
Finally, we would like to thank our families for their love and support.

St. Gallen, Switzerland Tomas Hrycej

September 2022 Bernhard Bermeitinger
Matthias Cetto
Siegfried Handschuh
Contents

1 Data Science and Its Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

Part I Mathematical Foundations

2 Application-Specific Mappings and Measuring the Fit to Data . . . . . 7
2.1 Continuous Mappings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.1.1 Nonlinear Continuous Mappings . . . . . . . . . . . . . . . . . . . . 10
2.1.2 Mappings of Probability Distributions . . . . . . . . . . . . . . . . 12
2.2 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.2.1 Special Case: Two Linearly Separable Classes . . . . . . . . . 16
2.2.2 Minimum Misclassification Rate for Two Classes . . . . . . . 21
2.2.3 Probabilistic Classification . . . . . . . . . . . . . . . . . . . . . . . . 28
2.2.4 Generalization to Multiple Classes . . . . . . . . . . . . . . . . . . 31
2.3 Dynamical Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.4 Spatial Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
2.5 Mappings Received by “Unsupervised Learning” . . . . . . . . . . . . . 41
2.5.1 Representations with Reduced Dimensionality . . . . . . . . . . 43
2.5.2 Optimal Encoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
2.5.3 Clusters as Unsupervised Classes . . . . . . . . . . . . . . . . . . . 49
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
3 Data Processing by Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . 55
3.1 Feedforward and Feedback Networks . . . . . . . . . . . . . . . . . . . . . . 56
3.2 Data Processing by Feedforward Networks . . . . . . . . . . . . . . . . . 59
3.3 Data Processing by Feedback Networks . . . . . . . . . . . . . . . . . . . . 62
3.4 Feedforward Networks with External Feedback . . . . . . . . . . . . . . 67
3.5 Interpretation of Network Weights . . . . . . . . . . . . . . . . . . . . . . . . 71
3.6 Connectivity of Layered Networks . . . . . . . . . . . . . . . . . . . . . . . . 73
3.7 Shallow Networks Versus Deep Networks . . . . . . . . . . . . . . . . . . 79
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
4 Learning and Generalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
4.1 Algebraic Conditions for Fitting Error Minimization . . . . . . . . . . . 84
4.2 Linear and Nonlinear Mappings . . . . . . . . . . . . . . . . . . . . . . . . . 90

ix
x Contents

4.3 Overdetermined Case with Noise . . . . . . . . . . . . . . . . . . . . . . . . . 95

4.4 Noise and Generalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
4.5 Generalization in the Underdetermined Case . . . . . . . . . . . . . . . . 109
4.6 Statistical Conditions for Generalization . . . . . . . . . . . . . . . . . . . . 111
4.7 Idea of Regularization and Its Limits . . . . . . . . . . . . . . . . . . . . . . 113
4.7.1 Special Case: Ridge Regression . . . . . . . . . . . . . . . . . . . . 116
4.8 Cross-Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
4.9 Parameter Reduction Versus Regularization . . . . . . . . . . . . . . . . . 120
5 Numerical Algorithms for Data Science . . . . . . . . . . . . . . . . . . . . . . 129
5.1 Classes of Minimization Problems . . . . . . . . . . . . . . . . . . . . . . . . 130
5.1.1 Quadratic Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . 131
5.1.2 Convex Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
5.1.3 Non-convex Local Optimization . . . . . . . . . . . . . . . . . . . . 132
5.1.4 Global Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
5.2 Gradient Computation in Neural Networks . . . . . . . . . . . . . . . . . . 136
5.3 Algorithms for Convex Optimization . . . . . . . . . . . . . . . . . . . . . . 138
5.4 Non-convex Problems with a Single Attractor . . . . . . . . . . . . . . . 142
5.4.1 Methods with Adaptive Step Size . . . . . . . . . . . . . . . . . . . 144
5.4.2 Stochastic Gradient Methods . . . . . . . . . . . . . . . . . . . . . . 145
5.5 Addressing the Problem of Multiple Minima . . . . . . . . . . . . . . . . 152
5.5.1 Momentum Term . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
5.5.2 Simulated Annealing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162

Part II Applications
6 Specific Problems of Natural Language Processing . . . . . . . . . . . . . . 167
6.1 Word Embeddings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168
6.2 Semantic Similarity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170
6.3 Recurrent Versus Sequence Processing Approaches . . . . . . . . . . . 171
6.4 Recurrent Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173
6.5 Attention Mechanism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177
6.6 Autocoding and Its Modification . . . . . . . . . . . . . . . . . . . . . . . . . 180
6.7 Transformer Encoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180
6.7.1 Self-attention . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182
6.7.2 Position-Wise Feedforward Networks . . . . . . . . . . . . . . . . 184
6.7.3 Residual Connection and Layer Normalization . . . . . . . . . 184
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192
7 Specific Problems of Computer Vision . . . . . . . . . . . . . . . . . . . . . . . 195
7.1 Sequence of Convolutional Operators . . . . . . . . . . . . . . . . . . . . . 196
7.1.1 Convolutional Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197
7.1.2 Pooling Layers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200
Contents xi

7.1.3 Implementations of Convolutional Neural Networks . . . . . 200

7.2 Handling Invariances . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201
7.3 Application of Transformer Architecture to Computer Vision . . . . 203
7.3.1 Attention Mechanism for Computer Vision . . . . . . . . . . . . 203
7.3.2 Division into Patches . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206

Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209
Acronyms

AI Artificial Intelligence
ARMA Autoregressive Moving Average
BERT Bidirectional Encoder Representations from Transformers
CNN Convolutional Neural Network
CV Computer Vision
DL Deep Learning
DS Data Science
FIR Finite Impulse Response
GRU Gated Recurrent Unit
IIR Infinite Impulse Response
ILSVRC ImageNet Large Scale Visual Recognition Challenge
LSTM Long Short-Term Memory Neural Network
MIMO Multiple Input/Multiple Output
MSE Mean Square Error
NLP Natural Language Processing
OOV Out-of-Vocabulary
PCA Principal Component Analysis
ReLU Rectified linear units
ResNet Residual Neural Network
RNN Recurrent Neural Network
SGD Stochastic Gradient Descent
SISO Single Input/Single Output
SVD Singular value decomposition
SVM Support vector machine

xiii
Data Science and Its Tasks
1

As the name Data Science (DS) suggests, it is a scientific field concerned with data.
However, this definition would encompass the whole of information technology.
This is not the intention behind delimiting the Data Science. Rather, the focus is on
extracting useful information from data.
In the last decades, the volume of processed and digitally stored data has reached
huge dimensions. This has led to a search for innovative methods capable of coping
with large data volumes. A naturally analogous context is that of intelligent infor-
mation processing by higher living organisms. They are supplied by a continuous
stream of voluminous sensor data (delivered by senses such as vision, hearing, or
tactile sense) and use this stream for immediate or delayed acting favorable to the
organism. This fact makes the field of Artificial Intelligence (AI) a natural source of
potential ideas for Data Science. These technologies complement the findings and
methods developed by classical disciplines concerned with data analysis, the most
prominent of which is statistics.
The research subject of Artificial Intelligence (AI) is all aspects of sensing, recog-
nition, and acting necessary for intelligent or autonomous behavior. The scope of
Data Science is similar but focused on the aspects of recognition. Given the data,
collected by sensing or by other data accumulation processes, the Data Science tasks
consist in recognizing patterns interesting or important in some defined sense. More
concretely, these tasks can adopt the form of the following variants (but not limited
to them):

• recognizing one of the predefined classes of patterns (a classification task). An

example is recognition of an object in a visual scene characterized by image pixel
data or determining the semantic meaning of an ambiguous phrase;
• finding a quantitative relationship between some data (a continuous mapping).
Such relationships are frequently found in technical and economic data, for ex-
ample, the dependence of interest rate on the growth rate of domestic product or
money supply;
• finding characteristics of data that are substantially more compact than the origi-
nal data (data compression). A trivial example is characterizing the data about a

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 1

T. Hrycej et al., Mathematical Foundations of Data Science, Texts in Computer Science,
https://doi.org/10.1007/978-3-031-19074-2_1
2 1 Data Science and Its Tasks

population by an arithmetic mean or standard deviation of the weight or height of

individual persons. A more complex example is describing the image data by a set
of contour edges.

Depending on the character of the task, the data processing may be static or
dynamic. The static variant is characterized by a fixed data set in which a pattern is
to be recognized. This corresponds to the mathematical concept of a mapping: Data
patterns are mapped to their pattern labels. Static recognition is a widespread setting
for image processing, text search, fraud detection, and many others.
With dynamic processing, the recognition takes place on a stream of data provided
continuously in time. The pattern searched can be found only by observing this stream
and its dynamics. A typical example is speech recognition.
Historically, the first approaches to solving these tasks date back to several cen-
turies ago and have been continually developed. The traditional disciplines have
been statistics as well as systems theory investigating dynamic system behavior.
These disciplines provide a large pool of scientifically founded findings and meth-
ods. Their natural focus on linear systems results from the fact that these systems are
substantially easier to treat analytically. Although some powerful theory extensions
to nonlinear systems are available, a widespread approach is to treat the nonlinear
systems as locally linear and use linear theory tools.
AI has passed several phases. Its origins in the 1950s focused on simple learn-
ing principles, mimicking basic aspects of the behavior of biological neuron cells.
Information to be processed has been represented by real-valued vectors. The corre-
sponding computing procedures can be counted to the domain of numerical mathe-
matics. The complexity of algorithms has been limited by the computing power of
information processing devices available at that time. The typical tasks solved have
been simple classification problems encompassing the separation of two classes.
Limitations of this approach with the given information processing technology
have led to an alternative view: logic-based AI. Instead of focusing on sensor informa-
tion, logical statements, and correspondingly, logically sound conclusions have been
investigated. Such data has been representing some body of knowledge, motivating
to call the approach knowledge based. The software systems for such processing
have been labeled “expert systems” because of the necessity of encoding expert
knowledge in an appropriate logic form.
This field has reached a considerable degree of maturity in machine processing of
logic statements. However, the next obstacle had to be surmounted. The possibility of
describing a real world in logic terms showed its limits. Many relationships important
for intelligent information processing and behavior turned out to be too diffuse for
the unambiguous language of logic. Although some attempts to extend the logic
by probabilistic or pseudo-probabilistic attributes (fuzzy logic) delivered applicable
results, the next change of paradigm has taken place.
With the fast increase of computing power, also using interconnected computer
networks, the interest in the approach based on numerical processing of real-valued
data revived. The computing architectures are, once more, inspired by neural systems
of living organisms. In addition to the huge growth of computing resources, this phase
1 Data Science and Its Tasks 3

is characterized by more complex processing structures. Frequently, they consist of

a stack of multiple subsequent processing layers. Such computing structures are
associated with the recently popular notion of Deep Learning (DL).
The development of the corresponding methods has mostly been spontaneous and
application driven. It has also taken place in several separate scientific communities,
depending on their respective theoretical and application focus: computer scientists,
statisticians, biologists, linguists as well as engineers of systems for image process-
ing, speech processing, and autonomous driving. For some important applications
such as Natural Language Processing (NLP) or Computer Vision (CV), many trials
for solutions have been undertaken followed by equally numerous failures.
It would be exaggerated to characterize the usual approach as a “trial-and-error”
approach. However, so far, no unified theory of the domain has been developed.
Also, some popular algorithms and widely accepted recommendations for their use
have not reached the maturity in implementation and theoretical foundations. This
motivates the need for a review of mathematical principles behind the typical Data
Science solutions, for the user to be able to make appropriate choices and to avoid
failures caused by typical pitfalls.
Such a basic review is done in the following chapters of this work. Rather than
attempting to provide a theory of Data Science (DS) (which would be a very ambitious
project), it compiles mathematical concepts useful in looking for DS solutions. These
mathematical concepts are also helpful in understanding which configurations of data
and algorithms have the best chance for success. Rather than presenting a long list
of alternative methods, the focus is rather on choices common to many algorithms.
What is adopted is the view of someone facing a new DS application. The questions
that immediately arise are as follows:

• What type of generic task is this (forecast or classification, static or dynamic

system, etc.)?
• Which are the requirements on appropriate data concerning their choice and quan-
tity?
• Which are the conditions for generalization to unseen cases and their consequences
for dimensioning the task?
• Which algorithms have the largest potential for good solutions?

The authors hope to present concise and transparent answers to these questions
wherever allowed by the state of the art.
Part I
Mathematical Foundations
Application-Specific Mappings
and Measuring the Fit to Data 2

Information processing algorithms consist of receiving input data and computing

output data from them. On a certain abstraction level, they can be generally described
by some kind of mapping input data to output data. Depending on software type and
application, this can be more or less explicit. A dialogue software receives its input
data successively and sometimes in dependence on previous input and output. By
contrast, the input of a weather forecast software is a set of measurements from
which the forecast (the output data) is computed by a mathematical algorithm. The
latter case is closer to the common idea of a mapping in the sense of a mathematical
function that delivers a function value (possibly a vector) from a vector of arguments.
Depending on the application, it may be appropriate to call the input and output
vectors patterns.
In this sense, most DS applications amount to determining some mapping of an
input pattern to an output pattern. In particular, the DS approach is gaining this
mapping in an inductive manner from large data sets.
Both input and output patterns are typically described by vectors. What is sought
is a vector mapping
y = f (x) (2.1)
assigning an output vector y to a vector x.
This mapping may be arbitrarily complex but some types are easily tractable while
others are more difficult. The simplest type of mapping is linear:
y = Bx (2.2)
Linear mapping is the type the most thoroughly investigated, providing volu-
minous theory concerning its properties. Nevertheless, its limitations induced the
interest in nonlinear alternatives, the recent one being neural networks with growing
popularity and application scope.
The approach typical for DS is looking for a mapping that fits to the data from
some data collection. This fitting is done with the help of a set of variable parameters
whose values are determined so that the fit is the best or even exact. Every mapping
of type (2.1) can be written as
y = f (x, w) (2.3)
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 7
T. Hrycej et al., Mathematical Foundations of Data Science, Texts in Computer Science,
https://doi.org/10.1007/978-3-031-19074-2_2
8 2 Application-Specific Mappings and Measuring …

with a parameter vector w. For linear mappings of type (2.2), the parameter vector
w consists of the elements of matrix B.
There are several basic application types with their own interpretation of the
mapping sought. The task of fitting a mapping of a certain type to the data requires a
measure of how good this fit is. An appropriate definition of this measure is important
for several reasons:

• In most cases, a perfect fit with no deviation is not possible. To select from alter-
native solutions, comparing the values of fit measure is necessary.
• For optimum mappings of a simple type such as linear ones, analytical solutions
are known. Others can only be found by numerical search methods. To control the
search, repeated evaluation of the fit measure is required.
• The most efficient search methods require smooth fit measures with existing or
even continuous gradients, to determine the search direction where the chance for
improvement is high.

For some mapping types, these two groups of requirements are difficult to meet
in a single fit measure.
There are also requirements concerning the correspondence of the fit measure
appropriate from the viewpoint of the task on one hand and of that used for (mostly
numerical) optimization on the other hand:

• The very basic requirement is that both fit measures should be the same. This
seemingly trivial requirement may be difficult to satisfy for some tasks such as
classification.
• It is desirable that a perfect fit leads to a zero minimum of the fit measure. This is
also not always satisfied, for example, with likelihood-based measures. Difficulties
to satisfy these requirements frequently lead to using different measures for the
search on one hand and for the evaluation of the fit on the other hand. In such
cases, it is preferable if both measures have at least a common optimum.

These are the topics of the following sections.

2.1 Continuous Mappings

The most straightforward application type is using the mapping as what it mathe-
matically is: a mapping of real-valued input vectors to equally real-valued output
vectors. This type encompasses many physical, technical, and econometric applica-
tions. Examples of this may be:
2.1 Continuous Mappings 9

Fig. 2.1 Error functions

• Failure rates (y) determined from operation time and conditions of a component
(x).
• Credit scoring, mapping the descriptive features (x) of the credit recipient to a
number denoting the creditworthiness (y).
• Macroeconomic magnitudes such as inflation rate (y) estimated from others such
as unemployment rate and economic growth (x).

If a parameterized continuous mapping is to be fitted to data, the goal of fitting is to

minimize the deviation between the true values y and the estimated values f (x, w).
So, the basic version of fitting error is
e = y − f (x, w) (2.4)
It is desirable for this error to be a small positive or negative number. In other
words, it is its absolute value |e| that is to be minimized for all training examples.
This error function is depicted in Fig. 2.1 and labeled absolute error.
An obvious property of this error function is its lack of smoothness. Figure 2.2
shows its derivatives: It does not exist at e = 0 and makes a discontinuous step at
that position e = 0.
This is no problem from the application’s view. However, a discontinuous first
derivative is strongly adverse for the most well-converging numerical algorithms
that have a potential to be used as fitting or training algorithms. It is also disadvan-
tageous for analytical treatment. The error minimum can be sometimes determined
analytically, seeking for solutions with zero derivative. But equations containing dis-
continuous functions are difficult to solve. Anyway, the absolute value as an error
measure is used in applications with special requirements such as enhanced robust-
ness against data outliers.
Numerical tractability is the reason why a preferred form of error function is a
square error e2 , also shown in Fig. 2.1. Its derivative (Fig. 2.2) is not only continuous
but even linear, which makes its analytical treatment particularly easy.
10 2 Application-Specific Mappings and Measuring …

Fig. 2.2 Derivatives of error

functions

For a vector mapping f (x, w), the error (2.4) is a column vector. The vector
product e e is the sum of the squares of the errors of individual output vector elements.
Summing these errors over K training examples result in the error measure

K
K
M
E= ek ek = 2
emk (2.5)
k=1 k=1 m=1
Different scaling
of individual elements of vector patterns can make scaling
weights S = s1 . . . s M appropriate. Also, some training examples may be more
important than others, which can be expressed by additional weights rk . The error
measure (2.5) has then the generalized form

K
K
M
E= ek Sek rk = 2
emk sm rk (2.6)
k=1 k=1 m=1

2.1.1 Nonlinear Continuous Mappings

For linear mappings (2.2), explicit solutions for reaching zero in the error measure
(2.5) and (2.6) are known. Their properties have been thoroughly investigated and
some important aspects are discussed in Chap. 4. Unfortunately, most practical ap-
plications deviate to a greater or lesser extent from the linearity assumption. Good
analytical tractability may be a good motivation to accept a linear approximation if
the expected deviations from the linearity assumption are not excessive. However, a
lot of applications will not allow such approximation. Then, some nonlinear approach
is to be used.
Modeling nonlinearities in the mappings can be done in two ways that strongly
differ their application.
2.1 Continuous Mappings 11

The first approach preserves linearity in parameters. The mapping (2.3) is ex-
pressed as
y = Bh (x) (2.7)
with a nonparametric function h (x) which plays the role of the input vector x itself.
In other words, h (x) can be substituted for x in all algebraic relationships valid for
linear systems. This includes also explicit solutions for Mean Square Errors (MSEs)
(2.5) and (2.6).
The function h (x) can be an arbitrary function but a typical choice is a polynomial
in vector x. This is motivated by the well-known Taylor expansion of an arbitrary
multivariate function [7]. This expansion enables an approximation of a multivariate
function by a polynomial of a given order on an argument interval, with known error
bounds.
For a vector x with two elements x1 and x2 , a quadratic polynomial is

h x1 x2 = 1 x1 x2 x12 x22 x1 x2 (2.8)
For a vector x with three elements x1 , x2 , and x3 , it is already as complex as
follows:
h x1 x2 x3 = 1 x1 x2 x3 x12 x22 x32 x1 x2 x1 x3 x2 x3 (2.9)
For a vector x of length N , the length of vector h (x) is
(N − 1) N N 2 + 3N
1+ N + N + =1+ (2.10)
2 2
For a polynomial of order p, the size of vector h (x) grows with the pth power of
N . This is the major shortcoming of the polynomial approach for typical applications
of DS where input variable numbers of many thousands are common. Already with
quadratic polynomials, the input width would increase to millions and more.
Another disadvantage is the growth of higher polynomial powers outside of the
interval covered by the training set—a minor extrapolation may lead to excessively
high output values.
So, modeling the multivariate nonlinearities represented by polynomials is practi-
cal only for low-dimensional problems or problems in which it is justified to refrain
from taking full polynomials (e.g., only powers of individual scalar variables). With
such problems, it is possible to benefit from the existence of analytical optima and
statistically well-founded statements about the properties of the results.
These properties of parameterized mappings linear in parameters have led to
the high interest in more general approximation functions. They form the second
approach: mappings nonlinear in parameters. A prominent example are neural net-
works, discussed in detail in Chap. 3. In spite of intensive research, practical state-
ments about their representational capacity are scarce and overly general, although
there are some interesting concepts such as Vapnik–Chervonenkis dimension [21].
Neural networks with bounded activation functions such as sigmoid do not exhibit
the danger of unbounded extrapolation. They frequently lead to good results if the
number of parameters scales linearly with the input dimension, although the optimal-
ity or appropriateness of their size is difficult to show. Determining their optimum
size is frequently a result of lengthy experiments.
12 2 Application-Specific Mappings and Measuring …

Fitting neural networks to data is done numerically because of missing analyt-

ical solutions. This makes the use of well-behaving error functions such as MSE
particularly important.

2.1.2 Mappings of Probability Distributions

Minimizing the MSE (2.5) or (2.6) leads to a mapping making a good (or even
perfect, in the case of a zero error) forecast of the output vector y. This corresponds
to the statistical concept of point estimation of the expected value of y.
In the presence of an effect unexplained by input variable or of some type of noise,
the true values of the output will usually not be exactly equal to their expected values.
Rather, they will fluctuate around these expected values according to some probability
distribution. If the scope of these fluctuations is different for different input patterns
x, the knowledge of the probability distribution may be of crucial interest for the
application. In this case, it would be necessary to determine a conditional probability
distribution of the output pattern y conditioned on the input pattern x
g (y | x) (2.11)
If the expected probability distribution type is parameterized by parameter vector
p, then (2.11) extends to
g (y | x, p) (2.12)
From the statistical viewpoint, the input/output mapping (2.3) maps the input
pattern x directly to the point estimator of the output pattern y. However, we are
free to adopt a different definition: input pattern x can be mapped to the conditional
parameter vector p of the distribution of output pattern y. This parameter vector
has nothing in common with the fitted parameters of the mapping—it consists of
parameters that determine the shape of a particular probability distribution of the
output patterns y, given an input pattern x. After the fitting process, the conditional
probability distribution (2.12) becomes
g (y, f (x, w)) (2.13)
It is an unconditional distribution of output pattern y with distribution parameters
determined by the function f (x, w). The vector w represents the parameters of the
mapping “input pattern x ⇒ conditional probability distribution parameters p” and
should not be confused with the distribution parameters p themselves. For example,
in the case of mapping f () being represented by a neural network, w would corre-
spond to the network weights. Distribution parameters p would then correspond to
the activation of the output layer of the network for a particular input pattern x.
This can be illustrated on the example of a multivariate normal distribution with
a mean vector m and covariance matrix C. The distribution (2.12) becomes
1 −1 −1
g (y | x, p) = N (m (x) , C (x)) = e 2 (y−m(x)) C(x) (y−m(x))
(2π ) |C (x)|
N
(2.14)
2.1 Continuous Mappings 13

The vector y can, for example, represent the forecast of temperature and humidity
for the next day, depending on today’s meteorological measurements x. Since the
point forecast would scarcely hit the tomorrow’s state and thus be of limited use, it
will be substituted by the forecast that the temperature/humidity vector is expected
to have the mean m (x) and the covariance matrix C (x), both depending on today’s
measurement vector x. Both the mean vector and the elements of the covariance ma-
trix together constitute the distribution parameter vector p in (2.12). This parameter
vector depends on the vector of meteorological measurements x as in (2.13).
What remains is to choose an appropriate method to find the optimal mappings
m (x) and C (x) which depend on the input pattern x. In other words, we need
some optimality measure for the fit, which is not as simple as in the case of point
estimation with its square error. The principle widely used in statistics is that of
maximum likelihood. It consists of selecting distribution parameters (here: m and C)
such that the probability density value for the given data is maximum.
For a training set pattern pair (xk , yk ), the probability density value is
1 1 −1 (y −m(x ))
e− 2 (yk −m(xk )) C(xk ) k k (2.15)
(2π ) |C (xk )|
N

For independent samples (xk , yk ), the likelihood of the entire training set is the
product
K
1 1 −1
e− 2 (yk −m(xk )) C(xk ) (yk −m(xk )) (2.16)
k=1 (2π ) |C (x k )|
N

The maximum likelihood solution consists in determining the mappings m (x)

and C (x) such that the whole product term (2.16) is maximum.
The exponential term in (2.16) suggests that taking the logarithm of the expression
may be advantageous, converting the product to a sum over training patterns. A
negative sign would additionally lead to a minimizing operation, consistently with
the convention for other error measures which are usually minimized. The resulting
term

K
N 1 1
ln 2π + ln |C (xk )| + (yk − m (xk )) C (xk )−1 (yk − m (xk ))
2 2 2
k=1
(2.17)
can be simplified by rescaling and omitting constants to

K
ln |C (xk )| + (yk − m (xk )) C (xk )−1 (yk − m (xk )) (2.18)
k=1
In the special case of known covariance matrices C (xk ) independent from the
input pattern xk , the left term is a sum of constants and (2.18) reduces to a generalized
MSE, an example of which is the measure (2.6). The means m (xk ) are then equal to
the point estimates received by minimizing the MSE.
More interesting is the case of unknown conditional covariance matrices C (xk ).
Its form advantageous for computations is based on the following algebraic laws:
14 2 Application-Specific Mappings and Measuring …

• Every symmetric matrix such as C (xk )−1 can be expressed as a product of a lower
triangular matrix L and its transpose L , that is, C (xk )−1 = L (xk ) L (xk ) .
• The determinant of a lower diagonal matrix L is the product of its diagonal ele-
ments.
• The determinant of L L is the square of the determinant of L.
• The inverse L −1 of a lower diagonal matrix L is a lower diagonal matrix and its
determinant is the reciprocal value of the determinant of L.

The expression (2.18) to be minimized becomes

K
M

−2 (ln (lmm (xk ))) + (yk − m (xk )) L (xk ) L (xk ) (yk − m (xk ))
k=1 m=1
(2.19)
The mapping f (x, w) of (2.13) delivers for every input pattern xk the mean vector
m (xk ) and the lower triangular matrix L (xk ) of structure
⎡ ⎤
l11 · · · 0
⎢ .. . . . ⎥
⎣ . . .. ⎦ (2.20)
l M1 · · · l M M
So, for input pattern xk , the output vector yk is forecast to have the distribution
1 −1
e 2 (yk −m(x k )) L(x k )L(x k ) (yk −m(x k )) (2.21)
M −2
(2π ) N m=1 lmm (x k )

If the mapping f (x, w) is represented, for example, by a neural network, the

output layer of the network is trained to minimize the log-likelihood (2.19), with the
tuple (m(xk ), L(xk )) extracted from the corresponding elements of the output layer
activation vector.
For higher dimensions M of output pattern vector y, the triangular matrix L has
a number of entries growing with the square of M. Only if mutual independence of
individual output variables can be assumed, L becomes diagonal and has a number
of nonzero elements equal to M.
There are few concepts alternative to multivariate normal distribution if mutual de-
pendencies are essential (in our case, mutual dependencies within the output pattern
vector y). A general approach has been presented by Stützle and Hrycej [19]. How-
ever, if the independence assumption is justified, arbitrary univariate distributions
can be used with specific parameters for each output variable ym . For example, for
modeling the time to failure of an engineering component, the Weibull distribution
[22] is frequently used, with the density function
β−1 β
β y − y
g (y) = e η
(2.22)
η η
2.2 Classification 15

We are then seeking the parameter pair (β(x), η(x)) depending on the input pattern
x such that the log-likelihood over the training set

K
β (xk ) yk yk β(xk )
ln + β (xk ) − ln −
η (xk ) η (xk ) η (xk )
k=1
(2.23)

K
yk β(xk )
= ln β (xk ) − β (xk ) ln η (xk ) + (β (xk ) − 1) ln yk −
η (xk )
k=1
is minimum. The parameter pair can, for example, be the output layer (of size 2)
activation vector
β η = f (x, w) (2.24)

2.2 Classification

A classification problem is characterized by assigning every pattern a class out of a

predefined class set. Such problems are frequently encountered whenever the result
of a mapping is to be assigned some verbal category. Typical examples are

• images in which the object type is sought (e.g., a face, a door, etc.);
• radar signature assigned to flying objects;
• object categories on the road or in its environment during autonomous driving.

Sometimes, the classes are only discrete substitutes for a continuous scale. Dis-
crete credit scores such as “fully creditworthy” or “conditionally creditworthy” are
only distinct values of a continuous variable “creditworthiness score”. Also, many
social science surveys classify the answers to “I fully agree”, “I partially agree”, “I
am indifferent”, “I partially disagree”, and “I fully disagree”, which can be mapped
to a continuous scale, for example [−1, 1]. Generally, this is the case whenever the
classes can be ordered in an unambiguous way.
Apart from this case with inherent continuity, the classes may be an order-free
set of exclusive alternatives. (Nonexclusive classifications can be viewed as separate
tasks—each nonexclusive class corresponding to a dichotomy task “member” vs.
“nonmember”.) For such class sets, a basic measure of the fit to a given training or test
set is the misclassification error. The misclassification error for a given pattern may
be defined as a variable equal to zero if the classification by the model corresponds to
the correct class and equal to one if it does not. More generally, assigning the object
with the correct class i erroneously to the class j is evaluated by a nonnegative real
number called loss L i j . The loss of a correct class assignment is L ii = 0.
The so-defined misclassification loss is a transparent measure, frequently directly
reflecting application domain priorities. By contrast, it is less easy to make it opera-
tional for fitting or learning algorithms. This is due to its discontinuous character—a
class assignment can only be correct or wrong. So far, solutions have been found
only for special cases.
16 2 Application-Specific Mappings and Measuring …

Fig. 2.3 Two classes with linear separation

This discontinuity represents a difficulty when searching for optimal classifica-

tion mappings. For continuous mappings, there is a comfortable situation that a fit
measure such as MSE is also one that can be directly used in numerical optimization.
Solving the optimization problem is identical with solving the fitting problem. For
classification tasks, this comfort cannot be enjoyed. What we want to reach is not
always identical with what we can efficiently optimize. To bridge this gap, various
approaches have been proposed. Some of them are sketched in the following sections.

2.2.1 Special Case: Two Linearly Separable Classes

Let us consider a simple problem with two classes and two-dimensional patterns
[x1 , x2 ] as shown in Fig. 2.3. The points corresponding to Class 1 and Class 2
can be completely separated by a straight line, without any misclassification. This
is why such classes are called linearly separable. The attainable misclassification
error is zero.
The existence of a separating line guarantees the possibility to define regions in
the pattern vector space corresponding to individual classes. What is further needed
is a function whose value would indicate the membership of a pattern in a particular
class. Such function for the classes of Fig. 2.3 is that of Fig. 2.4. Its value is unity
for patterns from Class 1 and zero for those from Class 2.
Unfortunately, this function has properties disadvantageous for treatment by nu-
merical algorithms. It is discontinuous along the separating line and has zero gradient
elsewhere. This is why it is usual to use an indicator function of type shown in Fig. 2.5.
It is a linear function of the pattern variables. The patterns are assigned to Class
1 if this function is positive and to Class 2 otherwise.
Many or even the most class pairs cannot be separated by a linear hyperplane. It
is not easy to determine whether they can be separated by an arbitrary function if the
2.2 Classification 17

Fig. 2.4 Separating function

Fig. 2.5 Continuous

separating function

family of these functions is not fixed. However, some classes can be separated by
simple surfaces such as quadratic ones. An example of this is given in Fig. 2.6. The
separating curve corresponds to the points where the separating function of Fig. 2.7
intersects the plane with y = 0.
The discrete separating function such as that of Fig. 2.4 can be viewed as a
nonlinear step function of the linear function of Fig. 2.5, that is,

1 for b x ≥ 0
s bx = (2.25)
0 for b x < 0
18 2 Application-Specific Mappings and Measuring …

Fig. 2.6 Quadratic separation

Fig. 2.7 Quadratic

separation function

To avoid explicitly mentioning the absolute term, it will be assumed that the last
element of input pattern vector x is equal to unity, so that
⎡ ⎤
x1 ⎡ ⎤
⎢ .. ⎥ x1
⎢ ⎥ ⎢ ⎥
b x = b1 · · · b N −1 b N ⎢ . ⎥ = b1 · · · b N −1 ⎣ ... ⎦ + b N
⎣x N −1 ⎦
x N −1
1
The misclassification sum for a training set with input/output pairs (xk , yk ) is
equal to
K
s b xk − yk
2
E= (2.26)
k=1
2.2 Classification 19

Here, yk is the class indicator of the kth training pattern with values 0 or 1. For
most numerical minimization methods for error functions E, the gradient of E with
regard to parameters b is required to determine the direction of descent towards low
values of E. The gradient is

∂E K
ds
=2 s b xk − yk xk (2.27)
∂b dz
k=1
with z being the argument of function s (z).
However, the derivative of the nonlinear step function (2.25) is zero everywhere
except for the discontinuity at z = 0 where it does not exist. To receive a useful
descent direction, the famous perceptron rule [16] has used a gradient modification.
This pioneering algorithm iteratively updates the weight vector b in the direction of
the (negatively taken) modified gradient

∂E K
= s b xk − yk xk (2.28)
∂b
k=1
ds
This modified gradient can be viewed as (2.27) with dz substituted by unity (the
derivative of linear function s (z) = z). Taking a continuous gradient approxima-
tion is an idea used by optimization algorithms for non-smooth functions, called
subgradient algorithms [17].
The algorithm using the perceptron rule converges to zero misclassification rate
if the classes, as defined by the training set, are separable. Otherwise, convergence
is not guaranteed.
An error measure focusing on critical patterns in the proximity of separating
line is used by the approach called the support vector machine (SVM) [2]. This
approach is looking for a separating line with the largest orthogonal distance to the
nearest patterns of both classes. In Fig. 2.8, the separating line is surrounded by
the corridor defined by two boundaries against both classes, touching the respective
nearest points. The goal is to find a separating line for which the width of this corridor
is the largest. In contrast to the class indicator of Fig. 2.4 (with unity for Class 1
and zero for Class 2), the support vector machine rule is easier to represent with a
symmetric class indicator y equal to 1 for one class and to −1 for another one. With
this class indicator and input pattern vector containing the element 1 to provide for the
absolute bias term, the classification task is formulated as a constrained optimization
task with constraints

yk b xk ≥ 1 (2.29)
If these constraints are satisfied, the product b xk is always larger than 1 for Class
1 and smaller than −1 for Class 2.
The separating function b x of (2.29) is a hyperplane crossing the x1 /x2 -coordinates
plane at the separating line (red line in Fig. 2.8). At the boundary lines, b x is equal
to constants larger than 1 (boundary of Class 1) and smaller than −1 (boundary
of Class 2). However, there are infinitely many such separating functions. In the
20 2 Application-Specific Mappings and Measuring …

Fig. 2.8 Separating principle of a SVM

Fig. 2.9 Alternative separating functions—cross-sectional view

cross section perpendicular to the separating line (i.e., viewing the x1 /x2 -coordinates
plane “from aside”), they may appear as in Fig. 2.9.
There are infinitely many such hyperplanes (appearing as dotted lines in the cross
section of Fig. 2.9), some of which becoming very “steep”. The most desirable
variant would be that exactly touching the critical points of both classes at a unity
“height” (solid line). This is why the optimal solution of the SVM is such that it has
the minimum norm of vector b:

min b (2.30)

The vector norm is a quadratic function of vector elements. So, the constraints
(2.29) together with the objective function (2.30) constitute a quadratic minimization
problem with constraints, solvable with modern numerical methods. Usually, the dual
form having the same optimum as the problem in (2.29) and (2.30) is solved.
Both the perceptron rule and the SVM are originally designed for linearly sepa-
rable classes. In this case, the optimum corresponds to the perfect separation and no
misclassification occurs. With linearly separable classes, the measure of success is
2.2 Classification 21

simple: “separated” (successful fit) and “non-separated” (failing to fit). The absence
of intermediary results makes the problem of discontinuous misclassification error
or loss irrelevant—every separation is a full success.

2.2.2 Minimum Misclassification Rate for Two Classes

Unfortunately, separability or even linear separability is rather scarce in real-world

classification problems. Then, the minimization of the inevitable misclassification
loss is the genuine objective. The perceptron rule and the SVM have extensions for
non-separable classes but they do not perform genuine misclassification minimiza-
tion although the results may be acceptable.
The group of methods explicitly committed to this goal are found in the statistical
discriminant analysis. The principle behind the typically applied approach is to de-
termine the probability of a pattern vector to be a member of a certain class. If these
probabilities are known (or estimated in a justified way) for all classes in questions,
it is possible to choose the class with the highest probability. If the probability that a
pattern is a member of the ith class is Pi , the probability of being assigned a false class
j = i is 1 − Pi . If every pattern is assigned to the class with the highest probability,
the probability of misclassification (which is proportional to the misclassification
error) is at its minimum.
With the knowledge of a probability distribution of the patterns of each class, this
assessment can be made. In other words, the classification problem is “reduced” to
the task of assessment of these probability distributions for all classes. The quotation
marks around “reduced” suggest that this task is not easy. On the contrary, it is
a formidable challenge since most real-world pattern classes follow no analytical
distribution used in the probability theory.
Let us consider the case of two classes, the patterns of each of which are normally
distributed (Gaussian distribution), with mean vector m i , covariance matrix Ci , i =
1, 2, and pattern vector length N :
1 −1 −1
N (m i , Ci ) = e 2 (x−m i ) Ci (x−m i ) (2.31)
(2π ) N |Ci |
The density (2.31) can be viewed as a conditional density f (x | i) given the class
i. The classes may have different prior probabilities pi (i.e., they do not occur equally
frequently in reality). Bayesian posterior probability of pattern x being the ith class
is then
f (x | i) pi
Pi = (2.32)
f (x | 1) p1 + f (x | 2) p2
Which class has a higher probability can be tested by comparing the ratio
P1 f (x | 1) p1
= (2.33)
P2 f (x | 2) p2
with unity, or, alternatively, comparing its logarithm
ln (P1 ) − ln (P2 ) = ln ( f (x | 1)) − ln ( f (x | 2)) + ln ( p1 ) + ln ( p2 ) (2.34)
with zero.
22 2 Application-Specific Mappings and Measuring …

Substituting (2.31)–(2.34) results in

1 1
ln − (x − m 1 ) C1−1 (x − m 1 )
(2π ) |C1 |
N 2

1 1
− ln + (x − m 2 ) C2−1 (x − m 2 )
(2π ) |C2 |
N 2
+ ln ( p1 ) − ln ( p2 ) (2.35)
1 1 1
= ln (|C2 |) − ln (|C1 |) + x C2−1 − C1−1 x
2 2 2
1 −1
−1 −1
+ m 1 C1 − m 2 C2 x − m 1 C1 m 1 − m 2 C2−1 m 2
2
+ ln ( p1 ) − ln ( p2 )
which can be made more transparent as
x Ax + b x + d (2.36)
with
1 −1
A= C − C1−1
2 2
b = m 1 C1−1 − m 2 C2−1
1 1 1 −1
d = ln (|C2 |) − ln (|C1 |) − m 1 C1 m 1 − m 2 C2−1 m 2 + ln ( p1 ) − ln ( p2 )
2 2 2
A Bayesian optimum decision consists in assigning the pattern to Class 1 if
the expression (2.36) is positive and to Class 2 if it is negative.
Without prior probabilities pi , the ratio (2.33) is the so-called likelihood ratio
which is a popular and well elaborated statistical decision criterion. The decision
function (2.36) is then the same, omitting the logarithms of pi .
The criterion (2.36) is a quadratic function of pattern vector x. The separating
function is of type depicted in Fig. 2.7. This concept can be theoretically applied to
some other distributions beyond the Gaussian [12].
The discriminant function (2.36) is dedicated to classes with normally distributed
classes. If the mean vectors and the covariance matrices are not known, they can easily
be estimated from the training set, as sample mean vectors and sample covariance
matrices, with well-investigated statistical properties. However, the key problem is
the assumption of normal distribution itself. It is easy to imagine that this assumption
is rather scarcely strictly satisfied. Sometimes, it is even clearly wrong.
Practical experience has shown that the discriminant function (2.36) is very sensi-
tive to deviations from distribution normality. Paradoxically, better results are usually
reached with a further assumption that is even less frequently satisfied: that of a com-
mon covariance matrix C identical for both classes. This is roughly equivalent to the
same “extension” of both classes in the input vector space.
2.2 Classification 23

For Gaussian classes with column vector means m 1 and m 2 , and common co-
variance matrix C, matrix A and some parts of the constant d become zero. The
discriminant function becomes linear:
b x + d > 0
with
b = (m 1 − m 2 ) C −1
1 p1
d = − b (m 1 + m 2 ) + ln
2 p2 (2.37)
1 p1
= − (m 1 + m 2 ) C −1 (m 1 + m 2 ) + ln
2 p2
This linear function is widely used in the linear discriminant analysis.
Interestingly, the separating function (2.37) can, under some assumptions, be
received also with a least squares approach. For simplicity, it will be assumed that
the mean over both classes m 1 p2 + m 2 p2 is zero. Class 1 and Class 2 are
coded by 1 and −1, and the pattern vector x contains 1 at the last position.
The zero gradient is reached at
b X X = y X (2.38)
By dividing both sides by the number of samples, matrices X and X X contain
sample moments (means and covariances). Expected values are

1 1
E b XX = E yX (2.39)
K K
The expression X X corresponds to the sample second moment matrix. With the
zero mean, as assumed above, it is equal to the sample covariance matrix. Every
covariance matrix over a population divided into classes can be decomposed to the
intraclass covariance C (in this case, identical for both classes) and the interclass
covariance
M = m1 m2

p1 0
P= (2.40)
0 p2
Ccl = M P M
This can be then rewritten as

C + MPM 0

b = p1 m 1 − p2 m 2 p1 − p2 (2.41)
0 1
resulting in

C + Ccl 0 −1
b = p1 m 1 − p2 m 2 p1 − p2
0 1
−1
C + Ccl 0 (2.42)
= p1 m 1 − p2 m 2 p1 − p2
0 1

= p1 m 1 − p2 m 2 C + Ccl −1 p1 − p2
24 2 Application-Specific Mappings and Measuring …

It is interesting to compare the linear discriminant (2.37) with least square solution
(2.37) and (2.42). With an additional assumption of both classes having identical prior
probabilities p1 = p2 (and identical counts in the training set), the absolute term of
both (2.37) and (2.42) becomes zero. The matrix Ccl contains covariances of only two
classes and is thus of maximum rank two. The additional condition of overall mean
equal to zero reduces the rank to one. This results in least squares-based separating
vector b to be only rescaled in comparison with that of separating function (2.37).
This statement can be inferred in the following way.
In the case of identical prior probabilities of both classes, the condition of zero
mean of distribution of all patterns is m 1 +m 2 = 0, or m 2 = −m 1 . It can be rewritten
as m 1 = m and m 2 = −m with the help of a single column vector of class means m.
The difference of both means is m 1 − m 2 = 2m. The matrix Ccl is
1 1
Ccl = m1 m2 m1 m2 = m 1 m 1 + m 2 m 2 = mm (2.43)
2 2
with rank equal to one—it is an outer product of only one vector m with itself.
The equation for separating function b of the linear discriminant is
bC = 2m (2.44)
while for separating function bLS of least squares, it is
bLS (C + Ccl ) = 2m (2.45)
Let us assume the proportionality of both solutions by factor d:
bLS = db (2.46)
Then
db (C + Ccl ) = 2dm + 2dm C −1 Ccl = 2m (2.47)
or
1−d
m C −1 Ccl = m C −1 mm = m = em (2.48)
d
with
1−d
e= (2.49)
d
and
1
d= (2.50)
1+e
The scalar proportionality factor e in (2.48) can always be found since Ccl = mm
is a projection operator to a one-dimensional space. It projects every vector, i.e.,
also the vector m C −1 , to the space spanned by vector m. In other words, these
two vectors ale always proportional. Consequently, a scalar proportionality factor
d for separating functions can always be determined via (2.50). This means that
proportional separating functions are equivalent since they separate identical regions.
The result of this admittedly tedious argument is that the least square solution
fitting the training set to the class indicators 1 and −1 is equivalent with the optimum
linear discriminant, under the assumption of
2.2 Classification 25

Fig. 2.10 Lateral projection

of a linear separating
function (class indicators 1
and −1)

• normally distributed classes;

• identical covariance matrix of both classes;
• and classes with identical prior probabilities.

This makes the least squares solution interesting since it can be applied without
assumptions about the distribution—of course with the caveat that is not Bayesian
optimal for other distributions. This seems to be the foundation of the popularity of
this approach beyond the statistical community, for example, in neural network-based
classification.
Its weakness is that the MSE reached cannot be interpreted in terms of misclassifi-
cation error—we only know that in the MSE minimum, we are close to the optimum
separating function. The reason for this lack of interpretability is that the function
values of the separating function are growing with the distance from the hyperplane
separating both classes while the class indicators (1 and −1) are not—they remain
constant at any distance. Consequently, the MSE attained by optimization may be
large even if the classes are perfectly separated. This can be seen if imagining a “lat-
eral view” of the vector space given in Fig. 2.10. It is a cross section in the direction
of the class separating line. The class indicators are constant: 1 (Class 1 to the
left) and −1 (Class 2 to the right).
More formally, the separating function (for the case of separable classes) assigns
the patterns, according to the test b x + d > 0 for Class 1 membership, to the
respective correct class. However, the value of b x + d is not equal to the class
2
indicator y (1 or −1). Consequently, the MSE b x + d − y is far away from zero
in the optimum. Although alternative separating functions with identical separating
lines can have different slopes, no one of them can reach zero MSE. So, the MSE
does not reflect the misclassification rate.
This shortcoming can be alleviated by using a particular nonlinear function of
the term b x + d. Since this function is usually used in the form producing class
26 2 Application-Specific Mappings and Measuring …

Fig. 2.11 Lateral projection

of a linear separating
function (class indicators 1
and 0)

Fig. 2.12 Logistic (sigmoid)

function

indicators 1 for Class 1 and zero for Class 2, it will reflect the rescaled linear
situation of Fig. 2.11.
The nonlinear function is called logistic or logit function in statistics and econo-
metrics. With neural networks, it is usually referred to as sigmoid function, related
via rescaling to tangent hyperbolicus (tanh). It is a function of scalar argument z:
1
y = s (z) = (2.51)
1 + e−z
This function is mapping the argument z ∈ (−∞, ∞) to the interval [0, 1], as
shown in Fig. 2.12.
Applying (2.51) to the linear separating function b x+d, that is, using the nonlinear
separating function
1
y = s b x + d = −(b x+d) (2.52)
1+e
will change the picture of Fig. 2.11 to that of Fig. 2.13. The forecast class indicators
(red crosses) are now close to the original ones (blue and green circles).
The MSE is
s b x + d − y
2
(2.53)
For separable classes, MSE can be made arbitrarily close to zero, as depicted in
Fig. 2.14. The proximity of the forecast and true class indicators can be increased
2.2 Classification 27

Fig. 2.13 Lateral projection

of a logistic separating
function (class indicators 1
and 0)

Fig. 2.14 A sequence of

logistic separating functions

by increasing the norm of weight vector b. This unbounded norm is a shortcoming

of this approach if used with perfectly separable classes.
For non-separable classes, this danger disappears. There is an optimum in which
the value of the logistic class indicator has the character of probability. For patterns
in the region where the classes overlap, the larger its value, the more probable is the
membership in Class 1. This is illustrated as the lateral projection in Fig. 2.15.
Unfortunately, in contrast to a linear separating function and Gaussian classes,
minimizing the MSE with the a logistic separating function has no guaranteed opti-
mality properties with regard to neither misclassification loss nor class membership
probability.
How to follow the probabilistic cue more consequently is discussed in the follow-
ing Sect. 2.2.3.
28 2 Application-Specific Mappings and Measuring …

Fig. 2.15 Logistic

separating function with two
non-separable classes

2.2.3 Probabilistic Classification

First, it has to be mentioned that misclassification rate itself is a probabilistic concept.

It can be viewed as the probability that a pattern is erroneously assigned to a wrong
class.
The approach discussed in this section adopts another view. With two classes,
the probability p can be assessed that a pattern belongs to Class 1 while the
probability of belonging to Class 2 is complementary, that is 1 − p. For a given
pattern, p is a conditional probability conditioned by this pattern:
p (x) = P (y = 1 | x) (2.54)
From the probabilistic point of view, the class membership of pattern x is a ran-
dom process, governed by Bernoulli distribution—a distribution with exactly the
properties formulated above: probability p of membership in Class 1 and 1 − p
for the opposite. The probability is a function of input pattern vector x.
The classification problem consists in finding a function f (x, w), parameterized
by vector w, which is a good estimate of true probability p of membership of pattern
x in Class 1. This approach is a straightforward application to the principle ex-
plained in Sect. 2.1.2. The distribution concerned here is the Bernoulli distribution
with a single distribution parameter p.
For a pattern vector xk and a scalar class indicator yk , the likelihood of the prob-
ability p resulting as a function f (x, w) is
f (xk , w), yk = 1
(2.55)
1 − f (xk , w), yk = 0
This can be written more compactly as
f (xk , w) yk (1 − f (xk , w))1−yk (2.56)
2.2 Classification 29

where the exponents yk and 1 − yk acquire values 0 or 1 and thus “select” the correct
alternative from (2.55).
For a sample (or training set) of mutually independent samples, the likelihood
over this sample is the product

K
f (xk , w) yk (1 − f (xk , w))1−yk
k=1
(2.57)

K
K
= ( f (xk , w)) (1 − f (xk , w))
k=1,yk =1 k=1,yk =0

Maximizing (2.57) is the same as minimizing its negative logarithm

K
L=− yk ln f (xk , w) + (1 − yk ) ln (1 − f (xk , w))
k=1
(2.58)

K
K
=− ln ( f (xk , w)) − ln (1 − f (xk , w))
k=1,yk =1 k=1,yk =0

If the training set is a representative sample from the statistical population as-
sociated with pattern xk , the expected value of likelihood per pattern L/K can be
evaluated. The only random variable in (2.58) is the class indicator y, with probability
p of being equal to one and 1 − p of being zero:

E [L/K ] = E [yk ln ( f (xk , w)) + (1 − yk ) ln (1 − f (xk , w))]

(2.59)
= p (xk ) ln ( f (xk , w)) + (1 − p (xk )) ln (1 − f (xk , w))
The minimum of this expectation is where its derivative with regard to the output
of mapping f () is zero:
∂ E [L] ∂ ∂
= p (xk ) ln ( f (xk , w)) + (1 − p (xk )) ln (1 − f (xk , w))
∂f ∂f ∂f
p (xk ) 1 − p (xk )
= − =0 (2.60)
f (xk , w) 1 − f (xk , w)
p (xk ) (1 − f (xk , w)) − (1 − p (xk )) f (xk , w) = 0
f (xk , w) = p (xk )

This means that if the mapping f () is expressive enough to be parameterized to

hit the conditional class 1 probability for all input patterns x, this can be reached by
minimizing the log-likelihood (2.58). In practice, a perfect fit will not be possible.
In particular, with a mapping f (x) = Bx, it is clearly nearly impossible because
of outputs Bx that would probably fail to remain in the interval (0, 1). Also, with a
logistic regression (2.52), it will only be an approximation for which no analytical
solution is known. However, iterative numerical methods lead frequently to good
results.
30 2 Application-Specific Mappings and Measuring …

As an alternative to the maximum likelihood principle, a least squares solution

minimizing the square deviation between the forecast and the true class indicator
can be considered.
For a sample (xk , yk ), the error is
ek = ( f (xk , w) − yk )2 (2.61)
The mean value of the error over the whole sample population, that is, MSE, is

E = E ( f (xk , w) − yk )2
(2.62)
= p (xk ) ( f (xk , w) − 1)2 + (1 − p (xk )) ( f (xk , w))2
This is minimum for values of f () satisfying
∂E ∂
= p (xk ) ( f (xk , w) − 1)2 + (1 − p (xk )) ( f (xk , w))2
∂f ∂f
= 2 p (xk ) ( f (xk , w) − 1) + 2 (1 − p (xk )) f (xk , w) (2.63)
= 2 ( f (xk , w) − p (xk ))
=0
or
f (xk , w) = p (xk ) (2.64)
Obviously, minimizing MSE is equivalent to the maximum likelihood approach,
supposed the parameterized approximator f (x, w) is powerful enough for capturing
the dependence of class 1 probability on the input pattern vector.
Although the least square measure is not strictly identical with the misclassifi-
cation loss, they reach their minimum for the same parameter set (assuming the
sufficient representation power of the approximator, as stated above). In asymptotic
terms, the least squares are close to zero if the misclassification loss is close to zero,
that is, if the classes are separable. However, for strictly separable classes, there is
a singularity—the optimum parameter set is not unambiguous and the parameter
vector may grow without bounds.

With a parameterized approximator f (x, w) that can exactly compute the class
probability for a given pattern x and some parameter vector w, the exact fit is
at both the maximum of likelihood and the MSE minimum (i.e., least squares).
Of course, to reach this exact fit, an optimization algorithm that is capable of
finding the optimum numerically has to be available. This may be difficult for
strongly nonlinear approximators.
Least squares with logistic activation function seem to be the approach to
classification that satisfies relatively well the requirements formulated at the
beginning of Sect. 2.2.
2.2 Classification 31

Fig. 2.16 Separation of three classes

2.2.4 Generalization to Multiple Classes

All classification principles of Sects. 2.2.1–2.2.3 can be generalized to multiple

classes. Class separation is done by determining the maximum of the indicator func-
tions. For linear indicator functions, this leads to linear separating lines such as shown
in Fig. 2.16.
There is a straightforward generalization of the perceptron rule (2.28). Every class
m has its own class indicator function f m (x, w). The forecast class assignment is
qk = arg max

f m (xk , w)
m
The output pattern yk is a unit vector of true class indicators ym with unity at the
position of true class pk and zeros otherwise.
Suppose the misclassification loss if class p is erroneously classified as q is L pq .
The error function is that of Hrycej [9]:
E = L pq r f q (xk , w) − f p (xk , w)

z, z > 0 (2.65)
r (z) =
0, z ≤ 0
with partial derivatives
∂E
= L pq s f q (xk , w) − f p (xk , w)
∂ fq
∂E
= −L pq s f q (xk , w) − f p (xk , w)
∂fp
∂E (2.66)
= 0, q = p, q
∂ fq

1, z > 0
s (z) =
0, z ≤ 0
32 2 Application-Specific Mappings and Measuring …

It has to be pointed out that convergence of the perceptron rule to a stable state
for non-separable classes is not guaranteed without additional provisions.
Also, multiple class separation by SVM is possible by decomposing to a set of
two-class problems [4].
The generalization of the Bayesian two-class criterion (2.32) to a multiclass prob-
lem is straightforward. The posterior probability of membership of pattern x in the
ith class is
f (x | i) pi
Pi = M (2.67)
f (x | j) p j
j=1
Seeking for the class with the maximum posterior probability (2.67), the denom-
inator, identical for all classes, can be omitted.
Under the assumption that the patterns of every class follow multivariate normal
distribution, the logarithm of the numerator of (2.67) is

1 1
ln − (x − m i ) Ci−1 (x − m i ) + ln ( pi )
(2π ) |Ci |
N 2
N 1 1 1 −1
= − ln (2π ) − ln (|Ci |) − x Ci−1 x + m i Ci−1 x − m i Ci m i + ln ( pi )
2 2 2 2
(2.68)
which can be organized to the quadratic expression
qi (x) = x Ai x + bi x + di (2.69)
with
1
Ai = − Ci−1
2
bi = m i Ci−1 (2.70)
N 1 1 −1
di = − ln (2π ) − ln (|Ci |) − m i Ci m i + ln ( pi )
2 2 2
The Bayesian optimum which simultaneously minimized the misclassification
rate is to assign pattern xk to class i with the largest qi (xk ). Assuming identical
covariance matrices Ci for all classes, the quadratic terms become identical (because
of identical matrices Ai ) and are irrelevant for the determination of misclassification
rate minimum. Then, the separating functions become linear as in the case of two
classes.

2.3 Dynamical Systems

Generic applications considered so far have a common property: an input pattern is

mapped to an associated output pattern. Both types of patterns occur and can be ob-
served simultaneously. This view is appropriate for many application problems such
2.3 Dynamical Systems 33

as processing static images or classification of credit risk. However, some application

domains are inherently dynamic. In speech processing, semantic information is con-
tained in the temporal sequence of spoken words. Economic forecasts are concerned
with processes evolving with time—the effects of a monetary measure will occur in
weeks, months, or years after the measure has been taken. Also, most physical and
engineering processes such as wearing of technical devices or reaction of a car to
a steering wheel movement evolve in time. Their output pattern depends not only
on the current input pattern but also on past ones. Such systems are referred to as
dynamical systems.
Quantitative relationships, including dynamical systems, are usually described
with the help of equations. An equation can be viewed as a mapping of its right side
to its left side (or inversely). If one of both sides consists of a single scalar or vector
variable, it is possible to capture the description of such systems in a general mapping.
In contrast to static relationships, input and output patterns in such mappings will
encompass features or measurements with origin at varying times.
Let us consider the ways to describe a dynamic system by a mapping more con-
cretely. One of the variants implies the assumption that the output depends only on
past values of the input. For a scalar output y and scalar input x, both indexed by the
discrete time point t, the linear form is

H
yt = bh xt−h (2.71)
h=0
The output is a weighted sum of past inputs, and it is independent of past inputs
delayed by more than H time steps. In other words, the output response to an impulse
in an input is terminated in a finite number of time steps. This is why this model is
called Finite Impulse Response (FIR) in signal theory [14]. Its analogy in the domain
of stochastic process models is the moving average model using white noise input [5].
To see which class of dynamical systems can be described by this mapping, it is
necessary to review some basics of system theory. As in many other domains, its
postulates are most profound for linear systems.
Every linear dynamical system can be, beside other description forms, completely
specified by its impulse response. Since the available data typically consist of a set
of measurements at discrete time points, we will consider discrete time systems. An
impulse response is the trajectory of the system output if the input is a unit impulse
at a particular time step, for example at t = 0 while it is zero at other steps. An
example of a very simple system is shown in Fig. 2.17. It is the so called first-order
system, an example of which is a heat exchange process. After heating one end of
a metal rod to a certain temperature, the temperature of the other end will change
proportionally to the temperature difference of both ends. To a temperature impulse
at one end, the other end’s temperature will follow the curve shown in Fig. 2.17.
More exactly, the response to a unit impulse is a sequence of numbers bh , corre-
sponding to the output h time steps after the impulse. Impulse response is a perfect
characteristic of a linear system for the following reasons. Linear systems have the
property that the response to a sum of multiple signals is equal to the sum of the
responses to the individual signals. Every signal sequence can be expressed as a sum
34 2 Application-Specific Mappings and Measuring …

Fig. 2.17 Impulse response

of a simple system

of scaled impulses. For example, if the signal consists of the sequence (3, 2, 4) at
times t = 0, 1, 2, it is equivalent to the sum of a triple impulse at time t = 0, a
double impulse at t = 1, and a quadruple impulse at t = 2. So, the output at time t
is the sum of correspondingly delayed impulse responses. This is exactly what the
equality (2.71) expresses: the output yt is the sum of impulse responses bh to delayed
impulses scaled by input values xt−h , respectively.
The response of most dynamical systems is theoretically infinite. However, if
these systems belong to the category of stable systems (i.e., those that do not diverge
to infinity), the response will approach zero in practical terms after some finite time
[1]. In the system shown in Fig. 2.17, the response is close to vanishing after about
15 time steps. Consequently, a mapping of the input pattern vector consisting of the
input measurements with delays h = 0, . . . , 15 may be a good approximation of the
underlying system dynamics.
For more complex system behavior, the sequence length for a good approximation
may grow. The impulse response of the system depicted in Fig. 2.18 is vanishing
in practical terms only after about 50 time steps. This plot describes the behavior
of a second-order system. An example of such system is a mass fixed on an elastic
spring, an instance of which being the car and its suspension. Depending on the
damping of the spring component, this system may oscillate or not. Strong damping
as implemented by faultless car shock absorbers will prevent the oscillation while
insufficient damping by damaged shock absorbers will not.
With further growing system complexity, for example, for systems oscillating
with multiple different frequencies, the pattern size may easily grow to thousands of
time steps. Although the number of necessary time steps depends on the length of
the time step, too long time steps will lead to a loss of precision and at some limit
value even cease to represent the system in an unambiguous way.
The finite response representation of dynamic systems is easy to generalize to
nonlinear systems. The mapping sought is

yt = f xt . . . xt−h , w (2.72)
for training samples with different indices t, with correspondingly delayed input
patterns. A generalization to multivariate systems (with vector output yt and a set
of delayed input vectors xt−h ) is equally easy. However, the length of a necessary
2.3 Dynamical Systems 35

Fig. 2.18 Impulse response

of an oscillating system

response sequence is a serious constraint for applicability of finite response repre-

sentation of dynamic systems.
Fortunately, system theory provides alternative representation forms. A model of
the general form (mathematically, a difference equation)
xt+1 = Axt + Bu t (2.73)
is called a discrete linear system in state-space form, with t being an index running
with time. The notation usual in system theory is different from that used otherwise
in this book. The vector variable u k denotes an external input (or excitation) to the
system. The vector xk represents the state of the system (not the input, as in other
sections of this work), obviously subject to feedback via matrix A.
Note that in the neural network community, the system input is frequently denoted
by x. Here, the labeling common in systems theory is used: x for state and u for input.
In some parts of this book (e.g., Sect. 6.4), the neural network community denotation
is adopted.
In some systems, the state variables are not observable. Then, observable variables
y are introduced, resulting from the state and input variables in an algebraic (non-
dynamic) manner:
yt = C xt + Du t (2.74)
In our computing framework, all state variables can be accessed, so that (2.73) is
sufficient.
Time flows in discrete steps. In physical or economical systems, the steps are
typically equidistant and corresponding to some time period such as a second, a mil-
lisecond, or a year, depending on the application. The term discrete lets us distinguish
models running in such discrete time periods in contrast to continuous time models
described by differential equations instead of difference equations. In settings such
as Recurrent Neural Network (RNN) discussed in Sect. 3.3, the index k is simply
assigned to computing cycles for the evaluation of a neural network, no matter what
duration these cycles have.
Some aspects of behavior of linear systems (also as approximations of nonlinear
systems) are important for understanding DS models and will now be briefly ex-
plained. First of all, it is clear that the response of such system is potentially infinite.
Other documents randomly have
different content
and glens, scattered in magnificent confusion. A scene like this
commands our feelings to echo, as it were, in unison to its
grandeur and sublimity: the thrill of astonishment and the
transport of admiration seem to contend for the mastery; and
nerves are touched, that never thrilled before. We seem as if
our former existence were annihilated; and as if a new epoch
were commenced. Another world opens upon us; and an
unlimited orbit appears to display itself, as a theatre for our
ambition.”

The first two miles of our descent we by no means found difficult, but
wishing to take a minute survey of the picturesque Pass of Llanberris,
we changed the route generally prescribed to strangers, and
descended a rugged and almost perpendicular path, in opposition to
the proposals of our guide, who strenuously endeavoured to dissuade
us from the attempt; alleging the difficulty of the steep, and relating
a melancholy story of a gentleman, who many years back had broken
his leg. This had no effect: we determined to proceed; and the vale
of Llanberris amply rewarded us for the trouble.
Mr. Williams of Llandigai, in his observations on the Snowdon
mountains (which, from his having been a resident on the spot, may
be considered as entitled to the greatest credit,) makes the following
remarks on the probable derivation of their names, and the customs
and manners of their inhabitants.
“It would be endless to point out the absurd conjectures and
misrepresentations of those who have of late years undertaken to
describe this country. Some give manifestly wrong interpretations of
the names of places, and others, either ignorantly or maliciously,
have as it were caricatured its inhabitants. Travellers from England,
often from want of candour, and always from defect of necessary
knowledge, impose upon the world unfavourable as well as false
accounts of their fellow-subjects in Wales; yet the candour of the
Welsh is such, that they readily ascribe such misrepresentations to an
ignorance of their language, and a misconception of the honest,
though perhaps warm temper of those that speak it. And it may be,
travellers are too apt to abuse the Welsh, because they cannot or will
not speak English. Their ignorance ought not to incur disgust: their
reluctance proceeds not from stubbornness, but from diffidence, and
the fear of ridicule.

“NATIVES OF ERYRI.
“The inhabitants of the British mountains are so humane and
hospitable, that a stranger may travel amongst them without
incurring any expense for diet or lodging. Their fare an Englishman
may call coarse; however, they commonly in farm-houses have three
sorts of bread, namely, wheat, barley, and oatmeal; but the oatmeal
they chiefly use; this, with milk, butter, cheese, and potatoes, is their
chief summer food. They have also plenty of excellent trout, which
they eat in its season. And for the winter they have dry salted beef,
mutton, and smoked rock venison, which they call Côch ar Wyden,
i.e. The Red upon the Withe, being hung by a withe, made of a
willow or hazel twig. They very seldom brew ale, except in some of
the principal farm-houses: having no corn of their own growing, they
think it a superfluous expense to throw away money for malt and
hops, when milk, or butter-milk mixed with water, quenches the thirst
as well.
“They are hardy and very active; but they have not the perseverance
and resolution which are necessary for laborious or continued
undertakings, being, from their infancy, accustomed only to ramble
over the hills after their cattle. In summer they go barefoot, but
seldom barelegged, as has been lately asserted by a traveller. They
are shrewd and crafty in their bargains, and jocular in their
conversation; very sober, and great economists; though a late tourist
has given them a different character. Their greetings, when they
meet any one of their acquaintance, may to some appear tedious and
disagreeable: their common mode of salutation is ‘How is thy heart?
how the good wife at home, the children, and the rest of the family?’
and that often repeated. When they meet at a public house, they will
drink each other’s health, or the health of him to whom the mug goes
at every round. They are remarkably honest.
“Their courtships, marriages, &c. differ in nothing from what is
practised on these occasions among the lowlanders or other Welsh
people; but as there are some distinct and local customs in use in
North Wales, not adopted in other parts of Great Britain, I shall, by
way of novelty, relate a few of them:—When Cupid lets fly his shaft at
a youthful heart, the wounded swain seeks for an opportunity to have
a private conversation with the object of his passion, which is usually
obtained at a fair, or at some other public meeting; where he, if bold
enough, accosts her, and treats her with wine and cakes. But he that
is too bashful will employ a friend to break the ice for him, and
disclose the sentiments of his heart: the fair one, however, disdains
proxies of this kind, and he that is bold, forward, and facetious, has a
greater chance of prevailing; especially if he has courage enough to
steal a few kisses: she will then probably engage to accept of his
nocturnal visit the next Saturday night. When the happy hour
arrives, neither the darkness of the night, the badness of the
weather, nor the distance of the place, will discourage him, so as to
abandon his engagement. When he reaches the spot, he conceals
himself in some out-building, till the family go to rest. His fair friend
alone knows of and awaits his coming. After admittance into the
house a little chat takes place at the fireside, and then, if every thing
is friendly, they agree to throw themselves on a bed, if there is an
empty one in the house; when Strephon takes off his shoes and coat,
and Phillis only her shoes; and covering themselves with a blanket or
two, they chat there till the morning dawn, and then the lover steals
away as privately as he came. And this is the bundling or courting in
bed, [168] for which the Welsh are so much bantered by strangers.
“This courtship often lasts for years, ere the swain can prevail upon
his mistress to accept of his hand. Now and then a pregnancy
precedes marriage; but very seldom, or never, before a mutual
promise of entering into the marriage state is made. When a
matrimonial contract is thus entered into, the parents and friends of
each party are apprised of it, and an invitation to the wedding takes
place; where, at the appointed wedding-day, every guest that dines
drops his shilling, besides payment for what he drinks: the company
very often amounts to two or three hundred, and sometimes more.
This donation is intended to assist the young couple to buy bed-
clothes, and other articles necessary to begin the world. Nor does
the friendly bounty stop here: when the woman is brought to bed,
the neighbours meet at the christening, out of free good-will, without
invitation, where they drop their money; usually a shilling to the
woman in the straw, sixpence to the midwife, and sixpence to the
cook; more or less, according to the ability and generosity of the
giver.

“MODE OF BURYING.
“When the parish-bell announces the death of a person, it is
immediately inquired upon what day the funeral is to be; and on the
night preceding that day, all the neighbours assemble at the house
where the corpse is, which they call Ty Corph, i.e. ‘the corpse’s
house.’ The coffin, with the remains of the deceased, is then placed
on the stools, in an open part of the house, covered with black cloth;
or, if the deceased was unmarried, with a clean white sheet, with
three candles burning on it. Every person on entering the house falls
devoutly on his knees before the corpse, and repeats to himself the
Lord’s prayer, or any other prayer that he chooses. Afterwards, if he
is a smoker, a pipe and tobacco are offered to him. This meeting is
called Gwylnos, and in some places Pydreua. The first word means
Vigil; the other is, no doubt, a corrupt word from Paderau, or
Padereuau, that is, Paters, or Paternosters. When the assembly is
full, the parish-clerk reads the common service appointed for the
burial of the dead: at the conclusion of which, psalms, hymns, and
other godly songs are sung; and since Methodism is become so
universal, some one stands up and delivers an oration on the
melancholy subject, and then the company drop away by degrees.
On the following day the interment takes place, between two and
four o’clock in the afternoon, when all the neighbours assemble
again. It is not uncommon to see on such occasions an assembly of
three or four hundred people, or even more. These persons are all
treated with warm spiced ale, cakes, pipes and tobacco; and a dinner
is given to all those that come from far: I mean, that such an
entertainment is given at the funerals of respectable farmers. [170a]
They then proceed to the church; and at the end of that part of the
burial service, which is usually read in the church, before the corpse
is taken from the church, every one of the congregation presents the
officiating minister with a piece of money; the deceased’s next
relations usually drop a shilling each, others sixpence, and the poorer
sort a penny a-piece, laying it on the altar. This is called offering,
and the sum amounts sometimes to eight, ten, or more pounds at a
burial. The parish-clerk has also his offering at the grave, which
amounts commonly to about one-fourth of what the clergyman
received. After the burial is over the company retire to the public-
house, where every one spends his sixpence for ale; [170b] then all
ceremonies are over.”—Mr. W. then proceeds to explain the good and
ill resulting from the prevalence of Methodism, and those fanatics
termed Ranters, &c., and states, that “the mountain-people preserve
themselves, in a great measure, a distinct race from the lowlanders:
they but very seldom come down to the lowlands for wives; nor will
the lowlander often climb up the craggy steeps, and bring down a
mountain spouse to his cot. Their occupations are different, and it
requires that their mates should be qualified for such different modes
of living.
“I will not scruple to affirm, that these people have no strange blood
in their veins,—that they are the true offspring of the ancient Britons:
they, and their ancestors, from time immemorial, have inhabited the
same districts, and, in one degree or other, they are all relations.”
The vale of Llanberris is bounded by the steep precipices of
Snowdon, and two large lakes, communicating by a river. It was
formerly a large forest, but the woods are now entirely cut down.
We here dismissed our Cambrian mountaineer, and easily found our
way to Dolbadern (pronounced Dolbathern) Castle, situated between
the two lakes, and now reduced to one circular tower, thirty feet in
diameter, with the foundations of the exterior buildings completely in
ruins: in this, Owen Gôch, brother to Llewellin, last prince, was
confined in prison. This tower appears to have been the keep or
citadel, about ninety feet in height, with a vaulted dungeon. At the
extremity of the lower lake are the remains of a British fortification,
called Caer cwm y Glô: and about half a mile from the castle, to the
south, at the termination of a deep glen, is a waterfall, called
Caunant Mawr; it rushes over a ledge of rocks upwards of twenty
yards in height, falls some distance in an uninterrupted sheet, and
then dashes with a tremendous roar through the impeding fragments
of the rock, till it reaches the more quiet level of the vale. Returning
to the lakes, you have a fine view of the ruins, with the promontory
on which they are situated; and that with greatly heightened effect, if
favoured by their reflection on the glassy surface of the waters, to
which you add the rocky heights on each side; Llanberris church,
relieving the mountain scenery, and the roughest and most rugged
cliffs of Snowdon in the back-ground topping the whole, which give
together a grand and pleasing coup d’œil.
In this vicinity are large slate quarries, the property of Thomas
Asheton Smith, Esq.; and a rich vein of copper ore. These afford
employ to great numbers of industrious poor: to the men, in
obtaining the ore and slates, and the women and children in
breaking, separating, and preparing the different sorts for
exportation, or for undergoing farther preparatory processes to fit
them for smelting. From hence a rugged horse-path brought us to
the Caernarvon turnpike-road, about six miles distant; the high
towers of the castle, the very crown and paragon of the landscape, at
last pointed out the situation of

CAERNARVON;
and having crossed a handsome modern stone bridge thrown over
the river Seiont, and built by “Harry Parry, the modern Inigo, a.d.
1791,” we soon entered this ancient town, very much fatigued from
our long excursion.
The town of Caernarvon, beautifully situated and regularly built, is in
the form of a square, enclosed on three sides with thick stone walls;
and on the south side defended by the Castle.
The towers are extremely elegant; but not being entwined with ivy,
do not wear that picturesque appearance which castles generally
possess. Over the principal entrance, which leads into an oblong
court, is seated, beneath a great tower, the statue of the founder,
holding in his left hand a dagger; this gateway was originally fortified
with four portcullises. At the west end, the eagle tower, remarkably
light and beautiful, in a polygon form; three small hexagon turrets
rising from the middle, with eagles placed on their battlements; from
thence it derives its name. In a little dark room [173a] in this tower,
measuring eleven feet by seven, was born King Edward II. April 25,
1284. The thickness of the wall is about ten feet. To the top of the
tower we reckoned one hundred and fifty-eight steps; from whence
an extensive view of the adjacent country is seen to great
advantage. On the south are three octagonal towers, with small
turrets, with similar ones on the north. All these towers
communicate with each other by a gallery, both on the ground,
middle, and upper floor, formed within the immense thickness of the
walls, in which are cut narrow slips, at convenient distances, for the
discharge of arrows.
This building, founded on a rock, is the work of King Edward I., the
conqueror of the principality; the form of it is a long irregular square,
enclosing an area of about two acres and a half. From the
information of the Sebright manuscript, Mr. Pennant says, that, by
the united efforts of the peasants, it was erected within the space of
one year.
Having spent near three hours in surveying one of the noblest castles
in Wales, we walked round the environs of the town. The terrace
[173b]
round the castle wall, when in existence, was exceedingly
pleasing, being in front of the Menai, which is here upwards of a mile
in breadth, forming a safe harbour, and is generally crowded with
vessels, exhibiting a picture of national industry; whilst near it a
commodious quay presents an ever-bustling scene, from whence a
considerable quantity of slate, and likewise copper, from the
Llanberris mine, is shipped for different parts of the kingdom.
Caernarvon may certainly be considered as one of the handsomest
and largest towns in North Wales; and under the patronage of Lord
Uxbridge promises to become still more populous and extensive.
In Bangor-street, is the Uxbridge Arms hotel, a large and most
respectable inn; where, as well as at the Goat, the charges are
moderate and the accommodations excellent.
Caernarvon is only a township and chapelry to Llanbeblic. Its market
is on a Saturday, which is well supplied and reasonable; and with the
spirited improvements made to the town and harbour, has been the
means of greatly increasing its population: according to the late
returns it contains 1008 houses, and 6000 inhabitants. The church,
or rather chapel, has been rebuilt by subscription. Service is
performed here in English, and at the mother church at Llanbeblic
[174]
in Welsh.
The Port, although the Aber sand-banks forming a dangerous bar,
must ever be a great drawback upon it, has not only been
wonderfully improved, but is in that progressive state of improvement
by the modern mode of throwing out piers, that vessels can now, of
considerable tonnage, lie alongside the quay, and discharge or take in
their cargoes in perfect safety; this bids fair, as may be seen by the
rapid increase of its population and tonnage, to make it a place of
trade and considerable resort: yet still it only ranks as a creek, and its
custom-house is made dependent on that of the haven of Beaumaris;
to the comptroller of which its officer is obliged to report: this must
be a considerable hindrance to its trade, particularly in matters out of
the customary routine. The county hall, which is near the castle, is a
low building, but sufficiently commodious within to hold with
convenience the great sessions. Caernarvon possessed such great
favour with Edward the 1st. as to have the first royal charter granted
in Wales given to it. It is by that constituted a free borough: it has
one alderman, one deputy mayor, two bailiffs, a town-clerk, two
serjeants-at-mace, and a mayor; who, for the time, is governor of the
castle, and is allowed 200l. per annum to keep it in repair; it, jointly
with Conway, Nevin, Criccaeth, and Pwllheli, sends a member to
parliament; for the return of whom, every inhabitant, resident or
non-resident, who has been admitted to the freedom of the place,
possesses a vote.
It is allowed to have a prison for petty offences independent of the
sheriff. Its burgesses likewise were exempt throughout the kingdom
from tollage, lastage, passage, murage, pontage, and all other
impositions of whatever kind, with other privileges, too numerous to
insert.
The county prison is likewise near the castle. It was erected in the
year 1794. The new market-house, containing the butchers’
shambles, &c. is a well-contrived and convenient building, affording
good storage for corn and other articles left unsold.
The site of the ancient town of Segontium, which lies about half a
mile south of the present one, will be found worthy the attention of
the traveller; it was the only Roman station of note in this part of
Cambria, on which a long chain of minor forts and posts were
dependent. It is even maintained, and that by respectable
authorities, that it was not only the residence, but burial-place of
Constantius, father of Constantine the Great; but most probably this
arises from confusing Helena, the daughter of Octavius, duke of
Cornwall, who was born at Segontium, and married to Maximus, first
cousin of Constantine, with Helena his mother, whom these
authorities assert to have been the daughter of a British king. A
chapel, said to have been founded by Helen, and a well which bears
her name, are amongst the ruins still pointed out.
Since the numerous late improvements have been going forward, at
and near Caernarvon, new and interesting lights have been thrown
on the ruins in its vicinity, which will form a rich treat to the
antiquary.
Near the banks of the Seint, from which Segontium took its name,
and which runs from the lower lake of Llanberris, are the remains of
a fort, which appears to have been calculated to cover a landing-
place from the river at the time of high-water: it is of an oblong
shape, and includes an area of about an acre; one of the walls which
is now standing is about seventy-four yards, and the other sixty-four
yards long, in height from ten to twelve feet, and nearly six feet in
thickness. The peculiar plan of the Roman masonry is here
particularly discernible, exhibiting alternate layers, the one regular,
the other zig-zag; on these their fluid mortar was poured, which
insinuated itself into all the interstices, and set so strong as to form
the whole into one solid mass; retaining its texture even to the
present day, to such a degree, that the bricks and stone in the
Roman walls yield as easy as the cement.
English history has spoken so fully on this place, as connected with
Edward the 1st., on the title, which he, from his son being born in
this castle, so artfully claimed for him, and the future heirs apparent
to the British throne, as affording to the Welsh a prince of their own,
agreeable to their wishes, and the quiet annexation of the principality
to his dominions, which Edward by this means obtained, that it
appears superfluous to enlarge upon it in this work.
Several excursions may be made from Caernarvon, with great
satisfaction to the tourist; the principal of which is a visit to

PLAS-NEWYDD,
the elegant seat of the Marquis of Anglesea, situated in the Isle of
Anglesey, and distant about six miles from Caernarvon: if the wind
and tide prove favourable, the picturesque scenery of the Menai will
be viewed to great advantage by hiring a boat at the quay. [178] But if
this most advisable plan should not be approved of, the walk to the
Moel-y-don ferry, about five miles on the Bangor road, will prove
highly gratifying: the Menai, whose banks are studded with
gentlemen’s seats, appearing scarcely visible between the rich foliage
of the oak, which luxuriates to the water’s brink, is filled with vessels,
whose shining sails, fluttering in the wind, attract and delight the
observing eye; whilst the voice of the sailors, exchanging some salute
with the passing vessel, is gently wafted on the breeze.
Crossing the ferry, we soon reached the ancient residence of the
arch-druid of Britain, where was formerly stationed the most
celebrated of the ancient British academies: from this circumstance,
many places in this island still retain their original appellation, as
Myfyrim, the place of studies: Caer Idris, the city of astronomy;
Cerrig Boudin, the astronomer’s circle. The shore to the right soon
brought us to the plantations of Plâs-Newydd, consisting chiefly of
the most venerable oaks, and noblest ash in this part of the country:

. . . “Superior to the pow’r

Of all the warring winds of heaven they rise;
And from the stormy promontory tower,
And toss their giant arms amid the skies;
While each assailing blast increasing strength supplies.”
Beattie’s Minstrel.

Beneath their “broad brown branches,” we discovered several

cromlechs, the monuments of Druidical superstition; several stones of
enormous size support two others placed horizontally over them. [179]
For what purpose these ancient relics were originally erected, it is not
for us puisne antiquarians to discuss; and with eager impatience we
hurried to visit the noble mansion, which has not yet received the
finishing stroke of the architect; sufficient, however, is accomplished
to form a conjecture of its intended splendour and magnificence.
The whole is built, stables included, in a gothic castellated form, of a
dark slate-coloured stone: on entering the vestibule, we, for a short
time, imagined ourselves in the chapel, a mistake, though soon
discovered, to which every visitor is liable; the ceiling having gothic
arches, with a gallery suitable to it, and several niches cut in the side
walls: we were next conducted through a long suite of apartments,
the design of them all equally convenient and elegant. The
landscape from the Gothic windows is both beautiful and sublime: a
noble plantation of trees, the growth of ages; the winding strait of
the Menai, gay with vessels passing and repassing: and, beyond this
tranquil scene, the long range of the Snowdon mountains shooting
into the clouds, the various hues of whose features appear as
beautiful as their magnitude is sublime. The house is protected from
the encroachment of the sea, by a strong parapet embattled wall; in
fine, this magnificent seat of the gallant Marquis seems to possess
many conveniences peculiar to its situation: the warm and cold baths,
constantly filled by the Menai, are sequestered and commodious, and
every part of the house is abundantly supplied with water.
Since the last edition of this work was published, this splendid
residence has been finished in a style corresponding to the promises
it held forth, and now ranks amongst the first in the principality.
The park, though small, is well-wooded, and laid out with taste; and
the woods extending along the bold cliffs of the Menai, with the
parapeted bastion wall, which supports the terrace at the bottom of
the lawn, cause this elegant edifice, with its turrets and gilded vanes,
surrounded by its venerable groves, to be viewed to great advantage
from the water or opposite shore. The front is composed of a centre
nearly semilunar, with two wings semioctagonal; these receive a bold
and happy finish from octagonal turrets rising from the basement of
each angle of the front and wings, several feet above the embattled
parapet, finishing in small spires surmounted by gilt vanes.
Behind the house are two of the largest cromlechs; the upper stone
of one is twelve feet seven inches long, twelve broad, and four thick,
supported by five upright ones; the other is close to the first, and is
only about five feet and a half square.
Not far from these is a carnedd, part of which is destroyed; within
was found a cell, about seven feet long and three wide, covered with
two flat stones. On the top of the stones were two semicircular
holes, for what purpose intended I leave to others to determine;
some conceive they were places of confinement, and these holes
served as stocks, in which to secure the victims of the Druidical
sacrifices; but let us rather hope not; for as the learned of those days
here for a period found a shelter, and as these woods

“Were tenanted by bards, who nightly thence,

Rob’d in their flowery vests of innocent white,
Issued with harps, that glitter to the moon,
Hymning immortal strains:”
Mason’s Caractacus.

we may as reasonably conceive that learning, poetry, music and

religion, would soothe and soften the angry passions of the soul, as
that they would rouse to the horrid immolation of human sacrifices.
Being unavoidably prevented at this time from visiting the celebrated
Paris mountain, the property of the Marquis of Anglesea and the Rev.
Mr. Hughes, we again returned to the hotel at Caernarvon; purposing
to stay the following day (Sunday), for the purpose of making a strict
enquiry into the religious sect, settled here, and in many parts of
Wales, called Jumpers. [181]
The account we had received from our landlord, we imagined was
exaggerated; and this more strongly induced us to visit the chapel,
that we might be enabled, in future, to contradict this ridiculous
report.
At six in the evening the congregation assembled; and, on our
entrance into the chapel, we observed on the north side, from a sort
of stage or pulpit, erected on the occasion, a man, in appearance a
common day-labourer, holding forth to an ignorant and deluded
multitude. Our entrance at first seemed to excite a general
dissatisfaction; and our near neighbours, as if conscious of their
eccentricities, muttered bitter complaints against the admittance of
strangers. The chapel, which was not divided into pews, and even
destitute of seats, contained near a hundred people: half way round
was erected a gallery. The preacher continued raving, and, indeed,
foaming at the mouth, in a manner too shocking to relate:—he
allowed himself no time to breathe, but, seemingly intoxicated,
uttered the most dismal howls and groans imaginable, which were
answered by the congregation, so loud as occasionally to drown even
the voice of the preacher. At last, being nearly exhausted by
continual vociferation, and fainting from exertion, he sunk down in
the pulpit. The meeting, however, did not disperse: a psalm was
immediately sung by a man, who, we imagined, officiated as clerk,
accompanied by the whole congregation. The psalm had not
continued long, before we observed part of the assembly, to our
great surprise, jumping in small parties of three, four, and sometimes
five in a set, lifting up their hands, beating their breasts, and making
the most horrid gesticulations. Each individual separately jumped,
regularly succeeding one another, while the rest generally assisted
the jumper by the help of their hands. The women always appeared
more vehement than the men, and infinitely surpassed them in
numbers; seeming to endeavour to excel each other in jumping,
screaming, and howling. We observed, indeed, that many of them
lost their shoes, hats, and bonnets, with the utmost indifference, and
never condescended to search after them; in this condition, it is not
unusual to meet them jumping to their homes. Their meetings are
twice a week, Wednesdays and Sundays. Having accidentally met
with a gentleman, at the hotel, a native of Siberia, we invited him to
our party; and, induced by curiosity, he readily accompanied us to the
chapel. On the commencement of the jumping, he entreated us to
quit the congregation, exclaiming “Good God! I for a moment forgot
I was in a Christian country. The dance of the Siberians, in the
worship of the Lama, with their shouts and gesticulations, is not more
horrid!” This observation so forcibly struck me, that I could not avoid
inserting it in my note-book.
With disgust we left the chapel, and were given to understand, by
our landlord, they celebrate a particular day every year, when
instances have been known of women dying by too great an
exertion; and fainting is frequently the consequence of their
excessive jumping.
This sect is by no means confined to the town of Caernarvon; but in
many villages, and several market towns, both in North and South
Wales, they have established regular chapels. “They have (says a
correspondent to the Gentleman’s Magazine) [183] periodical meetings
in many of the larger towns, to which they come from thirty to forty
miles round. At one held at Denbigh, about last April, there were, I
believe, upwards of four thousand people, from different parts. At
another, held at Bala, soon afterwards, nearly double that number
were supposed to be present.” The last number appears rather to be
exaggerated, though the letter being dated from Denbigh, should be
considered as authoritative.
Another correspondent to the Gentleman’s Magazine gives the
following information respecting the sect: “That they are not a
distinct sect, but Methodists, of the same persuasion as the late Mr.
Whitfield; for though there are several congregations of the Wesleyan
Methodists in this country, there is no such custom amongst them.
But jumping during religious worship is no new thing amongst the
other party, having (by what I can learn) been practised by them for
many years past. I have seen some of their pamphlets in the Welsh
language, in which this custom is justified by the example of David,
who danced before the ark; and of the lame man, restored by our
blessed Saviour, at the gate of the temple, who leaped for joy.” How
far this gentleman’s account may be accurate, I leave for others to
decide: it is certainly to be lamented, in a country where the Christian
religion is preached in a style of the greatest purity and simplicity,
that these poor ignorant deluded wretches should be led to a form of
worship so dissonant to the established church of England, and
indeed by a poor ignorant fellow, devoid of education and common
sense.
The same road we had so much admired the preceding Saturday
soon brought us to

BANGOR,
the oldest episcopal see in Wales; being founded in 516.
The situation is deeply secluded, “far from the bustle of a jarring
world,” and must have accorded well with monastic melancholy; for
the Monks, emerging from their retired cells, might here indulge in
that luxurious gloominess, which the prospect inspires, and which
would soothe the asperities inflicted upon them by the severe
discipline of superstition. The situation of Bangor appears more like
a scene of airy enchantment than reality; and the residences of the
Canons are endeared to the votaries of landscape by the prospect
they command. On the opposite shore, the town of Beaumaris was
seen straggling up the steep declivity, with its quay crowded with
vessels, and all appeared bustle and confusion; the contrast, which
the nearer prospect inspired, was too evident to escape our notice,
where the

“Oak, whose boughs were moss’d with age,

And high top bald with dry antiquity,”

afforded a seat for the contemplation of the wide expanse of the

ocean, which is seen beyond the little island of Puffin, or Priestholm;
so called from the quantity of birds of that species, which resort here
in the summer months.
The cathedral has been built at different times, but no part very
ancient; the church was burnt down by Owen Glendwr, in the reign of
King Henry IV.; the choir was afterwards built by Bishop Henry Dene,
(or Deane), between the years 1496 and 1500; the tower and nave
by Bishop Skevington, in 1532. The whole is Gothic architecture,
with no other particular ornament to distinguish it from a common
English parish church. There are, however, several bishops [185]
buried in the choir. I could dwell with pleasure on the picturesque
beauties of this little episcopal see; but a repetition of the same
epithets, grand, beautiful, sublime, fine, with a long catalogue, which
must necessarily occur, would appear tautologous on paper, though
their archetypes in nature would assume new colours at every
change of position of the beholder.
This bishopric owes the chief of its revenues and immunities to
Anian, bishop of the diocese, in the reign of Edward the First; who
being in high favour with that monarch, and having had the honour
of christening the young prince, born at Caernarvon, afterwards
Edward the Second, had, as a compensation for the temporalities
confiscated in the reign of King Henry the Third, various manors,
ferries, and grants from the revenues of the principality allotted to
the see.
Mr. Evans, in his valuable work, the Topography of North Wales, has
clearly refuted the improbable charge made against Bishop Bulkely, of
having sold the cathedral bells; and, on the contrary, proves from
documents, that the cathedral was indebted to him for considerable
repairs, and that likewise by his will he was a benefactor to it; this
falsehood, which originated with Godwin, in his Treatise, entitled “De
Presulibus,” as a piece of scandal against the church, met with but
too ready a belief from former tourists, whose false records, Mr.
Evans deserves great credit for refuting.
Bangor is governed by the Bishop, whose steward holds the courts.
From being a quiet, retired place, it has now become a scene of
commercial bustle and activity, and is rapidly rising into an important
town. The opening of Lord Penrhyn’s slate quarries, and the great
increase of travelling through it since the union with Ireland, have
been the great causes of its increased and growing prosperity. From
the convenience for sea-bathing, the excellent new roads which
branch from it in every direction, the beauties of scenery which
surround it on every side, its proximity to many of the finest objects
which Wales can boast of, and the great interest which is excited in
the suspended bridge over the Menai Strait, it has become a place of
fashionable resort; and during the summer exhibits a scene of gaiety
and cheerfulness, that forms a striking contrast to its ancient
monastic gloom. The tourist will find this a fit spot for his head
quarters, as he can branch out in various directions, and each affords
him ample scope for his sketch-book, or his contemplation.
Public baths are much wanted here; and it is to be hoped, that the
spirit of improvement, which has lately manifested itself in this
neighbourhood, will not rest till these are accomplished.
The castle is said to have been built by Hugh, Earl of Chester, in the
reign of William the Second; it stood on a steep hill, on the south side
of Bangor, called Castle Bank, but there is not at this time a vestige
remaining.
A pleasant walk leads to the Bangor Ferry Inn, delightfully situated,
overlooking the Straits of

MENAI.
This Strait, which separates Anglesea from the main land, although
bearing only the appearance of a river, is an arm of the sea, and
most dangerous in its navigation at particular periods of the tide, and
in boisterous weather: during the flood, from the rush of water at
each extremity, it has a double current, the clash of which, termed
Pwll Ceris, it is highly rash and dangerous to encounter. In the space
of fifteen miles, there are six established ferries: the first of which to
the south is Abermenai, the next near Caernarvon, and three miles
north from the first is Tal y foel; four miles further, Moel y don; three
miles beyond which is the principal one, called Porthaethwy, but more
generally known as Bangor Ferry; it is the narrowest part of the
Strait, and is only about half a mile wide; this is the one over which
the mails and passengers pass on their route to and from Holyhead,
and near which is the bridge, of which a particular description and
plan is for the first time given; a mile further north is the fifth, Garth
Ferry; and the sixth, and widest ferry at high water, is between the
village of Aber and Beaumaris. Yet notwithstanding these ferries, the
principal part of the horned cattle that pass from Anglesea are
compelled by their drivers to swim over the passage at Bangor Ferry,
to the terror and injury of the animals, and the disgust and horror of
the bystanders.
There appears but little doubt of Anglesea having been once
connected with the main land, as evident traces of an isthmus are
discernible near Porthaeth-hwy; where a dangerous line of rocks
nearly cross the channel, and cause such eddies at the first flowing of
the tide, that the contending currents of the Menai seem here to
struggle for superiority. This isthmus once destroyed, and a channel
formed, it has been the work of ages, by the force of spring tides and
storms, gradually to deepen and enlarge the opening; as it appears
by history, that both Roman and British cavalry, at low water, during
neap tides, forded or swam over the Strait, and covered the landing
of the infantry from flat-bottomed boats.
The violent rush of water, and consequent inconvenience, delay, and
danger, when the wind and tide are unfavourable to the passage over
Bangor Ferry, in the present state of constant and rapid
communication with Ireland, gave rise to the idea of forming a bridge
over the Menai. Various estimates and plans were submitted to the
public consideration by our most celebrated engineers, and men of
science; when, after numerous delays, Mr. Telford’s design for one on
the suspension principle was adopted, and money granted by
parliament for carrying it into effect. The first stone of this
magnificent structure was laid on the 10th of August, 1819, without
any ceremony, by the resident engineer, Mr. Provis, and the
contractors for the masonry.
“When on entering the Straits,” [189] says a recent author, “the bridge
is first seen, suspended as it were in mid air, and confining the view
of the fertile and richly-wooded shores, it seems more like a light
ornament than a massy bridge, and shows little of the strength and
solidity which it really possesses. But as we approached it nearer,
whilst it still retained its light and elegant appearance, the
stupendous size and immensity of the work struck us with awe; and
when we saw that a brig, with every stick standing, had just passed
under it,—that a coach going over appeared not larger than a child’s
toy, and that foot-passengers upon it looked like pigmies, the
vastness of its proportions was by contrast fully apparent.” The
whole surface of the bridge is in length 1,000 feet, of which the part
immediately dependent upon the chains is 590 feet, the remaining
distance being supported by seven arches, four on one side and three
on the other, which fill up the distance from the main piers to the
shore. These main piers rise above the level of the road 50 feet, and
through them, two archways, each 12 feet wide, admit a passage.
Over the top of these piers, four rows of chains, the extremities of
which are firmly secured in the rocks at each end of the bridge, are
thrown; two of them nearly in the centre, about four feet apart, and
one at each side. The floor of the road is formed of logs of wood,
well covered with pitch, and then strewn over with granite broken
very small, forming a solid body by its adhesion to the pitch
impervious to the wet. A light lattice work of wrought iron to the
height of about six feet, prevents the possibility of accidents by falling
over, and allows a clear view of the scenery on both sides, which can
be seen to great advantage from this height. Having expressed our
admiration of the skill evident in the construction, at once so simple
and so useful, and having satisfied our curiosity on the top, we
descended by a precipitous path to the level of the water, and gazed
upwards with wonder, at the immense flat surface above us, and its
connecting gigantic arches. The road is 100 feet above high water,
and the arches spring at the height of 60 feet from abutments of
solid masonry, with a span of 52 feet. These abutments taper
gradually from their base to where the arch commences, and
immense masses as they are, show no appearance of heaviness;
indeed, taking the whole of the Menai Bridge together, a more perfect
union of beauty with utility cannot be conceived. It has been erected
to bear a weight upon the chains of 2,000 tons; the whole weight at
present imposed is only 500, leaving an available strength of 1,500
tons; so that there is an easy remedy for a complaint which has been
made of its too great vibration in a gale of wind, by laying additional
weight upon it. The granite of which the piers and arches are built, is
a species of marble, admitting a very high polish; of this the
peasantry in the neighbourhood avail themselves, and every one has
some specimen of polished marble ready to offer the tourist. There
is so much magnificence, beauty, and elegance, in this grand work of
art, that it harmonizes and accords perfectly with the natural scenery
around, and though itself an object of admiration, still in connection it
heightens the effect of the general view.

MONA, OR ANGLESEA, [191]

which forms one of the six counties of North Wales, was to that
principality what the island of Sicily was to Italy, its granary, and chief
dependence for subsistence; it was likewise the favourite spot, and
the last asylum of the Druids in Britain; it was to their venerable and
sacred groves, in this their last sanctuary, that they fled from Roman
tyranny; and it was here, around their altars, defenceless and
undefended, save by firebrands snatched from beneath the sacrifice,
that these venerable bigots fell, on the score of their religion, under
the murderous swords of Pagans, who, their means of attaining
knowledge being considered, were more ignorant than themselves.
Neither have we a right, on the bare testimony of these their bloody
tyrannical persecutors, to believe them to have been guilty of the
horrid rites and human sacrifices of which they are accused. In what
portion of history do we find the state, the hero, or the conqueror,
wanting a good and sufficient reason to cover the plea for conquest
or aggression; and, above all, do we ever find the Romans,
throughout their history, wanting in such plea to cover the basest of
their actions? It was the religious stimulus by which the Druids urged
the Britons, even the females, to deeds of heroic madness, to which
the Romans owed the dear purchase in life and blood of their British
conquest; and which, whilst that stimulus existed, they were but too
well aware, must ever be insecure: no wonder then, that to cover the
inhumanity of a cold-blooded warfare of extermination against a
priesthood, that controlled and guided the energies of a daring
people, they should represent them in their bloody orgies as
immolating human victims: nay, most probably, even the accusation
was founded in truth, but grossly and wilfully misrepresented; for the
ministers of religion being, most probably, the administrators of
justice, and sole keepers of traditionary laws, the sacrifice of guilty
culprits to such laws, to make a deeper impression on the minds of
hardy but superstitious barbarians, was made a religious act. When
we reflect on the late horrible sacrifices that have been made in this
country in the nineteenth century, to its offended laws, and on those
disgusting, though less dreadful exhibitions, which are made so
frequently, in a leading street of the metropolis of Britain, that they,
from their business-like, unceremonious mode of execution, no longer
deter from similar offences: and when we see the culprits come forth
attended by ministers of religion, who may appear to uninformed by-
standers to superintend the ceremony, as to the lot of one of them it
falls to give the fatal signal; we should reflect how such a spectacle
may be misrepresented by an Indian, a Chinese, or an Esquimaux,
and then judge with due candour of the religious rites and actions of
the Druids.
Anglesea can no longer, with propriety, as it did of old, bear the title
of Ynys Dowyll, or the Shady Island; for those sacred groves, those
venerable oaks, which fell not under the harsh mandates of its
Roman, Saxon, or English invaders, have yielded to the hand of time,
or the avarice of man; and the late appearance of the island was
unsheltered and exposed, almost with the exception of the respected
hallowed shades of Plas-Newydd and Baron Hill; but numerous and
thriving plantations are now springing up, doing away with that
sterile appearance; and the better and more speedily to accomplish
this desirable end, public nursery grounds have been established in
the centre of the island, to afford facilities for, and to encourage
planting. It has had the desired effect; and by an improvement in
smelting the ore, and extracting the sulphur from it, vegetation is no
longer injured, even where there is any soil on the Parys Mountain.
It was formerly divided into seven districts, or comots, but at present
its divisions are only six. It contains about two hundred thousand
acres of land; is in length, from north-west to south-east, about
twenty miles; in breadth, from north-east to south-west, about
sixteen miles; and in circumference, about seventy-six miles: has
seventy-four parishes, and four market towns, and is in the diocese
of Bangor. The number of houses are estimated at 7183; the
inhabitants at 37,045. It sends two members to parliament; one for
the county, and one for Beaumaris.
From its too great deficiency of wood, and live fences, the sterility of
the Parys Mountain and its vicinity, and the rocky appearance of the
soil, there was formerly no semblance of that exuberant fertility that
would allow this small partially cultivated island to export to the
extent it does, both in live stock and grain, viz. about 15,000 head of
black cattle, about 5000 hogs, great numbers of sheep, and nearly
4000 quarters of corn, besides numerous other articles of produce
and manufacture. Its climate is more mild than that of Wales
generally, but it is at the same time subject to fogs and damp; the
advantage and disadvantage both attributable to the sea breezes. It
is well watered by numerous rivulets, and has abundance of
harbours: among the first is that well known and highly useful one of
Holyhead, which has of late been greatly improved: that of
Beaumaris is likewise good, and capable of carrying on considerable
trade; besides these, there are the minor ones of Red Wharf Bay,
Dulas Bay, Amlwch, from which the copper ore, &c. obtained from
Parys Mountain is shipped, and which might be greatly improved,
Cemlyn, Aberfraw, &c.: most, or all of them, might, at a small
expense, be rendered still more safe and useful. Besides its exports
in corn and cattle, this small island carries on great trade in copper,
ochre, sulphur, mill-stones, lead, &c. &c. It likewise produces various
specimens of marble, (well known in London as Mona marbles) and
amongst others, the asbestos: it yields potters clay and fullers earth,
as well as coals, which are now worked in the neighbourhood of
Llangafni. Neither is the sea less bountiful than the land; affording a
bill of fare that would not disgrace the table of a London alderman.
On account of the great thoroughfare which this island has become
since the Union, from the exertions made by government to afford
safety and facility in the forwarding the principal Irish mails and
despatches, the roads are kept in excellent order.

BEAUMARIS,
the largest and best built town in Anglesea, is pleasantly situated on
the western shore of the bay of that name, and commands a fine
view of the sea and the Caernarvonshire mountains. Its original
name was Porth Wygyr. Its harbour is well sheltered, and affords
ample protection for coasters, and ships of considerable burthen,
which, during northerly winds, are driven there in great numbers, to
avoid the dangers of a lee shore. As no manufactures of
consequence are carried on in its neighbourhood, it is rather
calculated for great retirement, than for active bustle; but being the
county town, it is now and then enlivened by the gaieties attendant
upon assizes, elections, and other public meetings.
The castle, built by Edward I. in 1295, stands in the estate of Lord
Bulkeley, close to the town, and covers a considerable space of
ground; but from its low situation it was always inferior in point of
strength to the castles of Conway and Caernarvon.
Close above the town is Baron Hill, the seat of Lord Viscount Warren
Bulkeley, delightfully situated on the declivity of a richly wooded
bank, and possessing a complete command of every object which can
add to the charms of picturesque scenery. The park extends to, and
nearly surrounds, the west and north sides of the town; whilst the
rising ground, upon which the mansion stands, shelters the town
from the rude blasts that would otherwise assail it; thus giving it that
protection from the raging of the elements which the noble owner
ever affords to its inhabitants, when sorrow and adversities assail
their domestic peace. To enumerate all the acts of Lord Bulkeley’s
munificence and kindness would be impossible, but a few of them
may be seen in the neighbourhood of Beaumaris.
The beautiful road of four miles and a half, along the shore of the
Menai to Bangor Ferry, was made at the expense of Lord and Lady
Bulkeley in 1804: it cost about £3000, and, when completed, was
presented to the public and has since been maintained at his
lordship’s expense. A road possessed of greater picturesque beauty
is not to be found in Britain.
The church is kept in repair by his lordship, to which he has
presented an excellent organ, a set of elegant communion plate, a
clock, and a peal of six fine toned bells; together, costing about
£1200. He has also given a good house to the rector for the time
being. The national school, as well as the minister’s house, was built
by public subscription, on land given by Lord Bulkeley; and the
master’s and mistress’s salaries have since been paid by him and his
lady.
Many more acts of their liberality might be enumerated, but these are
sufficient to prove them zealous protecting friends, and kind
neighbours. Their numerous deeds of private charity ought not to be
blazoned to the world, but they will live long in the grateful
remembrance of those around them.
Beaumaris, situated 249 miles from London, had, in 1811, 249
houses, and 1,810 inhabitants; and in 1821 a population of 2,205. It
is governed by a mayor, recorder, two bailiffs, twenty-four capital
burgesses, and several inferior officers. It formerly possessed an
extensive trade; but has declined since the rise of Liverpool.
From Beaumaris we proceeded, by Dulas and Red Wharf Bay, to
Amlwch; the distance is about sixteen miles, through a pleasant
country, in parts greatly resembling England. About a mile from Red
Wharf Bay you pass the village of Pentraeth, The End of the Sands.
The situation is pleasant; and Mr. Grose was so taken with the
picturesque beauty of its small church, as to give a view of it in his
Antiquities.
Near this, in a field at Plâs Gwynn, the seat of the Panton family, are
two stones, placed, as tradition says, to mark the bounds of an
astonishing leap; which obtained for the active performer of it the
wife of his choice; but it appears, that as he leaped into her
affections with difficulty, he ran away from them with ease; for going
to a distant part of the country, where he had occasion to reside
several years, he found, on his return, that his wife had, on that very
morning, been married to another person. Einson, on hearing this,
took his harp, and, sitting down at the door, explained in Welsh metre
who he was, and where he had been resident. His wife narrowly
scrutinized his person, unwilling to give up her new spouse, when he
exclaimed:

Look not, Angharad, on my silver hair,

Which once shone bright of golden lively hue:
Man does not last like gold:—he that was fair
Will soon decay, though gold continue new.
If I have lost Angharad, lovely fair!
The gift of brave Ednyfed, and my spouse,
All I’ve not lost, (all must from hence repair)
Nor bed, nor harp, nor yet my ancient house.
I once have leap’d to show my active power,
A leap which none could equal or exceed,
The leap in Aber Nowydd, which thou, fair flower!
Didst once so much admire, thyself the meed.
Full fifty feet, as still the truth is known,
And many witnesses can still attest;
How there the prize I won, thyself must own:
This action stamp’d my worth within thy breast.
Bingley’s North Wales.

At Llanfair, which is about a mile distant from this road, was born the
celebrated scholar and poet, Goronwy Owen, who, notwithstanding
his acknowledged and admired abilities, was, after a series of
hardships and struggles, obliged to expatriate himself to the wilds of
Virginia, where he was appointed pastor of the Church. He was well
versed in the Latin, Greek, and oriental languages, was a skilful
antiquary, and an excellent poet. His Latin odes are greatly admired;
but his Welsh poems rank him among the most distinguished bards of
his country.
About five miles west of Beaumaris is Peny-mynydd, the birth-place
of Owen Tudor, a private gentleman, who, having married Catherine
of France, the Dowager of our Henry V., in 1428, became the
ancestor of a line of monarchs. They had three sons and one
daughter. The daughter died in her infancy: Edmund was created
Earl of Richmond, and marrying a daughter of the Duke of Somerset,
had Henry, afterwards Henry VII. Jasper was created Earl of
Pembroke; and Owen became a monk. By means of his marriage,
therefore, Owen Tudor not only became father to a line of kings; but
in his son, as Gray says, Wales came to be governed again by their
own princes.
The Tudor family became extinct in Richmond Tudor, who died in
1657, and the estate belongs to Lord Bulkeley. In the Church is one
of their monuments, removed from Lanvaes Abbey at its dissolution.

LLANELIAN
is about two miles east of Amlwch, near the coast: Mr. Bingley’s
account of which, and the superstitious ceremonies still attaching to
it, is both curious and entertaining:

“The Church is by no means an inelegant structure; and

adjoining to it is a small chapel of very ancient foundation, that
measures in its interior twelve feet by fifteen, called Myfyr, the
confessional. A curious closet of wood, of an hexagonal form,
called St. Elian’s closet, is yet left in the east wall; and is
supposed to have served both the office of communion table,
and as a chest to contain the vestments and other utensils
belonging to the chapel. There is a hole in the wall of the
chapel, through which the priests are supposed to have received
confessions: the people believe this hole to have been used in
returning oracular answers to persons who made enquiries of the
saint respecting future events. Near the door is placed Cyff
Elian, Elian’s chest, or poor-box. People out of health, even to
this day, send their offering to the saint, which they put through
a hole into the box. A silver groat, though not a very common
coin, is said to be a present peculiarly acceptable, and has been
known to procure his intercession, when all other kinds of coin
have failed! The sum thus deposited, which in the course of a
year frequently amounts to several pounds, the church-wardens
annually divide among the poor of the parish.
“The wakes of Llanelian were formerly held on the three first
Friday evenings in August; but they are now confined to only one
of those days. Young persons from all parts of the adjacent
country, and even from distant counties, assemble here; most of
whom have along with them some offering for the saint, to
ensure their future prosperity, palliate their offences, and secure
blessings on their families, their cattle, and corn.
“The misguided devotees assemble about the chapel, and having
deposited their offerings, many of them proceed to search into
their future destiny in a very singular manner, by means of the
wooden closet. Persons of both sexes, of all ages and sizes,
enter the small door-way, and if they can succeed in turning
themselves round within the narrow limits of the place, (which
measures only betwixt three and four feet in height, about four
feet across the back, and eighteen inches in width) they believe
that they shall be fortunate till at least the ensuing wake; but if
they do not succeed in this difficult undertaking, they esteem it
an omen of ill-fortune, or of their death within the year. I have
been told, that it is curious enough to see a stout lusty fellow,
weighing perhaps sixteen or eighteen stone, striving to creep
into these narrow confines, with as much confidence of success
as a stripling a yard high; and when he fails in the attempt, to
see him fuming and fretting, because his body, which contains in
solid bulk more than the place could hold, were it crammed into
all corners, cannot be got in. But when we consider, that
superstition and enthusiasm have generally little to do with
reason, we must not wonder at this addition to the heap of
incongruities that all ages have afforded us.
“Llanelian was formerly a sanctuary, or place of refuge for
criminals. In digging a grave in the churchyard, about sixteen
years ago, a deep trench was discovered, which extended about
twenty yards in a transverse direction across. It was found to
contain a great quantity of human bones; and is supposed to
have been the place of interment of a number of sailors, who
perished in a storm that drove them upon this coast.”

AMLWCH,
or the Winding Loch, is a dirty-looking straggling town, founded on
rocks. It owes its support chiefly to the copper works in its vicinity.
The church is a neat modern structure, dedicated to Elaeth, a British
saint: the port, which is but small, is, notwithstanding, excellently
adapted for the trade which is carried on; it is narrow, capable of
only containing two vessels abreast, of about 200 tons burthen each,
and of these it will furnish room for about thirty; the entrance is by a
chasm between two rocks.
The Parys mountain, like the works at Merthyr, shews what the
industry of man is capable of accomplishing in removing rocks,
mountains, and dragging forth the bowels of the earth. To those who
possess good nerves, the view of this scene of wealth and industry
will afford gratification unalloyed; but to those not so blessed, the
horrific situations in which the principal actors of the scene are
placed, poised in air, exposed to the blasting of the rocks, and the
falling of materials, which themselves are sending aloft, or from those
which may be misdirected, as ascending from the workings of others,
by striking against projecting crags, seem to threaten death in so
many varied shapes, that the wonder and admiration excited by the
place are lost in pity and anxiety for the hardy miners.
From the top of the mountain, the dreadful yawning chasm, with the
numerous stages erected over the edge of the precipice, appal rather
than gratify the observer. To see the mine to advantage, you must
descend to the bottom, and be provided with a guide, to enable you
to shun the danger, that would be considerable, from the blasts and
falling materials; the workmen generally not being able to see those
that their operations may endanger.
The Mona mine is the entire property of the Marquis of Anglesea.
The Parys mine is shared.
The mountain has been worked with varied success for about sixty-
five years: it is now believed to be under the average; but whether
that arises from the low price of the article, or the mine being
exhausted, I am unable to say: for a considerable period, it produced
20,000 tons annually. One bed of ore was upwards of sixty feet in
thickness. In the blasting the rock, to procure the ore, from six to
eight tons of gunpowder are yearly consumed.
“This celebrated mountain,” says Mr. Evans, “is easily distinguished
from the rest; for it is perfectly barren from the summit to the plain
below: not a single shrub, and hardly a blade of grass, being able to
live in its sulphurous atmosphere.

“No grassy mantle hides the sable hills,

No flowery chaplet crowns the trickling rills;
Nor tufted moss, nor leathery lichen creeps
In russet tapestry, o’er the crumbling steeps.”
Darwin.

From hence we proceeded to

HOLYHEAD,
called in Welsh Caergybi, situated on an island at the western
extremity of Anglesea. It has lately changed its aspect from a poor
fishing village to a decent looking town, in consequence of its being
the chief resort for passengers to and from Dublin. The distance
across the channel is about fifty-five miles; and there are sailing
Welcome to our website – the ideal destination for book lovers and
knowledge seekers. With a mission to inspire endlessly, we offer a
vast collection of books, ranging from classic literary works to
specialized publications, self-development books, and children's
literature. Each book is a new journey of discovery, expanding
knowledge and enriching the soul of the reade

Our website is not just a platform for buying books, but a bridge
connecting readers to the timeless values of culture and wisdom. With
an elegant, user-friendly interface and an intelligent search system,
we are committed to providing a quick and convenient shopping
experience. Additionally, our special promotions and home delivery
services ensure that you save time and fully enjoy the joy of reading.

Let us accompany you on the journey of exploring knowledge and

personal growth!

ebooknice.com

The Subtle Art of Not Giving a F*ck: A Counterintuitive Approach to Living a Good Life
From Everand
The Subtle Art of Not Giving a F*ck: A Counterintuitive Approach to Living a Good Life
Mark Manson
4/5 (6458)
Principles: Life and Work
From Everand
Principles: Life and Work
Ray Dalio
4/5 (648)
The Gifts of Imperfection: Let Go of Who You Think You're Supposed to Be and Embrace Who You Are
From Everand
The Gifts of Imperfection: Let Go of Who You Think You're Supposed to Be and Embrace Who You Are
Brené Brown
4/5 (1175)
Never Split the Difference: Negotiating As If Your Life Depended On It
From Everand
Never Split the Difference: Negotiating As If Your Life Depended On It
Chris Voss
4.5/5 (1005)
The Glass Castle: A Memoir
From Everand
The Glass Castle: A Memoir
Jeannette Walls
4.5/5 (1856)
Grit: The Power of Passion and Perseverance
From Everand
Grit: The Power of Passion and Perseverance
Angela Duckworth
4/5 (650)
Sing, Unburied, Sing: A Novel
From Everand
Sing, Unburied, Sing: A Novel
Jesmyn Ward
4/5 (1267)
The Perks of Being a Wallflower
From Everand
The Perks of Being a Wallflower
Stephen Chbosky
4.5/5 (4103)
Her Body and Other Parties: Stories
From Everand
Her Body and Other Parties: Stories
Carmen Maria Machado
4/5 (903)
Shoe Dog: A Memoir by the Creator of Nike
From Everand
Shoe Dog: A Memoir by the Creator of Nike
Phil Knight
4.5/5 (629)
Hidden Figures: The American Dream and the Untold Story of the Black Women Mathematicians Who Helped Win the Space Race
From Everand
Hidden Figures: The American Dream and the Untold Story of the Black Women Mathematicians Who Helped Win the Space Race
Margot Lee Shetterly
4/5 (1022)
The Hard Thing About Hard Things: Building a Business When There Are No Easy Answers
From Everand
The Hard Thing About Hard Things: Building a Business When There Are No Easy Answers
Ben Horowitz
4.5/5 (361)
Steve Jobs
From Everand
Steve Jobs
Walter Isaacson
4.5/5 (1139)
Elon Musk: Tesla, SpaceX, and the Quest for a Fantastic Future
From Everand
Elon Musk: Tesla, SpaceX, and the Quest for a Fantastic Future
Ashlee Vance
4.5/5 (582)
The Emperor of All Maladies: A Biography of Cancer
From Everand
The Emperor of All Maladies: A Biography of Cancer
Siddhartha Mukherjee
4.5/5 (298)
A Man Called Ove: A Novel
From Everand
A Man Called Ove: A Novel
Fredrik Backman
4.5/5 (5181)
The Yellow House: A Memoir (2019 National Book Award Winner)
From Everand
The Yellow House: A Memoir (2019 National Book Award Winner)
Sarah M. Broom
4/5 (100)
Angela's Ashes: A Memoir
From Everand
Angela's Ashes: A Memoir
Frank McCourt
4.5/5 (943)
Brooklyn: A Novel
From Everand
Brooklyn: A Novel
Colm Tóibín
3.5/5 (2141)
The Little Book of Hygge: Danish Secrets to Happy Living
From Everand
The Little Book of Hygge: Danish Secrets to Happy Living
Meik Wiking
3.5/5 (464)
Devil in the Grove: Thurgood Marshall, the Groveland Boys, and the Dawn of a New America
From Everand
Devil in the Grove: Thurgood Marshall, the Groveland Boys, and the Dawn of a New America
Gilbert King
4.5/5 (280)
The World Is Flat 3.0: A Brief History of the Twenty-first Century
From Everand
The World Is Flat 3.0: A Brief History of the Twenty-first Century
Thomas L. Friedman
3.5/5 (2289)
Yes Please
From Everand
Yes Please
Amy Poehler
4/5 (2016)
Bad Feminist: Essays
From Everand
Bad Feminist: Essays
Roxane Gay
4/5 (1090)
The Art of Racing in the Rain: A Novel
From Everand
The Art of Racing in the Rain: A Novel
Garth Stein
4/5 (4372)
A Tree Grows in Brooklyn
From Everand
A Tree Grows in Brooklyn
Betty Smith
4.5/5 (2033)
Cheat Sheet AWS AI Practitioner
100% (1)
Cheat Sheet AWS AI Practitioner
50 pages
The Woman in Cabin 10
From Everand
The Woman in Cabin 10
Ruth Ware
3.5/5 (2814)
The Outsider: A Novel
From Everand
The Outsider: A Novel
Stephen King
4/5 (2886)
A Heartbreaking Work Of Staggering Genius: A Memoir Based on a True Story
From Everand
A Heartbreaking Work Of Staggering Genius: A Memoir Based on a True Story
Dave Eggers
3.5/5 (233)
The Sympathizer: A Novel (Pulitzer Prize for Fiction)
From Everand
The Sympathizer: A Novel (Pulitzer Prize for Fiction)
Viet Thanh Nguyen
4.5/5 (141)
Team of Rivals: The Political Genius of Abraham Lincoln
From Everand
Team of Rivals: The Political Genius of Abraham Lincoln
Doris Kearns Goodwin
4.5/5 (244)
Wolf Hall: A Novel
From Everand
Wolf Hall: A Novel
Hilary Mantel
4/5 (4135)
Rise of ISIS: A Threat We Can't Ignore
From Everand
Rise of ISIS: A Threat We Can't Ignore
Jay Sekulow
3.5/5 (144)
On Fire: The (Burning) Case for a Green New Deal
From Everand
On Fire: The (Burning) Case for a Green New Deal
Naomi Klein
4/5 (78)
Manhattan Beach: A Novel
From Everand
Manhattan Beach: A Novel
Jennifer Egan
3.5/5 (919)
Fear: Trump in the White House
From Everand
Fear: Trump in the White House
Bob Woodward
3.5/5 (836)
John Adams
From Everand
John Adams
David McCullough
4.5/5 (2546)
The Unwinding: An Inner History of the New America
From Everand
The Unwinding: An Inner History of the New America
George Packer
4/5 (45)
The Light Between Oceans: A Novel
From Everand
The Light Between Oceans: A Novel
M.L. Stedman
4.5/5 (815)
Little Women
From Everand
Little Women
Louisa May Alcott
4.5/5 (2369)
The Constant Gardener: A Novel
From Everand
The Constant Gardener: A Novel
John le Carré
4/5 (278)
Predicting Rapid Impact Compaction - Case Study
No ratings yet
Predicting Rapid Impact Compaction - Case Study
36 pages
Dissertation
No ratings yet
Dissertation
62 pages
Gemma 3 Report
No ratings yet
Gemma 3 Report
25 pages
The Evolution of Multimodal Model Architectures
No ratings yet
The Evolution of Multimodal Model Architectures
31 pages
DUSt 3 R
No ratings yet
DUSt 3 R
13 pages
Animalformer Multimodal Vision Framework For Behavior Based 3w9vdr6dbn
No ratings yet
Animalformer Multimodal Vision Framework For Behavior Based 3w9vdr6dbn
10 pages
SpectralSpatial Morphological Attention Transformer For Hyperspectral Image Classification
No ratings yet
SpectralSpatial Morphological Attention Transformer For Hyperspectral Image Classification
15 pages
Enhancing Spatiotemporal Traffic Prediction Throug
No ratings yet
Enhancing Spatiotemporal Traffic Prediction Throug
10 pages
PIC2PLATE
No ratings yet
PIC2PLATE
60 pages
A Comparative Analysis of ChatGPT and DeepSeek: Capabilities, Applications, and Future Directions ChatGPT &DeepSeek
No ratings yet
A Comparative Analysis of ChatGPT and DeepSeek: Capabilities, Applications, and Future Directions ChatGPT &DeepSeek
5 pages
2024 Acl-Long 542
No ratings yet
2024 Acl-Long 542
21 pages
WEN Echnical Eport: Qwen Team, Alibaba Group
No ratings yet
WEN Echnical Eport: Qwen Team, Alibaba Group
26 pages
Forecast-MAE: Self-Supervised Pre-Training For Motion Forecasting With Masked Autoencoders
No ratings yet
Forecast-MAE: Self-Supervised Pre-Training For Motion Forecasting With Masked Autoencoders
16 pages
A Diachronic Language Model For Long Time Span 2025 Information Processing
No ratings yet
A Diachronic Language Model For Long Time Span 2025 Information Processing
17 pages
NIPUNNNNN
No ratings yet
NIPUNNNNN
43 pages
Knowledge Science Engineering And Management 15th International Conference Ksem 2022 Singapore August 68 2022 Proceedings Part Iii Gerard Memmi Baijian Yang Linghe Kong Tianwei Zhang Meikang Qiu download
No ratings yet
Knowledge Science Engineering And Management 15th International Conference Ksem 2022 Singapore August 68 2022 Proceedings Part Iii Gerard Memmi Baijian Yang Linghe Kong Tianwei Zhang Meikang Qiu download
78 pages
Generative AI in Cybersecurity: A Comprehensive Review of LLM Applications and Vulnerabilities
No ratings yet
Generative AI in Cybersecurity: A Comprehensive Review of LLM Applications and Vulnerabilities
52 pages
Colorectal Cancer Image Recognition Algorithm Based On Improved
No ratings yet
Colorectal Cancer Image Recognition Algorithm Based On Improved
11 pages
Kim 2021 Donut
No ratings yet
Kim 2021 Donut
12 pages
Amit Mukherjee, Adithya Saladi - Azure OpenAI Essentials - A Practical Guide To Unlocking Generative AI-powered Innovation With Azure OpenAI-Packt Publishing (2025)
No ratings yet
Amit Mukherjee, Adithya Saladi - Azure OpenAI Essentials - A Practical Guide To Unlocking Generative AI-powered Innovation With Azure OpenAI-Packt Publishing (2025)
368 pages
Advancing Vehicle Plate Recognition: Multitasking Visual Language Models With Vehiclepaligemma
No ratings yet
Advancing Vehicle Plate Recognition: Multitasking Visual Language Models With Vehiclepaligemma
33 pages
Document Image Dewarping
No ratings yet
Document Image Dewarping
9 pages
Lecture 07 LLaVA
No ratings yet
Lecture 07 LLaVA
27 pages
Generative Deep Learning PDF
No ratings yet
Generative Deep Learning PDF
166 pages
Co Co Op
No ratings yet
Co Co Op
11 pages
An AI Mock-Interview Platform For Interview Performance Analysis" (2024) by Emily R. Turner, David H. Lee, and Sophie N. Martinez.
No ratings yet
An AI Mock-Interview Platform For Interview Performance Analysis" (2024) by Emily R. Turner, David H. Lee, and Sophie N. Martinez.
9 pages
Animer: Animal Pose and Shape Estimation Using Family Aware Transformer
No ratings yet
Animer: Animal Pose and Shape Estimation Using Family Aware Transformer
15 pages
ViT-SmartAgri Vision Transformer and Smartphone-Based Plant
No ratings yet
ViT-SmartAgri Vision Transformer and Smartphone-Based Plant
14 pages
Burgert Go-with-The-Flow Motion-Controllable Video Diffusion Models Using Real-Time Warped Noise CVPR 2025 Paper
No ratings yet
Burgert Go-with-The-Flow Motion-Controllable Video Diffusion Models Using Real-Time Warped Noise CVPR 2025 Paper
11 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Data-Science-49415926: Dowload Ebook

Uploaded by

Data-Science-49415926: Dowload Ebook

Uploaded by

Download the Full Ebook and Access More Features - ebooknice.

(Ebook) Mathematical Foundations of Data Science

Download more ebook instantly today at https://ebooknice.com

Start reading on any device today!

(Ebook) Mathematical Foundations of Data Science by Tomas

(Ebook) Biota Grow 2C gather 2C cook by Loucas, Jason;

(Ebook) Matematik 5000+ Kurs 2c Lärobok by Lena

(Ebook) SAT II Success MATH 1C and 2C 2002 (Peterson's SAT

(Ebook) Cambridge IGCSE and O Level History Workbook 2C -

(Ebook) Mathematical Foundations of Data Science Using R

(Ebook) Mathematical Foundations of Big Data Analytics by

(Ebook) Kannan R. Foundations of Data Science by Hopcroft

Matthias Cetto Siegfried Handschuh

ISSN 1868-0941 ISSN 1868-095X (electronic)

For Whom Is This Book Written?

This book is appropriate for advanced undergraduate or master’s students in

What Makes This Book Different?

This book encompasses the formulation of typical tasks as input/output mappings,

St. Gallen, Switzerland Tomas Hrycej

1 Data Science and Its Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

Part I Mathematical Foundations

4.3 Overdetermined Case with Noise . . . . . . . . . . . . . . . . . . . . . . . . . 95

7.1.3 Implementations of Convolutional Neural Networks . . . . . 200

• recognizing one of the predefined classes of patterns (a classification task). An

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 1

population by an arithmetic mean or standard deviation of the weight or height of

is characterized by more complex processing structures. Frequently, they consist of

• What type of generic task is this (forecast or classification, static or dynamic

Information processing algorithms consist of receiving input data and computing

These are the topics of the following sections.

2.1 Continuous Mappings

Fig. 2.1 Error functions

If a parameterized continuous mapping is to be fitted to data, the goal of fitting is to

Fig. 2.2 Derivatives of error

2.1.1 Nonlinear Continuous Mappings

Fitting neural networks to data is done numerically because of missing analyt-

2.1.2 Mappings of Probability Distributions

The maximum likelihood solution consists in determining the mappings m (x)

The expression (2.18) to be minimized becomes

If the mapping f (x, w) is represented, for example, by a neural network, the

A classification problem is characterized by assigning every pattern a class out of a

Fig. 2.3 Two classes with linear separation

This discontinuity represents a difficulty when searching for optimal classifica-

2.2.1 Special Case: Two Linearly Separable Classes

Fig. 2.4 Separating function

Fig. 2.5 Continuous

Fig. 2.6 Quadratic separation

Fig. 2.7 Quadratic

Fig. 2.8 Separating principle of a SVM

Fig. 2.9 Alternative separating functions—cross-sectional view

min b (2.30)

2.2.2 Minimum Misclassification Rate for Two Classes

Unfortunately, separability or even linear separability is rather scarce in real-world

Substituting (2.31)–(2.34) results in

Fig. 2.10 Lateral projection

• normally distributed classes;

Fig. 2.11 Lateral projection

Fig. 2.12 Logistic (sigmoid)

Fig. 2.13 Lateral projection

Fig. 2.14 A sequence of

by increasing the norm of weight vector b. This unbounded norm is a shortcoming

Fig. 2.15 Logistic

2.2.3 Probabilistic Classification

First, it has to be mentioned that misclassification rate itself is a probabilistic concept.

Maximizing (2.57) is the same as minimizing its negative logarithm

E [L/K ] = E [yk ln ( f (xk , w)) + (1 − yk ) ln (1 − f (xk , w))]

This means that if the mapping f () is expressive enough to be parameterized to

As an alternative to the maximum likelihood principle, a least squares solution

Fig. 2.16 Separation of three classes

2.2.4 Generalization to Multiple Classes

All classification principles of Sects. 2.2.1–2.2.3 can be generalized to multiple

2.3 Dynamical Systems

Generic applications considered so far have a common property: an input pattern is

as processing static images or classification of credit risk. However, some application

Fig. 2.17 Impulse response

Fig. 2.18 Impulse response

min b (2.30)