Mathematical Foundations of Data Science Using
Mathematical Foundations of Data Science Using
Part I Introduction to R
2 Overview of programming paradigms
2.1 Introduction
2.2 Imperative programming
2.3 Functional programming
2.4 Object-oriented programming
2.5 Logic programming
2.6 Other programming paradigms
2.7 Compiler versus interpreter languages
2.8 Semantics of programming languages
2.9 Further reading
2.10 Summary
3 Setting up and installing the R program
3.1 Installing R on Linux
3.2 Installing R on MAC OS X
3.3 Installing R on Windows
3.4 Using R
3.5 Summary
4 Installation of R packages
4.1 Installing packages from CRAN
4.2 Installing packages from Bioconductor
4.3 Installing packages from GitHub
4.4 Installing packages manually
4.5 Activation of a package in an R session
4.6 Summary
5 Introduction to programming in R
5.1 Basic elements of R
5.2 Basic programming
5.3 Data structures
5.4 Handling character strings
5.5 Sorting vectors
5.6 Writing functions
5.7 Writing and reading data
5.8 Useful commands
5.9 Practical usage of R
5.10 Summary
6 Creating R packages
6.1 Requirements
6.2 R code optimization
6.3 S3, S4, and RC object-oriented systems
6.4 Creating an R package based on the S3 class system
6.5 Checking the package
6.6 Installation and usage of the package
6.7 Loading and using a package
6.8 Summary
Part II Graphics in R
7 Basic plotting functions
7.1 Plot
7.2 Histograms
7.3 Bar plots
7.4 Pie charts
7.5 Dot plots
7.6 Strip and rug plots
7.7 Density plots
7.8 Combining a scatterplot with histograms: the layout
function
7.9 Three-dimensional plots
7.10 Contour and image plots
7.11 Summary
8 Advanced plotting functions: ggplot2
8.1 Introduction
8.2 qplot()
8.3 ggplot()
8.4 Summary
9 Visualization of networks
9.1 Introduction
9.2 igraph
9.3 NetBioV
9.4 Summary
2.1 Introduction
Programming paradigms form the conceptual foundations of practical programming
languages used to control computers [→79], [→122]. Before the 1940s, computers
were programmed by wiring several systems together [→122]. Moreover, the
programmer just operated switches to execute a program. In a modern sense, such a
procedure does not constitute a programming language [→122]. Afterwards, the von
Neumann computer architecture [→79], [→122], [→136] heavily influenced the
development of programming languages (especially those using imperative
programming, see Section →2.2). The von Neumann computer architecture is based
on the assumption that the machine’s memory contains both commands and data
[→110]. As a result of this development, languages that are strongly machine-
dependent, such as Assembler, have been introduced. Assembler belongs to the family
of so-called low-level programming languages [→122]. By contrast, modern
programming languages are high-level languages, which possess a higher level of
abstraction [→122]. Their functionality comprises simple, standard constructions, such
as loops, allocations, and case differentiations. Nowadays, modern programming
languages are often developed based on a much higher level of abstraction and novel
computer architectures. An example of such an architecture is parallel processing
[→122]. This development led to the insight that programming languages should not
be solely based on a particular machine or processing model, but rather describe the
processing steps in a general manner [→79], [→122].
The programming language concept has been defined as follows [→79], [→122]:
Definition 2.1.1.
A programming language is a notational system for communicating computations to
a machine.
Louden [→122] pointed out that the above definition evokes some important
concepts, which merit brief explanation here. Computation is usually described using
the concept of Turing machines, where such a machine must be powerful enough to
perform computations any real computer can do. This has been proven true and,
moreover, Church’s thesis claims that it is impossible to construct machines which are
more powerful than a Turing machine.
In this chapter, we examine the most widely-used programming paradigms
namely, imperative programming, object-oriented programming, functional
programming, and logic programming. Note that so-called “declarative”
programming languages are also often considered to be a programming paradigm.
The defining characteristic of an imperative program is that it expresses how
commands should be executed in the source code. In contrast, a declarative program
expresses what the program should do. In the following, we describe the most
important features of these programming paradigms and provide examples, as an
understanding of these paradigms will assist program designers. →Figure 2.1 shows
the classification of programming languages into the aforementioned paradigms.
In LISP or Scheme, the program is simply (+ a b), but aand bneeds to be predefined,
e. g., as
The first program declares two variables aand b, and stores the result of the
computation in a new variable sum. The second program first defines two constants,
aand b, and binds them to certain values. Next, we call the function (+ )(a function
call is always indicated by an open and closed bracket) and provide two input
parameters for this function, namely aand b. Note also that defineis already a
function, as we write (define ... ). In summary, the functional character of the
program is reflected by calling the function (+ )instead of storing the sum of the two
integer numbers in a new variable using the elementary operation “+”. In this case,
the result of the purely functional program is (+ 4 5)=9.
Another example is the square function for real values expressed by
and
This program is typically imperative, as we use a loop structure (while ... do) and
the variables band nchange their values to finally compute n! (state change). As loop
structures do not exist in functional programming, the corresponding program must
be recursive. In purely mathematical terms, this can be expressed as follows:
n! = f (n) = n ⋅ f (n − 1) if n > 1 , else f (n) = 0 if n = 0 . The implementation of
to be true (see also the Peano axioms [→12]). Here, the symbol → stands for the
logical implication. Informally speaking, this means that if 1 is a natural number and if
n is a natural number (for all n), and that, therefore, the successor is also a natural
number, then 3 is a natural number. To prove the statement, we apply the last two
logical statements as so-called axioms [→35], [→175], and obtain
Logic programming languages often use so-called Horn clauses to implement and
evaluate logical statements [→35], [→175]. Using Prolog, the evaluation of these
statements is given by the following:
This version uses the concept of recursion for representing the following formula:
sum(n)=n+sum(n-1)(see also Section →2.3). As mentioned in Section →2.3, recursive
solutions may be less efficient especially when calling a function with large values
than iterative ones by using variables in the sense of imperative programming.
To conclude this section, we demonstrate the object-oriented programming
paradigm in R. For this, we employ the object-oriented programming system S4
[→129] and implement the same problem as above. The result is shown in Listing
2.11.
First, we use the predefined class series_operationwith a predefined data- type.
Then, we define a prototype of the method fun_sum_object_orientedusing the standard
class series_operation. Using the setMethodcommand, we define the method
fun_sum_object_orientedconcretely, and also create a new object from
series_operationwith a concrete value. Finally, calling the method gives the desired
result.
Figure 2.3 The basic principle of an interpreter (left) and compiler (right) [→122].
A distinct property of an interpreter is that the program p ∈ L and the sequence
of input symbols are executed simultaneously without using any prior information
[→202]. Typical interpreter languages include functional programming languages,
such as Lisp and Miranda, but other examples include Python and Java (see [→122],
[→168]). A key advantage of interpreter languages is that the debugging process is
often more efficient compared to that of a compiler, as the code is executed at the
runtime only. The frequent inefficiency of interpreter programs may be identified as a
weakness because all fragments of the program, such as loops, must be translated
when executing the program again.
Next, we sketch the compiler approach to translate computer programs. A
compiler translates an input program as a preprocessing step into another form,
which can then be executed more efficiently. This preprocessing step can be
understood as follows: A program written in a programming language (source
language) is translated into machine language (target language) [→202]. In
mathematical terms, this equals a mapping C : L ⟶ L that maps programs of a
1 2
this process, a target program can be then executed directly. Typical compiler
languages include C, Pascal, and Fortran (see [→122], [→168]). We emphasize that
compiler languages are extremely efficient compared to interpreter languages.
However, when changing the source code, the program must be compiled again,
which can be time consuming and resource intensive. →Figure 2.3 shows the
principle of a compiler schematically.
2.10 Summary
The study of programming paradigms has a long history and is relatively complex.
Nevertheless, we considered it important to introduce this fundamental aspect to
show that programming is much more than writing code. Indeed, although
programming is generally perceived as practical, it has a well-defined mathematical
foundation. As such, programming is less practical than it may initially appear, and
this knowledge can be utilized by programmers in their efforts to enhance their
coding skills.
3 Setting up and installing the R program
In this chapter, we show how to install R on three major operating systems that are
widely used: Linux, MAC OS X, and Windows. As a note, we would like to remark that
this order reflects our personal preference of the operating systems based on the
experience we gained over the years making maximum use of computers.
From our experience, Linux is the most stable and reliable operating system of
these three and is also freely available. An example of such a Linux-operating system
is Ubuntu, which can be obtained from the web page →http://www.ubuntu.com/. We
are using Ubuntu since many years and can recommend it to anyone, no matter
whether it is for a professional or a private usage. Linux is in many ways similar to the
famous operating system Unix, developed by the AT&T Bell Laboratories and released
in 1969, however, without the need of acquiring a license. Typically, a research
environment of professional laboratories has a computer infrastructure consisting of
Linux computers, because of the above-mentioned advantages in addition to the free
availability of all major programming languages (e. g., C/C++, python, perl, and Java)
and development tools. This makes Linux an optimal tool for developers.
Interestingly, the MAC OS X system is Unix-based like Linux, and hence, shares
some of the same features with Linux. However, a crucial difference is that one
requires a license for many programs because it is a commercial operating system.
Fortunately, R is freely available for all operating systems.
Alternatively, one can install R by using the Ubuntu software center, which is
similar to an App store. For other Linux distributions the installation is similar, but
details change. For instance, for Fedora, the installation via terminal uses the
command:
3.4 Using R
The above installation, regardless for which operating system, allows you to execute R
in a terminal. This is the most basic way to use the programming language. That
means one needs, in addition, an editor for writing the code. For Linux, we
recommend emacs and for MAX OS X Sublime (which is similar to emacs). Both are
freely available. However, there are many other editors that can be used. Just try to
find the best editor for your needs (e. g., nice command highlighting or additional
tools for writing or debugging the code) that allows you to comfortably write code.
Some people like this vi-feeling1 of programming, however, others prefer to have
a graphical-user interface that offers some utilities. In this case, RStudio
(→https://www.rstudio.com/) might be the right choice for you. In →Fig. 3.1, we show
an example how an RStudio session looks. Essentially, the window is split into four
parts. A terminal for executing commands (bottom-left), an editor (top-left) to write
scripts, a help window showing information about R (bottom-right) or for displaying
plots, and a part displaying variables available in the working space (top-right).
Figure 3.1 Window of an Rstudio session.
3.5 Summary
For using the base functionality of R, the installation shown in this chapter is
sufficient. That means essentially everything we will discuss in Chapter →5 regarding
the introduction to programming can be done with this installation. For this reason,
we suggest to skip the next chapter discussing the installation of external packages
and come back to it when it is needed to install such packages.
4 Installation of R packages
After installing the base version of R, the program is fully functional. However, one of
the advantages of using R is that we are not limited to the functionality that comes
with the base installation, but we can extend it easily by installing additional
packages. There are two major sources from which such packages are available. One
is the COMPREHENSIVE R ARCHIVE NETWORK (CRAN) and the other is BIOCONDUCTOR. Recently,
GitHub has been emerging as a third major repository. In what follows, we explain
how to install packages from these and other sources.
Here, package.name is the name of the package of interest. In order to find the
name of a package we want to install, one can go the CRAN web page
(→http://cran.r-project.org/) and browse or search the list of available packages. If
such a package is found, then we just need to execute the above command within an
R session and the package will be automatically installed. It is clear that in order for
this to work properly, we need to have a web connection.
As an example, we install the bc3net package that enables infering networks from
gene expression data [→45].
At the time of writing this book CRAN provided 14435 available packages. This is
an astonishing number, and one of the reasons for the widespread use of R since all
of these packages are freely available.
That means, first, the package devtools from CRAN needs to be installed and then
a package from GitHub with the name ID/packagename can be installed. For instance, in
order to install ggplot2 one uses the command
Only after the execution of the above command the content of the package
package.name is available. For instance we activate the package bc3net as follows:
To see what functions are provided by a package we can use the function help:
4.6 Summary
In this chapter, we showed how to install external packages from different package
repositories. Such packages are optional and are not needed for utilizing the base
functionality of R. However, there are many useful packages available that make
programming more convenient and efficient. For instance, in an academic
environment it is common to provide an R package when publishing a scientific article
that allows reproducing the conducted analysis. This makes the replication of such an
analysis very easy because one does not need to rewrite such scripts.
5 Introduction to programming in R
In principle also, the symbol “=” can be used for an assignment, but there are
cases where this leads to problems, and for this reason we suggest using always the
“<−” operator, because it can be used in all cases.
The basic elements of R, to which different values can be assigned, are called
objects. There are different types of objects and some of them are listed in →Table 5.1.
This will result in a character string showing the full path to the current working
directory of the R session. In case one would like to change the directory, one can use
the set working directory function setwd():
The information resulting from args() is usually only informative if one is already
familiar with the function of interest, but just forgot details about its arguments. For
more information, we need to use the function help(), which is described in detail in
the next section.
In the following, we use the term “function” and “command” interchangeably,
although a command has a more general meaning than a function.
At this early stage in the book, we would like to highlight the fact that R provides
helpful information about functions, but this does not necessarily mean that this
information will be to the extend you expect or would wish for. Instead, usually, the
provided help information is rather short and not sufficient (or intended) to fully
understand the very details of the complexity of the function of interest.
However, most help information comes with R examples at the end of the help
file. This allows you to reproduce, at least parts, of the capabilities of the described
functions by using the provided example code. It is not necessary to type these
examples manually but there is a useful function available, called example(), that
executes the provided example code automatically:
That means you do not need to manually copy-and-paste (or type) the example
code, but just apply the example() command to the function you wish to learn more
about.
5.2.1 If-clause
A basic element of every programming language is an if-clause. An if-clause can be
used to test the truth of a logical statement. For instance, the logical statement in the
example below is: a > 2 . If the variable a is larger than 2, then this statement is true,
and the code in the first {} brackets is executed. However, if this statement is false,
then the code that follows the brackets {} after else will be executed.
The usage of an if-clause is very flexible, allowing the removal of the else clause,
but also to include further conditional statements by means of the else if command.
We would like to note that, e. g., the statement “a=4” is not a logical statement,
but an assignment, and will for this reason not work as an argument for an if-clause.
5.2.2 Switch
The switch() command is conceptually similar to an if-clause. However, the difference
is that one can test more than one condition at the same time. For instance, in the
example below, the switch() command tests 3 conditions, because it has 3 executable
components, indicated by the “{ }” environments. If the variable “a” is 1, the first
commands are executed, if “a” is 2 the second, and so on.
For all other values of “a”, there will be no true condition and, hence, none of the
above commands will be executed.
For clarity, we just want to mention that for reasons of a better readability, we
split the switch() command in the above example into three different lines. This way
one can see that it consists of 3 executable components. When you write your own
programs, you will see that such a formatting is in general very helpful to get a quick
overview of a program, because this increases the readability of the code.
5.2.3 Loops
In R, there are two different ways to realize a looping behavior. The first is by using a
for-loop, and the second by using a while-loop. A looping behavior means the
consecutive execution of the same procedure for a number of steps. The number of
steps can be fixed, or variable.
5.2.4 For-loop
A for-loop repeats a statement for a predefined number of times. In the following
example, i is successively assigned the values 1 to 3, and the command print(i) is
executed 3 times, for three different values of i:
5.2.5 While-loop
Another looping function is while(). Its syntax is
One needs to make sure that the argument of the while() function becomes at
some time during the looping process logically false, because otherwise the function
is iterated infinitely. This is a frequent programming bug.
In order to keep the logic simple, let us assume that the argument of the for-loop is
argument = i in 1:N (5.1)
This argument contains one variable and one parameter. Here i is the variable,
because its value changes systematically with every loop that is executed, and N is a
parameter, because its value is fixed throughout the whole loop. The values that can
be assumed to i are determined by 1:N, because the argument says i in 1:N. If you
define N=4 and execute 1:N in an R session, you get
That means 1:N is a vector of integers of length N. To see this, you can define a <-
1:N and access the components of vector a by a[1], e. g., for the first component.
The values that i can assume are systematically assigned according to the order of the
vector 1:N, i. e., in loop 1, i is equal to 1; in loop 2, i is equal to 2; until finally in loop N,
i is equal to N. For this reason, the “argument” of a for-loop is—in our example—
dependent on the variable i, i. e.,
argument(i), (5.2)
Note that this is just a symbolic writing to emphasize that the argument of a loop is
connected to the step of the loop. Here, it is important to realize that the variable of
the “argument” changes its value in every loop step.
Due to the fact that the argument itself is a function of the loop step, we have the
following dependency chain:
body(argument(loop step)). (5.5)
5.2.6.3 For-function
The third part is an actual R function. In R, you can always recognize a function by its
name, followed by round brackets “()” containing, optionally, an argument. In the
case of the for-function, it contains an argument, as discussed above. The purpose of
the for-function is to execute the body consecutively.
To make this clear, especially with respect to the argument of the body, which
depends on the number of the loop step, let us consider the following example:
The for-function converts this into the following consecutive execution of the
body, as a function of the argument:
First, the value of the variable i changes with every loop, according to the
argument. In our case “i” just assumes the values 1, 2, 3. Then the concrete value of i
is used in every loop, leading to different values of a. From a more general point of
view, this means that the for-function does not only execute the body of the function
consecutively, but it changes also the content of the workspace, which is the memory
of an R session, with every loop step. This is the exact meaning of body(argument(loop
step)).
5.2.7 Break
Both loop functions can be interrupted at any time during the execution of the loop
using the break() command. Frequently, this is used in combination with an if-clause
within a loop to test for a specific decision that shall lead to the interruption of the
loop.
Combining loops with if-clauses and the break() function allows creating very
flexible constructs that can exhibit a rich behavior.
5.2.8 Repeat-loop
For completeness, we want to mention that there is actually a third type of loop in R,
the repeat-loop. However, in contrast with a for-loop and a while-loop, this does not
come with an interruption condition, but is in fact an infinite loop that does never
stop. For this reason, the repeat() command needs to be used always in combination
with the break() statement:
5.3.1 Vector
A vector is a 1-dimensional data structure. As the example below shows, a vector can
be easily defined by using the combine function c(). This function concatenates its
elements forming a vector. Individual elements can be accessed in various ways,
using squared brackets.
There are many functions to obtain properties of a vector, e. g., its length or the
sum of its elements. In order to make sure that the sum of its elements can be
computed, the function mode() or typeof() allows determining the data-type of the
elements. Examples of different types are character, double, logical, or NULL.
Accessing elements of a vector can be done either individually (a[3] gives the
third element of vector a) or collectively by specifying the indices of the elements
(a[c(2,5)] gives the second and fifth element).
One can also assign names to elements of a vector using the command names().
When assessing an element with its name, one needs to make sure to use the same
index. In the below example, one needs to use "C" and not C, because the latter
indicates a variable rather than the capital letter itself.
There are also several functions available to generate vectors by using predefined
functions, e. g., the sequence (seq()) of numbers or letters (letters()). A general
characteristic of a vector is that whatever the type its elements, they need to be all of
the same type. This is in contrast with lists, discussed in Section →5.3.3.
It is also possible to define a vector of a given length and mode initiated by zeros.
For example, vector(mode = "numeric", length = 10) results in a numeric vector of
length 10, whereas each element is initialized with a 0.
It is also possible to apply a function element-wise to a vector without the need to
access its elements, e. g., in a for-loop.
Other useful functions that can be either applied to a vector or used to generate
vectors are provided in →Table 5.2.
If we want to add an element to a vector, we can use the command append():
Here the option after allows specifying a subscript, after which the values are to be
appended.
Table 5.2 Examples of functions that can be applied to vectors or can be used to
generate vectors.
Command name Description
LETTERS capital letters
letters lower case letters
month.name month names
rep("hello", times=3) repeats the first argument n-times
sum sum of all elements in the vector
length length of vector
rev reverse order of elements
This command allows us also to demonstrate the usefulness of the NULL object
type, introduced in Section →5.1.
Although the variable a does not contain an element with a value, it contains one
initialized element as a place holder of type NULL. Repeating the above example with
an uninitialized object would result in an error message.
A simplified form of the above can be written as follows:
5.3.2 Matrix
A matrix is a 2-dimensional data structure. It can be constructed with the command
matrix(), see Listing 5.28.
Here, the option byrow allows controlling how a matrix is filled. Specifically, by
setting it to “FALSE” (default), the matrix is filled by columns, otherwise the matrix is
filled by rows. Accessing the elements of a matrix is similar to a vector by using the
squared brackets. Again, this can be done either individually (a[1,2] giving the
element in row 1 and column 2) or collectively by specifying the indices of the
elements (a[c(1,3),] gives all the elements of row 1 and 3). It is interesting to note
that by not specifying element, i. e., by using “,”, all elements are selected.
There are several commands available to obtain the properties of a matrix. Some
of these commands are provided in →Table 5.3.
Sometimes, it is useful to assign names to rows and columns. This can be
achieved by the commands rownames() and colnames().
There are alternative ways to create a matrix. For instance, by using the
commands cbind(), rbind(), or dim():
5.3.3 List
A list is a more complex data structure than the previous ones, because it can contain
elements of different types. This was not allowed for either of the previous data
structures. Formally, a list is defined using the function list(), see Listing 5.33.
In the above example, the list b consists of 4 elements, which are different data
structures. In order to access an element of a list, the double-squared brackets can be
used, e. g., b[[2]], to access the second element. This appears similar to a vector,
discussed in Section →5.3.1. In fact, there are many commands for vectors that can
also be applied to lists, e. g., length() or names(). If name attributes are assigned to the
elements of a list, then these can be accesses by the “$” operator.
In this case, the usage of double-squared brackets or the “$” operator provide the
same results. It is also possible to assign names to the elements when defining a list.
The following example shows that even a partial assignment is possible. In this case,
the first two elements could be accessed by their name, whereas the latter two can
only be accessed by using indices, e. g., b[[3]] for the third element.
5.3.4 Array
Arrays are a generalization of vectors, matrices, and lists in the sense that they can be
of arbitrary dimension and type.
Elements of an array can be accessed using squared brackets, and the number of
indices corresponds to the number of dimensions.
Again the command names() can be used to identify the names of the elements in
a data frame. Interestingly, the “$” operator can be used with "x" as well as x to
access elements. →Table 5.4provides an overview of further commands for data
frames.
Table 5.4 Some examples of commands that can be used with data frames.
Command name Description
dim dimension of a matrix: c(nrow, ncol)
ncol number of columns
nrow number of rows
length total number of elements
names names of the elements
5.3.6 Environment
An environment is similar to a list, however, it needs named elements. That means,
the name of an element needs to be a character string. The command ls() provides a
list of the names of all elements in an environment.
Alternatively, one can use the function assign() to assign a new element to an
environment:
If we want to delete many variables, we need to specify the “list” argument of the
command rm() providing a character vector naming the objects to be removed. We
can also delete all variables in the current working space in the following way:
5.3.8 Factor
For analyzing data containing categorial variables, a data structure called factor is
frequently encountered. A factor is like a label or a tag that is assigned to a certain
category to represent it. In principle, one could define a list containing the same
information, however, the R implementation of the data structure factor is more
efficient. An example for defining a factor is given below.
Here, we assign 4 different factors to height, but only three “values” are different.
In the case of a factor, different values are actually called levels. The different levels of
a factor can also be obtained with the command levels().
In the above example, the factors were categorial variables, meaning that the
levels have no particular ordering. An extension to this is to define such an ordering
between the levels. This can be done implicitly or explicitly.
Each of these functions results in an R object of a specific type. The first function
returns an object of class POSIXct and the second of class Date. The reason for this is
that objects of the same type can be manipulated in a convenient way, e. g., using
subtraction, we can get the time difference between two time points or dates.
5.3.10 Information about R objects
In the above sections, we showed how to define basic R objects of different types. In
all these cases, we knew the type of these object, because we defined them explicitly.
However, when using packages, we may not always have this information. For such
cases, R provides various commands to get information about the types of objects.
The function attributes() gives information about different attributes an R object can
have, including information about class, dim, dimnames, names, row.names, or levels. In
case the attributes() function does not provide information about the class of an R
object, one can obtain this information with the command class().
The sep option allows specifying what separator is used for concatenating the
strings; the default introduces a blank between two strings.
It is also possible to include a variable to form a new string.
This is useful if we want to read many files from a directory within a loop and their
names vary in a systematic way, e. g., by an enumeration. It can also be used to create
names for an environment (see Sec. →5.3.6), because an environment needs strings
as indices for elements.
Furthermore, the function paste() can be used to connect more than just two
strings:
If we want to overwrite parts of a string s with another string, we need to use the
function substring() with start, specifying where to start overwriting. In case we just
want to insert a new string without overwriting parts of the string s, we need to use
the function substr():
The reason why a ”.” does not work as a split symbol, but ”[.]” does, is due to the
fact that the argument split is a regular expression (see Section →5.4.5).
The first argument of the above function characterizes the pattern we try to find
and text is the string to be searched.
Both functions result in similar outputs, but displayed in different ways. While
regexpr() returns an integer vector of the same length as text, whose components
provide information about the position of a match or no match, resulting in −1,
grepexpr() returns a list of this information. Furthermore, both functions have the
attribute match.length that indicates the number of elements that are actually
matched. One may wonder how is it possible that the length of a match could not
correspond to the length of pattern. This is where (nontrivial) regular expression
come into play.
A regular expression is a pattern that can include special symbols, as listed in
→Table 5.5 below.
Table 5.5 Some special symbols that can be used in regular expressions.
Symbols Meaning of the symbols
* an asterisk matches zero or more of the preceding character
. a dot matches any single character
+ a plus sign matches one or more of the preceding character
[...] the square brackets enclose a list of characters that can be matched alternatively
{min, max} the preceding element is matched between min and max times
\s matches any single whitespace character
| the vertical bar separates two or more alternatives
\t match a tab
\r match a carriage return
\n match a linefeed
0−9 match any number between 0 and 9
A-Z match any upper letter between A and Z
a-z match any lower letter between a and z
For example, the regular expression x+ matches any of the following within a
string: "x", "xx", "xxx", etc. This means that the length of the regular expression is not
equal to the length of the matched pattern. By using special symbols, it is possible to
generate quite flexible search patterns, and the resulting patterns are not necessarily
easy to recognize from the regular expression.
To demonstrate the complexity of regular expressions, let us consider the
following example. Suppose that we want to identify a pattern in a string, of which we
do not know the exact composition. However, we know certain components. For
example, we know that it starts with a “G” and is followed either by none or several
letters or numbers, but we do not know by how many. After this, there is a sequence
of number, which is between 1 and 4 elements long:
The above code realizes such a search and it finds at position 4 of txt a match that
is 7 elements long.
In order to extract the matched substring of txt, the function regmatches() can be
used. It expects as arguments the original string used to match a pattern and the
result from the function regexpr():
For our above example the matched substring is “GGA0234”:
This example demonstrates that with regular expressions it is not only possible to
match substrings that are exactly known, but also to match substrings that are only
partially known. This flexibility is very powerful.
If we are interested in the positions of the sorted elements in the original vector x
we can get these indices by using the function order().
A somewhat related function to order() is rank(). However, rank(x) gives the rank
numbers (in increasing order) of the elements of the input vector x:
In the case of ties, there are several options available to handle the situation, and
one of them is ties.method.
Here, fct.name is the name of the new function you want to define, argument is the
argument you submit to this function, and body is a list of commands that are
executed, applied to argument.
A new definition for a function utilizes itself an R function called function. If the
body of the new function consists merely of one command, one can use the simplified
syntax:
However, for reasons of clarity and readability of the code, we recommend always
to define the body of the function, starting with a “{” and ending with a “}”.
Let us consider an example defining a new function that adds 1 to a real number
given by the argument x:
In this example, the name of the new function is add.one(). One should always pay
attention to the name to not accidentally overwrite some existing function. For
instance, if we would call the new function sqrt(), then the square root function, part
of the R base package, will be overwritten.
It is good practice to finish the body with the command return() that contains as
its argument the variable we would like to get as a result from the application of the
new function. However, the following will result in the exact same behavior as the
function add.one():
Here, it is important not to write y <- x + 1, but instead x+1, without assignment
to a variable. We do not recommend this syntax, especially not for beginners, because
it is less explicit in its meaning.
We would like to note that the above-defined function is just a simple example
that does not include checks in order to avoid errors. For instance, one would like to
ensure that the argument of the function, x, is actually a number, because otherwise
operations in the body of the function may result in errors. This can be done, for
instance, using the command is.numeric().
The usage of such a self-defined function is the same as for a systems function,
namely fct.name(x). The following is an example:
Some reasons for writing your own functions are to help you to
organize your programs
make your programs more readable
limit the scope of variables
The last point is very important and shall be visualized with the following example.
Start a new R session (this is important!) and copy the following code into the R
workspace:
What will be the output of print(yzxv)? It will result in an error message, because
the variable yzxv is defined within the scope of the function fct.test(), and as such, it is
not directly accessible from outside the function. This is actually the reason why we
need to specify with the return() function the variables we want to return from the
function. If we could just access all variables defined within the body of a function,
there would be no need to do this.
The rationale behind our recommendation to start a new R session is to clear any
variable in the session, already defined yzxv; since, in this case, print(yzxv) would
output that existing variable rather than the value calculated inside the function
fct.test(). For the specific choice of our variable name, this may be unlikely (that is why
we used yzxv), but for more common variable names, such as a, i or m, there is a real
possibility that this could happen.
In general, functions allow us to separate our R workspace into different parts,
each containing their own variables. For this reason, it is also possible to reuse the
same variable name in different functions without the danger of collisions.
This point addresses the so-called scope of a variable, which is an important issue,
because it is the source of common bugs in programs.
In this case, the list variable y serves as a container to transmit all desired
variables. That means, formally, one has just one output variable, but this variable
contains additional output variables that can be accessed via the components of the
list. For example, we can access its third component by y[[3]].
Rprovides the useful command args(), which gives some information on the input
arguments of a function. Try, for example, args(matrix).
Here, the option file defines the name of the file in which we want to save the
data. In principle, any name is allowed, with or without extension. However, it is
helpful to name this file filename.RData, where the extension RData indicates that it is a
binary R data file. Here, binary file means that if we open this file within any text
editor, its content is not visible because of its coding format. Hence, in order to view
its content, we need to load this file again in an R workspace.
If we want to save more than one R object, two different syntax variations exist
that can be used. The first way to save more than one R object is to just name these
objects, separated by a comma:
The second way is to define a list that contains the variable names as character
elements:
If we want to save all the variables in the current workspace and not just the
selected ones, we can use the function save.image():
This function is a short cut for the following script, which accomplishes the same
task:
For the above examples, we did not need to care about the formatting of the file
to which we save the data, but R makes essentially a copy of the workspace, either for
selected variables or for all variables. This is a very convenient and fast way to save
variables to a file. One disadvantage of this way is that these files can only be loaded
with R itself, but not with other programs or programming languages. This is a
problem if we plan to exchange data with other people, friends, or collaborators and
we are unsure whether they either have access to R or do not want to use it, for some
reason. Therefore, R provides additional functions that are more generic in this
respect. In the following, we discuss three of them in detail.
There are 3 functions in the base package, namely, write.table(), write.csv(), and
write.csv2(), that allow saving tables as a text file. All of these functions have the
following syntax:
Here, M is a matrix or a data frame, and the option sep specifies the symbol used
to separate elements in M from each other. As a result, the data saved in file can be
viewed by any text editor, because the information is saved as a text rather than a
binary file as the one generated, for example, by the function save(). The functions
write.csv() and write.csv2() provide a convenient interface to Microsoft EXCEL, because
the resulting file format is directly recognized by this program. This means we can
load these files directly in EXCEL.
A potential disadvantage of these 3 functions appears when the output of an R
program is not just one table, but several tables of different size and additional data
structures in the form of, e. g., lists, environments, or scalar variables. In such cases, a
function like write.table() would not suffice, because you can only save one table. On
the other hand, the functions save() or save.image() can be used without the need to
combine all data structures into just one table.
Since the function save() makes essentially a copy of the workspace, or parts of it,
and saves it to a file, then the function load() just pastes it back into the workspace.
Hence, there are no formatting problems that we need to take care of.
In contrast, if tabular data are provided in a text file, we need to read this file
differently. R provides 5 functions to read such data, namely, read.table(), read.csv(),
read.csv2(), read.delim(), and read.delim2(). For example, the function read.table() has
the following syntax:
The option header is a logical value that indicates whether the file contains the
names of the variables as its first line. The option skip is an integer value indicating
the number of lines that should be skipped when we start reading the file. This is
useful when the file contains at the beginning some explanations about its content or
general comments.
Let us consider an example:
The content of the file infile is shown in →Fig. 5.3. This file contains a comment at its
beginning spanning one row. For this reason, we skip this line with the option skip=1.
Furthermore, this file contains a header giving information about the columns it
contains. By using “header=TRUE” this information is converted into the column names
of the table we are creating using the function read.table(). Using colnames(dat) will
give us this information. Most importantly, we need to specify the symbol that is used
to separate the numbers in the input file. This is accomplished by setting “sep=","”. As
a result, the variable dat will be a data frame containing the tabular data in the input
file having the information about the corresponding columns as column names. We
can access the information in the individual columns by using either dat[[1]], e. g., for
the first column or dat$names. Try to access the information in the second column. Is
there a problem?
Figure 5.3 File content of infile and the effect the options in the command
read.table() have on its content.
If n is a negative value, the whole file will be read. Otherwise, the exact number of
lines will be read. The advantage of this way of reading a file is that the formatting of
the file can change, but does not need to be fixed.
For text files with a complex, irregular formatting it is necessary to read these files
line-by-line in order to adopt the formatting separately for each line. This can be done
in the following way:
The function file() opens a connection to the file specified by the option
description and the option open that we want to read the information from. Then
calling readLines() reads exactly one line from the file. That means, if called repeatedly,
for example within a for-loop, it gives one line after the other, and these can be
processed individually. In this way, arbitrarily formatted files can be read and stored
in variables so that the information provided by the file can be used in an R session. If
we want to restart reading from this file, we just need to apply the function file()
again.
Figure 5.4 File content of infile and the effect the options in the command
read.table() have on its content.
To demonstrate the usage of the function readLines(), let us consider the following
example reading data from the file shown in →Fig. 5.4. In this case, our file contains
some irregular rows, and we would either like to entirely omit some of them, such as
row 5, or only use them partially, e. g., row 4. The following code reads the file and
accomplishes this task:
This corresponds to the information in the input file, skipping row 5 and omitting
the second element in row 4. From this example, we can see that “low-level
functions” offers some degree of flexibility, which translates into a considerable
amount of additional coding that we need to do to process an input file.
There is another function similar to readLines(), called scan(). The function scan()
does not result in a data frame, but a list or vector object. Another difference with
readLines() is that it allows specifying the data-types to be read by setting the what
option. Possible values of this options are, e. g., double, integer, numeric, character, or
raw. The following code shows an example for its usage:
As one can see, the object dat is a vector and the components of the input file,
separated according to sep, form the components of this vector. In our experience,
the function readLines() is the better choice for complex data files.
We just would like to mention without discussion that the function writeLines
allows a similar functionality and flexibility for writing data to a file in a line-by-line
manner.
When we are interested in identifying the indices of a matrix that have a certain
value, one can use the option arr.ind=TRUE to get the matrix indices:
If we would set this option to FALSE (which is the default value), the result is just
the number of TRUE elements, but not their indices.
Here, X corresponds to a matrix or array, FUN is the function that should be applied
to X, and MARGIN indicates the dimension of X to which FUN will be applied. The
following example, calculates the sum of the rows for a matrix:
A similar result could be obtained by using a for-loop over the rows of the matrix
A.
In the case where the variable X is a vector, there exists a similar function called
sapply(). This function has the following syntax:
There are two differences compared to apply(). First, no MARGIN argument is
needed, because the function FUN will be applied to each component of the vector X.
Second, there is an option called simplify resulting in a simplified output of the
function sapply(). If set to TRUE, the result will have the form of a vector, whereas if set
to FALSE the result will be a list. It depends on the intended usage, i. e., which form
one might prefer, but a vector is usually most suitable for visual inspections. These
results can also be obtained with the command lapply().
The next example results in a vector, where each element is the third power of the
components of the vector X.
The function union() results in a set containing all elements, without duplication,
provided in the two sets of its argument.
Other commands for sets include intersect(), which returns only elements that are
in both sets, and setdiff() gives only elements, which are in the first, but not in the
second set, i. e., if X = setdiff(Y, Z), then all the elements in the set X are also in the
set Y, but not in the set Z. →Table 5.8 provides an overview of set operations.
Table 5.8 Each of these commands will discard any duplicated values in its
arguments.
Command name Description
union(x,y) combines the values in x and y
interest(x,y) finds the common elements in x and y
setdiff(x,y) removes the elements in x that are also in y
setequal(x,y) returns the logical value true if x is equal to y and false otherwise
is.element(x, y) returns the logical value true if x is a element in y and false otherwise
This can be useful if we want to use the values in the vector x as indices, and we
want to use each index only once.
Table 5.9 Each of these commands allows testing its argument and returns a logical
value.
Command name Description
is.numeric(x) returns TRUE if argument is numerical value (double or integer)
is.character(x) returns TRUE if argument is of character type
is.logical(x) returns TRUE if argument is of logical type
is.list(x) returns TRUE if argument is a list
is.matrix(x) returns TRUE if argument is a matrix
is.environment(x) returns TRUE if argument is an environment
is.na(x) returns TRUE if argument is NA
is.null(x) returns TRUE if argument is of null type (NULL)
Here, x is a vector, from which elements will be sampled. The option size indicates
the number of elements that will be sampled, and replace indicates if the sampling is
with (TRUE), or without (FALSE) replacement. In the case replace = FALSE, the option
size needs to be smaller than the number of elements (length of the vector) in vector
x.
In →Figure 5.5, we visualize the two different sampling strategies. The column x
(before) indicates the possible values that can be sampled, and the column x (after)
contains the elements that are “left” after drawing a certain number of elements
from it. In the case of sampling with replacement, there is no difference since each
element that is “removed” from x is replaced with the same element. However, for
sampling without replacement, the number of elements in x decreases. It is important
to note that in the case of sampling with replacement, we can sample the same
element multiple times (see green ball in →Figure 5.5). This is not possible without
replacement.
Figure 5.5 Visualization of different sampling strategies. A: Sampling with
replacement. B: Sampling without replacement.
The option prob allows assigning a probability distribution to the elements of the
vector x. By default, a uniform distribution is assumed, i. e., selecting all elements in x
with the same probability.
In the case where we just want to sample all integer values from 1 to n, the
following version of the function sample() can be used:
In order to get the error message, one can execute either of the following
commands:
The difference between both commands is that the function geterrmessage() gives
only the last error message in the current R session. That means, if you execute
further commands that also result in an error, you cannot go back in the history of
crashed functions.
In order to use the functionality of the function try() within a program, one can
test if the output of try() is as expected or not. For our example above, this can be
done as follows:
In this way, a numeric output can be used in some way, whereas an error
message, resulting in a FALSE for this test, can be handled in a different manner.
One may wonder how could it be possible that a command within a “functional”
program can result in an error. The answer is that before a program is functional, it
needs to be tested. And during the testing stage, there may be some irregularities,
and using the function try() may help to find these. Aside from this, R may use an
external input, e. g., provided by an input file, containing information that is outside
the definition of the program. Hence, it may contain information that is not as
expected in a certain context.
In addition to the function try(), R provides the function tryCatch(), which is a more
advanced version for handling errors and warning events.
5.8.8 The function system()
There is an easy way to invoke operating system (OS) specific commands by using the
command system(). This command allows the execution of OS commands like pwd or ls
as if they would be executed from a terminal. However, the real utility of the function
system() is that it can also be used to execute scripts.
The input of the function source() is a character string containing the name of the
file.
The advantage of writing an R program in a file and then executing it is that the
results are easily reproducible in the future. This is particularly important if we are
writing a scientific paper or a report and we would like to make sure that no detail
about the generation of the results is lost. In this respect, it can be considered a good
practice to store all of our programs in files.
Aside from this, it is also very helpful since we do not need to remember every
detail of a program, which is anyway hardly possible if a program is getting more and
more complex and lengthy. In this way, we can create over time our own library of
programs, which we can use to look up how we solved certain problems, in case we
cannot remember.
5.10 Summary
In this chapter, we provided an introduction to programming with R that covered all
base elements of programming. This is sufficient for the remainder of the book and
should also allow you to write your own programs for a large number of different
problems. A very good free online resource for getting more details about functions,
options, and packages is STHDA →http://www.sthda.com/english developed by
Alboukadel Kassambara. For unlocking advanced features of R, we recommend the
book by [→46]. This is not a cookbook, but provides in-depth explanations and
discussions.
6 Creating R packages
6.1 Requirements
A user can utilize the functions available in the R-based environment to create their
packages rather than creating functions or objects from scratch. Below are examples
to get a list of all functions in these packages.
6.1.2 R repositories
R repositories provide a large number of packages for statistical analysis, machine
learning, modeling, visualization, web mining, and web applications. A list of currently
available R repositories is shown in →Table 6.1.
6.1.3 Rtools
Rtools is required for building R packages. It is installed with R-base for Linux and
MacOs, but for windows, it needs to be installed. The ".exe" file of Rtools for
installation can be obtained at the following address: →http://cran.r-
project.org/bin/windows/Rtools/.
6.2 R code optimization
For an efficient R functioning, a developer should provide code that is efficient and
fast. It is always advised to developers to perform profiling on their code to check
about memory size taken by the code, execution time, and performance of each
instruction in the code for making some performance improvement. The function
debug() in R-base allows a user to test the code execution, line by line. Furthermore,
the function traceback() helps a user to find the line where the code crashed.
The check command creates a folder with the package name and .Rcheck
extension. All the error logs and warning files are created inside this folder. The user
can check all these files to evaluate the package.
6.6 Installation and usage of the package
When a package is built and checked properly, and all errors and warnings have been
addressed, then it is ready for installation in R. The following command is used to
install a package in R:
Package: trgpkg
Type: Package
Title: Example package for package creation
Version: 1.0
Date: 2019-01-20
Author: Shailesh Tripathi and Frank Emmert Streib
Maintainer: Shailesh Tripathi <shailesh.tripathy@gmail.com>
Description: This provides simple example for package creation in
R
License: GPL (>= 2)
LazyLoad: yes
exportPattern("^[[:alpha:]]+")
export(plot.trg, trgval)
importFrom("graphics", "abline", "axis", "plot",
"points", "segments", "text")
importFrom("stats", "runif")
\name{plot.trg}
\alias{plot.trg}
\title{
Plots the sin function and the input value of "trg" class.
}
\description{
Plots the sin function and the input value of "trg" class.
}
\usage{
plot.trg(trg, wave = T, minang = -450,
maxang = 450, ...)
}
\arguments{
\item{trg}{
this is a "trg" class object generated using "trgval" function.
}
\item{wave}{
It is a logical value. If true gives a wave plot of sin
function.
}
\item{minang}{
the minimum value of the domain of sin function for visualizing
sin function on the x-axis.
}
\item{maxang}{
maximum value of domain of sin function for
visulaizing sin function on x- axis.
}
\item{\dots}{
all other input type as availavle in "plot" function
}
}
\value{
Provides a graphic view of the sin function
}
\author{
Shailesh Tripathi and Frank Emmert Streib}
\seealso{
\code{\link{plot}}
}
\examples{
zz <- trgval(90)
plot(zz)
plot(zz, wave=FALSE)
}
\name{trgval}
\alias{trgval}
%- Also NEED an ’\alias’ for EACH other topic documented here.
\title{
A generic function which is used to calculate
trignometric values.
}
\description{
A generic function which is used to calculate
trigonometric values.}
\usage{
trgval(x, ...)
}
\arguments{
\item{x}{
is a numeric value or vector}
\item{\dots}{
}
}
\value{
returns a "trg" class object
}
\author{
Shailesh Tripathi and Frank Emmert-Streib}
\seealso{
plot.trg, trgval.default}
\examples{
zz <- trgval(c(30, 60, 90))
plot(zz)
plot(zz, wave=FALSE)
}
6.8 Summary
In this chapter, we provided a brief introduction how to create an R package. This
topic can be considered advanced and for the remainder of this book it is not
required. However, in a professional context the creation of R packages is necessary
for simplifying the usage and exchange of a large number of individually created
functions.
Nowadays, many published scientific articles provide accompanying R packages to
ensure that all obtained results can be reproduced. Despite the intuitive clarity of this,
the reproducability of results has recently sparked heated discussions, especially
regarding provisioning the underlying data [→70].
Part II Graphics in R
7 Basic plotting functions
In this chapter, we introduce plotting capabilities of R that are part
of the base installation. We will see that there is a large number of
different plotting functions that allow a multitude of different
visualizations.
7.1 Plot
The most basic plotting tool in R is provided by the plot()
function, which allows visualizing y as a function of x. The following
script gives two simple examples (see →Figure 7.1 (A) and (B)):
the same result, but with the line option for the type of the
visualization. The difference is that in this case, the 50 pairs of
points are connected by smooth line segments that result in a
smooth line visualization.
At first glance, →Figure 7.1 (B) may appear as the natural
visualization of a sinus function, because we know it is a smooth
function. However, it is important to realize that a computer
graphics is always pixel-based, i. e., a line is always a sequence of
points. But what is then the difference to a point-based graphics? It
is the spacing between consecutive points (and their size). In this
sense, →Figure 7.1 (B) is realized, internally by R, as a sequence of
points that are very close to each other so that the resulting plot
appears as a continuous line.
For instance, change the value of the option length.out in the
seq command to see what consequence this has on the resulting
plot.
Figure 7.1 Examples for the basic plotting function plot().
Figure 7.3 Example for adding horizontal and vertical lines to a plot
using the function abline().
7.1.3 Opening a new figure window
In order to plot a function in a new figure by keeping a figure that is
already created, one needs to open a new plotting window using
one of the following commands:
X11(),for Linux and Mac if using R within a terminal
macintosh(), for a Mac operating system
windows(), for a Windows operating system
7.2 Histograms
An important graphical function to visualize the distribution of
data is hist(). The command hist() shows the histogram of a data set.
For instance, we are drawing n = 200 samples from a normal
distribution with a mean of zero, a standard deviation of one, and
saving the resulting values in a vector called x, see the code below.
The left →Figure 7.4 shows a histogram of the data with 25 bars of
an equal width, set by the option breaks. Here, it is important to
realize that the data in x are raw data. That means, the vector x does
not provide directly the information displayed in →Figure 7.4 (Left),
but indirectly. For this reason, the number of occurrences of values
in x, e. g., in the interval 0.5 ≤ x ≤ 0.6 , need to be calculated by
the hist() function. However, in order to do that one needs to specify
what are the boundaries of the intervals to conduct such
calculations. The function hist() supports two different ways to do
that. The first one is to just set the total number of bars the
histogram should contain. The second one is by providing a vector
containing the boundary values explicitly.
In the example below, we use two new options. The first one is
space allowing to adjust the spacial distance between adjacent bars.
The second one is names.arg, which allows specifying the labels that
appear below each bar. For specifying the labels, we use the
function LETTERS() to conveniently assign the first 5 capital letters of
the alphabet to names.arg. Alternatively, the function letters() can be
used to generate lowercase letters.
2 (7.1)
1 (x − μ)
f (x) = exp (− ), − ∞ ≤ x ≤ ∞,
2
√ 2πσ 2σ
7.11 Summary
Despite the fact that all of the commands discussed in this chapter
are part of the base installation of R, they provide a vast variety of
options for the visualization of data, as we have seen in the last
sections. All extension packages either address specific problems,
e. g., for the visualization of networks, or for providing different
visual aestatics.
8 Advanced plotting functions: ggplot2
8.1 Introduction
The package ggplot2 was introduced by Hadley Wickham [→200].
The difference of this package to many others is that it does not
only provide a set of commands for the visualization of data, but it
implements Leland Wilkinson’s idea of the Grammar of Graphics
[→204]. This makes it more flexible, allowing to create many
different kinds of visualizations that can be tailored in a problem-
specific manner. In addition, its aesthetic realizations are superb.
The ggplot2 package is available from the CRAN repository and
can be installed and loaded into an R session by
8.2 qplot()
The function qplot() is similar to the basic plot function in R. The q in
front of plot stands for “quick”, in the way that it does not allow
getting access to the full potential provided by the package ggplot2.
The full potential is accessible via ggplot, discussed in Section →8.3.
To demonstrate the functionality of qplot(), we use the penguin
data provided in the package FlexParamCurve.
For this, we use only the first 10 observation points. As we can see
in →Fig. 8.3, in addition to these 10 data points, there is a smooth
curve added as a result from the smoothing function. We would like
to note that here, we used a vector to define the option geom,
because we wanted to show the data points in addition to the
smooth curve.
Figure 8.3 An example for smoothing data with qplot().
8.3 ggplot()
The underlying idea of the function ggplot() is to construct a figure
according to a certain grammar that allows adding the desired
components, features, and aspects to a figure and then generate
the final plot. Each of such components is added as a layer to the
plot.
The base function ggplot() requires two input arguments:
data: a data frame of the data set to be visualized
aes(): a function containing aesthetic settings of the plot
The data set contains only three variables (tree, age, and
circumference), whereas the variable “Tree” is an indicator variable
for a particular tree.
8.3.3 geoms()
There is a total of 37 different geom() functions available to specify
the geometry of plotted data. So far, we used only geom_point() and
geom_line(). In the following, we will discuss some additional
functions listed below.
In →Table 8.3, we list geom() functions with their counterpart in
the R base package.
Table 8.3 Functions associated with ggplot() and geom() and their
corresponding counter parts in the R base package.
ggplot function Base plot function
geom_point() points()
geom_line() lines()
geom_curve() curve()
geom_hline() hline()
geom_vline() vline()
geom_rug() rug()
geom_text() text()
geom_smooth(method = ”lm”) abline(lm(y x))
geom_density() lines(density(x))
geom_smooth() lines(loess(x, y))
geom_boxplot() boxplot()
9.1 Introduction
In this chapter, we discuss two R packages, igraph and NetBioV
[→42], [→187]. Both have been specifically designed to visualize
networks. Nowadays, network visualization plays an important role
in many fields, as they can be used to visualize complex
relationships between a large number of entities. For instance, in
the life sciences, various types of biological, medical, and gene
networks, e. g., ecological networks, food networks, protein
networks, or metabolic networks serve as a mathematical
representation of ecological, molecular, and disease processes
[→9], [→71]. Furthermore, in the social sciences and economics,
networks are used to represent, e. g., acquaintance networks,
consumer networks, transportation networks, or financial networks
[→74]. Finally, in chemistry and physics, networks are used to
encode molecules, rational drugs, and complex systems [→20],
[→55].
All these fields, and many more, benefit from a sensible
visualization of networks, which enables gaining an intuitive
understanding of the meaning of structural relationships between
the entities within the network. Generally, such a visualization
precedes a quantitative analysis and informs further research
hypotheses.
9.2 igraph
Type Syntax
star graph.star(n, mode = c("in", "out", "mutual", "undirected"))
Type Syntax
random network erdos.renyi.game(n, p.or.m, type=c("gnp", "gnm"), directed =
F, loops = F)
Layered level.plot
layout
Layout Functions in R
categories
Spiral plot.spiral.graph
layout
The spiral layout included in the NetBioV package provides the user
with some options to visualize networks in different spiral forms.
The aestetics of the spirals can be influenced by setting a tuning
parameter for the angle of the spiral. In addition, a wide range of
color options is provided as an input to highlight, e. g., the degrees
of nodes. In addition, the placement of nodes can be either
determined by standard layout functions or by a user-defined
function.
10.1 Introduction
In data science, all problems will be approached computationally. For this reason, we started this
book with an introduction to the programming language R. The next step consists in the
understanding of mathematical methods needed for the data analysis models, because all
analysis models are based on mathematics and statistics. However, before we present in the
subsequent chapters the mathematical basis of data science, we want to emphasize in this
chapter a more general point concerning the mathematical language itself. This point refers to
the abstract nature of data science.
In →Figure 10.1, we show a very general visualization that holds for every data analysis problem.
The key point here is that every data analysis is conducted via a computer program that
represents methodological ideas from statistics and machine learning, and every computer
program consists of instructions (commands) that enable the communication with the processor
of a computer to perform computations electronically. Due to the fact that every data analysis is
conducted via a computer program that contains instructions in a programming language, a good
data scientist needs to “speak” fluently a programming language. However, the base of any
programming language for data analysis is mathematics, and its key characteristics is
abstractness. For this reason, a simplified message from the above discussion can be summarized
as follows:
Thinking in abstract mathematical terms makes you a better programmer and, hence, a better data scientist.
This is also the reason why mathematics is sometimes called the language of science [→185],
[→188] (as already pronounced by Galileo).
Before we proceed, we would like to add a few notes for clarification. First, by a programmer
we mean actually a scientific programmer that is concerned with the conversion of statistical and
machine learning ideas into a computer program rather than a general programmer that
implements graphical user interfaces (GUIs) or web sites. The crucial difference is that the level of
mathematics needs for, e. g., the implementation of a GUI is minimal comparable to the
implementation of a data analysis method. Also, such a way of programming is usually purely
deterministic and not probabilistic. However, the nature of a data analysis is to deal with
measurement errors and other imperfections of the data. Hence, probabilistic and statistical
methods cannot be avoided in data science but are integral pillars.
Figure 10.1 Generic visualization of any data analysis problem. Data analysis is conducted via a
computer program that has been written based on statistical- and machine-learning methods
informed with domain-specific knowledge, e. g., from biology, medicine, or the social sciences.
Second, it is certainly not necessary to implement every method for conducting a data analysis,
however, a good data scientist could implement every method. Third, the natural language we are
speaking, e. g., English, does not translate equally well into a computer language like R, but there
are certain terms and structures that translate better. For instance, when we speak about a
“vector” and its components, we will not have a problem to capture this in R, because in Chapter
→5 we have seen how to define a vector. Furthermore, in Chapter →12, we will learn much more
about vectors in the context of linear algebra. This is not a coincidence, but the meaning of a
vector is informed by its mathematical concept. Hence, whenever we use this term in our natural
language, we have an immediate correspondence to its mathematical concept. This implies that
the more we know about mathematics, the more we become familiar with terms that are well
defined mathematically, and such terms can be swiftly translated into a computer program for
data analysis.
We would like to finish by adding one more example that demonstrates the importance of
“language” and its influence on the way humans think. Suppose, you have a twin sibling and you
both are separated right after birth. You grow up in the way you did, and your twin grows up on a
deserted island without civilization. Then, let us say after 20 years, you both are independently
asked a series of questions and given tasks to solve. Given that you both share the same DNA, one
would expect that both of you have the same potential in answering these questions. However,
practically it is unlikely that your twin will perform well, because of basic communication problems
in the first place. In our opinion the language of “mathematics” plays a similar role with respect to
“questions” and “tasks” from a data analysis perspective.
In the remainder of this chapter, we provide a discussion of some basic abstract mathematical
symbols and operations we consider very important to (A) help formulating concise mathematical
statements, and (B) shape the way of thinking.
and b being any integer number; and R is the set of all real numbers, e. g., 1.4271.
There is a natural connection between these number systems in the way that
N ⊂ Z ⊂ Q ⊂ R ⊂ C. (10.1)
That means, e. g., that every integer number is also a real number, but not every integer number
is a natural number. Furthermore, the special sets Z and R denote the set of all positive
+ +
Intervals
When defining functions, it is common to limit the value of numbers to specific intervals. One
distinguishes finite from infinite intervals. Specifically, finite intervals can be defined in four
different ways:
[a, b]= {x ∣ a ≤ x ≤ b} open interval, (10.2)
Modulo operation
The modulo operation gives the remainder of a devision of two positive numbers a and b. It is
defined for a ∈ R and b ∈ R ∖ {0} by
+ +
programming, the modulo operation is frequently used for integer numbers a and b, because a
cyclic mapping can be easily realized, i. e., N + 1 → 1 can be obtained by
modulo(N + 1, N ). (10.12)
Example 10.2.1.
yields to 3 mod 7 = 3 .
Rounding operations
The floor and ceiling operations round a real number to its nearest integer value up or down. The
corresponding functions are denoted by
⌊x⌋ f loor f unction, (10.13)
sign(x) = ⎨0 if x = 0;
⎩
−1 if x < 0.
Absolute value
The absolute value of a real number x ∈ R is
+x if x ≥ 0; (10.16)
abs(x) = |x| = {
−x if x < 0.
is the set containing chess pieces, and D = {A , C } is a set of sets. From these examples, one
can see that an object is something very generic, and a set is just a container for objects. Usually,
the objects of a set are enclosed by the brackets “{” and “}”.
∈
The symbol denotes the membership relation to indicate that an object is contained in a set.
For instance, 2 ∈ A , and ∘ ∈ B . Here, the objects 2 and ∘ are also called elements of their
∈
corresponding sets. The symbol is a relation, because it establishes a connection between an
object and a set and, hence, relates both with each other.
If we have two sets, A and A , and every element in A is also contained in A , but there
1 2 1 2
are also elements in A that are not in A , we write A ⊂ A . In this case A is called a subset of
2 1 1 2 1
elements in A , we write A = A , because both sets contain the same elements. Finally, if every
2 1 2
element in A is also contained in A , and there is at least one additional element in A , we write
1 2 2
∅
1 2 1 2
A special set is the empty set, denoted by , which does not contain any element. |A| is the
cardinality of A, i. e., the number of its elements. It is possible that a set contains a finite or infinite
number of elements. For instance, for the above set B, we have |B| = 2 , and for the set of
natural numbers |N| = ∞ .
The set A 1 ∪ A2 = {x : x ∈ A1 ∨ x ∈ A2 } is called the union of A , and A .
1 2
sets, we used the colon symbol “:” within the curled brackets. This symbol means “with the
property’’. Hence, the set {x : x ∈ A ∨ x ∈ A } can be read explicitly as every x that is element
1 2
Figure 10.2 Visualization of set operations. Left: The union of two sets. Right: The cut set of A 1
and A . 2
An alphabet Σ is a finite set of atomic symbols, e. g., Σ = {a, b, c} . That means, Σ contains all
elements for a given setting. No other elements can exist.
Σ is the set of all words over Σ. For example if Σ = {b} , then Σ = {ϵ, b, bb, bbb, bbbb …} .
⋆ ⋆
Definition 10.3.1.
∀
The expression means for all. For example if A = {a 1, a2 , a3 } , then by ∀ x ∈ A , we mean all
elements in set A, i. e., a , a , a .
1 2 3
Definition 10.3.2.
∃
The expression means there exits. For example if B = {−1, 2, 3} , then by ∃ x ∈ B : x < 3 , we
mean that in set B there exists an element, which is less than 3. Possibly, there is more than one
such element, as is the case for B.
Definition 10.3.3.
∃
The expression ! means there exits only one. For example: ∃! x ∈ B : x < 2 means that in the
set B there exists only one element, which is less than 2.
10.4 Boolean logic
∧ ∨
The operators and are the logical or and and, respectively. They form logical operators to
combine logical variables v, q ∈ {1, 0} . Sometimes the logical variables are expressed as
{true, f alse} . By using operations from the set
of logical operators, we can easily construct logical formulas. For instance, the formulas
v ∨ q, (v ∨ q), (v ∨ q) ∧ ¬(v ∨ q) (10.19)
represent valid logical formulas as they are derived by using the operators in (→10.18). However,
according to this definition, the formulas
(v ∨ q)qq, (v q) (10.20)
the set operators O , similar to the ones given in equation (→10.19). The following statements
about logical formulas hold:
Theorem 10.4.1 (Commutative laws [→98]).
S1 ∧ S2 ⟺ S2 ∧ S1 (10.21)
S1 ∨ S2 ⟺ S2 ∨ S1 (10.22)
Theorem →10.4.1 says that the logical arguments can be switched for the logical operators
and and or. Theorem →10.4.2 says that we may successively shift the brackets to the right.
Similarly, when expanding expressions over the reals, for instance x(x + 1) = x + x , Theorem
2
¯
¯
¯ (10.30)
A ∩ B= A ∪ B
The resulting statements (or forms) are called normal forms, and important examples thereof
are the disjunctive normal form and conjunctive normal form of logical expressions, see [→98].
where
Si = Sj1 ∧ Sj2 ∧ ⋯ ∧ Sjk . (10.32)
j
The terms S are literals, i. e., logical variables or the negation thereof.
ji
Two examples for logical formulas given in disjunctive normal form are
(v ∧ q) ∨ (¬v ∧ ¬q) (10.33)
or
v ∨ (v ∧ q). (10.34)
Here we denote the literals by using the notations v and q for logical variables.
where
Si = Sj1 ∨ Sj2 ∨ ⋯ ∨ Sjk . (10.36)
j
or
v ∧ (v ∨ q). (10.38)
In practice, the application of Boolean functions [→98] has been important to develop
electronic chips for computers, mobile phones, etc. A logic gate [→98] represents an electronic
component that realizes (computes) a Boolean function f (v , … , v ) ∈ {0, 1} ; v are logical
∧∨
1 n i
variables. These logic gates use the logical operators , , ¬ and transform input signals into
output signals. →Figure 10.3 shows the elementary logic gates and their corresponding truth
tables.
Figure 10.3 Elementary logic gates of Boolean functions and their corresponding truth table. The
top symbol corresponds to the IEC, and the bottom to the US standard symbols.
∨
We see in →Figure 10.3 that the OR-gate is based on the functionality of the operator . That
means, the output signal of the OR-gate equals 1 as soon as one of its input signals is 1.
The output signal of the AND-gate equals 1 if and only if all input signals equal 1. As soon as
one input signal equals 0, the value of the Boolean function computed by this gate is 0.
The NOT-gate computes the logical negation of the input signal. If the input signal is 1, the
NOT-gate gives 0, and vice versa.
summarized by the sum operation (∑) and the product operation (∏).
Sum
The sum, ∑, is defined for numbers a involving all integer indices i , i
i l u ∈ N from i , i
l l + 1 … , iu
, i. e.,
iu (10.39)
∑ ai = ail + ail +1 + ⋯ + aiu .
i=il
Here “l” indicates “lower”, whereas “u” means “upper”, to denote the beginning and ending of
the indices. For i = 1 , and i = n , we obtain the sum over all elements of A,
l u
The latter form needs to be used if only selected indices should be used for the summation. For
instance, suppose, I = {2, 4, 5} is an index set containing the desired indices for the summation
then
(10.41)
∑ ai = ∑ ai = a2 + a4 + a5 .
i∈I i∈{2,4,5}
Product
Similar to the sum, the product, ∏, is also defined for numbers a involving all integer indices
i
i , i ∈ N from i , i + 1 … , i , i. e.,
l u l l u
iu (10.42)
∏ ai = ail ⋅ ail +1 ⋅ ⋯ ⋅ aiu ;
i=il
(10.43)
∏ ai = ail ⋅ ail +1 ⋅ ⋯ ⋅ aiu .
Remark 10.5.1.
In the above discussions of the sum and product, we assumed integer indices for the
identification of the numbers a , i. e., i ∈ N . However, we would like to remark that, in principle,
i
this can be generalized to arbitrary “labels”. For instance, for the set A = {a , a , a } , we can △ ∘ ⊗
i∈{△,∘,⊗}
(10.45)
∏ ai = a△ ⋅ a∘ ⋅ a⊗ .
i∈{△,∘,⊗}
Hence, from a mathematical point of view, the nature of the indices is flexible. However, whenever
we implement a sum or a product with a programming language, integer values for the indices
are advantageous, because, e. g., the indexing of vectors or matrices is accomplished via integer
indices.
In R, the most flexible way to realize sums and products is via loops. However, if one just wants a
sum or a product over all elements in a vector A, from i = 1 to i = N , one can use the
l u
following commands:
Binomial coefficients
For all natural numbers k, n ∈ N with 0 ≤ k ≤ n , the binomial coefficient, denoted C(n, k) , is
defined by
n n! (10.46)
C(n, k) = ( ) = .
k k!(n − k)!
It is interesting to note that a binomial coefficient is a natural number itself, i. e., C(n, k) ∈ N .
For the definition of a binomial coefficient the factorial “!” of a natural number is used. The
factorial of n is just the product of the numbers from 1 to n, i. e.,
n (10.47)
n! = ∏ i = 1 ⋅ 2 ⋅ ⋯ ⋅ n.
i=1
The binomial coefficient has the combinatorial meaning that from n objects, there are C(n, k)
ways to select k objects without considering the order in which the objects have been selected. In
→Figure 10.4, we show an urn with n = 4 objects. From this urn, we can draw k = 2 objects in 6
different ways.
Also, the factorial n! has a combinatorial meaning. It gives the number of different arrangements
of n objects by considering the order. For instance, the objects {1,2,3} can be arranged in 3!=6
different ways:
(1, 2, 3) − (1, 3, 2) − (2, 3, 1) − (2, 1, 3) − (3, 1, 2) − (3, 2, 1). (10.48)
Figure 10.4 Visualization of the meaning of the Binomial coefficient C(4, 2) .
n (10.50)
C(n, n) = ( ) = 1,
n
n n (10.51)
( ) = ( ),
k n − k
∀n ∈ N, and 0 ≤ k ≤ n .
The following recurrence relation for binomial coefficients is called Pascal’s rule:
n + 1 n n (10.52)
( ) = ( ) + ( ).
k + 1 k k + 1
In →Figure 10.5, we visualize the result of Pascal’s rule for n ∈ {0, … , 6} . The resulting object is
called Pascal’s triangle.
Figure 10.5 Pascal’s triangle for Binomial coefficients. Visualized is the recurrence relation for
Binomial coefficients in equation (→10.52).
∗
amax = max {A} = {ai ∣ ai ∈ A and ai ≥ aj ∀j ≠ i}. (10.54)
i=1,…,n
If there is more than one element that is minimum or maximum, then the corresponding sets
a
∗
min
and a ∗
max
contain more than one element.
∗
imax = argmax {A} = {i ∣ ai ∈ A and ai ≥ aj ∀j ≠ i}. (10.56)
i=1,…,n
Logical statements
A logical statement may be defined verbally or mathematically, and has the values true or false.
For simplicity, we define the Boolean value 1 for true, and 0 for false. One can show that the set
{true, f alse} is isomorphic to the set {0,1}.
The Boolean value of the statement “The next autumn comes for sure” equals 1 and, hence, the
statement is true. From a probabilistic point of view, this event is certain and its probability equals
one. Therefore, we may conclude that this statement does not contain any information, see also
[→169]. The following inequalities and equations
i= −5, (10.57)
−1≥ 5, (10.59)
1< 2, (10.60)
n
n(n + 1) (10.61)
∑ j= , n ∈ N,
2
j=1
are mathematical statements, which are true or false. The first equation is false, as i = √−1 ,
where i is the imaginary unit of a complex number z = a + ib . The second equation is obviously
true, as 50+20+30 equals 100. For the third statement, a negative number cannot be greater or
equal, as then a positive number and its Boolean value is therefore false. The fourth statement
represents an inequality too, and is true. Strictly speaking, the fifth equation is a statement form
(Sf) over the natural numbers, as it contains the variable n ∈ N .
In general, statement forms contain variables and are true or false. In case of equation (→10.61),
we can write ⟨Sf (n)⟩ = ⟨∑ j = ⟩ . This statement form is true for all n ∈ N and can be
n n(n+1)
j=1 2
statements. The statement S ∧ S means that S and S hold. This statement may have the
1 2 1 2
value true or false, see →Fig. 10.3. For instance, S := 2 + 2 = 4 ∧ S := 3 + 3 = 6 is true, but
1 2
S := 2 + 2 = 4 ∧ S := 3 + 3 = 9 is false. Similarly, S
1 3 ∨ S means that S or S holds.
1 2 1 2
true as well. The logical negation of the statement S is usually denoted by ¬S . The well-known
triangle equation,
|x1 + x2 | ≤ |x1 | + |x2 |, x1 , v2 ∈ R, (10.63)
This means
|x1 + x2 | > |x1 | + |x2 | (10.65)
is generally false.
Statement: ⇒
The logical implication S ⟹ S means that S implies S . Verbally, one can say S “logically
1 2 1 2 1
Statement ⇔
The statement S 1 ⟺ S2 is stronger, because S holds if and only if S holds.
1 2
For the above statements, it is important to note that to go from the left statement to the right
∧∨
one, or vice versa, one needs to apply logical operators (¬, , ) or algebraic operations (+, −, /,
etc.). For instance, by assuming the true statement n ≥ 2n , n > 1 , we obtain the implications
2
2 2 2 2 (10.66)
n ≥ 2n ⟹ n − 2n ≥ 0 ⟹ n − 2n + 1 = (n − 1) ≥ 0.
Finally, we want to remark that a false statement may imply a true statement; i 2
= 1 (false as
i = −1 ) implies 0 ⋅ i = 0 ⋅ 1 (true).
2 2
Definition →10.7.1 defines the sum of two real numbers based on the trivial definition of the
symbol “+”.
Definition 10.7.2.
Let a, b ∈ R . The function fL : R ⟶ R, given by
fL (x) := ax + b, (10.68)
is given by x = − . b
a
x = −
a
b
by performing elementary calculations. Specifically, the first elementary calculation is
subtracting b from ax + b = 0 . Second, we divide the resulting equation by a and obtain the
result.
Another example is the famous binomial theorem.
Theorem 10.7.2.
Let a, b ∈ R and n ≥ 1 . Then,
n
n
(10.70)
n n−k k
(a + b) = ∑( )a b .
k
k=1
Corollary 10.7.1.
2 2 2 (10.71)
(a + b) = a + 2ab + b .
10.8 Summary
In general, the mathematical language is meant to help with the precise formulation of problems.
If one is new to the field, such formulations can be intimidating at first and verbal formulations
may appear as sufficient. However, with a bit of practice one realizes quickly that this is not the
case, and one starts to appreciate and to benefit from the power of mathematical symbols.
Importantly, the mathematical language has a profound implication on the general mathematical
thinking capabilities, which translate directly to analytical problem-solving strategies. The latter
skills are key for working successfully on data science projects, e. g., in business analytics, because
the process of analyzing data requires a full comprehension of all involved aspects, and the often
abstract relations.
11 Computability and complexity
This chapter provides a theoretical underpinning for the programming in R that we introduced in
the first two parts of this book. Specifically, we introduced R practically by discussing various
commands for computing solutions to certain problems. However, computability can be defined
mathematically in a generic way that is independent of a programming language. This paves the
way for determining the complexity of algorithms. Furthermore, we provide a mathematical
definition of a Turing machine, which is a mathematical model for an electronic computer. To
place this in its wider context, this chapter also provides a brief overview of several major
milestones in the history of computer science.
11.1 Introduction
Nowadays, the use of information technologies and the application of computers are ubiquitous.
Almost everyone uses computer applications to store, retrieve, and process data from various
sources. A simple example is a relational database system for querying financial data from stock
markets, or finding companies’ telephone numbers. More advanced examples include programs
that facilitate risk management in life insurance companies or the identification of chemical
molecules that share similar structural properties in pharmaceutical databases [→54], [→170].
The foundation of computer science is based on theoretical computer science [→163], [→164].
Theoretical computer science is a relatively young discipline that, put simply, deals with the
development and analysis of abstract models for information processing. Core topics in
theoretical computer science include formal language theory and compilers [→121], [→160],
computability [→22], complexity [→37], and semantics of programming languages [→122],
[→126], [→167] (see also Section →2.8). More recent topics include the analysis of algorithms
[→37], the theory of information and communication [→40], and database theory [→124]. In
particular, the mathematical foundations of theoretical computer science have influenced modern
applications tremendously. For example, results from formal language theory [→160] have
influenced the construction of modern compilers [→121]. Formal languages have been used for
the analysis of automata. The automata model of a Turing machine has been used to formalize
the term algorithm, which plays a central role in computer science. When dealing with algorithms,
an important question is whether they are computable (see Section →2.2). Another crucial issue
relates to the analysis of algorithms’ complexity, which provides upper and lower bounds on their
time complexity (see Section →11.5.1). Both topics will be addressed in this chapter.
Given an alphabet Σ, one can print only one character c ∈ Γ in each field. A special character (e.
g., $) is used to fill the empty fields (blank symbol).
The transition function δ is crucial for the control unit, and encodes the program of the Turing
machine (see →Figure 11.1). The Turing table conveys information about the current and
subsequent stages of the machine after it reads a character c ∈ Γ . This initiates certain actions of
the read/write head, namely
l: moving the head exactly one field to the left.
r: moving the head exactly one field to the right.
x: overwriting the content of a field with x ∈ Γ ∪ {$} without moving the head.
A fundamental question of theoretical computer science concerns the types of functions that are
computable using Turing machines. For example, it emerged that functions defined on words (e.
g., f : Σ ⟶ Σ ) are Turing-computable if there is at least one Turing machine that stops after
⋆ ⋆
a finite number of steps in the final state. We wish to emphasize that this also holds for other
functions (e. g., multivariate functions over several variables).
We conclude this section with an important observation regarding Turing completeness. This
term is relevant for basic paradigms of programming languages (see Chapter →2). A
programming language is deemed Turing-complete if all functions that are computable with this
language can be computed by a universal Turing machine. For example, most modern
programming languages (from different paradigms), such as Java, C++, and Scheme, are Turing-
complete [→122].
11.4 Computability
We now turn to a fundamental problem in theoretical computer science: the determination as to
whether or not a function is computable [→164]. This problem can be discussed intuitively as well
as mathematically. We begin with the intuitive discussion, and then provide its mathematical
formulation. It is generally accepted that function f : N ⟶ N is computable if an algorithm to
compute f exists. Therefore, assuming an arbitrary n ∈ N as input, the algorithm should stop
after a finite number of computation steps with output f (n) . When discussing this simple model,
we did not take into account any considerations regarding a particular processor or memory.
Evidently, however, it is necessary to specify such steps to implement an algorithm. In practical
terms, this is complex, and can only be accomplished by a general mathematical definition to
decide whether a function f : N ⟶ N is computable.
A related problem is whether any arbitrary problem can be solved using an algorithm, and, if
not, whether the algorithm can identify the problem as noncomputable. This is known as the
decision problem formulated by Hilbert, which turned out to be invalid [→36]. A counter-example
is Godel’s well-known incompleteness theorem [→36]. Put simply, it states that no algorithm exists
that can verify whether an arbitrary statement over N is true or false. To explore Gödel’s
statement in depth, several formulations of the term algorithm as a computational procedure
have been proposed. A prominent example thereof was proposed by Church, who explored the
well-known Lambda calculus, which can be understood as a mathematical programming
language (see [→122]). It was in this context also that Turing developed the concept of a Turing
machine [→36], [→122] (see Section →11.3). Another contribution by Gödel is an alternative
computational procedure based on the definition of complex mathematical functions composed
of simple functions. The result of all these developments was that the Church-Turing thesis, which
states that all the above-mentioned computational processes (algorithms) are equivalent, was
proven.
Furthermore, it has been proven that computability does not depend on a specific
programming language (see [→122]). In other words, most programming languages are
equipotent [→122]. For example, suppose that we solve a problem by using an imperative
programming language, such as Fortran (see Section →2.2). Then, an equivalent algorithm exists
that can be implemented using a functional language, such as Scheme (see Section →2.3).
A mathematical definition of computable can be formulated as follows:
a finite number of computation steps in the case where f is defined for n , n , … , n . In the case
1 2 k
We wish to note that a similar definition can be given for functions defined on words (e. g.,
f : Σ ⟶ Σ , see [→36], [→164]). Examples of computable functions include the following:
⋆ ⋆
11.5.1 Bounds
Let n be the input size of an algorithm (i. e., the number of data elements to be processed). The
time complexity of an algorithm is determined by the maximal number of steps (e. g., value
assignments, arithmetic operations, memory allocations, etc.) in relation to input size required to
obtain a specific result.
In the following, we describe how to measure the time complexity of an algorithm asymptotically,
and describe several forms thereof. First, we state an upper bound for the time complexity that
will be attained in the worst case (O-notation). To begin, we provide a definition of real
polynomials, as they play a crucial role in the asymptotic measurement of algorithms’ time
complexity.
f (x) = an x
n
+ an−1 x
n−1
+ ⋯ + a0 , an ≠ 0 , ak ∈ R , k = 0, 1, … , n,
(11.8)
real polynomial. Generally speaking, a polynomial is called real if its coefficients are real.
To define an asymptotic upper bound for the time complexity of an algorithm, the O-notation is
required.
Definition →11.5.2 means that g(n) is an asymptotic upper bound of f (n) if a constant c > 0
exists and a natural number n such that f (n) is less or equal c ⋅ g(n) for n ≥ n .
0 0
In contrast to the worst case, described by the O-notation, we now define an asymptotic lower
bound that describes the “least” complexity. This is provided by the Ω-notation.
To simultaneously define upper and lower bounds for the time complexity, the Θ-notation is used.
According to Definition →11.5.4, g(n) is an exact asymptotic bound of f (n) if two constants
c , c > 0 exist, and a natural number n such that f (n) lies in between c ⋅ g(n) , and c ⋅ g(n)
1 2 0 1 2
if n ≥ n .0
11.5.2 Examples
In this section, some examples are given to illustrate the definitions of the asymptotic bounds. In
practice, the O-notation is the most important and widely used. Hence, the following examples will
focus on it.
To simplify the notation, we denote the number of calculation steps in an algorithm by f (n) . Let
f (n) := n + 3n . To determine the complexity class O(n ), k ∈ N , the constants c and n must
2 k
0
be determined. Using Definition →11.5.2, setting c = 4 and g(n) = n , the following inequalities 2
can be verified:
n
2
+ 3n ≤ 4n
2
or 3n ≤ 3n .
2 (11.12)
n
2
+ 3n ≤ c ⋅ n .
2 (11.13)
Thus, we obtain n 2
+ 3n ∈ O(n )
2
.
A second example is the function
f (n) := c5 n
5
+ c4 n
4
+ c3 n
3
+ c2 n
2
+ c1 n + c0 .
(11.14)
constant c , and obtain f (n) ∈ O(n ) . We see that the O-notation always emphasizes the
5
5
f (n) = ak n
k
+ ak−1 n
k−1
+ ⋯ + a1 n + a0 , ak ≠ 0,
(11.15)
and obtain
c := |a | + |a
k k−1
f (n)= |ak n
| + |a
= n
k−2
k
∣
≤ n (|ak | +
k
k
ak +
+ ak−1 n
ak−1
|ak−1 |
0
k−1
n
+ ⋯ + a1 n + a0 |
ak−2
+
2
+ ⋯ +
|ak−2 |
n
2
c= O(1),
n
k
+ ⋯ +
|a0 |
Inequality (→11.16) has been obtained using the triangle inequality [→178]. By setting
j
k
n
)
In the final example, we use a simple imperative program (see Section →2.2) to calculate the sum
of the first n natural numbers ( sum = 1 + 2 ⋯ + n ). Basically, the pseudocode of this program
consists of the initialization step, sum = 0 , and a for-loop with variable i and body sum = i + 1
(11.16)
for 1 ≤ i ≤ n . The first value assignment requires constant costs, say, c . In each step of the for-
loop to increment the value of the variable sum , constant costs c are required. Then we obtain
the upper bound for the time complexity
An algorithm with a constant number of steps has time complexity O(1) (see equation (→11.18)).
The second rule given by equation (→11.19) means that constant factors can be neglected. If we
execute a program with time complexity O(f (n)) sequentially, the final program will have the
same complexity (see equation (→11.20)). According to equation (→11.21), the logarithmic
complexity does not depend on the base b. Moreover, the sequential execution of two programs
with different time complexities has the complexity of the program with higher time complexity
(see equation (→11.22)). Finally, the overall complexity of a nested program (for example, two
nested loops) is the product of the individual complexities (see equation (→11.23)).
In view of the importance of the O-notation for practical use, several of its properties are listed
below:
(11.17)
(11.18)
(11.19)
(11.20)
(11.21)
(11.22)
(11.23)
Finally, we list some examples of algorithm complexity classes:
O(1) consists of programs with constant time complexity (e. g., value assignments,
searching procedures).
O(n ) consists of programs with quadratic time complexity (e. g., a simple sorting
2
the shortest path problem proposed by Dijkstra [→58] where, n is the number of vertices in
a network).
O(n ) generally consists of programs with polynomial time complexity. Obviously, O(n) ,
k
O(log (n)) consists of programs with logarithmic time complexity (e. g., binary searching
[→37]).
O(2 ) consists of programs with exponential time complexity (e. g., enumeration problems
n
Such algorithms could possibly be used when searching for graph isomorphisms or cycles, whose
graphs have bounded vertex degrees, for example (see [→130]).
11.6 Summary
At this juncture, it is worth reiterating that, despite the apparent novelty of the term data science,
the fields on which it is based have long histories, among them theoretical computer science
[→61]. The purpose of this chapter has been to show that computability, complexity, and the
computer, in the form of a Turing machine, are mathematically defined. This aspect can easily be
overlooked in these terms’ practical usage.
The salient point is that data scientists should recognize that all these concepts possess
mathematical definitions which are neither heuristic nor ad-hoc. As such, they may be revisited if
necessary (e. g., to analyze an algorithm’s runtime). Our second point is that not every detail
about these entities must be known. Given the intellectual complexity of these topics, this is
encouraging, because acquiring an in-depth understanding of these is a long-term endeavor.
However, even a basic understanding is preferable and helps in improving practical programming
and data analysis skills.
12 Linear algebra
One of the most important and widely used subjects of mathematics is linear algebra [→27]. For
this reason, we begin this part of the book with this topic. Furthermore, linear algebra plays a
pivotal role for the mathematical basics of data science.
This chapter opens with a brief introduction to some basic elements of linear algebra, e. g.,
vectors and matrices, before discussing advanced operations, transformations, and matrix
decompositions, including Cholesky factorization, QR factorization, and singular value
decomposition [→27].
12.1.1 Vectors
Vectors define quantities, which require both a magnitude, i. e., a length, and a direction to be
fully characterized. Examples of vectors in physics are velocity or force. Hence, a vector extends a
scalar, which defines a quantity fully described by its magnitude alone. From an algebraic point of
view, a vector, in an n-dimensional real space, is defined by an ordered list of n real scalars,
x , x , … , x arranged in an array.
1 2 n
Definition 12.1.1.
A vector is said to be a row vector if its associated array is arranged horizontally, i. e.,
(x1 , x2 , … , xn ),
whereas a vector is called a column vector when its array is arranged vertically, i. e.,
x1
⎛ ⎞
x2
.
⋮
⎝ ⎠
xn
Geometrically, a vector can be regarded as a displacement between two points in space, and it
→
is often denoted using a symbol surmounted by an arrow, e. g., V .
Definition 12.1.2.
→ →
Let V = (x1 , x2 , … , xn ) be an n-dimensional real vector. Then, the p-norm of V , denoted
→
∥ V ∥p , is defined by the following quantity
1
(12.1)
n p
→
p
∥ V ∥p = (∑ |xi | ) ,
i=1
In particular,
1. →
the 1-norm of the vector V is defined by
n
→
∥ V ∥1 = ∑ |xi |,
i=1
2. →
the 2-norm of the vector V is defined by
1
n 2
→
2
∥ V ∥2 = (∑ |xi | ) ,
i=1
Definition 12.1.3.
n n
d : E × E ⟶ R,
Remark 12.1.1.
Definition 12.1.4.
−→
The magnitude of a vector AB is defined by the non-negative scalar given by its Euclidean norm,
−
−→ →
denoted ∥AB∥ or simply ∥AB∥ .
2
−→
Specifically, the magnitude of a 2-dimensional vector AB is given by
− →
2 2
∥AB∥ = √ (x − x ) + (y − y ) . B A B A
In applications, the Euclidean norm is sometimes also referred to as the Euclidean distance.
Using R, the norm of a vector can be computed as illustrated in Listing 12.1.
Definition 12.1.5.
→ →
Two n-dimensional vectors V and W are said to be parallel if they have the same direction.
Definition 12.1.6.
→ →
Two n-dimensional vectors V and W are said to be equal if they have the same direction and the
same magnitude.
Various transformations and operations can be performed on vectors, and some of the most
important will be presented in the following sections.
Figure 12.2 An example where the Euclidean distance ∥x − x ∥ is used for the classifier k-NN. A
i
point x is assigned the label i based on a majority vote, considering its nearest k neighbors. In this
example, k = 4 .
Example 12.1.1.
For supervised learning, k-NN (k nearest neighbors) [→96] is a simple yet efficient way to classify
data. Suppose that we have a high-dimensional data set with two classes, whose data points
represent vectors. Let x be a point that we wish to assign to one of these two classes. To predict
the class label of a point x, we calculate the Euclidean distance, introduced above (see Remark
→12.1.1), between x and all other points x , i. e., d = ∥x − x ∥ . Then, we order these distances
i i i
d in an increasing order. The k-NN classifier now uses the nearest k distances to obtain a majority
i
vote for the prediction of the label for the point x. For instance, in →Figure 12.2, a two-
dimensional example is shown for k = 4 . Among the four nearest neighbors of x are three red
points and one blue point. This means the predicted class label of x would be “red”. In the
extreme case k = 1 , the point x would be assigned to the class with the single nearest neighbor.
The k-NN method is an example of an instance-based learning algorithm. There are many
variations of the k-NN approach presented here, e. g., considering weighted voting to overcome
the limitations of majority voting in case of ties.
12.1.1.1 Vector translation
−
→ → →
Let V = AB denote the displacement between two points A and B. The same displacement V ,
−→
starting from a point A to another point B , defines a vector A B .
′ ′ ′ ′
−
−
−
−→ → → →
The vector A B is called a translation of the vector AB , and the two vectors AB and A B are
′ ′ ′ ′
−
− → → →
equal, when they have the same direction, and
′ ′
∥AB∥ = ∥ V ∥ = ∥A B ∥ . Hence, the translation
−→
of a vector AB is a transformation that maps a pair of points A and B to another pair of points A ′
−
−→ → 1.
AA
′
= BB
′
;
−
−→ → 2.
AA
′
and BB are parallel.
′
−
−→ →
In a two-dimensional space, the vector AB and its translation A B form opposite sides of a ′ ′
Definition 12.1.7.
−
→ →
′
V = A B
′
A vector
A
′
is called a standard vector if its initial point (i. e., the point ) coincides with
the origin of the coordinate system. Hence, using vector translation, any given vector can be
transformed into a standard vector, as illustrated in →Figure 12.3 (a).
A vector transformation, which changes the direction of a vector while its initial point remains
unchanged, is called a rotation. This results in an angle between the original vector and its rotated
→
counterpart, called the rotation angle. Let V = (xA , yA ) be a 2-dimensional vector, and
′
→
V = (x ,y ) its rotation by an angle θ (see →Figure 12.3 (b)). Then, the following properties
′
A
′
A
′
→ →
∥ V ∥= V .
Various operations can be carried out on vectors, including the product of a vector by a scalar,
the sum, the difference, the scalar or dot product, the cross product, and the mixed product. In
the following sections, we will discuss such operations.
Figure 12.3 Vector transformation in a 2-dimensional space: (a) Translation of a vector. (b)
Rotation of a vector.
→
Let V = (v1 , v2 , … , vn ) be an n-dimensional vector, and let k be a scalar. Then, the product of k
→ → →
with V , denoted k × V , is a vector U defined as follows:
→
U = (k × v1 , k × v2 , … , k × vn ).
→ →
Geometrically, the vector U is aligned with V , but k times longer or shorter.
Definition 12.1.8.
→ → → →
If V is non-null (i. e., not all of its components are zero), then V and U = kV are said to be
→ →
parallel if k > 0 , and anti-parallel if k < 0 . In the particular case where k = −1 , then V and U
are said to be opposite vectors (see →Figure 12.4 (b) for an illustration).
Figure 12.4 Vector transformation in a 2-dimensional space: (a) Orthogonal projection of a
vector. (b) Vector scaling.
Figure 12.5 Vector operations in a two-dimensional space: (a) Sum of two vectors. (b) Difference
between two vectors.
Definition 12.1.9.
→ → →
For any scalar k, the vectors V and U = k × V are said to be collinear.
12.1.1.4 Vector sum
→ →
Let V = (v1 , v2 , … , vn ) and W = (w1 , w2 , … , wn ) be two n-dimensional vectors. Then, the
→ → → → →
sum of V and W , denoted V + W , is a vector S defined as follows:
→ → →
S = V + W = (v1 + w1 , v2 + w2 , … , vn + wn ).
→
In a two-dimensional space, the vector sum S can be obtained geometrically, as illustrated in
→
→Figure 12.5 (a); i. e., we translate the vector W until its initial point coincides with the terminal
→ →
point of V . Since translation does not change a vector, the translated vector is identical to W .
→ →
Then, the vector S is given by the displacement from the initial point of V to the terminal point
→
of the translation of the translation of W . Note that the sum of vectors is commutative, i. e.,
→ → → →
V + W = W + V .
→ →
This means that if V has been translated instead of W , the result will be the same sum
→
vector S . This is illustrated in →Figure 12.5 (a).
→ →
Let V = (v1 , v2 , … , vn ) ,W = (w1 , w2 , … , wn ) be two n-dimensional vectors. Then, the
→ → → → → →
difference between V and W , denoted V − W , is the a vector D defined by the sum of V
→
and the opposite of W , i. e.,
→ → →
D = V + (−W ) = (v1 − w1 , v2 − w2 , … , vn − wn ).
This is illustrated geometrically in a 2-dimensional space in →Figure 12.5 (b). Note that, in contrast
→ → → →
with the sum, the difference between two vectors is not commutative, i. e., V − W ≠ W − V .
→ → →
It is often convenient to decompose a vector V into the vector components V || and V ⊥ , which
→
are respectively parallel and perpendicular to the direction of another vector W , and such that
→ → →
V = V || + V ⊥ .
→
In this case, the vector components of V are given by
→ →
→ V ⋅ W →
V || = W,
→ →
W ⋅ W
→ → →
V ⊥ = V − V || .
−
→ →
V = OA In a two-dimensional orthonormal space, the standard components of a vector , where
O denotes the origin of the coordinate system, with respect to the x-axis and the y-axis, are simply
the coordinates of the point A, i. e.,
−
→ → xA − xO xA − 0 xA
V = OA = ( ) = ( ) = ( ).
yA − yO yA − 0 yA
→
If α denotes the angle between the vector V and the x-axis, as illustrated in →Figure 12.1 (b),
then we have the following relationships:
→ (12.3)
xA =cos (α) × ∥ V ∥,
→
yA =sin (α) × ∥ V ∥,
xA
=tan (α).
yA
→ →
The projection of an n-dimensional vector V onto the direction of a vector W , in an m-
→
dimensional space, is a transformation that maps the terminal point of the vector V to a point in
→ → →
the space associated with the direction of W . This results in a vector P that is collinear to W .
Definition 12.1.10.
→ → →
Let θ denote the angle between V and W . If the magnitude of the vector P is given by
→ →
∥ P ∥ =cos (θ) × ∥ V ∥,
→ →
then the projection of V onto the direction of W is said to be orthogonal.
→
Clearly, in a two-dimensional space, as depicted in →Figure 12.4 (a), the vector P , the orthogonal
→ →
project of V onto the direction of W , is nothing but the vector component of V parallel to W.
Thus,
→ →
→ xP V ⋅ W xW
P = ( ) = ( ).
yP → → yW
W ⋅ W
→ →
The reflection of an n-dimensional vector V with respect to the direction of a vector W , in an m-
→ →
dimensional space, is a transformation that maps the vector V to an n-dimensional vector U ,
such that (see →Figure 12.6)
→ →
→ → V ⋅ W →
U = V − 2 W.
→
2
∥W ∥
→
In a two-dimensional orthonormal space, the components of the vector U are given by
→ →
→ xU xV xW V ⋅ W
U = ( ) = ( ) − k( ), with k = 2 .
yU yV yW →
2
∥W ∥
Figure 12.6 Vector reflection in a two-dimensional space.
→ →
Let V = (v1 , v2 , … , vn ) and W = (w1 , w2 , … , wn ) be two n-dimensional vectors. The dot
→ → → →
product, also called the scalar product, of V and W , denoted V ⋅ W , is a scalar p, defined as
follows:
→ →
p = V ⋅ W = v1 w1 + v2 w2 + ⋯ + vn wn .
Geometrically, the dot product, can be defined through the orthogonal projection of a vector onto
→ →
another. Let αbe the angle between two vectors V and W . Then,
→ →
→ → → → V ⋅ W
V ⋅ W =cos (α) × ∥ V ∥ × ∥W ∥, with cos (α) = .
→ →
∥ V ∥ × ∥W ∥
→
In →Figure 12.4 (a), the norm of the projected vector P can be interpreted as the dot product
→ →
between the vectors V and W .
Definition 12.1.11.
→ →
When the angle α between two vectors, V and W , is + kπ , where k is an integer, then the
π
2
two vectors are said to be perpendicular or orthogonal to each other, and their dot product is given
by
π → → → →
cos ( + kπ) × ∥ V ∥ × ∥W ∥ = 0 × ∥ V ∥ × ∥W ∥ = 0.
2
The cross product is applicable to vectors in an n-dimensional space, with n ≥ 3 . To illustrate this,
→ →
let V and W be two three-dimensional standard vectors defined as follows:
x x
−
− → ⎛ A⎞ → ⎛ B⎞
→ →
V = OA = yA and W = OB = yB .
⎝ ⎠ ⎝ ⎠
zA zB
→ → → → →
Then, the cross product of the vector V by the vector W , denoted V × W , is a vector C
→ →
perpendicular to both V and W , defined by
xC yA × zB − yB × zA
→ ⎛ ⎞ ⎛ ⎞
C = yC = −xA × zB + xB × zA ,
⎝ ⎠ ⎝ ⎠
zC xA × yB − xB × yA
or
→ → → →
C = ∥ V ∥ × ∥W ∥× sin (θ) × u ,
→ → → →
is the unit vector1 normal to both
→
where, u V and W , and θ is the angle between V and W .
Thus,
→ → →
∥ C ∥ = ∥ V ∥ × ∥W ∥× sin (θ) = A ,
→ →
where A denotes the area of the parallelogram spanned by V and W , as illustrated in →Figure
12.7.
This is an operation on vectors, which involves both a cross and a scalar product. To illustrate this,
→ → → →
let V , U , and W denote three three-dimensional vectors. Then, the mixed product between V ,
→ →
U , and W is a scalar p, defined by
→ → → → → → → → → (12.4)
p= ( V × U ) ⋅ W = (U × W ) ⋅ V = (W × V ) ⋅ U
→ → → → → → → → →
= V ⋅ ( U × W ) = U ⋅ (W × V ) = W ⋅ ( V × U )
= ±V ,
→ → →
where V denotes the volume of the parallelepiped spanned by V , U , and W .
In R, the above operations can be carried out using the scripts in Listing 12.2.
12.1.2 Vector representations in other coordinates systems
For various problems, the quantities characterized by vectors must be described in different
coordinate systems [→27]. Depending on the dimension of its space, a vector can be represented
−
→ →
in different ways. For instance, in a two-dimensional space, a standard vector
V = OA , where O
denotes the origin point, can be specified either by:
1. The pair (xA, yA ) , where x and y denote the coordinates of the point A, the
A A
→
terminal point of V , in a two-dimensional Euclidean space. The pair (x A, yA )
→
defines the representation of the vector V in cartesian coordinates.
2. → →
The pair (r, θ) , where r = ∥ V ∥ is the magnitude of V and θ is the angle
→
between the vector V and a reference axis in a cartesian system, e. g., the x-axis.
→
The pair (ρ, θ) defines the representation of the vector V in polar coordinates.
The polar coordinates can be recovered from cartesian coordinates, and vice versa. Let
−
→ →
x
V = OA = ( A )
yA
be a standard vector in a two-dimensional cartesian space, as depicted in
→
→Figure 12.8. Then, the polar coordinates of V can be obtained as follows:
(12.5)
2 2
r= √ x + y ,
A A
−1
yA
θ=tan ( ).
xA
yA = r sin (θ).
Figure 12.8 Representation of a 2-dimensional vector in a polar coordinates system.
In R, the above coordinate transformations can be carried out using the commands in Listing
12.3.
−
→ →
In a three-dimensional space, a standard vector
V = OA , where O denotes the origin point, can
be specified either by one of the following:
1. The triplet (xA, yA , zA ) , where x , y and z denote the coordinates of the
A A A
→
point A, the terminal point of V , in a three-dimensional Euclidean space. The
→
triplet (x , y , z ) defines the representation of the vector
A A A V in cartesian
coordinates (see →Figure 12.9 (a) for illustration).
Figure 12.9 Representation of a point in a three-dimensional space in different
coordinate systems: (a) cartesian coordinates system; (b) cylindrical coordinates
systems; (c) spherical coordinates system.
2. →
The triplet (ρ, θ, z A) , where ρ is the magnitude of the projection of V on the x-y
→
plane, θ is the angle between the projection of the vector on the x-y plane and
V
the x-axis, and z is the third coordinate of A in a cartesian system. The triplet
A
→
defines the representation of the vector
(ρ, θ, zA ) V in cylindrical coordinates
(see →Figure 12.9 (b) for illustration).
3. → →
The triplet (r, θ, φ) , where r = ∥ V ∥ is the magnitude of V , θ is the angle
→
between the projection of the vector V on the x-y plane and the x-axis, and φ is
→
angle between the vector V and the x-z plane. The triplet (ρ, θ, φ) defines the
→
representation of the vector V in spherical coordinates (see →Figure 12.9 (c) for
illustration).
Mutual relationships exist between cartesian, cylindrical and spherical coordinates. Let
−
→ →
V = OA =
2.
x
⎛ A⎞
yA
θ= θ,
φ=tan
V
→
V
ρ= √ x
θ=tan
zA = zA .
r= √ x
θ=tan
φ=cos
2
+ y
2
−1
−1
⎜⎟
⎝
2
+ y ,
−1
2
+ z ,
A
zA
yA = ρ sin (θ),
zA = zA .
zA = r cos (φ).
−1
2
r= √ ρ2 + z ,
(
A
zA
).
yA
xA
zA
r
A
yA
xA
),
).
),
Relationships between cylindrical and spherical coordinates also exist. From cylindrical
coordinates, spherical coordinates can be obtained as follows:
(12.7)
(12.8)
(12.9)
(12.10)
(12.11)
(12.12)
2 2
ρ= √ r − z ,
A
θ= θ,
zA = r cos (φ).
In R, the above coordinate system transformations can be carried out using the scripts in
Listing 12.4.
Example 12.1.2.
Classification methods are used extensively in data science [→41], [→64]. An important
classification technique for high-dimensional data is referred to as support vector machine (SVM)
classification [→41], [→64] (see also Section →18.5.2).
For high-dimensional data, the problem of interest is to classify the (labeled) data by determining
a separating hyperplane. When using linear classifiers, it is necessary to construct a hyperplane
1 2 3
∥δ∥ = √ (δ1 )
da,H =
δ =
⎜⎟
for optimal separation of the data points. To this end, it is necessary to determine the distance
between a point representing a vector and the hyperplane.
Let H : δ ⋅ x + δ ⋅ y + δ ⋅ z − a = 0 be a three-dimensional hyperplane, and let
⎝
δ1
δ2
δ2
+ (δ2 )
∥δ∥
⎞
2 2
+ (δ3 ) .
δ1 ⋅ x + δ2 ⋅ y + δ3 ⋅ z − a
.
A two-dimensional hyperplane is shown in →Figure 12.10 to illustrate the SVM concept for a two-
(12.13)
(12.14)
class classification problem. The data points represented by rectangles and circles represent the
two classes, respectively. →Figure 12.10 illustrates a case of a two-class problem, where a linear
classifier represented by a hyperplane, can be used to separate the two classes. The optimal
hyperplane is the one whose distance from the points representing the support vectors (SV) is
maximal.
Figure 12.10 Constructing a two-dimensional hyperplane for SVM-classification.
A complex number is a number of the form x + iy , where x and y are real numbers and i is the
imaginary unit, such that i = √−1 , see [→158]. Specifically, the number x is called the real part of
the complex number x + iy , whereas the part iy is called the imaginary part. The set of complex
numbers is commonly denoted by C ; R is a subset of C , since any real number can be viewed as
a complex number, for which the imaginary part is zero, i. e. y = 0 . Any complex number
z = x + iy can be represented by the pair of reals (x , y ) ; thus, a complex number can be
z z z z
viewed as a particular two-dimensional standard real vector. Let θ denote the angle between the
z
vector z = (x , y ) and the x-axis. Then, using the vector decomposition in a two-dimensional
z z
space, we have
(12.15)
2 2
xz = rz cos (θz ), yz = rz sin (θz ), where rz = √ xz + yz .
The number r is called the modulus or the absolute value of z, whereas θis called the argument of
z
z.
From (→12.15), we can deduce the following alternative description of a complex number
z = x + iy :
z z
z= xz + iyz (12.16)
iθz
= rz [cos (θz ) + i sin (θz )] = rz e .
Let z = x + iy , and let w = x + iy be two complex numbers and n an integer. Then, the
z z w w
Complex multiplication:
z × w = (x + iy ) × (x
z z + iy ) = (x x
w w − y y ) + i(x y
z w + y x ).
z w z w z w
Complex division: .
z xz +iyz (xz xw +yz yw )+i(yz xw −xz yw )
= =
w xw +iyw x2 +y2
w w
xw +iyw
Complex exponentiation: z
w
= (xz + iyz )
xw +yw
= (xz + yz )
2 2 2
e
irz (xz +iyz )
, where r is the
z
complex modulus of z.
In R, the above basic operations on complex numbers can be performed using the script in Listing
12.5.
→
An n-dimensional complex vector is a vector of the form V = (x , x , … , x ) , whose
1 2 n
12.1.3 Matrices
In the two foregoing sections, we have presented some basic concepts of vector analysis. In this
section, we will discuss a generalization of vectors also known as matrices.
Let m and n be two positive integers. We call A an m × n real matrix if it consists of an ordered
set of m vectors in an n-dimensional space. In other words, A is defined by a set of m × n scalars
aij ∈ R , with i = 1, … , m and j = 1, … , n , represented in the following rectangular array
1j 2j mj
A =
⎜⎟
⎛
⎝
a11
a21
am1
a12
a22
am2
i1
⋯
i2
a2n
amn
⎞
in
.
The matrix A, defined by (→12.17), has m rows and n columns. For any entry a , with
ij
i = 1, … , m and j = 1, … , n , of the matrix A, the index i is called the row index, whereas the
index j is the column index. The set of entries (a , a , … , a ) is called the i row of A, and the
th
In the case m = 1 and n > 1 , the matrix A is reduced to one row, and it is called a row vector;
likewise, when m > 1 and n = 1 , the matrix A is reduced to a single column, and it is called a
column vector. In the case m = n = 1 , the matrix A is reduced to a single value, i. e., a real scalar.
An m × n matrix A can be viewed as a list of m n-dimensional row vectors or a list of n m-
dimensional column vectors. If the entries a are complex numbers, then A is called a complex
matrix.
ij
The R programming environment provides a wide range of functions for matrix manipulation
and matrix operations. Several of these are illustrated in the three listings below.
Example 12.1.3.
connected network. The application of Definition →16.2.2 in Section →16.2.1 to the graph shown
in →Figure 12.12 yields the following matrix:
0 1 1 0 0 (12.18)
⎡ ⎤
1 0 0 0 0
A(G) = 1 0 0 1 1 .
0 0 1 0 0
⎣ ⎦
0 0 0 0 0
⎢⎥
⎡
⎣
2
2
0
(G) .
1
0
1
0
⎤
⎦
.
(12.19)
The power of the adjacency matrix (→12.18) (here 2) gives the length of the walk. The entry a of
A (G) gives the number of walks of length 2 from v to v . For instance, a
2
i j = 2 means there
14
11
ij
exist two walks of length 2 from vertex 1 to vertex 1. Moreover, a means there exists only one
walk of length 2 from vertex 1 to vertex 4. These numbers can be understood by inspection of the
network, G, shown in →Figure 12.12.
important operation on matrices is matrix multiplication. Let A and B be two matrices; the matrix
multiplication, A × B , requires the number of columns of the matrix A to be equal to the number
of rows of the matrix B, i. e., if A is an m × n real matrix, then B must be an n × l real matrix. The
result of this operation is an m × l real matrix, C, whose entries c for
ik
1 (12.20)
cik = ∑ aij × bjk .
j=1
When m = 1 , then, the result is a product between a row vector and a matrix.
Note that, even if both products A × B and B × A are defined, i. e., if l = m , A × B generally
differs from B × A .
Definition 12.2.1.
Let A be an n × n matrix, and let 0 denote the n-dimensional null vector.
n
A is said to be indefinite if and only if there exist x and y ∈ R such that x Ax > 0 and
n T
y Ay < 0 .
T
Sparse matrix: A matrix is called sparse if it has relatively few nonzero entries. The sparsity of
an n × m matrix, generally expressed in %, is given by
r
%, where r is the number of nonzero entries in A.
nm
Definition 12.3.1.
Let A be an n × n squared matrix, and let I be the n × n identity matrix. If there exists an
n
AB = In = BA, (12.21)
then B is called the inverse of A. If a squared matrix, A, has an inverse, then A is called an invertible
or nonsingular matrix; otherwise, A is called a singular matrix.
The inverse of the identity matrix is the identity matrix, whereas the inverse of a lower
(respectively, upper) triangular matrix is also a lower (respectively, upper) triangular matrix.
Note that for a matrix to be invertible, it must be a squared matrix.
Using R, the inverse of a squared matrix, A, can be computed as follows:
12.4 Trace and determinant of a matrix
Let A be an n × n real matrix. The trace of A, denoted tr(A) , is the sum of the diagonal entries of
A, that is,
n
tr(A) = ∑ aii .
i=1
The determinant of A, denoted det (A) , can be computed using the following recursive relation
[→27]:
a11 , if n = 1,
det (A) = { i+j
n
∑ (−1) aij det (Mij ), if n > 1,
i=1
column of A.
Let A and B be two n × n matrices and k a real scalar. Some useful properties of the
determinant for A and B include the following:
1. det (AB) =det (A) det (B) ;
2. det (A
T
) =det (A) ;
3. det (kA) = k
n
det (A) ;
4. det (A) ≠ 0 if and only if A is nonsingular.
Remark 12.4.1.
If an n × n matrix, A, is a diagonal, upper triangular, or lower triangular matrix, then
n
i=1
i. e., the determinant of a triangular matrix is the product of the diagonal entries. Therefore, the
most practical means of computing a determinant of a matrix is to decompose it into a product of
lower and upper triangular matrices.
Using R, the trace and the determinant of a matrix are computed as follows:
Definition 12.5.1.
Consider m vectors {A , i = 1, 2, … , m} , whereas A
i i ∈ R
n
. If the only set of scalars λ for
i
which
m
∑ λi Ai = 0n
i=1
independent.
Otherwise, the vectors are said to be linearly dependent.
Definition 12.5.2.
A subspace of R is a nonempty subset of R , which is also a vector space.
n n
Definition 12.5.3.
The set of all linear combinations of a set of m vectors {A , i = 1, 2, … , m} in R is a subspace
i
n
V = ∑ λi Ai with λi ∈ R ∀ i = 1, … , m}.
i=1
Definition 12.5.4.
A linearly independent set of vectors, which spans a subspace S is called a basis of S.
All the bases of a subspace S have the same number of components, and this number is called
the dimension of S, denoted dim (S) .
Two key subspaces are associated with any m × n matrix A, i. e., A ∈ R m×n
:
1. the subspace
m n
im (A) = {b ∈ R : b = Ax f or some x ∈ R },
Definition 12.5.5.
The rank of a matrix, A, denoted rank (A) , is the maximum number of linearly independent rows
or columns of the matrix A, and it is defined as follows:
rank (A
T
A) =rank (AA
T
) =rank (A) =rank (A
T
) ;
rank (AB) ≤min (rank (A), rank (B)) ;
If rank (A) = n , then rank (AB) =rank (A) ;
rank (A)+ rank (B) − n ≤rank (AB) (this is known as the Sylvester’s rank inequality).
Definition 12.5.6.
Let A be an m × n real matrix, i. e., A ∈ R m×n
, and u ∈ R and v ∈ R . Then, the matrix
m n
B ∈ R
m×n
such that
B = A + uv
T (12.22)
Then,
T
−1
−1 −1 T −1
−1
T −1 (12.23)
(A + U V ) = A − A U (In + V A U) V A .
−1 A
−1
uv
T
A
−1 (12.24)
T −1
(A + uv ) = A − .
T −1
1 + v A u
Woodbury formula provides the easiest way to compute the inverse of the matrix B, the rank-1
change of A. Thus,
−1
1 −1 T −1
B = (In − (A u)v )A .
T −1
1 + v (A u)
det (A − λI ) , i. e., the eigenvalues are solutions to the following equation: [→27]
det (A − λI ) = 0.
tr(A) = ∑
Definition 12.6.2.
A non-null vector x, such that
(A − λ (A)I )x = 0 .
i
i
n
i=1
n
i=1
λi (A)
aii = ∑
n
i=1
λi (A)
i=1,2,…,n
Ax = λi (A)x,
λi (A)
T
AQ = D,
λi (A
−1
) =
1
,
∣
A squared matrix, A, is called nonsingular if and only if all its eigenvalues are nonzero.
Definition 12.6.1.
The spectral radius of a squared matrix A, denoted ρ(A) , is given by
For each eigenvalue λ (A) , its right eigenvector x is found by solving the system
f or
i
If A is diagonal, upper triangular or lower triangular, then its eigenvalues are given by its
diagonal entries, i. e.,
λi (A) = aii , f or i = 1, 2, … , n.
i = 1, … , n.
Remark 12.6.1.
i
i
A is said to be indefinite if and only if λ (A) > 0 for some i and λ (A) < 0 for some j.j
such that
1 (A) ,λ
2 (A) ,…, λ
(12.25)
n (A) .
If a nonsingular n × n symmetric matrix, A, is positive semi-definite (respectively negative
semidefinite), then A is positive definite (respectively negative definite).
Using R, the eigenvalues for a matrix, A, and their associated eigenvectors, as well as the
spectral radius of the matrix A, can be computed as follows:
Definition 12.7.1.
A matrix norm, denoted by ‖·‖, is a scalar function defined from R m×n
to R , such that
1. ∥A∥ ≥ 0 for all A ∈ Rm×n
;
2. ∥A∥ = 0 ⟺ A = 0 m×n , where 0 denotes an m × n null matrix;
m×n
Furthermore, if
m×n n×q
∥AB∥ ≤ ∥A∥ × ∥B∥, f or all A ∈ R and B ∈ R ,
Definition 12.7.2.
Let A ∈ R m×n
and x ∈ R . Then, the subordinate matrix p-norm of A, denoted ∥A∥ , is defined
n
p
In particular,
1. the subordinate matrix 1-norm of A is defined by
m
∥Ax∥1
∥A∥1 =max = max ∑ |aij |.
x≠0n ∥x∥1 j=1,2,…,n
i=1
if m = n , then
Furthermore, if m = n , we have
Remark 12.7.1.
The subordinate matrix p-norm is consistent and, for any A ∈ R m×n
and x ∈ R , n
Definition 12.7.3.
The Frobenius norm of a matrix A is defined by
1
m n 2
2 T
∥Ax∥F = (∑ ∑ |aij | ) = tr(AA ).
i=1 j=1
12.8.1 LU factorization
Let A ∈ R . The LU factorization of A consists of the decomposition of A into a product of a
n×n
unit lower triangular matrix L and an upper triangular matrix U, that is,
A = LU ,
where
1 0 0 ⋯ 0 u11 u12 u13 ⋯ u1n
⎡ ⎤ ⎡ ⎤
L = and U = .
⋮ ⋮ ⋮ ⋱ ⋮ ⋮ ⋮ ⋮ ⋱ ⋮
⎣ ⎦ ⎣ ⎦
ln1 ln2 ln3 ⋯ 1 0 0 0 ⋯ unn
i=1
However, when a principal submatrix of A is singular, then a permutation, i. e., the reordering of
the columns of A, is required. If A is nonsingular, then there exists a permutation matrix
P ∈ R such that
n×n
P A = LU . (12.26)
P A = LDÛ ,
matrices L and U:
min(i,j)
k=1
Using R, the LU factorization of a matrix, A, is performed using the command expand(lu(A)), which
outputs three matrices L, U, and P. The matrices L and U are the lower and upper triangular
matrices we are looking for, whereas the matrix P contains all the row permutation operations
that have been carried out on the original matrix A for the purpose of obtaining L and U.
Therefore, the product LU gives a row-permuted version of A, whereas the product P LU
enables the recovery of the original matrix A.
Let A be an n × n
symmetric matrix. If A has an LU factorization, then there exists a unit lower triangular matrix
L ∈ R , and a diagonal matrix D ∈ R
n×n
such that
n×n
A = LDL
T
.
(12.27)
If a principal submatrix of A is singular, then a permutation, i. e., a reordering of both rows and
columns of A is required, and this results in the following factorization:
P AP
T
= LDL
T
,
(12.28)
where P ∈ R
n×n
is a permutation matrix.
T
T (12.29)
A = LDL = L̃L̃ ,
2 2
ii
12.8.3 QR factorization
Let A ∈ R m×n
. Then,
1. if m = n , and A is nonsingular, then there exits an orthogonal matrix Q ∈ R n×n
2. if m > n and rank (A) = n , then there exists an orthogonal matrix Q ∈ R m×m
factorization of A is defined by
R R (12.30)
T
A = Q[ ] ⟺ Q A = [ ],
0m−n,n 0m−n,n
since Q is orthogonal, i. e., Q Q = I . Here, 0 T
denotes the (m − n) × n
m m−n,n
matrix of zeros.
When rank(A) < n , i. e., a principal submatrix of A is singular, then a
permutation, i. e., the reordering, of the columns of A, is introduced, and the QR
factorization of A is defined by
R
T
Q AP = Q[ ],
0m−n,n
where P ∈ R
n×n
is a permutation matrix for reordering the columns of A.
Let V ∈ R m×n
and W ∈ R denote the n first columns and (m − n) last columns of the
m×(m−n)
T T
V V V W
= [ ].
T T
W V W W
i. e., V V = I , W W = I V W
T
n
T
m
T
= 0n,m−n and W V T
= 0m−n,n (that is, V and W are
orthogonal). Substituting Q with [V W ] in (→12.30) gives
T
R V R
T
Q A = [ ] ⟺ [ ]A = [ ],
T
0m−n,n W 0m−n,n
and therefore
V
T
A = R ⟺ A = V R,
(12.31)
W
T
A = 0m−n,n .
(12.32)
Equations (→12.31) and (→12.32) yield several important results, which link the QR factorization
of a matrix, A, to its subspaces im (A) (i. e., the range of A) and ker (A) (i. e. the kernel or the
null space of A). In particular,
1. since V is an orthogonal matrix, then, thanks to (→12.31), the columns of V form
an orthogonal basis for the subspace im (A) , that is, A is uniquely determined by
the linear combination of the column of V through A = V R . Consequently, the
matrix V V provides an orthogonal projection onto the subspace im (A) .
T
where U ∈ R
m×m
and V n×n
∈ R are orthogonal matrices, i. e., U U = I and
T
m
V
T
V = In , and D ∈ R n×n
is a diagonal matrix whose diagonal entries, d or simply
ii
where D ∈ R m×m
is a diagonal matrix, whose diagonal entries d ii = di for
i = 1, 2, … , m , are rearranged in descending order.
Definition 12.8.1.
T
σi (A) = √ λi (A A) f or i = 1, 2, … , p,
where p =min (m, n) and λ (A A), i = 1, 2, … , p , are the nonzero eigenvalues of the
i
T
that is, the rank of the matrix is the number of its nonzero singular values. However, due to
rounding errors, this approach to determine the rank is not straightforward in practice, as it
is unclear how small the singular value should be to be considered as zero.
Furthermore, if A has a full rank, i. e., r =min (m, n) , then the condition number of A,
denoted κ(A) , is given by
σ1 (A)
κ(A) = .
σr (A)
12.9 Systems of linear equations
A system of m linear equations in n unknowns consists of a set of algebraic relationships of the
form
n (12.33)
∑ aij xj = bi , i = 1, … , m,
j=1
where x are the unknowns, whereas a , the coefficients of the system, and b , the entries of the
j ij i
j = 1, … , n .
Theoretically, the system (→12.34) has a solution if and only if b ∈im (A) . If, in addition,
ker (A) = {0} , then the solution is unique. When a solution exists for the system (→12.34), then
where M is the matrix obtained by substituting the j column of A with the right-hand side
j
th
term b.
However, when the size of the matrix A is large, Cramer’s method is not sustainable, and
computing the solution, x, requires several efficient numerical methods. More often, the efficiency
with which these methods work depends on the patterns or structure of the matrix A. Depending
on the form of the matrix A, the systems of the form (→12.34) can be categorized as follows:
1. Triangular linear systems: If A ∈ R is either a nonsingular lower or an upper
n×n
then the solution to the system (→12.34) can be readily obtained using the
following method, known as forward substitution:
b1 (12.36)
x1 = ,
l11
(bi − ∑
i−1
lij xj )
(12.37)
j=1
xi = f or i = 2, 3, … , n.
lii
the solution to the system (→12.34) can easily be obtained using the following
method, known as backward substitution:
bn (12.38)
xn = ,
unn
(bi − ∑
n
uij xj ) (12.39)
j=i+1
xi = f or i = n − 1, n − 2, … , 1.
uii
P
−1
LU x = b ⟺ LU x = P b,
(12.40)
(→12.34) has more equations than variables, and such a system is said to be over-
determined. When b ∈im (A) , then the system (→12.34) has a solution, and so
does
T T
A Ax = A b.
(→12.34) has fewer equations than variables, and such a system is said to be
under-determined. If the equations in (→12.34) are consistent, then the system has
an infinite number of solutions. If A is a full rank matrix, that is, rank (A) = m ,
then the matrix AA is a nonsingular matrix. Thus, one of the solutions to the
T
⎧3x + 2y − 5z = 12 (12.41)
⎨ x − 3y + 2z = −13
⎩
5x − y + 4z = 10
Thus,
2.
3.
(c)
(d)
(e)
(f)
A =
⎜⎟
⎛
⎝
3
5
2
−3
−1
−5
4
⎞
12.10 Exercises
1. →
Let U = (−1,
vectors:
(a)
(b)
−2, 4) , and let
(b)
Let
b =
V = (2,
→
V
→
U
→
⎛
3,
U
and
12
−13
10
→
U
5)
→
⎞
and
→
in cylindrical and
U V
V
.
be two 3-dimensional
.
→
V
.
.
w
w
z
→
U .
4.
(a)
(b)
(c)
(d)
(e)
(f)
(g)
(h)
(i)
(j)
(k)
(l)
(m)
A =
⎜⎟
⎛
⎝
1
−4
3
−2
1
0
−1
⎞
⎠
and
⎝
7
−3
4
1
−1
−2
⎧2x
⎨4x
⎩
8x
⎧2x
⎨ x
⎩
3x
−
+
3y
9y
2y
+
−
3z
7z
15z
3z
z
=
=
=
8
10
9.
3
3
5
⎞
⎠
13 Analysis
Similar to linear algebra, also analysis [→158] is omnipresent in nearly all applications of
mathematics. In general, analysis deals with examining convergence, limits of functions,
differentiation, integration as well as metrics. In this chapter, we provide an introduction to these
topics and demonstrate how to conduct a numerical analysis using R.
13.1 Introduction
Differentiation and integration are fundamental mathematical concepts, having a wide range of
applications in many areas of science, particularly in physics, chemistry, and engineering [→158].
Both concepts are intimately connected, as integration is the inverse process of differentiation,
and vice versa. These concepts are especially important for descriptive models, e. g., providing
information about the position of an object in space and time (physics) or the temporal evolution
of the price of a stock (finance). Such models require the precise definition of the functional,
describing the system of interest, and related mathematical objects defining the dynamics of the
system.
Before we begin investigating limiting values, we introduce a class of functions, namely, real
sequences. For limiting values of complex sequences or functions, we refer to the reader to
[→158].
Definition 13.2.1.
A real sequence is a function a n : N ⟶ R .
We also write (a n )n∈N = (a1 , a2 , a3 , … , an , …) . Typical examples of sequences include:
1 1 1 (13.1)
(an ) = ( , , , …),
n∈N
2 4 6
(bn )
3
= (1 , 2 , 3
3 3
…).
(13.2)
n∈N
From the above sequences, it can be observed that a and b have the closed forms: a =
n n n
1
2n
and b = n , respectively. Now, we are ready to define the limiting value of a real sequence.
n
3
Definition 13.2.2.
A number l is called the limiting value or limes of a given real sequence (a ) , if for all ε > 0
n n∈N
exists N (ε) such that |a − l| < ε , for all n > N (ε) . In this case, the sequence, (a )
0 n 0 , isn n∈N
Example 13.2.1.
|a − 1| <
n
n
n⟶∞
holds.
1
10
1
ε
a = l.
n
|an − 1| =
∣
said to converge to l, and the following short-hand notation is used to summarize the previous
statement: lim
(1 −
1
n
) − 1 =
1
n
=
such that |a − 1| < ε for all n > N (ε) := . For example, if we set ε =
0
1
ε
1
n
< ε.
n )n∈N
Thus, we find n > =: N (ε) . In summary, for all ε > 0 , there exists a number N (ε) :=
, then n ≥ 11 .
This means that for all elements of the given sequence a = 1 − , starting from n = 11 ,
n
1
n
1
10
0
Before we give some examples of basic limiting values of sequences, we provide the following
proposition, which is necessary for the calculations that follow [→158].
Proposition 13.2.1.
1
ε
(13.3)
Let (a n )n∈N and (b )
n n∈N
be two convergent sequences with lim n⟶∞ an = a and
limn⟶∞ bn = b . Then, the following relationships hold:
an limn→∞ an a (13.6)
lim = = .
n→∞ bn limn→∞ bn b
Example 13.2.2.
Let us examine the convergence of the following two sequences:
3n + 1 (13.7)
an = ,
n + 5
bn = (−1) .
n
(13.8)
For a , we have
n
n(3 +
1
) limn→∞ (3 +
1
)
(13.9)
3n + 1 n n
lim = lim =
5 5
n→∞ n + 5 n→∞
n(1 + ) limn→∞ (1 + )
n n
1
3+ limn→∞ ( ) 3 + 0
n
= = = 3.
5
1+ limn→∞ ( ) 1 + 0
n
n
( ) converge to 0. By examining the values
n→∞
5
of b , we observe that its values alternate, i. e., they always flip between −1 and 1. According to
n
Definition →13.2.2, the sequence b is not convergent and, hence, does not have a limiting value.
n
Definition 13.2.3.
Let f (x) be a real function and x a sequence that belongs to the domain of f (x) . If all
n
sequences of the values f (x ) converge to l, then l is called the limiting value for x → ±∞ , and
n
For a general function, f : X → Y , we call the set X domain and Y the co-domain of function f.
Example 13.2.3.
Let us determine the limiting value of the function f (x) = 2x−1
x
for large and positive x. This
means, we examine lim x→∞ and find
2x−1
2x − 1 1 1 (13.10)
lim = lim (2 − ) = 2− lim ( ) = 2.
x→∞ x x→∞ x x→∞ x
Here, we used Proposition →13.2.1 for functions as it can be formulated accordingly (see [→158]).
The limiting value of f (x) = 2x−1
x
for large x can be seen in →Figure 13.2.
We conclude this section by stating the definition for the convergence of a function, f (x) , if x
tends to a finite value x . 0
Definition 13.2.4.
Let f (x) be a real function defined in a neighborhood of x . If for all sequences x in the domain
0 n
Example 13.2.4.
Let us calculate the limiting value of lim . Note that the function f (x) = is
3 3
2x −8x 2x −8x
x→−2 x+2 x+2
2x
3
− 8x 2x(x + 2)(x − 2) (13.11)
lim = lim = lim 2x(x − 2) = 16.
x→−2 x + 2 x→−2 x + 2 x→−2
The following sections will utilize the concepts introduced here to define differentiation and
integration.
13.3 Differentiation
Let f : R ⟶ R be a given continuous function. Then, f is called differentiable at the point x if 0
can be approximated by
df (x0 ) f (x0 + h) − f (xh ) (13.13)
′
f (x0 ) = ≈ , f or h ⟶ 0.
dx h
Therefore, the derivative of a function f at a point x can be viewed as the slope of the tangent
0
line of f (x) at the point x , as illustrated geometrically in →Figure 13.3. The tangent line in
0
→Figure 13.3 (left) corresponds to the limit of the displacement of the secant line in →Figure 13.3
(left) when h tends to zero, i. e., when x + h is getting closer to x . The intermediate dashed
0 0
lines correspond to the different positions of the secant line as h decreases to 0. →Figure 13.3
(right) shows the tangent line when h ≈ 0 , i. e., x ≈ x + h , which corresponds approximately
0 0
to equation (→13.13).
The above approximation can be extended to multivariate real functions as follows: Let f be a
scalar valued multivariable real function, i. e., f : R ⟶ R . Then, the first-order partial
n
Definition 13.3.2.
The gradient of a function f, denoted ∇f , is defined, in Cartesian coordinates, as follows:
∂f ∂f ∂f ∂f
∇f = e1 + e2 + ⋯ + ei + ⋯ + en ,
∂x1 ∂x2 ∂xi ∂xn
where the e , k = 1, … , n , are the orthogonal unit vectors pointing in the coordinate directions.
k
Thus, in (R , ∥ ⋅ ∥ ) , where ∥x∥ = √⟨x, x⟩ is the Euclidian norm, the gradient of f can be
n
2 2
rewritten as follows:
T
∂f ∂f ∂f
T
∇f = df = ( , ,…, ) .
∂x1 ∂x2 ∂xn
0 (13.15)
⎛ ⎞
0
ei =
1
⎝ ⎠
0
Definition 13.3.3.
The Hessian of f, denoted ∇ 2
f , is an n × n matrix of second-order partial derivatives of f, if these
exist, organized as follows:
2 2 2
∂ f ∂ f ∂ f
⎡ … ⎤
∂x1 ∂x1 ∂x1 ∂x2 ∂x1 ∂xn
2 2 2
∂ f ∂ f ∂ f
…
∂x2 ∂x1 ∂x2 ∂x2 ∂x2 ∂xn
2
∇ f = .
⋮ ⋮ ⋱ ⋮
2 2 2
∂ f ∂ f ∂ f
⎣ … ⎦
∂xn ∂x1 ∂xn ∂x2 ∂xn ∂xn
Thus, the Hessian matrix describes the local curvature of the function f.
Example 13.3.1.
Let f 3
: R ↦ R defined by f (x) = f (x 1, x2 , x3 ) = x
3
1
2
+ x + log (x3 )
2
. Then,
T
2 1
∇f (x) = (3x1 , 2x2 , ) ,
x3
and
6x1 0 0
⎛ ⎞
2 0 2 0
∇ f (x) = .
1
⎝ 0 0 − 2 ⎠
x
3
Definition 13.3.4.
Let f be a multivalued function, i. e., f : R ⟶ R . Then, the Jacobian of f, denoted J , is an
n m
f
m × n matrix of the first-order partial derivatives, if they exist, of the m real-valued component
Jf = .
⋮ ⋮ ⋱ ⋮
The Jacobian generalizes the gradient of a scalar-valued function of several variables to m real-
valued component functions. Therefore, the Jacobian for a scalar-valued multivariable function, i.
e., when m = 1 , is the gradient.
Example 13.3.2.
Let f 3
: R ↦ R
2
defined by
2
f1 (x1 , x2 , x3 ) x1 x
f (x) = f (x1 , x2 , x3 ) = ( ) = ( 2
2
).
f2 (x1 , x2 , x3 ) x + 2x1 x2
3
Then,
2
x 2x1 x2 0
2
Jf (x) = [ ].
2x2 2x1 2x3
Using R, the gradient of a function f at a point x is computed using the command grad(f, x).
Since the gradient of a function of a single variable is nothing but the first derivative of the
function, then the same command is used to compute the derivative of a function of one variable
at a given point. By contrast, the Hessian and the Jacobian of a function f at a point x are
computed using the command hessian(f, x) and jacobian(f, x), respectively.
For example, the gradient, the Hessian, and the Jacobian of the following function
f (x, y, z) = x y+ sin (z) at the point (x = 2, y = 2, z = 5) can be computed using R as
2
follows:
Let us consider the following example, from economics, for determining extreme values of
economic functions [→182] (see also Example →13.8.1). In this example, the economic functions
of interest are real polynomials [→135].
Example 13.3.3.
Let
1 3 (13.16)
3 2
C(x) = x − x + 7
3 2
be an economic cost function [→182] describing the costs depending on a quantity unit x. To find
the minima of C(x) , we use its derivative, i. e.,
′
C (x) = x
2
− 3x.
(13.17)
By solving
′
C (x) = x
2
− 3x = 0,
(13.18)
This yields C (3) = 3 > 0 . Hence, we found a minimum of C(x) at x = 3 . This can also be
′′
Figure 13.4 An example of an economic cost function C(x) with its minimum located at x = 3 .
Definition 13.4.1.
Let D denote the domain of a function f. A point x ∈ D is called a global maximum of f if
∗
Definition 13.4.2.
Let D denote the domain of a function f. A point x ∈ D is called a global minimum of f if
∗
Definition 13.4.3.
Let D denote the domain of a function f, and let J ⊂ D . A point x ∈ J is called a local
∗
f in J .
Definition 13.4.4.
Let D denote the domain of a function f, and let J ⊂ D . A point x ∈ J is called a local
∗
in J .
To characterize extrema of continuous functions, we invoke the well-known Weierstrass
extreme value theorem.
Theorem 13.4.1 (Weierstrass extreme value theorem).
Let D denote the domain of a function f, and let J = [a, b] ⊂ D . If f is continuous on J , then f
achieves both its maximum value, denoted M, and its minimum value, denoted m. In other words, there
exist x and x in J such that
∗
M
∗
m
f (x
∗
M
and f (x
) = M
∗
m
) = m ,
m ≤ f (x) ≤ M .
We want to emphasize that extrema of functions possess horizontal tangents. These extrema can
be calculated using basic calculus. Suppose that we have a real and a continuous function on a
domain D . The points x ∈ D , satisfying the equation f (x) = 0 , are candidates for extrema
′
(maximum or minimum). In Section →13.3, we explained that the first derivative of a function at a
point x corresponds to the slope of the tangent at x. Therefore, after solving the equation
f (x) = 0 and after calculating f (x) , it is necessary to distinguish the following cases:
′ ′′
f
′′
(x0 ) > 0 ⟹ f (x) has a minimum at x ∈ D 0
For a numerical solution to this problem, the package ggpmisc in R can be used to find extrema of a
function, as illustrated in Listing 13.2. The plot from the output of the script is shown in →Figure
13.6, where the colored dots correspond to the different extrema of the function
f (x) = 23.279 − 29.3598 exp (−0.00093393x) sin (0.00917552x + 20.515),
nonlinearity of a function down into its polynomial components. This yields a function that is more
linear than f (x) . The simplest, yet most frequently used approximation, is the linearization of a
function. Taylor series expansions have many applications in mathematics, physics, and
engineering. For instance, they are used to approximate solutions to differential equations, which
are otherwise difficult to solve.
Definition 13.5.1.
A one-dimensional Taylor series expansion of an infinitely differentiable function f (x) , at a point
x = x , is given by
0
∞ (n)
f (x0 ) n
f (x)= ∑ (x − x0 )
n!
n=0
′ ′′ (3)
f (x0 ) f (x0 ) 2
f (x0 ) 3
= f (x0 ) + (x − x0 ) + (x − x0 ) + (x − x0 ) + ⋯
1! 2! 3!
where f (n)
denotes the n derivative of f.
th
If x
0 = 0 , then the expansion may also be called a Maclaurin series.
Below, we provide examples of Taylor series expansions for some common functions, at a point
x = x :0
1 2
1 3
exp (x)=exp (x0 )[1 + (x − x0 ) + (x − x0 ) + (x − x0 ) + ⋯ ]
2 6
2 3
x − x0 (x − x0 ) (x − x0 )
ln (x)=ln (x0 ) + − + − ⋯
2 3
x0 2x 3x
0 0
1 2
1 3
cos (x)=cos (x0 )− sin (x0 )(x − x0 ) − cos (x0 )(x − x0 ) + sin (x0 )(x − x0 ) + ⋯
2 6
1 2
1 3
sin (x)=sin (x0 )+ cos (x0 )(x − x0 ) − sin (x0 )(x − x0 ) − cos (x0 )(x − x0 ) + ⋯
2 6
The accuracy of a Taylor series expansion depends on both the function to be approximated,
the point at which the approximation is made, and the number of terms used in the
approximation, as illustrated in →Figure 13.7 and →Figure 13.8.
Several packages in R can be used to obtain the Taylor series expansion of a function. For
instance, the library Ryacas can be used to obtain the expression of the Taylor series expansion of
a function, which can then be evaluated. The library pracma, on the other hand, provides an
approximation of the function at a given point using its corresponding Taylor series expansion.
The scripts below illustrate the usage of these two packages.
→Figure 13.7, produced using Listing 13.4, shows the graph of the function f (x) =exp (x)
alongside its corresponding Taylor approximation of order n = 5 , for x ∈ [−1, 1] . It is clear that
this Taylor series approximation of the function f (x) is quite accurate for x ∈ [−1, 1] , since the
graphs of the both functions match in this interval.
Figure 13.7 Taylor series approximation of the function f (x) =exp (x) . The approximation
order is n = 5 .
On the other hand, →Figure 13.8, produced using Listing 13.5, shows the graph of the function
f (x) =
1
1−x
alongside its corresponding Taylor approximation of order n = 5 , for x ∈ [−1, 1] .
The Taylor series approximation of the function f (x) is accurate on most of the interval
x ∈ [−1, 1] , except nearby 1, where the function f (x) and its Taylor approximation diverge. In
fact, when x tends to 1, f (x) tends to ∞, and the corresponding Taylor series approximation
cannot keep pace with the growth of the function f (x) .
Figure 13.8 Taylor series approximation of the function f (x) = 1
1−x
. The approximation order is
n = 5.
13.6 Integrals
The integral of a function f (x) over the interval [a, b] , denoted ∫ f (x)dx , is given by the area
b
a
∫ f (x)dx.
∫ f (x)dx = F (x) + C.
The function F is also referred to as the antiderivative of f, whereas C is the called the integration
constant.
Theorem 13.6.1 (Uniqueness Theorem).
If two functions, F and G, are antiderivatives of a function f on an interval I, then there exists a constant
C such that
F (x) = G(x) + C.
This result justifies the integration constant C for the indefinite integral.
Theorem 13.6.2 (First fundamental theorem of calculus).
Let f be a bounded function on the interval [a, b] and continuous on (a, b) . Then, the function
x
F (x) = ∫ f (z)dz, a ≤ x ≤ b.
a
The results from the above theorems demonstrate that the differentiation is simply the
inverse of integration.
b (13.20)
n
i=1
a
The last term in equation (→13.20) is known as the Riemann sum. When n tends to ∞, then Δx i
tends to 0, for all i = 1, 2, 3 … , n , and, consequently, the Riemann sum tends toward the real
value of the integral of f (x) over the interval [a, b] , as illustrated in →Figure 13.9.
Figure 13.9 Geometric interpretation of the integral. Left: Exact form of an integral. Right:
Numerical approximation.
Using R, a one-dimensional integral over a finite or infinite interval is computed using the
command integrate(f, lowerLimit, upperLimit), where f is the function to be integrated,
lowerLimit and upperLimit are the lower and upper limits of the integral, respectively.
2
Using R, an n-fold integral over a finite or infinite interval is computed using the command
adaptIntegrate(f, lowerLimit, upperLimit).
The integral ∫ 0
3
∫
1
5
∫
−2
−1 5
2
sin (x) cos (yz) dx dy dz can be computed as follows:
13.7 Polynomial interpolation
In many applications, results of experimental measurements are available in the form of
discrete data sets. However, efficient exploitation of these data requires their synthetic
representation by means of elementary (continuous) functions. Such an approximation, also
termed data fitting, is the process of finding a function, generally a polynomial, whose graph will
pass through a given set of data points.
Let (x , y ) , i = 0, … , m be m + 1 , given pairs of data. Then, the problem of interest is to find a
i i
n
Pn (xi ) = an xi + an−1 x
n−1
+ ⋯ + a1 xi + a0 = yi , i = 0, … , m.
(13.21)
i
Note that this approach was developed by Lagrange, and the resulting interpolation polynomial is
referred to as the Lagrange polynomial [→99], [→131]. When n = 1 and n = 2 , the process is
called a linear interpolation and quadratic interpolation, respectively.
Let us consider the following data points:
xi 1 2 3 4 5 6 7 8 9 10
yi −1.05 0.25 1.08 −0.02 −0.27 0.79 −1.02 −0.17 0.97 2.06
Using R, the Lagrange polynomial interpolation for the above pairs of data points (x, y) can
be carried out using Listing 13.8. In →Figure 13.10 (left), which is an output of Listing 13.8, the
interpolation points are shown as dots, whereas the corresponding Lagrange polynomial is
represented by the solid line.
Figure 13.10 Left: Polynomial interpolation of the data points in blue. Right: Roots of the
interpolation polynomial.
polynomial with real or complex-valued coefficients, established results are available to determine
the roots analytically by closed expressions. Let
f (x) = an x
n
+ an−1 x
n−1
+ ⋯ + a1 x + a0
(13.22)
be a real polynomial, i. e., its coefficients are real numbers, and n is the degree of this polynomial.
Then we write deg (f (x)) = n .
If n = 2 ,
f (x) = a2 x
2
+ a1 x + a0 = 0
(13.23)
x1,2 = .
2a2
For n = 3 ,
f (x) = a3 x
3
+ a2 x
2
+ a1 x + a0 = 0,
(13.25)
leads to the formulas due to Cardano [→135]. For some special cases where n = 4 , analytical
expressions are also known. In general, the well-known theorem due to Abel and Ruffini [→184]
states that general polynomials with deg (f (x)) ≥ 5 are not solvable by radicals. Radicals are n th
root expressions that depend on the polynomial coefficients. Another classical theorem proves
the existence of a zero of a continuous function.
Theorem 13.8.1 (Intermediate value theorem).
Let f : R ⟶ R be a continuous function, and let a and b ∈ R with a < b and such that f (a) and
f (b) are nonzero and of opposite signs. Then, there exists x with a < x < b such that f (x ) = 0 .
∗ ∗ ∗
Using R, the root(s) of a function, within a specified interval, can be obtained via the package
rootSolve.
f (x) = a0 + a1 x + a2 x
2
+ a3 x
3
+ a4 x
4
+ a5 x
5
+ a6 x
6 7
+ a7 x + a8 ∗ x
8 9
+ a9 x ,
(13.26)
where a 0 = −229 ,a
1 = 641.943 , a = −728.7627 , a = 445.0133 , a = −162.3738 ,
2 3 4
Using the function uniroot.all from the package rootSolve, the root(s) of the function (→13.26)
within the interval [1,10] can be obtained using Listing 13.9. In →Figure 13.10 (right), which is an
output of Listing 13.9, the function f (x) is represented by the solid line, whereas its
corresponding roots in the interval [0,10] are shown as dots. Obviously, all the roots lie on the
horizontal line f (x) = 0 .
Example 13.8.1.
In economics, for example, root-finding methods and basic derivatives find frequent application
[→182]. For instance, these are used to explore profit and revenue functions (see [→182]).
Generally, the revenue function R(x) and the profit function P (x) are defined by
R(x) = px, (13.27)
and
P (x) = R(x) − C(x), (13.28)
respectively [→182]. Here, x is a unit of quantity, p is the sales price per unit of quantity, and C(x)
is a cost function. The unit of quantity x is the variable and p is a parameter, i. e., a fixed number.
Suppose that we have a specific profit function defined by
1 2
(13.29)
P (x) = − x + 50x − 480.
10
This profit function P (x) is shown in →Figure 13.11, and to find its maximum, we need to find the
zeros of its derivative:
′
2 (13.30)
P (x) = − x + 50 = 0.
10
From this, we find x = 250 . Using this value, we obtain the maximizing unit of quantity for P (x) ,
i. e., P (250) = 5770 . To find the so-called break-even points, it is necessary to identify the zeros
of P (x) , i. e.,
1 (13.31)
2
P (x) = − x + 50x − 480 = 0.
10
Between the two zeros of P (x) , we make a profit. Outside this interval, we make a loss.
Therefore,
x
2
− 500x + 4800 = 0
(13.32)
yields to
2
(13.33)
500 500
x1,2 = ± √( ) − 4800.
2 2
Specifically, x = 480.02 , and x = 9.79 . This means that the profit interval of P (x)
1 2
f (z) = an z
n
+ an−1 z
n−1
+ ⋯ + a0 , an ≠ 0 , ak ∈ C , k = 0, 1, … , n,
(13.34)
be a complex polynomial. All the zeros of f (z) lie in the closed disk |z| ≤ ρ . Here, ρ is the positive root
of another equation, namely
Theorem 13.8.3.
Let f (z) be a complex polynomial given by equation (→13.34). All the zeros of f (z) lie in the closed
disk
aj (13.36)
|z| ≤ 1+ max .
0≤j≤n−1 an
Below, we provide some examples, which illustrate the results from these two theorems.
Example 13.8.2.
Let f (z) := z3
+ 4z
2
+ 1000z + 99 be a polynomial, whose real and complex zeros are as
follows:
z1 = −0.099 , (13.37)
Using Theorem →13.8.2 and Theorem →13.8.3 gives the bounds ρ = 33.78 and 1001,
respectively. Considering that the largest modulus of f (z) is max (z ) = 31.616 , Theorem
i
→13.8.2 gives a good result. The bound given by Theorem →13.8.3 is useless for f (z) . This
example demonstrates the complexity of the problem of determining zero bounds efficiently (see
[→50], [→135], [→155]).
13.10 Exercises
1. Evaluate the gradient, the Hessian, and the Jacobian matrix of the following
functions using R:
2 2
f (x, y) = y cos (x ) + √ xy at the point (x = π, y = 5)
sin(x∗y) 2 3
g(x, y, z) = e + z ∗ y cos (x ∗ y) + x ∗ y + √z at the point
(x = 5, y = 3, z = 21)
2. Use R to find the extremum of the function f (x) = 3x , and determine whether it
x
is a minimum or a maximum.
3. Use R to find the global extrema of the function g(y) = y − 6y − 15y + 100 in
3 2
3
2
[−1,1].
Find the critical numbers of the function f.
7. Use R to find the Taylor series expansion of the function f (x) =ln (1 + x) . Plot
the graph of the function f and the corresponding Taylor series approximation for
x ∈ [−1; 1] .
9. Use R to find the polynomial interpolation of the following pairs of data points:
xi 1 2 3 4 5 6 7 8 9 10
yi −2.05 0.75 1.8 −0.02 −0.75 1.71 −2.12 −0.25 1.70 3.55
2
g(x)= 23x − 3x − 1
8 7 4 2
h(x)= 23x − 3x + x − x − 20
14 Differential equations
Differential equations can be seen as applications of the methods from analysis, discussed in the
previous chapter. The general aim of differential equations is to describe the dynamical behavior
of functions [→78]. This dynamical behavior is the result of an equation that contains derivatives
of a function. In this chapter, we introduce ordinary differential equations and partial differential
equations [→3]. We discuss the general properties of such equations and demonstrate specific
solution techniques for selected differential equations, including the heat equation and the wave
equation. Due to the descriptive nature of differential equations, physical laws as well as biological
and economical models are often formulated with such models.
referred to as the vector-valued function, which controls how y changes over t, and k is a vector of
parameters. When n = 1 , the problem is called a single scalar ODE.
By itself, the ODE problem (→14.1) does not provide a unique solution function y(t) . If, in
addition to equation (→14.1), the initial state at t = t , y(t ) , is known, then the problem is called
0 0
an initial value ODE problem. On the other hand, if some conditions are specified at the extremes
(“boundaries”) of the independent variable t, e. g., y(t ) = C and y(t ) = C with C and
0 1 max 2 1
onward, and we are seeking a function y(t) , which describes the state of the system as a function
of t. Thus, a general formulation of a first-order initial value ODE problem can be written as
follows:
dy(t) (14.2)
′
= y (t) = f (y(t), t, k) f or t > t0 ,
dt
y(t0 )= C, (14.3)
where C is given.
Some examples of initial value ODE problems, depicted in →Figure 14.1, illustrate the
evolution of the ODE’s solution, depending on its initial condition.
Equation (→14.2) may represent a system of ODEs, where
T
y(t) = (y1 (t), … , yn (t)) and f (y(t), t, k) = (f1 (y(t), t, k), … , fn (y(t), t, k)),
and each entry of f (y(t), t, k) can be a nonlinear function of all the entries of y.
The system (→14.2) is called linear if the function f (y(t), t, k) can be written as follows:
If G(t. k) is constant and h(t, k) ≡ 0 , then the system (→14.4) is called homogeneous. The
solution to the homogeneous system, y (t) = Gy(t) with data y(t ) = C , is given by
′
0
y(t) = Ce .
G(t−t0 )
An ODE’s order is determined by the highest-order derivative of the solution function y(t)
appearing in the ODEs or the systems of ODEs. Higher-order ODEs or systems of ODEs can be
transformed into equivalent first-order system of ODEs.
Let
(n) ′ ′′ (n−1) (14.5)
y = f (t, y, y , y , … , y )
Equation (→14.5) can be rewritten in the form of a system of n first-order ODEs as follows:
′
y1 (t)= y2 (t),
′
y2 (t)= y3 (t),
′
y3 (t)= y4 (t),
′
yn (t)= f (t, y1 (t), y2 (t), … , yn (t)).
Analytical solutions to ODEs consist of closed-form formulas, which can be evaluated at any
point t. However, the derivation of such closed-form formulas is generally nontrivial. Thus,
numerical methods are generally used to approximate values of the solution function at a discrete
set of points. Since higher-order ODEs can be reduced to a system of first-order ODEs, most
numerical methods for solving ODEs are designed to solve first-order ODEs.
In R, numerical methods for solving ODE problems are implemented within the package
deSolve, and the function ode() from the package deSolve is dedicated to solving initial value ODE
problems. Further details about solving differential equations using R can be found in [→177] and
[→176].
Let us use the function ode() to solve the following ODE problem:
y
′
= kty (14.7)
dy2 (14.9)
= k2 y1 y3 ,
dt
dy3 (14.10)
= k3 y1 y2 ,
dt
with the initial conditions y (0) = −1 , y (0) = 0 , y = 1 , and where k , k and k are
1 2 3 1 2 3
A very simplistic formulation of a boundary value ODE problem can be written as follows:
dy(t) (14.11)
′
= y (t) = f (y(t), t, k) f or t > t0 ,
dt
In R, the function bvpshoot() from the package deSolve is dedicated to solving boundary value
ODE problems.
Let us use the function bvpshoot() to solve the following boundary value ODE problem:
′′ 2
y (t) − 2y (t) − 4ty(t)y (t)
′
with y(−1) = 1/4, y(1) = 1/3.
(14.13)
Since the problem (→14.13) is a second-order ODE problem, it is necessary to write its equivalent
first-order ODE system. Using the substitution (→14.6), the second-order ODE (→14.13) can be
rewritten in the following form:
′
y1 (t)= y2 (t), (14.14)
′ 2
y2 (t)= 2y1 (t) + 4ty1 (t)y2 (t),
(14.15)
Then, the problem (→14.14)–(→14.15) can be solved in R using Listing 14.3. →Figure 14.2
(right), which is an output of Listing 14.3, shows the evolution of the solution (y (t), y (t)) to the
1 2
The PDE (→14.18) can be classified according to the values assumed by A, B, and C at a given point
(x, y) . The PDE (→14.18) is called
∂n
= g on the boundary ∂R ,
∂u
∂s
Dirichlet conditions can only be applied if the solution is known on the boundary and if the
function f is analytic. These are frequently used for the flow (velocity) into a domain. Neumann
conditions occur more frequently [→102].
14.2.4 Well-posed PDE problems
A mathematical PDE problem is considered well-posed, in the sense of Hadamard, if
the solution exists,
the solution is unique,
the solution depends continuously on the auxiliary data (e. g., boundary and initial
conditions).
Parabolic PDE
In this section, we will illustrate the solution to the heat equation, which is a prototype parabolic
PDE . The heat equation, in a one-dimensional space with zero production and consumption, can
be written as follows:
∂u(x, t)
2
∂ u(x, t) (14.19)
− D , x ∈ (a, b).
2
∂t ∂x
Let us use R to solve the equation (→14.19) with a = 0 , b = 1 , i. e., x ∈ [0, 1] , and the following
boundary and initial conditions:
π (14.20)
u(x, 0) =cos ( x), u(0, t) =sin (t), u(1, t) = 0.
2
The heat equation (→14.19)–(→14.20) can be solved using Listing 14.4. The corresponding
solution, u(x, t) , is depicted in →Figure 14.3 for color levels (left) and a contour plot (right).
Figure 14.3 Solution to the heat equation in equation (→14.19) with the boundary and initial
conditions provided in equation (→14.20).
Hyperbolic PDE
∂ u
2 (14.21)
2
= ∇ ⋅ (c ∇u).
2
∂t
∂
u(t = 0, x, y)= 0,
∂t
2 2
−(x +y )
u(t = 0, x, y)= e .
The wave equation (→14.22)–(→14.23) can be solved using Listing 14.5. The corresponding
solution, u(t, x, y) , is depicted in →Figure 14.4, for t = 0 , t = 1 , t = 2 and t = 3 , respectively.
Figure 14.4 Solution to the wave equation in equation (→14.22) with the boundary and initial
conditions provided in equation (→14.23).
Elliptic PDE
A prototype of elliptic PDEs is the Poisson’s equation. Let us use R to solve the following Poisson’s
equation in a two-dimensional space:
2
∂ u(x, y)
2
∂ u(x, y) (14.24)
2 2
γ1 + γ2 = x + y , x ∈ (a, b), y ∈ (c, d),
2 2
∂x ∂y
Figure 14.5 Solution to the Poisson’s equation in equation (→14.24) with the boundary and initial
conditions provided in equation (→14.25).
14.3 Exercises
Use R to solve the following differential equations:
1. Solve the heat equation (→14.19) with a = 0 , b = 1 , i. e., x ∈ [0, 1] , and the
following boundary and initial conditions:
(a) u(x, 0) = 6 sin ( ) , u(0, t) =cos (t) , u(1, t) = 0 .
πx
2
u(1, t) = 0 .
∂
u(t = 0, x, y)= 0,
∂t
3 2
−(2x +3y )
u(t = 0, x, y)= e .
∂
u(t = 0, x, y)= 0,
∂t
2 3
−(5x +7y )
u(t = 0, x, y)= e .
15.1 Introduction
The theory of dynamical systems can be viewed as the most natural way of describing the
behavior of an integrated system over time [→56], [→109]. In other words, a dynamical system
can be cast as the process by which a sequence of states is generated on the basis of certain
dynamical laws. Generally, this behavior is described through a system of differential equations
describing the rate of change of each variable as a function of the current values of the other
variables influencing the one under consideration. Thus, the system states form a continuous
sequence, which can be formulated as follows. Let x = (x , x , … , x ) be a point in C that
1 2 n
n
Suppose that the laws, which describe the rate and direction of the change of x(t) , are known
and defined by the following equations:
x(t) (15.1)
n
= f (x(t)), t ∈ R, x ∈ C , x(t0 ) = x0 ,
dt
However, when those states form a discrete sequence, a discrete time formulation of the systems
(→15.1) can be written as follows:
x(k + 1) = f (x(k)), k ∈ Z, x(k) ∈ C
n
∀ k, x(0) = x0 . (15.2)
Definition 15.1.1.
A sequence, x(t) , is called a dynamical system if it satisfies the set of ordinary differential
equations (→15.1) (respectively (→15.2)) for a given time interval [t , t] . 0
Definition 15.1.2.
A curve C = {x(t)} , which satisfies the equations (→15.1) (respectively (→15.2)), is called the
orbit of the dynamical system x(t) .
Definition 15.1.3.
A point x ∈
satisfies f (x
2.
3.
k
∗
Definition 15.1.4.
∗
C
) = 0
∥x(t) − x ∥ ⟶ 0 as t ⟶ +∞ .
Definition 15.1.5.
A point x̄ ∈
∗
∗
n
∗
.
∥
is said to be a fixed point, also called a critical point, or a stationary point, if it
A critical point x is said to be stable if every orbit, originating near x , remains near x , i. e.,
∀ ε > 0 , ∃ ξ > 0 such that
∗
x(0) − x
∗
< ξ ⟹ x(t) − x
∀ t > 0.
A critical point x is said to be asymptotically stable if every orbit, originating sufficiently near x ,
∗
(x̄) = x̄ and f (x̄) ≠ x̄ for j = 1, … , k − 1 . The integer k is called the period of the point x̄ .
Definition 15.1.6.
j
An attractor is a minimal set of points A ⊂ C such that every orbit originating within its
neighborhood converges asymptotically towards the set A. A stable fixed point is an attractor
known as a map sink. A dynamical system may have more than one attractor. The set of states
Depending on the form of the functions f and the initial conditions x , in (→15.1) (respectively
i
(→15.2)), the evolution of a dynamical system can lead to one of the following regimes:
1.
0
∗
steady state: In such a regime, in response to any change in the initial condition,
the dynamical system restores itself and resumes its original course again,
leading to the formation of relatively stable patterns; thus, the system is wholly or
largely insensitive to the alteration of its initial conditions.
periodic: In this regime, in response to any change in the initial condition, the
trajectory of the system will eventually stabilize and alternate periodically
between relatively stable patterns.
chaotic: In such a regime, in response to any change in the initial condition, the
dynamical system generates a totally different orbit, i. e., any small perturbations
can lead to different trajectories. Hence, the system is highly sensitive to the
alteration of its initial conditions.
In the subsequent sections, we will illustrate the use of R to simulate and visualize some basic
dynamical systems, including population growth models, cellular automata, Boolean networks,
and other “abstract” dynamical systems, such as strange attractors and fractal geometries. These
dynamical systems are well known for their sensitivity to initial conditions, which is the defining
feature of chaotic systems.
x(t) = x(0)e
rt
.
(15.4)
The solution (→15.4) can be plotted in R using Listing 15.1, and the corresponding output,
which shows the evolution of the population for x(0) = 2 , r = 0.03 and t ∈ [0, 100] , is depicted
in →Figure 15.1 (left).
Figure 15.1 Left: Exponential population growth for r = 0.03 and x = 2 . Center: Logistic
0
population growth model for r = 0.1 , x = 0.1 < K = 10 . Right: Logistic population growth
0
dt
is zero if x = 0 , or x + K . Thus, the solution to the equation (→15.5) is given
by
Kx(0)e
rt (15.6)
x(t) = .
rt
K + x(0)(e − 1)
The solution (→15.6) can be plotted in R using Listing 15.2. The corresponding output, which
shows the evolution of the population over the time interval [0,100], is depicted in →Figure 15.1
(center) for x(0) = 0.1 , r = 0.1 , K = 10 , and in →Figure 15.1 (right) for x(0) = 20 , r = 0.1 ,
K = 10 .
the population number of the next generation. When the growth rate is assumed to be a linearly
decreasing function of y , then we get the following logistic equation:
n
yn (15.7)
yn+1 = ryn (1 − ).
K
where, x n+1 denotes the population size of the next generation, whereas x is the population
n
size of the current generation; and r is a positive constant denoting the growth rate of the
population between generations.
The graph x versus x
n is called the cobweb graph of the logistic map.
n+1
For any initial condition, over time, the population x will settle into one of the following types of
n
behavior:
1. fixed, i. e., the population approaches a stable value
2. periodic, i. e., the population alternates between two or more fixed values
3. chaotic, i. e., the population will eventually visit any neighborhood in a subinterval
of (0,1).
When 1 < r < 3 , the point x is asymptotically stable, i. e., for any x in the neighborhood of
∗
x , the sequence generated by the map (→15.9)—the orbit of x—remains close to or converges to
∗
x . In R, such a dynamics of the system can be illustrated using the scripts provided in Listing 15.3
∗
Due to its discrete nature, regulation of the growth rate in the logistic map (→15.8) operates with
a one period delay, leading to overshooting of the dynamical system. Beyond the value r = 3 , the
dynamical system (→15.8) is no longer asymptotically stable, but exhibits some periodic behavior.
The parameter value r = 3 is known as a bifurcation point. This behavior can be illustrated, in R,
using Listing 15.5.
Figure 15.3 Left: Cobweb graph of periodic fixed points for r = 3.2 . Center: Cobweb graph of
periodic fixed points for r = 3.4 ; Right: Dynamics of the population number over time.
→Figure 15.3 (left) and →Figure 15.3 (center), produced using Listing 15.4, show the cobweb
graphs of the logistic map for r = 3.2 and r = 3.4 , which both correspond to periodic fixed
points. →Figure 15.3 (right), produced using Listing 15.3, illustrates the dynamics of the
populations over time for both cases.
For larger values of r in the logistic map (→15.8), further bifurcations occur, and the number of
periodic points explodes. For instance, for r ≥ 3 , the structure of the orbits of the dynamical
system becomes complex and, hence, chaotic behavior ensues. Such behavior can be illustrated in
R, using the scripts provided in Listing 15.3 and Listing 15.4.
→Figure 15.4 (left) and →Figure 15.4 (center), produced using Listing 15.4, show the cobweb
graphs of the logistic map for r = 3.8 and r = 3.9 , which both correspond to chaotic motions.
→Figure 15.4 (right), produced using Listing 15.3, illustrates the dynamics of the populations over
time for both cases, where the chaotic evolution of the populations can be clearly observed.
Figure 15.4 Left: Cobweb graph of a chaotic motion for r = 3.8 . Center: Cobweb graph of a
chaotic motion for r = 3.9 ; Right: Dynamics of the population number over time.
→Figure 15.5 (left), (center), and (right), produced using Listing 15.5, illustrates the bifurcation
phenomenon, which can be visualized through the graph of the growth rate, r, versus the
population size, x. Such a graph is also known as the bifurcation diagram of a logistic map model.
→Figure 15.5 (left) depicts the bifurcation diagram for 0 ≤ r ≤ 4 , whereas →Figure 15.5 (center)
and →Figure 15.5 (right) show the zoom corresponding to the ranges 3 ≤ r ≤ 4 and
3.52 ≤ r ≤ 3.92 , respectively.
Figure 15.5 Bifurcation diagram for the logistic map model—growth rate r versus population size
x: Left 0 ≤ r ≤ 4 . Center: zoom for 3 ≤ r ≤ 4 . Right: zoom for 3.52 ≤ r ≤ 3.92 .
which feeds on the preys, and whose population number or concentration at time t is x (t) . 2
Furthermore, the model is based on the following assumptions about the environment, as well as
the evolution of the populations of the two species:
1. The prey population has an unlimited food supply, and it grows exponentially in
the absence of interaction with the predator species.
2. The rate of predation upon the prey species is proportional to the rate at which
the predator species and the prey meet.
The model describes the evolution of the population numbers x and x over time through the
1 2
following relationships:
dx1 (15.10)
= x1 (α − βx2 ),
dt
dx2
= −x2 (γ − δx1 ),
dt
where, and
dx1
dt
denote the growth rates of the two populations over time; α is the growth
dx2
dt
rate of the prey population in the absence of interaction with the predator species; β is the death
rate of the prey species caused by the predator species; γ is the death (or emigration) rate of the
predator species in the absence of interaction with the prey species; and δ is the growth rate of
the predator population.
The predator–prey model (→15.10) is a system of ODEs. Thus, it can be solved using the
function ode() in R. When the parameters α, β, γ, and δ are set to 0.2, 0.002, 0.1, and 0.001,
respectively, the system (→15.10) can be solved in R, using the scripts provided in Listing 15.6 and
Listing 15.7.
The corresponding outputs are shown in →Figure 15.6, where the solution in the phase plane
(x , x ) for x (0) = 25 , the evolution of the population of the species over time for x (0) = 25 ,
1 2 2 2
and the solution in the phase plane (x , x ) for 10 ≤ x (0) ≤ 150 are depicted in →Figure 15.6
1 2 1
Center: evolution of the population of the species over time for x (0) = 25 . Right: solution in the
2
At each time point, the state of each cell of the grid is updated according to a specified rule, so
that the new state of a given cell depends on the state of its neighborhood, namely the current
state of the cell under consideration and its adjacent cells, as illustrated below:
The cells at the boundaries do not have two neighbors, and thus require special treatments. These
cells are called the boundary conditions, and they can be handled in different ways:
The cells can be kept with their initial condition, i. e., they will not be updated at all during
the simulation process.
The cells can be updated in a periodic way, i. e., the first cell on the left is a neighbor of the
last cell on the right, and vice versa.
the cells can be updated using a desired rule.
Depending on the rule specified for updating the cell and the initial conditions, the evolution of
elementary cellular automata can lead to the following system states:
Steady state: The system will remain in its initial configuration, i. e., the initial spatiotemporal
pattern can be a final configuration of the system elements.
Periodic cycle: The system will alternate between coherent periodic stable patterns.
Self-organization: The system will always converge towards a coherent stable pattern.
Chaos: The system will exhibit some chaotic patterns.
For a finite number of cells N, the number of possible configurations for the system is also finite
and is given by 2 . Hence, at a certain time point, all configurations will be visited, and the CA will
N
enter a periodic cycle by repeating itself indefinitely. Such a cycle corresponds to an attractor of
the system for the given initial conditions. When a cellular automaton models an orderly system,
then the corresponding attractor is generally small, i. e., it has a cycle with a small period.
Using the R Listing 15.8, we illustrate some spatiotemporal evolutions of an elementary
cellular automaton using both deterministic and random initial conditions, whereby the cells at
the boundaries are kept to their initial conditions during the simulation process.
→Figure 15.7 shows the spatiotemporal patterns of an elementary cellular automaton with a
simple deterministic initial condition, i. e., all the cells are set to 0, except the middle one, which is
set to 1. Complex localized stable structures (using Rule 182), self-organization (using Rule 210)
and chaotic patterns (using Rule 89) are depicted in →Figure 15.7 (left), →Figure 15.7 (center), and
→Figure 15.7 (right), respectively.
→Figure 15.8 shows spatiotemporal patterns of an elementary cellular automaton with a
random initial condition, i. e., the states of the cells are allocated randomly. Complex localized
stable structures (using Rule 182), self-organization (using Rule 210) and chaotic patterns (using
Rule 89) are depicted in →Figure 15.8 (left), →Figure 15.8 (center), and →Figure 15.8 (right),
respectively.
Figure 15.7 Spatiotemporal patterns of an elementary cellular automaton with a simple
deterministic initial condition, i. e., all the cells are set to 0 except the middle, one which is set to 1.
Left: complex localized stable structures (Rule 182). Center: self-organization (Rule 210). Right:
chaotic patterns (Rule 89).
Figure 15.8 Spatiotemporal patterns of an elementary cellular automaton with a random initial
condition, i. e., the states of the cells are allocated randomly. Left: complex localized stable
structures (Rule 182). Center: self-organization (Rule 210). Right: chaotic patterns (Rule 89).
Boolean functions, called transition or regulation functions. Let x (t) represent the state of the
i
node x at time t, which takes the value of either 1 (on) or 0 (off). Then, the vector
i
x(t) = (x (t), … , x (t)) represents the state of all the nodes in X , at the time step t. The total
1 N
number of possible states for each time step is 2 . The state of a node x at the next time step
N
i
immediate predecessors (or input nodes) of x . If all the N nodes have the same number of input
i
nodes, K, then the RBN is referred to as an N K network, and K is also called the number of
connections of the network. Like most dynamical systems, RBNs also enjoy three main regimes
which, for an N K network, are correlated with the number of connections K [→109]. In particular,
if K < 2 the evolution of the RBN leads to stable (ordered) dynamics,
if K = 2 the evolution of the RBN leads to periodic (critical) dynamics,
if K ≥ 3 the evolution of the RBN leads to a chaotic regime.
RBNs can be viewed as a generalization of cellular automata, in the sense that, in Boolean
networks,
a cell neighborhood is not necessarily restricted to its immediate adjacent cells,
the size of the neighborhood of a cell and the position of the cells within the neighborhood
are not necessarily the same for every cell of the grid,
the state transition rules are not necessarily identical or unique for every cell of the grid,
the updating process of the cells is not necessarily synchronous.
The updating process of the nodes in a Boolean network can be synchronous or asynchronous,
deterministic or nondeterministic. According to the specified update process, Boolean networks
can be cast in different categories [→85], including the following:
1. Classical random Boolean networks (CRBNs): In RBNs of this type, at each discrete
time step, all the nodes in the network are updated synchronously in a
deterministic manner, i. e., the nodes are updated at time t + 1 , taking into
account the state of the network at time t.
2. Asynchronous random Boolean networks (ARBNs): In RBNs of this type, at each time
step, a single node is chosen at random and updated, and thus the update
process is asynchronous and nondeterministic.
3. Deterministic asynchronous random Boolean networks (DARBNs): For this class of
Boolean networks, each node is labeled with two integers u, v ∈ N ( u < v ). Let m
denote the number of time steps from the beginning of the simulation to the
current time. Then, the only nodes to be updated during the current time step are
those such that u = (m mod v) . If several nodes have to be updated at the
same time step, then the changes, made in the network by updating one node,
are taken into account during the updating process of the next node. Hence, the
update process is asynchronous and deterministic.
4. Generalized asynchronous random Boolean networks (GARBNs): For this class of
Boolean networks, at each time step, a random number of nodes are selected and
updated synchronously; i. e., if several nodes have to be updated at the same time
step, then the changes, made in the node-states by updating one node, are not
taken into account during the updating process of the next node. Thus, the
update process is semi-synchronous and nondeterministic.
5. Deterministic generalized asynchronous random Boolean networks (DGARBNs): This
type of Boolean networks is similar to the DARBN, except that, in this case, if
several nodes have to be updated at the same time step, the changes, made in
the node-states by updating one node, are not taken into account during the
updating process of the next node. Thus, the update process is semi-synchronous
and deterministic.
In the context of genomics, a gene regulatory network (GRN) can be modeled as a Boolean
network, where the status of a given gene (active/expressed or inactive/not expressed) is
represented as a Boolean variable, whereas the interactions/dependencies between genes are
described through the transition functions, and the input nodes for a gene x consist of genes
i
regulating x . Let us consider the following simple GRN with three genes A, B, C, i. e.,
i
⎧f1 = f1 (x1 , x3 ) = x1 ∨ x3 ,
⎨f2 = f2 (x1 , x3 ) = x1 ∧ x3 ,
⎩
f3 = f3 (x1 , x2 ) = ¬x1 ∨ x2 ,
∨∧
where , , and ¬ are the logical disjunction (OR), conjunction (AND), and negation (NOT),
respectively.
At a given time point t, the state-vector is x(t) = (x 1 (t), x2 (t), x3 (t)) and the state evolution at
the time point t + 1 is given by
The corresponding truth table, i. e., the nodes-state at time t + 1 for any given configuration of
the state vector x at time t, is as follows:
x(t) = (x1 (t), x2 (t), x3 (t)) 000 001 010 011 100 101 110 111
x(t + 1) = (x1 (t + 1), x2 (t + 1), x3 (t + 1)) 001 101 001 101 100 110 101 111
An RBN with N nodes can be represented by an N by N matrix, known as the adjacency matrix, for
which the value of the component (i, j) is 1 if there is an edge from node i to node j, and 0
otherwise. If we substitute the nodes x , x , x with their associated gene labels A, B, and C,
1 2 3
A B C
A 1 1 1
B 0 0 1
C 1 1 0
To draw the corresponding network using the package igraph in R, we can save the adjacency
matrix as a csv (comma separated values) or a text file and then load the file in R. The
corresponding text or csv file, which we will call here “ExampleBN1.txt”, will be in the following
format:
Using the R package Boolnet [→140], we can also draw a given Boolean network, generate an
RBN and analyze it, e. g., find the associated attractors and plot them. However, the dependency
relations of the network must be written into a text file using an appropriate format. For instance,
the dependency relations (→15.11) can be written in a textual format as follows:
Here, the symbols |, & and ! respectively denote the logical disjunction (OR), conjunction
(AND) and negation (NOT). Let us call the corresponding text file “ExampleBN1p.txt”, and this
must be in the current working R directory.
→Figure 15.9, produced using Listing 15.10, shows the visualization and analysis of the
Boolean network represented in the text file “ExampleBN1p.txt”. The network graph, the state
transition graph as well as attractor basins, and the state transition table when the initial state is
(010) i. e., (A = 0, B = 1, C = 0) , are depicted in →Figure 15.9 (top), →Figure 15.9 (bottom left)
and →Figure 15.9 (bottom right), respectively.
Figure 15.9 Visualization and analysis of a Boolean network—Example 1. Top: network graph.
Bottom left: state transition graph and attractor basins. Bottom right: state transition table when
the initial state is (010), i. e., (A = 0, B = 1, C = 0) .
→Figure 15.10, produced using Listing 15.11, shows the visualization and analysis of an RBN
generated within the listing. The network graph, the state transition graph, as well as attractor
basins, and the state transition table when the initial state is (11111111) are depicted in →Figure
15.10 (top), →Figure 15.10 (bottom left), and →Figure 15.10 (bottom right), respectively.
Figure 15.10 Visualization and analysis of a Boolean network—Example 2. Top: network graph.
Bottom left: state transition graph and attractor basins. Bottom right: state transition table when
the initial state is (11111111).
Figure 15.11 Spatiotemporal patterns of RBNs with N = 1000 . Left: critical dynamics (K = 2) .
Right: chaotic patterns (K = 7) .
→Figure 15.11, produced using Listing 15.12, shows spatiotemporal patterns of RBNs with
N = 1000 . Critical dynamics (for K = 2 ) and chaotic patterns (for K = 7 ) are shown in →Figure
15.11 (left) and →Figure 15.11 (right), respectively.
15.6 Case studies of dynamical system models with complex attractors
In this section, we will provide implementations, in R, for some exemplary dynamical system
models, which are known for their complex attractors.
written as:
dx1 (15.12)
= a(x2 − x1 ),
dt
dx2
= rx1 − x2 − x1 x3 ,
dt
dx3
= x1 x2 − bx3 ,
dt
iterations. Representations of the attractor in the plane (x, y) , in the space (x, y, z) and in the
plane (x, z) are given in →Figure 15.12 (left), →Figure 15.12 (center), and →Figure 15.12 (right),
respectively.
Figure 15.12 Lorenz attractor for a = 10 , r = 28 , b = 8/3 , (x , y , z ) = (0.01, 0.01, 0.01) ,
0 0 0
dt = 0.02 after 10 iterations: Left in the plane (x, y) . Center in the space (x, y, z) . Right in the
6
plane (x, z) .
15.13 (center) shows the output of Listing 15.14 when a = −1.4 , b = 1.6 , c = 1 , d = 0.7 ,
(x , y , ) = (π/2, π/2) after 1.5 × 10 iterations; →Figure 15.13 (right) shows output of Listing
6
0 0
iterations.
Figure 15.13 Clifford attractor. Left: a = −1.4 , b = 1.6 , c = 1 , d = 0.3 , (x , y , ) = (π/2, π/2)
0 0
after 2 × 10 iterations.
6
where, z = x + iy .
k k k
The resulting orbit of the map (→15.14) is generally visualized by plotting z in the real-
imaginary plane (x, y) , also called the phase-plot. In R, the orbit of the Ikeda attractor can be
obtained using Listing 15.15. →Figure 15.14 (left), produced using Listing 15.15, shows a
representation of the Ikeda’s attractor in the plane (x, y) .
Figure 15.14 Left: Ikeda attractor for a = 0.85 , b = 0.9 , k = 0.4 , p = 7.7 , z = 0 after
0
1.5 × 10 iterations. Center: de Jong attractor (→15.16) for a = 1.4 , b = 1.56 , c = 1.4 ,
6
d = −6.56 , (x , y , ) = (0, 0) after 1.5 × 10 iterations. Right: de Jong attractor (→15.15) for
6
0 0
In R, the orbit of Peter de Jong attractor can be obtained using Listing 15.16. →Figure 15.14
(center), produced using Listing 15.16, shows a representation of the de Jong attractor (→15.16),
in the plane (x, y) , for a = 1.4 , b = 1.56 , c = 1.4 , d = −6.56 , (x , y , ) = (0, 0) after
0 0
1.5 × 10 iterations. →Figure 15.14 (right), produced also using Listing 15.16, shows a
6
⎧
dx (15.17)
= −y − z,
dt
dy
⎨ = x + ay,
dt
⎩ dz
= b + z(x − c),
dt
where, a, b, and c are the parameters of the attractor. This attractor is known to have some
chaotic behavior for certain values of the parameters.
In R, the system (→15.17) can be solved and its results visualized using Listing 15.17. →Figure
15.15, produced using Listing 15.17, shows some visualizations of the Rössler attractor for
different values of its parameters and initial conditions. →Figure 15.15 (left) shows the output of
Listing 15.17 when a = 0.5 , b = 2 , c = 4 , (x , y , z ) = (0.3, 0.4, 0.5) , dt = 0.03 after 2 × 10
0 0 0
6
iterations. →Figure 15.15 (center) shows the output of Listing 15.17 when a = 0.5 , b = 2 , c = 4 ,
(x , y , z ) = (0.03, 0.04, 0.04) , dt = 0.03 . after 2 × 10 iterations. →Figure 15.15 (right)
6
0 0 0
15.7 Fractals
There exist various definitions of the word “fractal”, and the simplest of these is the one
suggested by Benoit Mandelbrot [→125], who refers to a “fractal” as an object, which possesses
self-similarity. In this section, we will provide examples of implementations for some classical
fractal objects, using R.
⎝ ⎠
xn−1 xn−1 xn−1
where, I is a 3
n−1 by 3
n−1
matrix of unity elements.
n−1
The construction and visualization of the Sierpińsky carpet can be carried out, in R, using Listing
15.18. →Figure 15.16 (left), which is an output of Listing 15.18, shows the visualization of the
Sierpińsky carpet after six iterations.
The Sierpińsky triangle can be constructed using the following iterative steps:
Step 1: Select three points (vertices of the triangle) in a two-dimensional plane. Let us call
them x , x , x ;
a b c
point p ;
n
In R, the construction and the visualization of the Sierpińsky triangle can be achieved using Listing
15.19. →Figure 15.16 (center), which is an output of Listing 15.19, shows the visualization of the
Sierpińsky triangle after 5e + 5 iterations.
and f, respectively.
Set n = n + 1 and go to Step 3;
Step 5: Plot the points of the sequence (x , y0), (x , y ), … , (x , y ) .
0 1 1 N N
In R, the construction and the visualization of the Barnsley fern can be achieved using Listing
15.20. →Figure 15.16 (right), which is an output of Listing 15.20, shows the visualization of the
Barnsley fern after 1e + 6 iterations.
Figure 15.16 Left: The Sierpińsky carpet. Center: The Sierpińsky triangle. Right: The Barnsley fern.
m
zn+1 = zn + c (15.18)
For a given value of c, the associated Julia set [→103] is defined by the boundary between the
set of z values that have bounded orbits, and those which do not. For instance, when m = 2 , for
0
In R, the construction and the visualization of the Julia set can be done using Listing 15.21 and
Listing 15.22. Figures →15.17, →15.18, →15.19, which have all been produced using Listing 15.21,
illustrate the evolution of the quadratic Julia set according to the value of the complex parameter
c.
Figure 15.17 Quadratic Julia sets. Left: c = 0.7 ; Center: c = −0.074543 + 0.11301i ; Right:
c = 0.770978 + 0.08545i .
Figure 15.18 Quadratic Julia sets. Left: c = 0.7 . Center: c = −0.74543 + 0.11301i . Right:
c = 0.770978 + 0.08545i .
c = a + ib with their real and imaginary parts, respectively. For instance, when m = 2 , the
follows:
⎪
system (→15.19) is called the quadratic Mandelbrot set, and it can be reformulated in R as
⎧x0 = y0 = 0,
2
⎨xn+1 = x − y + a,
⎩
n n
2
yn+1 = 2xn yn + b.
2
(15.21)
In R, the construction and the visualization of the quadratic Mandelbrot set can be done using
the scripts in Listing 15.23 and Listing 15.24. The graphs in →Figure 15.20, produced using Listing
15.23 and Listing 15.24, illustrate some visualization of the quadratic Mandelbrot set depending
on the values of its parameters.
Figure 15.20 Left: z n+1
2
. Center: z
= zn + c n+1 = c∗ cos (zn )/√ (0.8) . Right:
(3) .
3
zn+1 = zn − zn + c − (2/3)/√
15.8 Exercises
1. Consider the following dynamical system:
2
xn+1 = xn f or n = 0, 1, 2, 3, …
x = 3 , for n = 1, … , 500 .
0
Plot the corresponding cobweb graph, as well as the graph of the evolution of x , n
over time.
2. Consider the following dynamical system:
zt+1 − zt = zt (1 − zt ) f or t = 0, 1, 2, 3, …
Use R to simulate the dynamics of x using the initial conditions z = 0.2 and
n 0
z = 5 for n = 1, … , 500 .
0
Plot the corresponding cobweb graph, as well as the graph of the evolution of x , n
over time.
3. Let x be the number of fish in generation n in a lake. The evolution of the fish
n
Use R to simulate the dynamics of the fish population using the initial conditions
x = 1 and x =log (8) for n = 1, … , 500 .
0 0
Plot the corresponding cobweb graph, as well as the graph of the dynamics of the
population number, over time.
4. Consider the following predator–prey model x and y:
dx (15.22)
= Ax − Bxy,
dt
dy
= −Cy + Dxy.
dt
Use R to solve the system (→15.22) using the following initial conditions and
values of the parameters for t ∈ [0, 200] :
(a) x(0) = 81 , y(0) = 18 , A = 1.5 , B = 1.1 , C = 2.9 , D = 1.2 ;
Plot the corresponding solutions in the phase plane (x, y) , and the evolution of
the population of both species over time.
5. Use R to plot, in 3D, the following Lorenz system (→15.12) using the parameters
a = 15 , r = 32 , b = 3 , and the following initial conditions: x (0) = 0.03 ,
1
x (0) = 0.03 , x (0) = 0.03 ; x (0) = 0.5 , x (0) = 0.21 , x (0) = 0.55 .
2 3 1 2 3
16 Graph theory and network analysis
This chapter provides a mathematical introduction to networks and graphs. To facilitate this
introduction, we will focus on basic definitions and highlight basic properties of defining
components of networks. In addition to quantify network measures for complex networks, e. g.,
distance- and degree-based measures, we survey also some important graph algorithms,
including breadth-first search and depth-first search. Furthermore, we discuss different classes of
networks and graphs that find widespread applications in biology, economics, and the social
sciences [→10], [→23], [→53].
16.1 Introduction
A network G = (V , E) consists of nodes v ∈ V and edges e ∈ E , see [→94]. Often, an
undirected network is called a graph, but in this chapter we will not distinguish between a network
and a graph and use both terms interchangeably. In →Figure 16.1, we show some examples for
undirected and directed networks. The networks shown on the left-hand side are called undirected
networks, whereas those on the right-hand side are called directed networks since each edge has a
direction pointing from one node to another. Furthermore, all four networks, depicted in →Figure
16.1, are connected [→94], i. e., none of them has isolated vertices. For example, removing the
edge between the nodes from an undirected network with only two vertices, leaves merely two
isolated nodes.
Weighted networks are obtained by assigning weights to each edge. →Figure 16.2 depicts two
weighted, undirected networks (left) and two weighted, directed networks (right). A weight
between two vertices, w , is usually a real number. The range of these weights depends on the
AB
application context. For example, w could be a positive real number indicating the distance
AB
Figure 16.2 Weighted undirected and directed graphs with two vertices.
the network, which means that the vertices i and j are connected with each other. →Figure 16.3
shows an example of a network with V = {1, 2, 3, 4, 5} and E = {E , E , E , E , E } . For
12 23 34 14 35
example, node 3 ∈ V and edge E are part of the network shown by →Figure 16.3. From
34
→Figure 16.3, we further see that node 3 is connected with node 4, but also, node 4 is connected
with node 3. For this reason, we call such an edge undirected. In fact, the graph shown by →Figure
16.3 is an undirected network. It is evident that in an undirected network the symbol E has the
ij
same meaning as E , because the order of the nodes in this network is not important.
ji
Definition 16.2.1.
2
) .
E ⊆ (
V
2
) means that all edges of G belong to the set of subsets of vertices with 2 elements.
The size of G is the cardinality of the node set V, and is often denoted by |V | . The notation |E|
stands for the number of edges in the network. From →Figure 16.3, we see that this network has 5
vertices ( |V | = 5 ) and 5 edges ( |E| = 5 ).
In oder to encode a network by utilizing a mathematical representation, we use a matrix
representation. The adjacency matrix A is a squared matrix with |V | number of rows and |V |
number of columns. The matrix elements A , of the adjacency matrix provide the connectivity of
ij
a network.
Definition 16.2.2.
for i, j ∈ V .
As an example, let us consider the graph in →Figure 16.3. The corresponding adjacency matrix is
0 1 0 1 0 (16.2)
⎛ ⎞
1 0 1 0 0
A = 0 1 0 1 1 .
1 0 1 0 0
⎝ ⎠
0 0 1 0 0
Since this network is undirected, its adjacency matrix is symmetric, that means A ij = Aji
Definition 16.2.3.
E ⊆ V × V means that all directed edges of G are subsets of all possible combinations of
directed edges. The expression V × V is a cartesian product and the corresponding result is a set
of directed edges. If u, v ∈ V , then we write (u, v) to express that there exists a directed edge
from u to v.
The definition of the adjacency matrix of a directed graph is very similar to the definition of an
undirected graph.
Definition 16.2.4.
for i, j ∈ V .
In contrast with equation (→16.1), here, we choose the start vertex (i) and the end vertex (j) of a
directed edge. →Figure 16.5 presents a directed network with the following adjacency matrix:
E
t
⎜⎟
⎛
⎝
0
0
0
0
0
0
0
0
0
0
⎞
⎠
.
(16.4)
Here, we can see that A ≠ A . Therefore, the transpose of the adjacency matrix, A, of a directed
For example, the edge set of the directed network, depicted in →Figure 16.5, is
= {(2, 1), (2, 3), (4, 1), (3, 4), (3, 5)} .
Definition 16.2.5.
for i, j ∈ V .
Wij = {
wij
0
if there is a connection f rom i to j in G,
otherwise,
(16.5)
w 34
In equation (→16.5), w
vertex j.
= 3, w = 3, w
35 = 1. 41
ij ∈ R
W =
⎜⎟
denotes the weight associated with an edge from vertex i to
→Figure 16.6 depicted the weighted direct network with the following adjacency matrix:
⎝
0
Definition 16.2.6.
0
0
0
0
0
0
0
0
⎞
From the adjacency matrix W, we can identify the following (real) weights: w 21 = 2
A walk w of length μ in a network is a sequence of μ edges, which are not necessarily different. We
write w = v v , v v , … , v v . We also call the walk w closed if v = v .
1 2 2 3 μ−1 μ 1 μ
,w
23 = 1 ,
(16.6)
Definition 16.2.7.
A path P is a special walk, where all the edges and all the vertices are different.
In a directed graph, the close path is also called a cycle.
Let us illustrate these definitions by way of the network examples depicted in →Figure 16.7. If
we consider the upper graph on the left hand side, we see that 12, 23, 34 is an undirected path, as
all vertices and edges are different. This path has a length of 3. On the other hand, in the upper
graph of the right hand side, 12, 23, 32 is a walk of length 3. By considering the same graph, we
also find that 14, 43, 34, 41 is a closed walk, as it starts and ends in vertex 1. This closed walk has a
length of 4.
Now, let us consider the lower graph on the left hand side of →Figure 16.7. In this graph, 12,
23, 34 is a directed path of length 3 as the underlying graph is directed.
In the lower graph below on the right hand side, the path 23, 34, 41 has a length 3, but does
not represent a cycle, as its start and end vertices are not the same.
Now, we define the term distance between vertices in a network.
Definition 16.2.8.
The number of edges in the shortest path connecting the vertices u and v is the topological
distance d(u, v) .
Again, we consider the upper graph on the right hand side of →Figure 16.7. For instance, the
path 12, 23, 34, for going from vertex 1 to vertex 4 has length 3 and is obviously not the shortest
one. Calculating the shortest path yields d(1, 4) = 1 .
Definition 16.3.1.
Let G = (V , E) be a network. The degree k of the vertex v is the number of edges, which are
i i
Definition 16.3.2.
It is clear that equation (→16.7) represents the proportion of vertices in G possessing degree
k.
Degree-based statistics have been used in various application areas in computer science. For
example, it has been known that the vertex degrees of many real-world networks, such as www-
graphs and social networks [→2], [→23], [→24], are not Poisson distributed. However, the
following power law always holds:
P (k) ∼ k
−γ
, γ > 1. (16.8)
follows:
2ei ei (16.9)
Ci = = .
ni (ni − 1) ti
Here, n is the number of neighbors of vertex i, and e is the number of adjacent pairs between all
i i
themselves connected with each other. →Figure 16.8 depicts an example of graph as well as the
calculation of the corresponding local clustering coefficient.
Definition 16.3.3.
The distance matrix is defined by
(d(vi , vj )) ,
(16.10)
vi ,vj ∈V
Definition 16.3.4.
1 (16.11)
d̄(G) := ∑ d(vi , vj ).
N
( ) 1≤i<j≤N
2
We also define other well-known distance-based graph measures [→94] that have been used
extensively in various disciplines [→57], [→197].
Definition 16.3.5.
Definition 16.3.6.
Definition 16.3.7.
CD (v) = kv , (16.15)
When analyzing directed networks, the degree centrality can be defined straightforwardly by
utilizing the definition of the in-degree and out-degree [→94]. Now, let us define the well-known
betweeness centrality measure [→80], [→81], [→159], [→197].
Definition 16.3.9.
CB (vk ) = ∑ ,
σvi vj
vi ,vj ∈V ,vi ≠vj
σv
i vj
can be seen as the probability that v lies on a shortest path connecting v with v .
k i j
Definition 16.3.10.
When there exist more than one shortest paths connecting v with v , d(v , v ) remains
k i k i
unchanged.
The measure C (v ) has often been used to determine how close is a vertex to other vertices
C k
Figure 16.10 The last two graphs when running BFS on the input graph shown in →Figure 16.9.
Figure 16.11 The first graph is the input graph to run DFS. The start vertex is vertex 0. The steps
are shown together with a stack showing the visited and parent vertices.
Figure 16.12 The last five graphs when running DFS on the input graph shown in →Figure 16.11.
16.4.3 Shortest paths
Determining the shortest paths in networks has been a long-standing problem in graph theory
[→38], [→58]. For instance, finding the flight with the earliest arrival time in a given aviation
network [→117] requires the determination of all shortest paths. Other examples for the
application of shortest paths are graph optimization problems, e. g., for transportation networks
of production processes [→156].
A classical algorithm for determining the shortest paths within networks is due to Dijkstra
[→58]. It is interesting to note that many problems in algorithmic graph theory, e. g., determining
minimum spanning trees (see Section →16.4.4) and breadth first search also utilize Dijkstra’s
method, see [→38].
Dijkstra’s method can be described as follows. Given a network G = (V , E) and a starting
vertex v ∈ V the algorithms finds the shortest paths to all other vertices in G. In this case,
Dijkstra’s algorithm [→58] generates a so-called shortest path tree containing all the vertices that
lie on the shortest path.
We describe the basic steps of the algorithm of Dijkstra in order to determine the shortest paths
starting from a given vertex to all other vertices in G. Here, we assume that the input graph has
vertex labels and real edge labels [→38], [→58]:
We create the set of shortest path trees (SPTS), containing the vertices that are in a shortest
path tree. These vertices have the property that they have minimum distance from the
starting vertex. Before starting, it holds SPTS = ∅ .
We assign initial distance values ∞ in the input graph. Also, we set the distance value for the
starting vertex equal to zero.
Whereas the vertex set of SPTS does not contain all vertices of the input graph, the following
apply:
Select a vertex v ∈ V that is not contained in the vertex set of SPTS with minimum
distance
We put v ∈ V into the vertex set of SPTS
We update the distance value of all vertices that are adjacent with v ∈ V . In order to
update the distances, we iterate among all adjacent vertices. For all the adjacent
vertices u ∈ V with v ∈ V we perform the following: If the sum of the assigned
distance value of the vertex v (from the starting vertex) and the weight of the edge
{v, u} is less than the distance value of u, update the distance value of u.
Now we demonstrate the application of this algorithm for an example. The input graph in A is
given in →Figure 16.13. Because the vertex set of SPTS is initially empty, and we choose vertex 1
as start node. The initial distance values can be seen in →Figure 16.13 (B). We perform some steps
to see how the set of shortest paths is emerging; see →Figure 16.14. The vertices highlighted in
red are the ones in the shortest path tree. The graph shown in →Figure 16.14 in situation D is the
final shortest path tree consisting all vertices of the input graph in →Figure 16.13. That means, the
set of shortest path trees gives all shortest paths from vertex 1 to all other vertices.
Figure 16.13 (A) The input graph. (B) The graph with initial vertex weights.
Figure 16.14 Steps when running Dijkstra’s algorithm for the graph in (A) shown in →Figure
16.13.
As a remark, we would like to note that the graph shown in →Figure 16.13 is a weighted,
undirected graph (see Section →16.2.3). So, using the algorithm of Dijkstra [→58] makes sense for
edge-weighted graphs, as the shortest path between two vertices of a graph depends on these
weights. Interestingly, the shortest path problem becomes more simple if we consider
unweighted networks. If all edges in a network are unweighted, we may set all edge weights to 1.
Then, Dijkstra’s algorithm reduces to the search of the topological distances between vertices, see
Definition →16.2.9.
Let us consider the graph A in →Figure 16.15. In case we determine all shortest paths from
vertex 1 to vertex 4, we see that there exist more than one shortest path between these two
vertices. We find the shortest paths 1-3-4 and 1-2-4. So, the shortest path problem does not
possess a unique solution. The same holds when considering the shortest paths between vertex 1
and vertex 5. The calculations yield the two shortest paths 1-3-4-5 and 1-2-4-5.
Another example when calculating shortest paths in unweighted networks gives the graph in
B shown by →Figure 16.15. The shown graph P is referred to as the path graph [→186] with n
n
vertices. We observe that there exist n − 1 pairs of vertices with d(u, v) = 1 , n − 2 pairs of
vertices with d(u, v) = 2 , and so forth. Finally, we see that there exists only n − (n − 1) = 1 pair
with d(u, v) = n − 1 . Here, d(u, v) = n − 1 is just the diameter of P .n
In Listing 16.1, we shown an example how shortest paths can be found by using R. For this
example, we use a small-world network with n = 25 nodes. The command distances() gives only
the length of paths, whereas the command shortest_paths() provides one shortst path. In contrast,
all_shortest_paths() returns all shortest paths.
T = (V , E ) of G is a tree, where V
T T T = V . In this case, we say that the tree T spans G, as the
G
vertex set of the two graphs are the same and every edge of T belongs to G.
→Figure 16.16 shows an input graph G with a possible spanning tree T. It is obvious, by definition,
that there often exists more than one spanning tree of a given graph G. The problem of
determining spanning trees gets more complex if we consider weighted networks. In case we
start with an edge-labeled graph, one could determine the so-called minimum spanning tree
[→38]. This can be achieved by adding up the costs of all edge weights and, finally, searching for
the tree with minimum cost among all existing spanning trees. Again, the minimum spanning tree
for a given network is not unique. For instance, well-known algorithms to determine the minimum
spanning tree are due to Prim and Kruskal, see, e. g., [→38]. We emphasize that the application of
those methods may result in different minimum spanning trees. Here, we just demonstrate
Kruskal’s algorithm [→38] representing a greedy approach. Let G = (V , E) be a connected
graph with real edge weights. The main steps are the following:
We arrange the edges according to their weights in ascending order
We add edges to the resulting minimum spanning tree as follows: we start with the smallest
weight and end with the largest weight by consecutively adding edges according to their
weight costs
We only add the described edges if the process does not create a cycle
→Figure 16.17 shows the sequence of steps when applying Kruskal’s algorithm to the shown
input graph G. We choose any subgraph with the smallest weight as depicted in situation A. In B,
we choose the next smallest edge, and so on. We repeat this procedure according to the
algorithmic steps above until it does not create a cycle. Note, intermediate trees can be
disconnected (see C). One possible minimum spanning tree is shown in situation E. Differences
between the algorithms due to Kruskal and Prim are explained in, e. g., [→38].
In Listing 16.2, we shown an example how the minimum spanning tree can be found by using
R.For this example, we use a small-world network with n = 25 nodes. The command mst() gives
the underlying minimal spanning tree.
16.5.1 Trees
We start with the formal definition of a tree [→94], already briefly introducted in Section →16.4.4.
Definition 16.5.1.
A tree is a graph G = (V , E) that is connected and acyclic. A graph is acyclic if it does not contain
any cycle.
In fact, there exist several characterizations for trees which are equivalent [→100].
Theorem 16.5.1.
Let G = (V , E) be a graph, and let |V | := N . The following assertions are equivalent:
1. G = (V , E) is a tree.
Definition 16.5.2.
A rooted tree is a tree containing one designated root vertex. There is a unique path from the root
vertex to all other vertices in the tree, and all other vertices are directed away from the root.
→Figure 16.18 presents a rooted tree, in which the root is at the very top of a tree, whereas all
other vertices are placed on some lower levels. The tree in →Figure 16.18 is an unordered tree,
that means, the order of the vertices is arbitrary. For instance, the order of the green and orange
vertex can be swapped.
Figure 16.18 A rooted tree with its designated root vertex.
Definition 16.5.3.
An ordered tree is a rooted tree assigning a fixed order to the children of each vertex.
Definition 16.5.4.
A binary tree is an ordered tree, where each vertex has exactly two children.
Definition 16.5.5.
A generalized tree GT is defined by a vertex set V, an edge set E, a level set L, and a multilevel
function L . The edge set E will be defined in Definition →16.5.7. The vertex and edge set define
the connectivity and the level set and the multilevel function induces a hierarchy between the
nodes of GT . The index r ∈ V indicates the root.
The multilevel function is defined as follows [→63].
Definition 16.5.6.
Definition 16.5.7.
A generalized tree as defined by Definition →16.5.5 has three edges types [→63]:
Edges with |L (m) − L (n)| = 1 are called kernel edges ( E ). 1
Note that for an ordinary rooted tree as defined by Definition →16.5.2, we always obtain
|L (m) − L (n)| = 1 for all pairs (m, n) . From the above given definitions and the visualization
in →Figure 16.19, it is clear that a generalized tree is a tree-like graph with a hierarchy, and may
contain cycles.
In what follows, we survey important properties of random networks [→59]. For instance, the
degree distribution of a vertex v follows a binomial distribution,
i
N − 1 (16.20)
k N −1−k
P (ki = k) = ( )p (1 − p) ,
k
since the maximum degree of the vertex v is at most N − 1 ; in fact, the probability that the
i
vertex has k edges equals p (1 − p) and there exist ( ) possibilities to choose k edges
k N −1−k N −1
from N − 1 vertices.
Considering the limit N → ∞ , Equation (→16.20) yields
z
k
exp (−z) (16.21)
P (ki = k) ∼ .
k!
We emphasize that z = p(N − 1) is the expected number of edges for a vertex. This implies that
if N goes to infinity, the degree distribution of a vertex in a random network can be approximated
by the Poisson distribution. For this reason, random networks are often referred to as Poisson
random networks [→142].
In addition, one can demonstrate that the degree distribution of the whole random network also
follows approximatively the following Poisson distribution:
r
z exp (−z) (16.22)
P (Xk = r) ∼ .
r!
This means that there exist X k = r vertices in the network that possess degree k [→4].
As an application, we recall the already introduced clustering coefficient C , for a vertex v ,
i i
represented by equation (→16.9). In general, this quantity has been defined as the ratio |E | of i
existing connections among its k nearest neighbors divided by the total number of possible
i
Therefore, C is the probability that two neighbors of v are connected with each other in a
i i
z (16.24)
Ci ∼ ,
N
vertex, the connection to its next neighbor (1st neighbor) is highlighted in blue and the
connection to its second next neighbor (2nd neighbor) in red.
Second, start with an arbitrary vertex i and rewire its connection to its nearest neighbor on,
e. g., the right side with probability p to any other vertex j in the network. Then, choose
rw
the next vertex in the ring in a clockwise direction and repeat this procedure.
Third, after all first-neighbor connections have been checked, repeat this procedure for the
second and all higher-order neighbors, if present, successively.
This algorithm guarantees that each connection occurring in the network is chosen exactly once
and rewired with probability p . Hence, the rewiring probability, p , controls the disorder of the
rw rw
results in a random network. Intermediate values 0 < p < 1 give a topological structure that is
rw
The generation of a small-world network by using the Watts–Stogatz algorithm consists of two
main parts:
First, the adjacency matrix is initialized in a way that only the nearest k/2 neighbor vertices
are connected. The order of the vertices is arbitrarily induced by the labeling of the vertices
from 1 to N. This allows identifying, e. g., i + f as the fth neighbor of vertex i with f ∈ N .
For instance, f = 1 corresponds to the next neighbor of i. The module function is used to
ensure that the neighbor indices f remain in the range of {1, … , N } . Due to this fact the
vertices can be seen as organized on a ring. We would like to emphasize that for the
algorithm to work, the number of neighbors k needs to be an even number.
Second, each connection in the network is tested once if it should be rewired with
probability p . To do this, a random number, c, between 0 and 1 is uniformly sampled and
rw
In this case, we need first to remove the old connection between these vertices and then
draw a random integer, d, from {1, … , N } ∖ {i} to select a new vertex to connect with i.
We would like to note that in order to avoid a self-connection of vertex i, we need to remove
the index i from the set {1, … , N } .
To explain this common feature Barabási and Albert introduced a model [→8], now known as
Barabási–Albert (BA) or preferential attachment model [→142]. This model results in so called scale-
free networks, which have a degree distribution following a power law [→8]. A major difference
between the preferential attachment model and the other algorithms, described above, for
generating random or small-world networks is that the BA model does not assume a fixed
number of vertices, N, and then rewires them iteratively with a fixed probability, but in this model
N grows. Each newly added vertex is connected with a certain probability (which is not constant)
to other vertices already present in the network. The attachment probability defined by
ki (16.26)
pi =
∑j kj
is proportional to the degree k of these vertices, explaining the name of the model. This way,
j
16.7 Summary
Despite the fact that graph theory is a mathematical subject, similar to linear algebra and analysis,
it has a closer connection to practical applications. For this reason many real-world networks have
been studied in many disciplines, such as chemistry, computer science, economy [→64], [→65],
[→143]. A possible explanation for this is provided by the intuitive representation of many natural
networks, e. g., transportation networks of trains and planes, acquaintance networks between
friends or social networks in twitter or facebook. Also many attributes of graphs, e. g., paths or the
degrees of nodes, have a rather intuitive meaning. This motivates the widespread application of
graphs and networks in nearly all application areas. However, we have also seen in this chapter
that the analysis of graphs can be quite intricate, requiring a thorough understanding of the
previous chapters.
16.8 Exercises
1. Let G = V , E be a graph with V = {1, 2, 3, 4, 5} and
E = {{1, 2}, {2, 4}, {1, 3}, {3, 4}, {4, 5}} . Use R to obtain the following
results:
Calculate all vertex degrees of G.
Calculate all shortest paths of G.
Calculate diam(G) .
Calculate the number of circles of G.
2.
Generate 5 arbitrary trees with 10 vertices. Calculate their number of edges by
using R, and confirm E = 10 − 1 = 9 for all 5 generated trees.
3. Let G = V , E be a graph with V = {1, 2, 3, 4, 5, 6} and
E = {{1, 2}, {2, 4}, {1, 3}, {3, 4}, {4, 5}, {5, 6}} . Calculate the number of
Example 17.1.1.
If we toss a coin once, there are two possible outcomes. Either we obtain a “head” (H) or a “tail”
(T). Each of these outcomes is called an elementary event, ω (or a sample point). In this case, the
i
Example 17.1.2.
If we toss a coin three times, the sample space is
Ω = {(H , H , H ), (T , H , H ), (H , T , H ), … , (T , T , T )} , and the elementary outcomes are
From the second example, it is clear that although there are only two elementary outcomes, i.
e. H and T, the size of the sample space can grow by repeating such base experiments.
Definition 17.2.1.
A set, A, containing no elements is called an empty set, and it is denoted by ∅.
Definition 17.2.2.
If for every element a ∈ A we also have a ∈ B , then A is a subset of B, and this relationship is
denoted by A ⊂ B .
Definition 17.2.3.
The complement of a set A with respect to the entire space Ω, denoted A or A , is such that if
¯
c
There is a helpful graphical visualization of sets, called Venn diagram, that allows an insightful
representation of set operations. In →Figure 17.1 (left), we visualize the complement of a set A. In
this figure, the entire space Ω is represented by the large square, and the set A is the inner circle
(blue), whereas its complement A is the area around it (white). In contrast, in →Figure 17.1
¯
(right), the set A is the outer shaded area, and A is the inner circle (white).
¯
Definition 17.2.4.
Two sets A and B are called equivalent if A ⊂ B and B ⊂ A . In this case A = B .
Definition 17.2.5.
The intersection of two sets A and B consists only of the points that are in A and in B, and such a
relationship is denoted by A ∩ B , i. e., A ∩ B = {x ∣ x ∈ A and x ∈ B} .
Definition 17.2.6.
The union of two sets A and B consists of all points that are either in A or in B, or in A and B, and
this relationship is denoted by A ∪ B , i. e., A ∪ B = {x ∣ x ∈ A or x ∈ B} .
→Figure 17.2 provides a visualization of the intersection (left) and the union (right) of two sets
A and B.
Definition 17.2.7.
The set difference between two sets, A and B, consists of the points that are only in A, but not in B,
and this relationship is denoted by A ∖ B , i. e., A ∖ B = {x ∣ x ∈ A and x ∉ B} .
Using R, the four aforementioned set operations can be carried out as follows:
These commands represent the computational realization of the above Definitions →17.2.4 to
→17.2.7, which describe the equivalence, intersection, union, and set difference of sets.
Figure 17.2 Venn diagrams of two sets. Left: Intersection of A and B, A ∩ B . Right: Union of A and
B, A ∪ B .
Theorem 17.2.1.
For three given sets A, B, and C, the following relations hold:
1. Commutativity: A ∪ B = B ∪ A , and A ∩ B = B ∩ A .
2. Associativity: A ∪ (B ∪ C) = (A ∪ B) ∪ C , and A ∩ (B ∩ C) = (A ∩ B) ∩ C .
3. Distributivity: A ∪ (B ∩ C) = (A ∪ B) ∩ (A ∪ C) , and
A ∩ (B ∪ C) = (A ∩ B) ∪ (A ∩ C) .
4. (A ) = A .
c c
For the complement of a set, a bar over the symbol is frequently used instead of the
superscript “c”, i. e., A = A .
¯c
Definition 17.2.8.
Two sets A and A are called mutually exclusive if the following holds: A ∩ A = ∅ .
1 2 1 2
If n sets A with i ∈ {1, … , n} are mutually exclusive, then A ∩ A = ∅ holds for all i and j with
i i j
i ≠ j.
¯
¯
¯ (17.2)
(A ∩ B) = A ∪ B.
From, the above relationship, a negation of a union leads to an intersection, and vice versa.
Therefore, De Morgan’s Laws provides a mean for interchanging a union and an intersection via
an application of a negation.
Axiom 17.3.1.
For every event A,
Pr (A) ≥ 0. (17.3)
Axiom 17.3.2.
For the sample space Ω,
Pr (Ω) = 1. (17.4)
Axiom 17.3.3.
For every infinite set of independent events {A 1, … , A∞ } ,
∞ (17.5)
Pr (A1 ∪ A2 ∪ … A∞ ) = ∑ Pr (Ai ).
i=1
Definition 17.3.1.
We call Pr (A) a probability of event A if it fulfills all the three axioms above.
Such a probability is also called a probability measure on sample space Ω. For clarity, we repeat
that Ω contains the outcomes of all possible events. There are different conventions to denote
such a probability and frequent choices are “Pr” or “P”. In the following we use for brievity the
latter one.
These three axioms form the basis of probability theory, from which all other properties can
be derived.
From the definition of a probability and the three above axioms, follow a couple of useful
identities, including:
1. If A ⊂ B , then P (A) ≤ P (B) .
2. For every event A, 0 ≤ P (A) ≤ 1 .
3. For every event A, P (A ) = 1 − P (A) .
c
4. P (∅) = 0 .
i=1
Probabilities are called coherent if they obey the rules from the three axioms above. Examples for
the contrary will be given below.
We would like to note that the above definition of probability does not give a description
about how to quantify it. Classically, Laplace provided such a quantification for equiprobable
elementary outcomes, i. e., for p(ω ) = 1/m for Ω = {ω , … , ω } . In this case, the probability of
i 1 m
an event A is given by the number of elements in A divided by the total number of possible events,
i. e., p(A) = |A|/m . In practice, not all problems can be captured by this approach, because
usually the probabilities, p(ω ) , are not equiprobable. For this reason a frequentist quantification
i
or a Bayesians quantification of probability, which hold for general probability values, is used
[→91], [→161].
k (17.9)
P (A) = ∑ P (A|Bi )P (Bi ).
i=1
Proof.
From the identity
A = A ∩ Ω (17.10)
we have
A = A ∩ (B1 ∪ ⋯ ∪ Bk ), (17.11)
since {B , … , B } is a partition of Ω.
1 k
as follows:
P (A) = P ((A ∩ B1 ) ∪ ⋯ ∪ (A ∩ Bk )) (17.13)
= P (A ∩ B1 ) + ⋯ + P (A ∩ Bk ) (17.14)
k (17.16)
= ∑ P (A|Bi )P (Bi ).
i=1
Definition 17.5.1.
Two events A and B are called independent, or statistically independent, if one of the following
conditions hold:
1. P (AB) = P (A)P (B)
3.
¯
¯
A and B are independent
The extension to more than two events deserves attention, because it requires independence
among all subsets of the events.
Definition 17.5.2.
The n events A , A , … , A
1 2 n ∈ A are called independent if the following condition holds for all
subsets I of {1, … , n} :
(17.17)
P (A1 , … , An ) = ∏ P (Ai ).
i∈I
Definition 17.6.1.
For a given sample space Ω, a random variable X is a function that assigns to each event A ∈ Ω a
real number, i. e., X(A) = x ∈ R with X : Ω → R . The codomain of the function X is
C = {x ∣ x = X(A), A ∈ Ω} ⊂ R .
In the above definition, we emphasized that a random variable is a function, assigning real
numbers to events. For brevity this is mostly neglected when one speaks about random variables .
However, it should not be forgotten.
Furthermore, we want to note that the probability function has not been used explicitly in the
definition. However, it can be used to connect a random variable to the probability of an event.
For example, given a random variable X and a subset of its codomain S ⊂ C , we obtain
P (X ∈ S) = P ({a ∈ Ω ∣ X(a) ∈ S}), (17.18)
since {a ∈ Ω ∣ X(a) ∈ S} ⊂ Ω .
Similarly, for a single element S = x , we obtain
P (X = x) = P ({a ∈ Ω ∣ X(a) = x}). (17.19)
In this way, the probability values for events are clearly defined.
Definition 17.6.2.
The cumulative distribution function of a random variable X is a function F X : R → [0, 1] defined
by
FX (x) = P (X ≤ x). (17.20)
In this definition, the right-hand side term is interpreted as in equation (→17.18) and (→17.19) by
P (X ≤ x) = P ({a ∈ Ω ∣ X(a) ≤ x}). (17.21)
Example 17.6.1.
Suppose that we have a fair coin and define a random variable by X(H ) = 1 and X(T ) = 0 for
a probability space with Ω = {H , T } . We can find a piecewise definition of the corresponding
distribution function as follows:
3.
4.
5.
6.
7.
⎩
⎪
⎧P (∅) = 0
P ({T , H }) = 1
f or 0 ≤ x < 1;
f or x ≥ 1.
all points up to the end points themselves are. Mathematically, this corresponds to an open
interval indicated by “)”, e. g., [0,1) for the second step in →Fig. 17.3.
Theorem 17.6.1.
The cumulative distribution function, F (x) , has the following properties:
1.
2.
F (−∞) =limx→−∞ F (x) = 0
F (x+) = F (x)
and F (∞) =lim
is continuous from the right;
F (x) = 1 ;
x→∞
P (X > x) = 1 − F (x) ;
P (x < x ≤ x ) = F (x ) − F (x ) ;
1 2 2
P (X = x) = F (x) − F (x−) ;
P (x ≤ x ≤ x ) = F (x ) − F (x −) .
1 2
1
1 2 1 2
(17.22)
The circle at the end of the steps in →Fig. 17.3 means that the end points are not included, but
From the connection between a random variable and its probability value, given by equation
(→17.18), we can now introduce the definition of discrete and continuous random variables as
well as their corresponding distributions.
Definition 17.7.1.
If a random variable, X, can only assume a finite number of different values, e. g., x , … , x , then
1 n
Definition 17.7.2.
Let X be a discrete random variable. The probability function of X, denoted f (x) , is defined for
every real number, x, as follows:
f (x) = P (X = x). (17.23)
Given these two definitions and the properties of probability values, it can be shown that the
following conditions hold:
1. f (x) = 0 , if x is not a possible value of the random variable X;
2. ∑
n
i=1
f (x ) = 1 , if the x are all the possible values for the random variable X.
i i
Definition 17.7.3.
If a random variable, X, can assume an infinite number of values in an interval, e. g., between a
and b ∈ R , then X is called a continuous random variable. The probability of X being within an
interval [a, b] is given by the integral
b (17.24)
P (a ≤ X ≤ b) = ∫ f (x)dx.
Here, the nonnegative function f (x) is called the probability density function of X.
It can be shown that
∞ (17.25)
∫ f (x)dx = 1.
−∞
It is important to note that the probability for a single point x 0 ∈ R is zero, because
x0 (17.26)
P (x0 ≤ X ≤ x0 ) = ∫ f (x)dx = 0.
x0
In Section →17.12, we will discuss some important continuous distributions. However, here we
want to give an example for such a distribution.
The notation Unif ([a, b]) is often used to denote a uniform distribution in the interval [a, b] .
Definition 17.8.1.
(17.29)
E[X] = ∫ xf (x)dx, f or a continuous random variable X.
(17.31)
E[g(X)] = ∫ g(x)f (x)dx, f or a continuous random variable X.
From the definition of the expectation values of a random variable follows several important
properties that hold for discrete and continuous random variables.
Theorem 17.8.1.
Suppose that X and X , … , X are random variables. Then the following results hold:
1 n
i=1
If X 1, … , Xn are independent random variables and E[X ] is finite for every i, then
i
n n (17.34)
E[∏ Xi ] = ∏ E[Xi ].
i=1 i=1
17.8.2 Variance
An important special case for an expectation value of a function is given by
2 (17.35)
g(x) = (X − μ)
Due to the importance of this expression, it has its own name. It is called the variance of X. If the
mean of X, μ, is not finite, or if it does not exist, then Var(X) does not exist.
There is a related measure, called the standard deviation, which is just the square root of the
variance of X, denoted sd(X) = √Var(X) . Frequently, the Greek symbol σ is used to denote 2
In this case, the standard deviation assumes the form, sd(X) = √Var(X) = σ .
Property (3) has important practical implications, because it says that the variance of the mean of
a sample of size n for random variables that have all the same variance has a variance that is
reduced by the factor 1/n . If we take the square root of Var(X̄) = , we get the standard
Var(X)
deviation of X̄ given by
sd(X) (17.38)
SE = sd(X̄) = .
√n
10
E , and not in the variance of the individual errors E .
10
i=1 i i
17.8.3 Moments
Along the same principle, as for the definition of the variance of a random variable X, one can
define further expectation values.
Definition 17.8.2.
′ k k (17.40)
mk = E[g(x) ] = E[(X − μ) ].
For k = 2 , the central moment of X is just the variance of X. Analogously, one defines the kth
Definition 17.8.3.
k k (17.42)
mk = E[g(x) ] = E[X ].
Definition 17.8.4.
The linear correlation, often referred to as simply correlation, between two random variables X and
Y is defined by
(17.44)
Cor(X, Y ) = E[(X − E[X])(Y − E[Y ])]/√ Var(X) Var(Y )
(17.45)
= Cov(X, Y )/√ Var(X) Var(Y ).
Theorem 17.9.1.
Let X and Y be two discrete random variables with joint probability function f (x, y) . If (x a, ya ) is not
in the definition range of (X, Y ) , then f (x , y ) = 0 . Furthermore
a a
(17.48)
∑ f (xi , yi ) = 1,
∀ i
and
(17.49)
P ((X, Y ) ∈ Z) = ∑ f (x, y).
(x,y)∈Z
For evaluating such a discrete joint probability function, the corresponding probabilities can
be presented in a form of table. In →Table 17.1, we present an example of a discrete joint
probability function f (x, y) with X ∈ {x , x } and Y ∈ {y , y , y } .
1 2 1 2 3
Table 17.1 An example of a discrete joint probability function f (x, y) with X ∈ {x1 , x2 } and
Y ∈ {y , y , y } .
1 2 3
Y
y1 y2 y3
generalize naturally. However, the practical characterization of such distributions, e. g., in form of
tables like →Table 17.1 causes problems, because 3, 4, or 100-dimensional tables are not
manageable. Fortunately, for random variables that have a dependency structure that can be
represented by a directed acyclic graph (DAG), there is a simple representation.
By application of the chain rule, one can show that every joint probability distribution factorizes in
the following way:
n (17.50)
P (X1 , … , Xn ) = ∏ p(Xi | pa(Xi )).
i=1
Here, pa(X ) denotes the “parents” of variable X . In →Figure 17.4 (left), we show an example
i i
P (X1 , … , X5 ) = p(X1 )p(X2 )p(X3 |X1 )p(X4 |X1 , X2 )p(X5 |X1 , X2 ). (17.51)
Similarly, the joint probability distribution, for →Figure 17.4 (right), can be written as follows
P (X1 , … , X5 ) = p(X1 )p(X2 )p(X3 )p(X4 |X1 , X2 , X3 )p(X5 |X4 ). (17.52)
The advantage of such factorization is that the numerical specification of the joint probability
distribution is distributed over the terms p(X | pa(X )) . Importantly, each of these terms can be
i i
The shown DAGs in →Figure 17.4, together with the factorizations of their joint probability
distributions, are examples of so called Bayesian networks [→114], [→149]. Bayesian networks are
special examples of probabilistic models called graphical models [→116].
P (X = 0) = 1 − p. (17.54)
As a short notation, we write X ∼ Bern(p) for a random variable, X, drawn from a Bernoulli
distribution with parameter p. Hence, the symbol ∼ means “is drawn from” or “is sampled from”.
The R-package Rlab provides the Bernoulli distribution. With the help of the command rbern, we
can draw 10 random variables from a distribution with p = 0.5 .
An alternative is to use the sample command. Here it is important to sample with replacement.
A simple example for a discrete random variable with a Bernoulli distribution is a coin toss.
P (X = 1) = p . Then the probability to observe n “1s” (e. g. heads) from the N tosses is given by
i
N (17.55)
n N −n
P (X = n) = ( )p (1 − p) .
n
In →Figure 17.5, we visualize two Binomial distributions with different parameter. Each bar
corresponds to P (X = n) for a specific value of n.
Figure 17.5 Binomial distribution, Binom(N = 6, p = 0.3) (left) and Binom(N = 6, p = 0.1)
(right).
For N → ∞ and large values of p, the Binomial distribution can be approximated by a
normal distribution (discussed in detail in Sec. →17.12.4). In this case, one can set the mean value
to μ = N p , and the standard deviation to σ = √N p(1 − p) for the normal distribution. The
advantage of such approximation is that the normal distribution is computationally easier to
handle than the Binomial distribution. As a rule of thumb, this approximation can be used if
N p(1 − p) > 9 . Alternatively, it can be used if N p > 5 (for p ≤ 0.5 ) or N (1 − p) > 5 (for
p > 0.5 ).
To illustrate how to generate figures such as →Figure 17.5 (right), we provide below a listing,
using ggplot, for producing such figure.
In the following, we do not provide the scripts for the visualizations of similar figures, but only for
the values of the distributions. However, by following the example in Listing 17.5, such
visualizations can be generated easily.
Figure 17.6 Binomial distribution, pbinom(n, size=6, prob=0.6) (left) and qbinom(p, size=6, prob=0.6)
(right).
So far we have seen that R provides for each available distribution a function to sample random
variables from this distribution, and a function to obtain the corresponding probability density.
For the Binomial distribution, these functions are called rbinom and dbinom. For other
distributions, the following pattern for the names apply:
r’name-of-the-distribution’: draw random samples from the distribution;
d’name-of-the-distribution’: density of the distribution.
There are two more standard functions available that provide useful information about a
distribution. The first one is the distribution function, also called cumulative distribution function,
because it provides P (X ≤ n) , i. e., the probability up to a certain value of n, which is given by
m=n (17.56)
P (X ≤ n) = ∑ P (X = m).
m=0
The second function is the quantile function, which provides information about the value of n, for
which P (X ≤ n) = p holds. In R, the names of these functions follow the pattern:
p’name-of-the-distribution’: distribution function;
q’name-of-the-distribution’: quantile function.
For instance, sampling from X ∼ nbinom(r = 6, p = 0.2) using R can be done as follows:
It is worth noting that the Poisson distribution can be obtained from a Binomial distribution
for N → ∞ and p → 0 , assuming that λ = N p remains constant. This means that for large N
and small p we can use the Poisson distribution with λ = N p to approximate a Binomial
distribution, because the former is easier to handle computationally. Two rules of thumb say that
this approximation is good if N ≥ 20 and p ≤ 0.05 , or if N ≥ 100 and N p ≤ 10 .
This approximation explains also why the Poisson distribution is used to describe rare events
that have a small probability to occur, e. g., radioactive decay of chemical elements. Other
examples of rare events include spelling errors on a book page, the number of visitors of a certain
website, or the number of infections due to a virus.
The parameter λ of the exponential distribution must be strictly positive, i. e., λ > 0 .
Figure 17.8 Exponential distribution. Left: dexp(rate = 1) (left) and pexp(rate = 1) (right).
In the denominator of the definition of the Beta distribution appears the Beta function, which is
defined by
1 (17.62)
α−1 β−1
B(α, β) = ∫ x (1 − x) dx.
The parameters α and β must be strictly positive. In the denominator of the density appears the
gamma function, Γ, which is defined as follows:
∞ (17.64)
α−1
Γ(α) = ∫ t exp (−t)dt.
0
Figure 17.10 Gamma distribution. Left: dgamma(α = 2, β = 2) (left) and
pgamma(α = 2, β = 2) (right).
An important special case of the normal distribution is the standard normal distribution defined
by
1 x
2 (17.66)
f (x) = exp (− ), − ∞ ≤ x ≤ ∞.
√ 2π 2
2 2 (17.67)
1 (x1 − μ1 ) (x2 − μ2 ) (x1 − μ1 )(x2 − μ2 )
f (x)= c exp (− [ + − 2ρ ]),
2 2 2
2(1 − ρ ) σ σ σ1 σ2
1 2
2
x = (x1 , x2 ) ∈ R ,
1 (17.68)
c = .
2
2πσ1 σ2 √ (1 − ρ )
Figure 17.12 Two-dimensional normal distribution. In addition, projections on the x - and x -
1 2
contrast, →Figure 17.13 shows a contour plot of this distribution. Such a plot shows parallel slices
of the x − x plane.
1 2
Figure 17.13 Two-dimensional normal distribution: heat map and contour plot.
Here, x ∈ R is a n-dimensional random variable and the parameters of the density are its mean,
n
μ ∈ R , and the n × n covariance matrix Σ. |Σ| is the determinate of Σ. For n = 2 , we obtain the
n
Figure 17.14 Chi-square distribution. Left: Different values of the degree of freedom
k ∈ {2, 7, 20} . Right: Cumulative distribution function.
An example of the application of the Chi-square distribution is the sampling distribution for a
Chi-square test, which is a statistical hypothesis test that can be used to study the variance or the
distribution of data [→171].
2
(17.71)
Γ((ν + 1)/2) x
f (x) = (1 + ) , − ∞ ≤ x ≤ ∞.
√νπΓ(ν/2) ν
The Student’s t-distribution is also used as a sampling distribution for hypothesis tests.
Specifically, it is used for a t-test that can be used to compare the mean value of one or two
populations, i. e., groups of measurements, each with a certain number of samples [→171].
Figure 17.15 Student’s t-distribution. Left: Different values of the degree of freedom
k ∈ {2, 7, 20} . Right: QQnormal plot for t-distribution with k = 100 .
The log-normal distribution, shown in →Figure 17.16, has the following location measures:
σ
2 (17.74)
mean: exp (μ + ),
2
2
variance: exp (2μ + σ )(exp (σ ) − 1),
2 (17.75)
mode: exp (μ − σ ).
2 (17.76)
The Weibull distribution, shown in →Figure 17.17, has the following location measures:
mean: λΓ(1 + 1/β), (17.78)
2 2 (17.79)
variance: λ [Γ(1 + 2/β) − (Γ(1 + 1/β)) ],
1/β (17.80)
β − 1
mode: λ( ) , β > 1,
β
In biostatistics, the log-normal distribution and the Weibull distribution find their applications
in survival analysis [→112]. Specifically, these distributions are used as a parametric model for the
baseline hazard function of a Cox proportional hazard model, which can be used to model time-
to-event processes by considering covariates.
Its proof follows directly from the definition of conditional probabilities and the commutativity of
the intersection.
The terms in the above equation have the following names:
P (H )is called the prior probability, or prior.
P (D|H ) is called the likelihood.
The letters denoting the above variables, i. e., D and H, are arbitrary, but by using D for “data” and
H for “hypothesis”, one can interpret equation →17.81 as the change of the probability for a
hypothesis (given by the prior) after considering new data about this hypothesis (given by the
posterior).
Bayes’ theorem can be generalized to more variables.
Theorem 17.13.1 (Bayes’ theorem).
Let the events B … B be a partition of the space S such that P (B ) > 0 for all i ∈ {1, … , k} and
1 k i
To understand the utility of the Bayes’ theorem, let us consider the following example:
Suppose that a medical test for a disease is performed on a patient, and this test has a reliability
of 90 % . That means, if a patient has this disease, the test will be positive with a probability of
90 % . Furthermore, assume that if the patient does not have the disease, the test will be positive
with a probability of 10 % . Let us assume that a patient tests positive for this disease. What is the
probability that this patient has this disease? The answer to this question can be obtained using
Bayes’ theorem.
In order to make the usage of Bayes’ theorem more intuitive, we adopt the formulation in
equation (→17.82). Specifically, let us denote a positive test by A = T , a sick patient that has +
the disease (D) by B = D , and a healthy patient that does not have the disease by B = D .
1
+
2
−
(either the patient is sick or healthy). From the provided information about the medical test, see
above, we can identify the following entities:
P (T
+
|D
+
) = 0.9,
(17.84)
P (T
+
|D
−
) = 0.1.
(17.85)
At this point, the following observation can be made: the knowledge about the medical test is not
enough to calculate the probability P (D |T ) , because we also need information about P (D )
+ + +
and P (D ) .
−
These probabilities correspond to the prevalence of the disease in the population and are
independent from the characteristics of the performed medical test. Let us consider two different
diseases: one is a common disease and one is a rare disease. For the common (c) disease, we
assume P (D ) = 1/1000 , and for the rare (r) disease P (D ) = 1/1000000 . That means, for
c
+
r
+
the common disease, one person from 1000 is, on average, sick, whereas, for the rare disease,
only one person from 1000000 is sick. This gives us
Common disease Pc (D
+
) = 1/10 , Pc (D
3 −
) = 1 − 1/10 ,
3 (17.86)
Rare disease Pr (D
+
) = 1/10 , Pr (D
6 −
) = 1 − 1/10 .
6 (17.87)
Rare disease Pr (D
+
|T
+
) = 8.99 ⋅ 10
−6
.
(17.89)
It is worth noting that although the used medical test has the exact same characteristics, given by
|D ) and P (T |D ) (see equation (→17.84) and (→17.85)), the resulting probabilities are
+ + + −
P (T
Pc (D
+
|T
+
) = 991.1 ⋅ Pr (D
+
|T
+
),
(17.90)
which makes it almost 1000 times more likely to suffer from the common disease than the rare
disease, if tested positive.
The above example demonstrates that the context, as provided by P (D ) and P (D ) , is + −
from [0,1]. We can see that for any probability value of P (D ) below 80 % , the probability to
+
have a disease, if tested positive, is always below 5 % . Furthermore, we can see that the functional
relation between P (D ) and P (D |T ) is strongly nonlinear. Such a functional behavior makes
+ + +
it difficult to make good guesses for the values of P (D |T ) without doing the underlying
+ +
mathematics properly.
After this example, demonstrating the use of the Bayes’ theorem, we will now provide the
proof of the theorem.
Figure 17.18 P (D |T ) as a function of the prevalence probability P (D
+ + +
) for a common
disease. The horizontal lines corresponds to 5 % .
Proof.
From the definition of a conditional probability for two events A and B,
P (A ∩ B) (17.91)
P (A|B) = ,
P (B)
since P (A ∩ B ) = P (B ∩ A) .
i i
Using the law of total probability and assuming that {B 1, … , Bk } is a partition of the sample
space, we can write
k (17.94)
P (A) = ∑ P (A|Bj )P (Bj ).
j=1
17.14.1 Entropy
Shannon defined the entropy for a discrete random variable X, assuming values in {X 1, … , Xn }
i=1
Usually, the logarithm is base 2, because the entropy is expressed in bits (that means its unit is
a bit). However, sometimes, other bases are used, hence, attention to this is required.
The entropy is a measure of the uncertainty of a random variable. Specifically, it quantifies the
average amount of information needed to describe the random variable.
H (X) = H (X ).
′ (17.97)
Maximum: The maximum of the entropy is assumed for P (X ) = 1/n = const . ∀i , for
i
{X , … , X } .
1 n
The definition of the entropy can be extended to a continuous random variable, X, with probability
mass function f (X) and X ∈ D as follows:
(17.98)
H (X) = E[− log (f (X))] = − ∫ f (x) log (f (x))dx.
x∈D
Clearly, the entropy is positive for all values of p, and assumes its maximum for p = 0.5 with
H (p = 0.5) = 1 bit . In order to plot the entropy, we used n = 50 different values for p obtained
Similar to the joint probability and the conditional probability, there are also extensions of the
entropy along these lines.
Furthermore, let p = P (X , Y ) be their joint probability distribution. Then, the joint entropy of X
ij i j
i=1 j=1
H (Y |X) , is given by
n n m (17.101)
H (Y |X) = ∑ pi H (Y |X = xi ) = ∑ ∑ pi pji log (pji ).
Let X and Y be two random variables assuming values in X , … , X and Y , … , Y with the
1 n 1 n
distributions. The vertical dashed lines indicate the intersection points between both distributions.
At these points the sign of the logarithm changes, as shown on the right-hand side, since for
log (x) with x > 1 the logarithm is positive, and for x < 0 the logarithm is negative.
Figure 17.20 An example for the Kullback–Leibler divergence. On the left-hand side, we show the
probability distribution p(x) (a gamma distribution) and q(x) (a normal distribution). On the
right-hand side, we show only the logarithm, log ( ) , of both distributions.
p
Let X and Y be two random variables assuming values in X , … , X and Y , … , Y with the
1 n 1 m
I (X|X) = H (X)
In →Figure 17.21, we visualize the relationships between entropies and mutual information.
This graphical representation of the abstract relationships helps in summarizing these nontrivial
dependencies and in gaining an intuitive understanding.
Figure 17.21 Visualization of the nontrivial relationships between entropies and mutual
information.
In contrast with the correlation discussed in Section →17.8.4, mutual information measures
linear and nonlinear dependencies between X and Y. This extension makes this measure a popular
choice for practical applications. For instance, the mutual information has been used to estimate
the regulatory effects between genes [→44] to construct gene regulatory networks [→69], [→72],
[→139]. It has also been used to estimate finance networks representing the relationships
between stocks from, e. g., the New York stock exchange [→66] or investor trading networks [→7].
Proof.
To prove the Chebyshev inequality, we set Y = |X − E[X]| . This guarantees P (Y ≥ 0) ,
2
because Y is nonnegative. Furthermore, E[Y ] = Var(X) per definition of the variance. Now,
application of the Markov inequality and setting s = t gives2
2 2 (17.107)
P ( X − E[X] ≥ t) = P ( X − E[X] ≥ t ),
E[Y ] (17.108)
= P (Y ≥ s) ≤ ,
s
Var(X) (17.109)
= .
s
□
It is important to emphasize that the two above inequalities hold for every probability distribution
with the required conditions. Despite this generality, it is possible to make a specific statement
about the distance of a random sample from the mean of the distribution. For example, for
t = 4σ , we obtain
1 (17.110)
P ( X − E[X] ≥ 4σ) ≤ = 0.063.
16
That means, for every distribution, the probability that the distance between a random sample X
and E[X] is larger than four standard derivations is less than 6.3 % .
At the beginning of this chapter, we stated briefly the result of the law of large numbers. Before
we formulate it formally, we have one last point that requires some clarification. This point relates
to the mean of a sample. Suppose that we have a random sample of size n, given by X , … , X , 1 n
and each X is drawn from the same distribution with mean μ and variance σ . Furthermore,
i
2
each X is drawn independently from the other samples. We call such samples independent and
i
and
Var(X1 ) = ⋯ = Var(Xn ) = σ .
2 (17.112)
The question of interest here is the following: what is the expectation value of the sample mean?
The sample mean of the sample X 1, … , Xn is given by
1
n (17.113)
X̄ n = ∑ Xi .
n
i=1
Here, we emphasize the dependence on n by the subscript of the mean value. From this, we can
obtain the expectation value of X̄ by applying the rules for the expectation values discussed in
n
Similarly, we can obtain the variance of the sample mean, i. e., Var(X̄ ) , by n
n (17.115)
1
Var(X̄ n ) = Var( ∑ Xi ),
n
i=1
n (17.116)
1
= Var(∑ Xi ),
2
n
i=1
1
n (17.117)
= ∑ Var(Xi ),
2
n
i=1
1 σ
2 (17.118)
2
= nσ = .
2
n n
These results are interesting, because they demonstrate that the expectation value of the sample
mean is identical to the mean of the distribution, but the sample variance is reduced by a factor of
1/n compared to the variance of the distribution. Hence, the sampling distribution of X̄ n
Hence,
mean μ.
∣
becomes more and more peaked around μ with increasing values of n, and also having a smaller
variance than the distribution of X for all n > 1 .
P ( X̄ n − E[X̄ n ]
Pr (|X̄ n − μ| ≥ t) ≤
Var(X̄ n )
σ
2
nt
lim
n→∞
P (|X̄ n − μ| < t) = 1.
2
t
2
.
= X̄ n
.
, gives
This is a precise probabilistic relationship between the distance of the sample mean X̄ from the
mean μ as a function of the sample size n. Hence, this relationship can be used to get an estimate
for the number of samples required in order for the sample mean to be “close” to the population
We are now in a position to finally present the result known as the law of large numbers,
which adds a further component to the above considerations for the sample mean. Specifically, so
far, we know that the expectation of the sample mean is the mean of the distribution (see
equation (→17.114)) and that the probability of the minimal distance between X̄ and μ, given by
t, decreases systematically for increasing n (see equation (→17.120)). However, so far, we did not
assess the opposite behavior of equation (→17.120), namely what is P (|X̄ − μ| < t) ?
nt
This last expression is the result of the law of large numbers. That means, the law of large
2
numbers provides evidence that the distance between X̄ and μ stays with certainty, i. e., with a
probability of 1, below any arbitrary small value of t > 0 .
n
Formally, in statistics there is a special symbol that is reserved for the type of convergence
presented in equation (→17.122), which is written as
X̄ n → μ.
The “p” over the arrow means that the sample mean converges in probability to μ.
Suppose that we have an iid sample of size n, X , … , X , where each X is drawn from the same
1
X̄ n → μ.
n
n
i
n
distribution with mean μ and variance σ . Then, the sample mean X̄ converges in probability to μ,
2
n
n
(17.119)
(17.120)
(17.121)
(17.122)
(17.123)
(17.124)
In the previous section, we saw that the expected sample mean and the variance of a random
sample are μ and σ /n , respectively, if the distribution from which the samples are drawn has a
2
mean of μ and a variance of σ . What we did not discuss, so far, is the distributional form of this
2
random sample. This is the topic addressed by the central limit theorem.
Theorem 17.16.1 (Central limit theorem).
Let X , … , X be an iid sample from a distribution with mean μ and variance σ . Then,
1 n
2
X̄n − μ
(17.125)
lim Pr ( ≤ x) = F (x).
n→∞
√ σ2 /n
Here, F is the cumulative distribution function of the standard normal distribution, and x is a fixed real
number.
To understand the importance of the central limit theorem, we would like to emphasize that
equation (→17.125) holds for a large sample from any distribution, whether discrete or
continuous. In this case, can be approximated by a standard normal distribution. This
X̄n −μ
1/2
σ/n
implies that X̄ can be approximated by a normal distribution with mean μ and variance σ /n .
n
2
The central limit theorem is one of the reasons why the normal distribution plays such a
prescind role in statistics, machine learning, and data science. Even when individual random
variables do not come from a normal distribution (i. e., they are not sampled from a normal
distribution), their sum is normally distributed.
X and μ = E[X̄] . Then, for any ϵ > 0 , the following inequalities hold:
n
X̄ = 1/n ∑ i
i=1
2n ϵ
2 2 (17.126)
P (X̄ − μ ≥ ϵ) ≤ exp (− ),
n 2
∑ (bi − ai )
i=1
2n ϵ
2 2 (17.127)
P (|X̄ − μ| ≥ ϵ) ≤ 2 exp (− ).
n 2
∑ (bi − ai )
i=1
By setting ϵ′
= nϵ , one obtains inequalities for S = ∑
n
i=1
Xi ,
2ϵ
′2 (17.128)
′
P (S − E[S] ≥ ϵ ) ≤ exp (− ),
n 2
∑ (bi − ai )
i=1
2ϵ
′2 (17.129)
′
P (|S − E[S]| ≥ ϵ ) ≤ 2 exp (− ).
n 2
∑ (bi − ai )
i=1
Example 17.17.1.
following inequality:
The Hoeffding’s inequality finds its applications in statistical learning theory [→192].
Specifically, it can be used to estimate a bound for the difference between the in-sample error E in
and the out-of-sample error E . More generally, it is used for deriving learning bounds for
out
models [→138].
Then, we obtain the following probabilistic version of the Cauchy–Schwartz inequality for
expectation values
2 2 2 (17.132)
E[XY ] ≤ E[X ]E[Y ].
Using the Cauchy–Schwartz inequality, we can show that the correlation between two linearly
dependent random variables X and Y is 1, i. e.,
ρ(X, Y ) = 1 if Y = aX + b with a, b ∈ R. (17.133)
Here, E[exp (tX)] is the moment-generating function of X. There are many different Chernoff
bounds for different probability distributions and different values of the parameter t. Here, we
provide a bound for Poisson trails, which is a sum of iid Bernoulli random variables, which are
allowed to have different expectation values, i. e., P (X = 1) = p .
i i
Theorem 17.17.2.
Let X , … , X be iid Bernoulli random variables with P (X = 1) = p , and let X̄ = ∑ X be
1 n i i n
n
i=1 i
exp (−δ)
μ
(17.136)
Pr (X ≤ (1 − δ)μ) < ( ) ,
(1−δ)
(1 − δ)
Example 17.17.2.
As an example, we use this bound to estimate the probability when tossing a fair coin n = 100
times to observe m = 40 , or less heads. For this μ = 50 , and from (1 − δ)μ = 30 follows
δ = 0.2 . This gives P (X ≤ m) = 0.34 .
17.19 Summary
Probability theory plays a pivotal role when dealing with data, because essentially every
measurement contains errors. Hence, there is an accompanied uncertainty that needs to be
quantified probabilistically when dealing with data. In this sense, probability theory is an
important extension of deterministic mathematical fields, e. g., linear algebra, graph theory and
analysis, which cannot account for such uncertainties. Unfortunately, such methods are usually
more difficult to understand and require, for this reason, much more practice. However, once
mastered, they add considerably to the analysis and the understanding of real-world problems,
which is essential for any method in data science.
17.20 Exercises
1. In Section →17.11.2, we discussed that under certain conditions a Binomial
distribution can be approximated by a Poisson distribution. Show this result
numerically, using R. Use different approximation conditions and evaluate these.
How can this be quantified? Hint: See Section →17.14.2 about the Kullback–Leibler
divergence.
2. Calculate the mutual information for the discrete joint distribution P (X, Y ) given
in →Table 17.2.
Y
y1 y2 y3
Y
y1 y2 y3
X x1 z 0.5 − z 0.1
x2 0.1 0.1 0.2
3. Use R to calculate the mutual information for the discrete joint distribution
P (X, Y ) given in →Table 17.3, for z ∈ S = [0, 0.5) , and plot the mutual
18.1 Introduction
In general, an optimization problem is characterized by the following:
a set of alternative choices called decision variables;
a set of parameters called uncontrollable variables;
a set of requirements to be satisfied by both decision and uncontrollable variables, called
constraints;
some measure(s) of effectiveness expressed in term of both decision and uncontrollable
variables, called objective-function(s).
Definition 18.1.1.
A set of decision variables that satisfy the constraints is called a solution to the problem.
The aim of an optimization problem is to find, among all solutions to the problem, a solution that
corresponds to either
the maximal value of the objective function, in which case the problem is referred to as a
maximization problem, e. g. maximizing the profit;
the minimal value of the objective-function, in which case the problem is referred to as a
minimization problem, e. g. minimizing the cost; or
a trade-off value of many and generally conflicting objective-functions, in which case the
problem is referred to as a multicriteria optimization problem.
Optimization problems are widespread in every activity, where numerical information is
processed, e. g. mathematics, physics, engineering, economics, systems biology, etc. For instance,
typical examples of optimization applications in systems biology include therapy treatment
planning and scheduling, probe design and selection, genomics analysis, etc.
n
subject to: x ∈ S ⊆ R ,
Definition 18.2.1.
A solution x̄ to the problem (→18.1) is called a local optimum if
f (x̄) ≤ f (x) for all x in a neighborhood of x̄ , for a minimization-type problem, or
f (x̄) ≥ f (x) for all x in a neighborhood of x̄ , for a maximization-type problem.
Definition 18.2.2.
A solution x to the problem (→18.1) is called a global optimum if
∗
∗
f (x ) ≤ f (x) for all x ∈ S , for a minimization-type problem, or
) ≥ f (x) for all x ∈ S , for a maximization-type problem.
∗
f (x
maximization-type problem) for all x ∈ R . In this case, the problem (→18.1) is termed an
n
unconstrained optimization problem. On the other hand, if S ⊂ R , then the problem (→18.1) is
n
then the problem is called a discrete optimization problem, otherwise it is called a continuous
optimization problem. When there is a combination of discrete and continuous variables, the
problem is called a mixed optimization problem.
Remark 18.2.2.
Any minimization problem can be rewritten as a maximization problem, and vice versa, by
substituting the objective function f (x) with z(x) = −f (x) .
Therefore, from now, we will focus exclusively on minimization-type optimization problems.
1.
∗
∂xj
and
2. the Hessian matrix of f at the point x , ∇ f (x ) , is positive definite, i. e.,
∗ 2 ∗
The general principle of the gradient-based algorithms can be summarized by the following steps:
Step 1: Set k = 0 , and choose an initial point x = x and some convergence criteria;
(k) (0)
Step 2: Test for convergence: if the conditions for convergence are satisfied, then we can
stop, and x is the solution. Otherwise, go to Step 3;
(k)
Step 3: Computation of a search direction (also termed a descent direction): find a vector
d ≠ 0 that defines a suitable direction that, if followed, will bring us as close as possible to
k
the solution, x ; ∗
(k) (k)
f (x + αk dk ) < f (x ).
The steepest descent method, also called the gradient descent method, uses the negative of the
gradient vector, at each point, as the search direction for each iteration; thus, steps 3 and 4 are
performed as follows:
(k)
min
2
f (x1 , x2 ) = x1 + x2 ;
2 (18.3)
2
(x1 ,x2 )∈R
The contour plot of the functions f (x 1, x2 ) , depicted in →Figure 18.1 (left), is obtained using the
following script:
Using the steepest descent method, the problem (→18.3) can be solved in R as follows:
The contour plot of the functions g(x 1, x2 ) , depicted in →Figure 18.1 (right), is obtained using the
following script:
Figure 18.1 Left: contour plot of the function f (x , x ) in (→18.3) in the (x
1 2 1, x2 ) plane; right:
contour plot of the function g(x , x ) in (→18.4) in the (x , x ) plane.
1 2 1 2
Most of the optimization methods available in R, including the steepest descent, are implemented
for minimization problems. Since the solution that maximizes a function h(x) minimizes the
function −h(x) , we can solve the problem (→18.5) to find the solution to (→18.4), and then
multiply the value of the objective-function of (→18.5) by −1 to recover the value of the objective-
function of (→18.4).
2 2
(x−2x −y ) 2 (18.5)
min g(x1 , x2 ) = −e sin (6(x + y + xy ))
x1 ,x2
Using the steepest descent method, implemented in R, the problem (→18.5) can be solved as
follows:
Note that the convergence and solution given by the steepest descent method depend on
both the form of the function to be minimized and the initial solution.
The conjugate gradient method is a modification to the steepest descent method, which takes
into account the history of the gradients to move more directly towards the optimum. The
computation of the descent direction (Step 3) and the step-size (Step 4) are performed as follows:
Step 3: the descent direction is given by
(k)
−∇f (x ), k = 0,
dk = {
(k)
−∇f (x ) + βk dk−1 , k ≥ 0,
where several types of formulas for β have been proposed. The most known formulas are
k
(k)
T (18.7)
(∇f (x )) yk−1
PRP
βk = ,
(k−1) 2
∥∇f (x )∥
(k)
T (18.8)
(∇f (x )) yk−1
HS
β = ,
k T
d yk−1
k−1
(k)
T
(k)
T (18.10)
(∇f (x + αk dk )) ≤ −σ(∇f (x )) dk ,
1 2
x̄ = (1.5112, 2.016) , and f (x̄) = −0.0008079518 , which is a local minima. However, in contrast
with the steepest descent method, the conjugate gradient method converges with the initial
solution x = (1, 1) .
(0)
In contrast with the steepest descent and conjugate gradient methods, which only use first-order
information, i. e., the first derivative (or the gradient) term, Newton’s method requires a second-
order derivative (or the Hessian) to estimate the descent direction. Steps 3 and 4 are performed
as follows:
−1
Step 3: d = −[∇ f (x )]
k
2 (k)
∇f (x
(k)
) is the descent direction, where ∇2
f (x) is the
Hessian of f at the point x.
Step 4: α =argmin f (x
k α
(k)
− αdk ) .
Since the computation of the Hessian matrix is generally expensive, several modifications of
Newton’s method have been suggested in order to improve its computational efficiency. One
variant of Newton’s method is the Broyden–Fletcher–Goldfarb–Shanno (BFGS) method, which
uses the gradient to iteratively approximate the inverse of the Hessian matrix
−1
H
−1
k
2
= [∇ f (x
(k)
)] , as follows:
T T T
sk y sk y sk s
−1 k −1 k k
H = (I − )H (I − ) + ,
k T k−1 T T
s yk s yk s yk
k k k
where, s = x − x
k
(k)
and y = ∇f (x ) − ∇f (x
(k−1)
k
(k)
).
(k−1)
In R, the implementation of the BFGS variant of Newton’s method can be found in the general
multipurpose package optimx. This implementation of the BFGS method can be used to solve the
problem (→18.3), as follows:
Now, let us use the BFGS method, implemented in the package optimx, to solve the problem
(→18.4).
18.3.2 Derivative-free methods
Gradient-based methods rely upon information about at least the gradient of the objective-
function to estimate the direction of search and the step size. Therefore, if the derivative of the
function cannot be computed, because, for example, the objective-function is discontinuous,
these methods often fail. Furthermore, although these methods can perform well on functions
with only one extrema (unimodal function), such as (→18.3), their efficiency in solving problems
with multimodal functions depend upon how far the initial solution is from the global minimum, i.
e., gradient-based methods are more or less efficient in finding the global minimum only if they
start from an initial solution sufficiently close to it. Therefore, the solution obtained using these
methods may be one of several local minima, and we often cannot be sure that the solution is the
global minimum. In this section, we will present some commonly used derivative-free methods,
which aim to reduce the limitations of the gradient-based methods by providing an alternative to
the computation of the derivatives of the objective-functions. These methods can be very efficient
in handling complex problems, where the functions are either discontinuous or improperly
defined.
The Nelder–Mead method is an effective and computationally compact simplex algorithm for
finding a local minimum of a function of several variables. Hence, it can be used to solve
unconstrained optimization problems of the form:
min f (x),
n
x ∈ R . (18.11)
x
Definition 18.3.1.
Thus, x and x
1 n+1 correspond to the best and worst vertices, respectively.
At each iteration, the Nelder–Mead method consists of four possible operations: reflection,
expansion, contraction, and shrinking. Each of these operations has a scalar parameter associated
with it. Let us denote by α, β, γ, and δ the parameters associated with the aforementioned
operations, respectively. These parameters are chosen such that α > 0 , β > 1 , 0 < γ < 1 , and
0 < δ < 1.
Then, the Nelder–Mead simplex algorithm, as described in Lagarias et al. [→207], can be
summarized as follows:
Step 0: Generate a simplex with n + 1 vertices, and choose a convergence criterion;
Step 1: Sort the n + 1 vertices according to their objective-function values, i. e., so that
(→18.12) holds. Then, evaluate the centroid of the points in the simplex, excluding x , n+1
given by: x̄ = ∑ x ;
n
i=1 i
Step 2:
Calculate the reflection point x = x̄ + α(x̄ − x ) ;
r n+1
Step 3:
If f (x ) < f (x ) , then calculate the expansion point x = x̄ + β(x − x̄) ;
r 1 e r
Step 4:
If f (x ) ≤ f (x ) < f (x + 1) , then calculate the outside contraction point
n r n
x oc = x̄ + γ(x − x̄) ;
r
Step 5:
If f (x
r) ≥ f (xn+1 ) , then calculate the inside contraction point
xic = x̄ − γ(xr − x̄) ;
If f (x ) < f (x ) , then perform an inside contraction by replacing x
ic n+1 n+1 with x ;
ic
xi = x1 + δ(xi − x1 );
Now, let us use the Nelder–Mead method, implemented in the package optimx, to solve the
problem (→18.4).
18.3.2.2 Simulated annealing
The efficiency of the optimization methods, previously discussed, depends on the proximity of the
initial point, from which they started, to the optimum. Therefore, they cannot always guarantee a
global minimum, since they may be trapped in one of several local minima. Simulated annealing is
based on a neighborhood search strategy, derived from the physical analogy of cooling material
in a heath bath, which occasionally allows uphill moves.
Simulated annealing is based on the Metropolis algorithm [→133], which simulates the
change in energy within a system when subjected to the cooling process; eventually, the system
converges to a final “frozen” state of a certain energy.
Let us consider a system with a state described by an n-dimensional vector x, for which the
function to be minimized is f (x) . This is equivalent to an unconstrained minimization problem.
Let T, denoting the generalized temperature, be a scalar quantity, which has the same dimensions
as f. Then, the Metropolis algorithm description, for a nonatomic system, can be summarized as
follows:
Step 0:
Construct an initial solution x ; set x = x ;
0 0
Step 4:
Update the temperature value as follows: T ⟵ T − ε , where ε
T T ≪ T is a specified
positive real value.
Update the number of Monte Carlo steps: N ⟵ NMC + 1.MC
Step 5:
If T ≤ 0 , then stop, and return x;
Otherwise (i. e. T > 0 ) then go to Step 1.
In R, an implementation of the simulated annealing method can be found in the package GenSA,
and it can be used to solve the problem (→18.3) as follows:
Now, let us use the simulated annealing method, implemented in the package GenSA, to solve
the problem (→18.4).
18.4 Constrained optimization problems
Constrained optimization problems describe most of the real-world optimization problems. Their
complexity depends on the properties of the functional relationships between the decision
variables in both the objection function and the constraints.
Optimize f (x) = ∑ cj xj
j=1
j=1
j=1
j=1
lj ≤ xj ≤ uj , j = 1, … , n.
subject to x1 + x2 ≤ 14
2x1 − x2 ≤ 12
x1 , x2 ≥ 0
Minimize f (x1 , x2 ) = x2 − x1
subject to 2x1 − x2 ≥ −2
(P2 ) x1 − x2 ≤ 2
x1 + x2 ≤ 5
x1 , x2 ≥ 0
subject to x1 + x2 ≥ 6
(P3 ) x1 ≥ 4
x2 ≤ 3
x1 , x2 ≥ 0
Minimize f (x1 , x2 ) = x2 − x1
subject to 2x1 − x2 ≥ −2
(P4 ) x1 − 2x2 ≤ −8
x1 + x2 ≤ 5
x1 , x2 ≥ 0
Since most of the optimization methods available in R are implemented for minimization-type
problems, and the solution which maximizes a function f (x) minimizes the function −f (x) , then
it is necessary to multiply the objective-functions for problems ( P ) and (P ) by −1, and solve the
1 3
Suppose that, in the problem (P ) , x is a binary variable (i. e., it takes only the value 0 or 1),
1 1
and x is an integer variable, then it is necessary to set them to the appropriate type before
2
x ≤ u
x ≥ l,
where
Optimize = Minimize or Maximize;
⟶ R, g : R ⟶ R, ∀ i ∈ I , h : R ⟶ R, ∀ r ∈ J , with at least one of
n n n
f : R i r ∪ K
l, u ∈ (R ∪ {±∞}) .
n
The solution to constrained nonlinear optimization problems, in the form of (→18.13), can be
obtained using the Lagrange multiplier method.
hj (x) ≤ dj j = 1, … , m − p.
i=1 j=1
where λ , and μ are the Lagrangian multipliers associated with the constraints g (x) = b , and
i j i i
h (x) ≤ d , respectively.
j j
The fundamental result behind the Lagrangian formulation (→18.15) can be summarized as
follows: suppose that a solution x = (x , x , … , x ) minimizes the function f (x) subject to
∗ ∗
1
∗
2
∗
n
i=1 j=1
∗ ∗
μj (hj (x ) − dj ) = 0, j = 1, … , m − p;
(18.17)
∗
μj ≥ 0, j = 1, … , m − p; (18.18)
solution is one of these. For an optimal solution, x , some of the inequalities constraints will be
∗
satisfied at equality, and others will not. The latter can be ignored, whereas the former will form
the second equation above. Thus, the constraints μ (h (x ) − d ) = 0 mean that either an
∗
j j
∗
j
∗ ∗ ∗ ∗ ∗
∇f (x ) − ∑ λi ∇gi (x ) − ∑ μj ∇hj (x ) = 0.
i=1 j=1
Note that the KKT conditions (→18.16)–(→18.18) represent the stationarity, the
complementary slackness and the dual feasibility, respectively. Other supplementary KKT
conditions are the primal feasibiliy conditions defined by constraints of the problem (→18.14).
In R, an implementation of the Lagrange multiplier method, for solving nonlinear constrained
optimization problems, can be found in the package Rsolnp.
Let us use the function solnp from the R package Rsolnp to solve the following constrained
nonlinear minimization problem:
x1 2 2
Minimize f (x1 , x2 ) = e (4x + 2x + 4x1 x2 + 2x2 + 1)
1 2
(P ) subject to
x1 + x2 = 1
x1 x2 ≥ −10
18.5 Some applications in statistical machine learning
Most of the statistical theory, including statistical machine learning, consists of the efficient
use of collected data to estimate the unknown parameters of a model, which answers the
questions of interest.
where κ is a constant independent of ω, P (D , ω) is the probability of the observed data set and Ω
is the feasible set of ω.
When the data set, D , consists of a complete random sample x , x , … , x from a discrete
1 2 n
probability distribution with probability function p(x/ω) , the probability of the observed dataset
is given by
P (D , ω)= P (X1 = x1 , X2 = x2 , … , Xn = xn |ω), (18.20)
n n (18.21)
= ∏ P (Xi = xi ) = ∏ p(xi |ω).
i=1 i=1
When the data set D consists of a complete random sample x , x , … , x from a continuous
1 2 n
probability distribution with probability function f (x/ω) , then x ∈ R and the observation x falls i
Definition 18.5.1.
Without loss of generality, the likelihood function of a sample is proportional to the product of the
conditional probability of the data sample, given the parameter of interest, i. e.,
n (18.23)
L (ω) ∝ ∏ f (xi |ω), ω ∈ Ω.
i=1
Definition 18.5.2.
The value of the parameter ω, which maximizes the likelihood L (ω) , hence the probability of the
observed dataset P (D , ω) , is known as the maximum likelihood estimator (MLE) of ω and is
denoted ω̂ .
Note that the MLE ω̂ is a function of the data sample x , x , … , x . The likelihood function
1 2 n
(→18.23) is often complex to manipulate and, in practice, it is more convenient to work with the
logarithm of L (ω) ( log L (ω) ), which also yields the same optimal parameter ω̂ .
The MLE problem can then be formulated as the following optimization problem, which can be
solved using the numerical methods, implemented in R, presented in the previous sections.
Maximize log L (ω). (18.24)
ω∈Ω
y ∈ {−1, +1} . The fundamental idea, behind the concept of support vector machine (SVM)
classification [→191], is to find a pair (w, b) ∈ R × R such that the hyperplane defined by
m
⟨w, x⟩ + b = 0 separates the data points labeled y = +1 from those labeled y = −1 , and
i i
maximizes the distance to the closest points from either class. If the points (x , y ), i i i = 1, … , n
∥w∥
, (x1 − x2 )⟩ =
2
∥w∥
. Hence, for the
distance between the points (x 1, y1 ) and (x 2, y2 ) to be maximum, we need the ratio 2
∥w∥
to be
as maximum as possible or, equivalently, we need the ratio to be as minimum as possible, i.
∥w∥
e.
1 2
(18.26)
minimize ∥w∥ .
w∈R
m
2
Thus, to construct such an optimal hyperplane, it is necessary to solve the following problem:
1 2
(18.28)
minimize z(w) = ∥w∥
w∈R
m
, b∈R 2
subject to yi (⟨w, xi ⟩ + b) ≥ 1, f or i = 1, … , m
The above problem is a constrained optimization problem with a nonlinear (quadratic) objective
function and linear constraints, which can be solved using the Lagrange multiplier method.
The Lagrangian associated with the problem (→18.28) can be defined as follows:
1
m (18.29)
2
L(w, b, λ) = ∥w∥ − ∑ λi (yi (⟨w, xi ⟩ + b) − 1),
2
i=1
The Lagrangian (→18.29) must be minimized with respect to w and b, and maximized with
respect to λ.
Solving
∂
L(w, b, λ) = 0
(18.30)
∂b
{
∂
L(w, b, λ) = 0
∂w
yields
m (18.31)
∑ λi yi = 0,
i=1
m (18.32)
w = ∑ λi yi .
i=1
Substituting w into the Lagrangian (→18.29) leads to the following optimization problem, also
known as the dual formulation of support vector classifier:
m
1
m m (18.33)
maximize Z(λ) = ∑ λi − ∑ ∑ λi λj yi yj ⟨xi , xj ⟩
λ∈R
m
2
i i=1 j=1
subject to ∑ λi yi = 0,
i=1
λi ≥ 0, f or i = 1, … , m.
Both problems (→18.28) and (→18.33) can be solved in R using the package Rsolnp, as illustrated in
Listing (18.18). However, since the constraints of (→18.28) are relatively complex, it is
computationally easier to solve the problem (→18.33) and then recover the vector w through
(→18.32).
18.7 Summary
Optimization is a broad and complex topic. One of the major challenges in optimization is the
determination of global optima for nonlinear and high-dimensional problems. Generally,
optimization methods find applications in attempts to optimize a parametric decision-making
process, such as classification, clustering, or regression of data. The corresponding optimization
problems either involve complex nonlinear functions or are based on data points, i. e., the
problems include discontinuities. Knowledge about optimization methods can be helpful in
designing analysis methods, since they usually involve difficult optimization solutions. Hence, a
parsimonious approach for designing such analysis methods will also help to keep optimization
problems tractable.
18.8 Exercises
1. Consider the following unconstrained problem:
2 2 (18.34)
max f (x1 , x2 ) = −(x1 − 5) − (x2 − 3) .
2
(x1 ,x2 )∈R
Using R, provide the contour plot of the function f (x 1, x2) and solve the problem
(→18.34) using
the steepest descent method;
the conjugate gradient method;
Newton’s method;
the Nelder–Mead method;
Simulated annealing;
2. Consider the following unconstrained problem:
1 (18.35)
2 2
min z(x1 , x2 ) = −2x1 − 3x2 + x1 + 2x2 − 3x1 x2 .
(x1 ,x2 )∈R
2
5
Using R, provide the contour plot of the function z(x , x2) and solve the problem
1
(→18.35) using
the steepest descent method;
the conjugate gradient method;
Newton’s method;
the Nelder–Mead method;
Simulated annealing;
3. Using the R package lpSolveAPI, solve the following linear programming problems:
Minimize f (x1 , x2 ) = 2x1 + 3x2
1 1
subject to x1 + x2 ≤ 4
4
2
(A) x1 + 3x2 ≥ 20
x1 + x2 = 10
x1 , x2 ≥ 0
subject to x1 + x2 ≥ 3
(B) 2x1 + x2 ≤ 4
x1 + x2 = 3
x1 , x2 ≥ 0
4. Using the function solnp from the R package Rsolnp, solve the following nonlinear
constrained optimization problems:
2 2
Minimize f (x1 , x2 ) = (x1 − 1) + (x2 − 2)
(A) subject to
−x1 + x2 = 1
x1 + x2 ≤ 3
2 2
Minimize f (x1 , x2 ) = −x − x + 3x1 + 5x2
1 2
(B) subject to
x1 + x2 ≤ 7
x1 ≤ 5
x2 ≤ 6
Bibliography
[1] H. Abelson, G. J. Sussman, and J. Sussman. Structure and
Interpretation of Computer Programs. MIT Press; 2nd edition,
1996. a, b, c
[2] L. Adamic and B. Huberman. Power-law distribution of the
world wide web. Science, 287:2115a, 2000. →
[3] W. A. Adkins and M. G. Davidson. Ordinary Differential
Equations. Undergraduate Texts in Mathematics. Springer New
York, 2012. a, b
[4] R. Albert and A. L. Barabási. Statistical mechanics of complex
networks. Rev. Mod. Phys., 74:47–97, 2002. a, b
[5] G. R. Andrews. Foundations of Multithreaded, Parallel, and
Distributed Programming. Addison-Wesley, 1999. →
[6] M. Ashburner, C. A. Ball, J. A. Blake, D. Botstein, and H. Butler
et al. Gene ontology: tool for the unification of biology. The Gene
Ontology Consortium. Nat. Genet., 25(1):25–29, May 2000. →
[7] K. Baltakys, J. Kanniainen, and F. Emmert-Streib. Multilayer
aggregation of investor trading networks. Sci. Rep., 1:8198,
2018. →
[8] A. L. Barabási and R. Albert. Emergence of scaling in random
networks. Science, 206:509–512, 1999. a, b
[9] A. L. Barabási and Z. N. Oltvai. Network biology:
understanding the cell’s functional organization, Nat. Rev., 5:101–
113, 2004. →
[10] Albert-László Barabási. Network science. Philos. Trans. R. Soc.
Lond. A, 371(1987):20120375, 2013. →
[11] M. Barnsley. Fractals Everywhere. Morgan Kaufmann, 2000.
→
[12] R. G. Bartle and D. R. Sherbert. Introduction to Real Analysis.
Wiley Publishing, 1999. →
[13] Mokhtar S Bazaraa, Hanif D Sherali, and Chitharanjan M
Shetty. Nonlinear Programming: Theory and Algorithms. John Wiley
& Sons, 2013. →
[14] R. A. Becker and J. M. Chambers. An Interactive Environment
for Data Analysis and Graphics. Wadsworth & Brooks/Cole, Pacific
Grove, CA, USA, 1984. →
[15] M. Behzad, G. Chartrand, and L. Lesniak-Foster. Graphs &
Digraphs. International Series. Prindle, Weber & Schmidt, 1979.
→
[16] D. P. Bertsekas. Nonlinear Programming. Athena Scientific
Optimization and Computation Series. Athena Scientific, 2016. →
[17] Dimitri P Bertsekas and John N Tsitsiklis. Introduction to
probability, volume 1, 2002. →
[18] Joseph K Blitzstein and Jessica Hwang. Introduction to
Probability. Chapman and Hall/CRC, 2014. →
[19] D. Bonchev. Information Theoretic Indices for Characterization
of Chemical Structures. Research Studies Press, Chichester, 1983.
→
[20] D. Bonchev and D. H. Rouvray. Chemical Graph Theory:
Introduction and Fundamentals. Mathematical Chemistry. Abacus
Press, 1991. →
[21] D. Bonchev and D. H. Rouvray. Complexity in Chemistry,
Biology, and Ecology. Mathematical and Computational
Chemistry. Springer, New York, NY, USA, 2005. →
[22] G. S. Boolos, J. P. Burgess, and R. C. Jeffrey. Computability and
Logic Cambridge University Press; 5th edition, 2007. →
[23] S. Bornholdt and H. G. Schuster. Handbook of Graphs and
Networks: From the Genome to the Internet. John Wiley & Sons,
Inc., New York, NY, USA, 2003. a, b, c
[24] U. Brandes and T. Erlebach. Network Analysis. Lecture Notes
in Computer Science. Springer, Berlin Heidelberg New York,
2005. →
[25] A. Brandstädt, V. B. Le, and J. P. Sprinrad. Graph Classes. A
Survey. SIAM Monographs on Discrete Mathematics and
Applications, 1999. →
[26] L. Breiman. Random forests. Mach. Learn., 45:5–32, 2001. →
[27] O. Bretscher. Linear Algebra with Applications. Prentice Hall;
3rd edition, 2004. a, b, c, d, e, f, g, h, i, j
[28] Sergey Brin and Lawrence Page. The anatomy of a large-
scale hypertextual Web search engine. Comput. Netw. ISDN Syst.,
30(1–7):107–117, 1998. →
[29] M. Brinkmeier and T. Schank. Network statistics. In U.
Brandes and T. Erlebach, editors, Network Analysis, Lecture Notes
of Computer Science, pages 293–317. Springer, 2005. →
[30] I. A. Bronstein, A. Semendjajew, G. Musiol, and H. Mühlig.
Taschenbuch der Mathematik. Harri Deutsch Verlag, 1993. →
[31] F. Buckley and F. Harary. Distance in Graphs. Addison Wesley
Publishing Company, 1990. →
[32] P. E. Ceruzzi. A History of Modern Computing. MIT Press; 2nd
edition, 2003. a, b, c, d
[33] S. Chiaretti, X. Li, R. Gentleman, A. Vitale, M. Vignetti, F.
Mandelli, J. Ritz, and R. Foa. Gene expression profile of adult t-cell
acute lymphocytic leukemia identifies distinct subsets of patients
with different response to therapy and survival. Blood,
103(7):2771–2778, 2003. →
[34] W. F. Clocksin and C. S. Mellish. Programming in Prolog: Using
the ISO Standard. Springer, 2002. →
[35] S. Cole-Kleene. Mathematical Logic. Dover Books on
Mathematics. Dover Publications, 2002. a, b, c
[36] B. Jack Copeland, C. J. Posy, and O. Shagrir. Elements of
Information Theory. The MIT Press, 2013. a, b, c, d
[37] T. H. Cormen, C. E. Leiserson, and R. L. Rivest. Introduction to
Algorithms. MIT Press, 1990. a, b, c, d, e, f, g, h, i, j
[38] T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein.
Introduction to Algorithms. MIT Press, 2001. a, b, c, d, e, f, g, h, i, j,
k, l, m
[39] T. M. Cover and J. A. Thomas. Information Theory. John Wiley
& Sons, Inc., 1991. a, b
[40] T. M. Cover and J. A. Thomas. Elements of Information Theory.
Wiley Series in Telecommunications and Signal Processing. Wiley
& Sons, 2006. →
[41] N. Cristianini and J. Shawe-Taylor. An Introduction to Support
Vector Machines. Cambridge University Press, Cambridge, UK,
2000. a, b
[42] Gabor Csardi and Tamas Nepusz. The igraph software
package for complex network research. InterJournal, Complex
Systems:1695, 2006, http://igraph.sf.net. →
[43] L. da F. Costa, F. Rodrigues, and G. Travieso.
Characterization of complex networks: a survey of
measurements. Adv. Phys., 56:167–242, 2007. →
[44] R. de Matos Simoes and F. Emmert-Streib. Influence of
statistical estimators of mutual information and data
heterogeneity on the inference of gene regulatory networks.
PLoS ONE, 6(12):e29279, 2011. →
[45] R. de Matos Simoes and F. Emmert-Streib. Bagging
statistical network inference from large-scale gene expression
data. PLoS ONE, 7(3):e33624, 2012. →
[46] Pierre Lafaye de Micheaux, Rémy Drouilhet, and Benoit
Liquet. The r software. 2013. →
[47] J. Debasish. C++ and Object Oriented Programming Paradigm.
PHI Learning Pvt. Ltd., 2005. a, b, c, d, e
[48] Morris H DeGroot and Mark J Schervish. Probability and
statistics. Pearson Education, 2012. a, b
[49] M. Dehmer. Die analytische Theorie der Polynome.
Nullstellenschranken für komplexwertige Polynome. Weissensee-
Verlag, Berlin, Germany, 2004. →
[50] M. Dehmer. On the location of zeros of complex
polynomials. J. Inequal. Pure Appl. Math., 7(1), 2006. a, b
[51] M. Dehmer. Strukturelle Analyse web-basierter Dokumente.
Multimedia und Telekooperation. Deutscher Universitäts Verlag,
Wiesbaden, 2006. →
[52] M. Dehmer, editor. Structural Analysis of Complex Networks.
Birkhäuser Publishing, 2010. a, b
[53] M. Dehmer and F. Emmert-Streib, editors. Analysis of
Complex Networks: From Biology to Linguistics. Wiley-VCH,
Weinheim, 2009. →
[54] M. Dehmer, K. Varmuza, S. Borgert, and F. Emmert-Streib.
On entropy-based molecular descriptors: statistical analysis of
real and synthetic chemical structures. J. Chem. Inf. Model.,
49:1655–1663, 2009. →
[55] M. Dehmer, K. Varmuza, S. Borgert, and F. Emmert-Streib.
On entropy-based molecular descriptors: statistical analysis of
real and synthetic chemical structures. J. Chem. Inf. Model.,
49(7):1655–1663, 2009. →
[56] R. Devaney and M. W. Hirsch. Differential Equations,
Dynamical Systems, and an Introduction to Chaos. Academic Press,
2004. →
[57] J. Devillers and A. T. Balaban. Topological Indices and Related
Descriptors in QSAR and QSPR. Gordon and Breach Science
Publishers, Amsterdam, The Netherlands, 1999. →
[58] E. W. Dijkstra. A note on two problems in connection with
graphs. Numer. Math., 1:269–271, 1959. a, b, c, d, e, f, g
[59] S. N. Dorogovtsev and J. F. F. Mendes. Evolution of Networks.
From Biological Networks to the Internet and WWW. Oxford
University Press, 2003. →
[60] J. Duckett. Beginning HTML, XHTML, CSS, and JavaScript. Wrox,
2009. →
[61] F Emmert-Streib and M Dehmer. Defining data science by a
data-driven quantification of the community. Machine Learning
and Knowledge Extraction, 1(1):235–251, 2019. a, b
[62] F. Emmert-Streib. Exploratory analysis of spatiotemporal
patterns of cellular automata by clustering compressibility. Phys.
Rev. E, 81(2):026103, 2010. →
[63] F. Emmert-Streib and M. Dehmer. Topolocial mappings
between graphs, trees and generalized trees. Appl. Math.
Comput., 186(2):1326–1333, 2007. a, b, c
[64] F. Emmert-Streib and M. Dehmer, editors. Analysis of
Microarray Data: A Network-based Approach. Wiley VCH
Publishing, 2010. a, b, c, d, e
[65] F. Emmert-Streib and M. Dehmer. Identifying critical
financial networks of the djia: towards a network based index.
Complexity, 16(1), 2010. →
[66] F. Emmert-Streib and M. Dehmer. Influence of the time
scale on the construction of financial networks. PLoS ONE,
5(9):e12884, 2010. →
[67] F. Emmert-Streib and M. Dehmer. Networks for systems
biology: conceptual connection of data and function. IET Syst.
Biol., 5:185–207, 2011. a, b
[68] F. Emmert-Streib and M. Dehmer. Evaluation of regression
models: model assessment, model selection and generalization
error. Machine Learning and Knowledge Extraction, 1(1):521–551,
2019. →
[69] F. Emmert-Streib, M. Dehmer, and B. Haibe-Kains.
Untangling statistical and biological models to understand
network inference: the need for a genomics network ontology.
Front. Genet., 5:299, 2014. →
[70] F. Emmert-Streib, M. Dehmer, and O. Yli-Harja. Against
Dataism and for data sharing of big biomedical and clinical data
with research parasites. Front. Genet., 7:154, 2016. →
[71] F. Emmert-Streib and G. V. Glazko. Network biology: a direct
approach to study biological function. Wiley Interdiscip. Rev., Syst.
Biol. Med., 3(4):379–391, 2011. →
[72] F. Emmert-Streib, G. V. Glazko, Gökmen Altay, and Ricardo
de Matos Simoes. Statistical inference and reverse engineering
of gene regulatory networks from observational expression
data. Front. Genet., 3:8, 2012. →
[73] F. Emmert-Streib, S. Moutari, and M. Dehmer. The process
of analyzing data is the emergent feature of data science. Front.
Genet., 7:12, 2016. →
[74] F. Emmert-Streib, S. Tripathi, O. Yli-Harja, and M. Dehmer.
Understanding the world economy in terms of networks: a
survey of data-based network science approaches on economic
networks. Front. Appl. Math. Stat., 4:37, 2018. a, b
[75] Frank Emmert-Streib and Matthias Dehmer. Network
science: from chemistry to digital society. Front. Young Minds,
2019. →
[76] P. Erdös and A. Rényi. On random graphs. I. Publ. Math.,
6:290–297, 1959. →
[77] P. Erdös and A. Rényi. On random graphs. Publ. Math. Inst.
Hung. Acad. Sci., 5:17, 1960. →
[78] G Fichtenholz. Differentialrechnung und Integralrechnung.
Verlag Harri Deutsch, 1997. a, b, c
[79] R. W. Floyd. The paradigms of programming. Commun. ACM,
22(8):455–460, 1979. a, b, c, d
[80] L. C. Freeman. A set of measures of centrality based on
betweenness. Sociometry, 40, 1977. a, b, c
[81] L. C. Freeman. Centrality in social networks: conceptual
clarification. Soc. Netw., 1:215–239, 1979. a, b, c
[82] Thomas M. J. Fruchterman and Edward M. Reingold. Graph
drawing by force-directed placement. Softw. Pract. Exp.,
21(11):1129–1164, 1991. →
[83] R. G. Gallager. Information Theory and Reliable
Communication. Wiley, 1968. →
[84] A Gelman, J B Carlin, H S Stern, and D B Rubin. Bayesian Data
Analysis. Chapman & Hall/CRC, 2003. →
[85] C. Gershenson. Classification of random boolean networks.
In R. K. Standish, M. A. Bedau, and H. A. Abbass, editors, Artificial
Life VIII, pages 1–8. MIT Press, Cambridge, 2003. →
[86] G. H. Golub and C. F. Van Loan. Matrix Computation. The
Johns Hopkins University, 2012. →
[87] Geoffrey Grimmett, Geoffrey R Grimmett, and David
Stirzaker. Probability and Random Processes. Oxford University
Press, 2001. →
[88] Jonathan L Gross and Jay Yellen. Graph Theory and Its
Applications. CRC Press, 2005. →
[89] Grundlagen der Informatik für Ingenieure, 2008. Course
materials, School of Computer Science, Otto-von-Guericke-
University Magdeburg, Germany. a, b, c, d
[90] I. Gutman. The energy of a graph: old and new results. In A.
Betten, A. Kohnert, R. Laue, and A. Wassermann, editors,
Algebraic Combinatorics and Applications, pages 196–211. Springer
Verlag, Berlin, 2001. →
[91] Ian Hacking. The Emergence of Probability: A Philosophical
Study of Early Ideas About Probability, Induction and Statistical
Inference. Cambridge University Press, 2006. a, b
[92] P. Hage and F. Harary. Eccentricity and centrality in
networks. Soc. Netw., 17:57–63, 1995. →
[93] R. Halin. Graphentheorie. Akademie Verlag, Berlin, Germany,
1989. →
[94] F. Harary. Graph Theory. Addison Wesley Publishing
Company, Reading, MA, USA, 1969. a, b, c, d, e, f, g, h, i, j, k
[95] R. Harrison, L. G. Smaraweera, M. R. Dobie, and P. H. Lewis.
Comparing programming paradigms: an evaluation of functional
and object-oriented programs. Softw. Eng. J., 11(4):247–254,
1996. a, b
[96] T. Hastie, R. Tibshirani, and J. H. Friedman. The Elements of
Statistical Learning. Springer, Berlin, New York, 2001. →
[97] D. C. Hoaglin, F. Mosteller, and J. W. Tukey. Understanding
Robust and Exploratory Data Analysis. Wiley, New York, 1983. →
[98] R. E. Hodel. An Introduction to Mathematical Logic. Dover
Publications, 2013. a, b, c, d, e, f, g, h, i
[99] A. S. Householder. The Numerical Treatment of a Single
Nonlinear Equation. McGraw-Hill, New York, NY, USA, 1970. →
[100] T. Ihringer. Diskrete Mathematik. Teubner, Stuttgart, 1994.
a, b
[101] Edwin T Jaynes. Probability Theory: The Logic of Science.
Cambridge University Press, 2003. →
[102] J. Jost. Partial Differential Equations. Springer, New York, NY,
USA, 2007. a, b
[103] G. Julia. Mémoire sur l’itération des fonctions rationnelles.
J. Math. Pures Appl., 8:47–245, 1918. →
[104] B. Junker, D. Koschützki, and F. Schreiber. Exploration of
biological network centralities with centibin. BMC Bioinform.,
7(1):219, 2006. a, b
[105] Joseph B Kadane. Principles of Uncertainty. Chapman and
Hall/CRC, 2011. →
[106] Tomihisa Kamada, Satoru Kawai, et al. An algorithm for
drawing general undirected graphs. Inf. Process. Lett., 31(1):7–15,
1989. →
[107] M. Kanehisa and S. Goto. KEGG: Kyoto Encyclopedia of
Genes and Genomes. Nucleic Acids Res., 28:27–30, 2000. →
[108] Daniel Kaplan and Leon Glass. Understanding Nonlinear
Dynamics. Springer Science & Business Media, 2012. →
[109] S. A. Kauffman. The Origin of Order: Self Organization and
Selection in Evolution. Oxford University Press, USA, 1993. a, b, c
[110] S. V. Kedar. Programming Paradigms and Methodology.
Technical Publications, 2008. a, b, c, d
[111] U. Kirch-Prinz and P. Prinz. C++. Lernen und professionell
anwenden. mitp Verlag, 2005. a, b, c, d, e
[112] D. G. Kleinbaum and M. Klein. Survival Analysis: A Self-
Learning Text. Statistics for Biology and Health. Springer, 2005.
→
[113] V. Kontorovich, L. A. Beltrn, J. Aguilar, Z. Lovtchikova, and K.
R. Tinsley. Cumulant analysis of Rössler attractor and its
applications. Open Cybern. Syst. J., 3:29–39, 2009. →
[114] Kevin B Korb and Ann E Nicholson. Bayesian Artificial
Intelligence. CRC Press, 2010. a, b
[115] R. C. Laubenbacher. Modeling and Simulation of Biological
Networks. Proceedings of Symposia in Applied Mathematics.
American Mathematical Society, 2007. →
[116] S. L. Lauritzen. Graphical Models. Oxford Statistical Science
Series. Oxford University Press, 1996. →
[117] M. Z. Li, M. S. Ryerson, and H. Balakrishnan. Topological
data analysis for aviation applications. Transp. Res., Part E, Logist.
Transp. Rev., 128:149–174, 2019. →
[118] Dennis V Lindley. Understanding Uncertainty. John Wiley &
Sons, 2013. →
[119] E. N. Lorenz. Deterministic nonperiodic flow. J. Atmos. Sci.,
20:130–141, 1963. →
[120] A. J. Lotka. Elements of Physical Biology. Williams and
Wilkins, 1925. →
[121] K. C. Louden. Compiler Construction: Principles and Practice.
Course Technology, 1997. a, b
[122] K. C. Louden and K. A. Lambert. Programming Languages:
Principles and Practice. Advanced Topics Series. Cengage
Learning, 2011. a, b, c, d, e, f, g, h, i, j, k, l, m, n, o, p, q, r, s, t, u, v,
w, x, y, z, aa, ab, ac, ad, ae, af, ag, ah, ai, aj, ak, al, am
[123] D. J. C. MacKay. Information Theory, Inference and Learning
Algorithms. Cambridge University Press, 2003. →
[124] D. Maier. Theory of Relational Databases. Computer Science
Press; 1st edition, 1983. →
[125] B. B. Mandelbrot. The Fractal Geometry of Nature. W. H.
Freeman and Company, San Francisco, 1983. a, b
[126] E. G. Manes and A. A. Arbib. Algebraic Approaches to
Program Semantics. Monographs in Computer Science. Springer,
1986. a, b, c, d, e, f, g
[127] M. Marden. Geometry of polynomials. Mathematical Surveys
of the American Mathematical Society, Vol. 3. Rhode Island, USA,
1966. →
[128] O. Mason and M. Verwoerd. Graph theory and networks in
biology. IET Syst. Biol., 1(2):89–119, 2007. a, b
[129] N. Matloff. The Art of R Programming: A Tour of Statistical
Software Design. No Starch Press, 2011. →
[130] B. D. McKay. Graph isomorphisms. Congr. Numer., 730:45–
87, 1981. →
[131] J. M. McNamee. Numerical Methods for Roots of Polynomials.
Part I. Elsevier, 2007. →
[132] A. Mehler, M. Dehmer, and R. Gleim. Towards logical
hypertext structure. A graph-theoretic perspective. In
Proceedings of I2CS’04, Lecture Notes, pages 136–150. Springer,
Berlin–New York, 2005. →
[133] N. Metropolis, A. W. Rosenbluth, M. N. Rosenbluth, and A.
H. Teller. Equations of state calculations by fast computing
machines. J. Chem. Phys., 21(6):1087–1092, 1953. →
[134] C. Meyer. Matrix Analysis and Applied Linear Algebra. SIAM,
2000. →
[135] M. Mignotte and D. Stefanescu. Polynomials: An Algorithmic
Approach. Discrete Mathematics and Theoretical Computer
Science. Springer, Singapore, 1999. a, b, c, d, e, f
[136] J. C. Mitchell. Concepts in Programming Languages.
Cambridge University Press, 2003. a, b, c
[137] Michael Mitzenmacher and Eli Upfal. Probability and
Computing: Randomization and Probabilistic Techniques in
Algorithms and Data Analysis. Cambridge University Press, 2017.
→
[138] Mehryar Mohri, Afshin Rostamizadeh, and Ameet
Talwalkar. Foundations of Machine Learning. MIT Press, 2018. →
[139] D. Moore, R. de Matos Simoes, M. Dehmer, and F. Emmert-
Streib. Prostate cancer gene regulatory network inferred from
RNA-Seq data. Curr. Genomics, 20(1):38–48, 2019. →
[140] C. Müssel, M. Hopfensitz, and H. A. Kestler. Boolnet—an R
package for generation, reconstruction and analysis of Boolean
networks. Bioinformatics, 26(10):1378–1380, 2010. →
[141] M. Newman. Networks: An Introduction. Oxford University
Press, Oxford, 2010. →
[142] M. E. J. Newman. The structure and function of complex
networks. SIAM Rev., 45:167–256, 2003. a, b
[143] M. E. J. Newman, A. L. Barabási, and D. J. Watts. The
Structure and Dynamics of Networks. Princeton Studies in
Complexity. Princeton University Press, 2006. a, b
[144] Jorge Nocedal and Stephen Wright. Numerical Optimization.
Springer Science & Business Media, 2006. →
[145] Peter Olofsson. Probabilities: The Little Numbers That Rule
Our Lives. John Wiley & Sons, 2015. →
[146] G. O’Regan. Mathematics in Computing: An Accessible Guide
to Historical, Foundational and Application Contexts. Springer,
2012. →
[147] A. Papoulis. Probability, Random Variables, and Stochastic
Processes. Mc Graw-Hill, 1991. →
[148] Lothar Papula. Mathematik für Ingenieure und
Naturwissenschaftler Band 1: Ein Lehr-und Arbeitsbuch für das
Grundstudium. Springer-Verlag, 2018. →
[149] J. Pearl. Probabilistic Reasoning in Intelligent Systems.
Morgan-Kaufmann, 1988. →
[150] J. Pitman. Probability. Springer Texts in Statistics. Springer
New York, 1999. →
[151] V. V. Prasolov. Polynomials. Springer, 2004. →
[152] T. W. Pratt, M. V. Zelkowitz, and T. V. Gopal. Programming
Languages: Design and Implementation, volume 4. Prentice-Hall,
2000. →
[153] R, software, a language and environment for statistical
computing. www.r-project.org, 2018. R Development Core Team,
Foundation for Statistical Computing, Vienna, Austria. →
[154] R Development Core Team. R: A Language and Environment
for Statistical Computing. R Foundation for Statistical Computing,
Vienna, Austria, 2008. ISBN 3-900051-07-0. →
[155] Q. I. Rahman and G. Schmeisser. Analytic Theory of
Polynomials. Critical Points, Zeros and Extremal Properties.
Clarendon Press, Oxford, UK, 2002. a, b, c
[156] J.-P. Rodrigue, C. Comtois, and B. Slack. The Geography of
Transport Systems. Taylor & Francis, 2013. a, b
[157] O. E. Rössler. An equation for hyperchaos. Phys. Lett.,
71A:155–157, 1979. →
[158] W. Rudin. Real and Complex Analysis. McGraw-Hill, 3rd
edition, 1986. a, b, c, d, e, f, g, h, i
[159] G. Sabidussi. The centrality index of a graph. Psychometrika,
31:581–603, 1966. →
[160] A. Salomaa. Formal Languages. Academic Press, 1973. a, b
[161] Leonard J Savage. The Foundations of Statistics. Courier
Corporation, 1972. →
[162] J. Schneider and S. Kirkpatrick. Stochastic Optimization.
Scientific Computation. Springer Berlin Heidelberg, 2007. →
[163] Uwe Schöning. Algorithmen—kurz gefasst. Spektrum
Akademischer Verlag, 1997. →
[164] Uwe Schöning. Theoretische Informatik—kurz gefasst.
Spektrum Akademischer Verlag, 2001. a, b, c
[165] H. G. Schuster. Deterministic Chaos. Wiley VCH Publisher,
1988. →
[166] K. Scott. The SQL Programming Language. Jones & Bartlett
Publishers, 2009. a, b
[167] M. L. Scott. Programming Language Pragmatics. Morgan
Kaufmann, 2009. a, b
[168] R. W. Sebesta. Concepts of Programming Languages, volume
9. Addison-Wesley Reading, 2009. a, b, c, d, e, f
[169] C. E. Shannon and W. Weaver. The Mathematical Theory of
Communication. University of Illinois Press, 1949. a, b
[170] L. Shapiro. Organization of relational models. In
Proceedings of Intern. Conf. on Pattern Recognition, pages 360–365,
1982. →
[171] D. J. Sheskin. Handbook of Parametric and Nonparametric
Statistical Procedures. RC Press, Boca Raton, FL; 3rd edition,
2004. a, b
[172] W. Sierpinśki. On curves which contains the image of any
given curve. Mat. Sbornik. In Russian. French translation in Oeuvres
Choisies II, 30:267–287, 1916. →
[173] Devinderjit Sivia and John Skilling. Data Analysis: A Bayesian
Tutorial. OUP Oxford, 2006. →
[174] V. A. Skorobogatov and A. A. Dobrynin. Metrical analysis of
graphs. MATCH Commun. Math. Comput. Chem., 23:105–155,
1988. →
[175] P. Smith. An Introduction to Formal Logic. Cambridge
University Press, 2003. a, b, c, d, e, f
[176] K. Soetaert, J. Cash, and F Mazzia. Solving Differential
Equations in R. Springer-Verlag, New York, 2012. →
[177] K. Soetaert and P. M. J. Herman. A Practical Guide to
Ecological Modelling. Using R as a Simulation Platform. Springer-
Verlag, New York, 2009. →
[178] D. Ştefănescu. Bounds for real roots and applications to
orthogonal polynomials. In Computer Algebra in Scientific
Computing, 10th International Workshop, CASC 2007, Bonn,
Germany, pages 377–391, 2007. →
[179] S. Sternberg. Dynamical Systems. Dover Publications, New
York, NY, USA, 2010. →
[180] James V Stone. Bayes’ Rule: A Tutorial Introduction to
Bayesian Analysis. Sebtel Press, 2013. →
[181] S. H. Strogatz. Nonlinear Dynamics and Chaos: With
Applications to Physics, Biology, Chemistry, and Engineering.
Addison-Wesley, Reading, 1994. →
[182] K. Sydsaeter, P. Hammond, and A. Strom. Essential
Mathematics for Economic Analysis. Pearson; 4th edition, 2012. a,
b, c, d, e
[183] S. Thurner. Statistical mechanics of complex networks. In
M. Dehmer and F. Emmert-Streib, editors, Analysis of Complex
Networks: From Biology to Linguistics, pages 23–45. Wiley-VCH,
2009. →
[184] J. P. Tignol. Galois’ Theory of Algebraic Equations. World
Scientific Publishing Company, 2016. →
[185] Mary Tiles. Mathematics: the language of science? Monist,
67(1):3–17, 1984. →
[186] N. Trinajstić. Chemical Graph Theory. CRC Press, Boca Raton,
FL, USA, 1992. a, b, c
[187] S. Tripathi, M. Dehmer, and F. Emmert-Streib. NetBioV: an
R package for visualizing large-scale data in network biology.
Bioinformatics, 384, 2014. a, b
[188] S. B. Trust. Role of Mathematics in the Rise of Science.
Princeton Legacy Library. Princeton University Press, 2014. →
[189] J. W. Tukey. Exploratory Data Analysis. Addison-Wesley, New
York, 1977. →
[190] V. van Noort, B. Snel, and M. A. Huymen. The yeast
coexpression network has a small-world, scale-free architecture
and can be explained by a simple model. EMBO Rep., 5(3):280–
284, 2004. →
[191] V. Vapnik. Statistical Learning Theory. J. Willey, 1998. →
[192] Vladimir Naumovich Vapnik. The Nature of Statistical
Learning Theory. Springer, 1995. →
[193] V. Volterra. Variations and fluctuations of the number of
individuals in animal species living together. In R. N. Chapman,
editor, Animal Ecology, McGraw–Hill, 1931. →
[194] J. von Neumann. The Theory of Self-Reproducing Automata.
University of Illinois Press, Urbana, 1966. →
[195] Andreas Wagner and David A. Fell. The small world inside
large metabolic networks. Proc. R. Soc. Lond. B, Biol. Sci.,
268(1478):1803–1810, 2001. →
[196] J. Wang and G. Provan. Characterizing the structural
complexity of real-world complex networks. In J. Zhou, editor,
Complex Sciences, volume 4 of Lecture Notes of the Institute for
Computer Sciences, Social Informatics and Telecommunications
Engineering, pages 1178–1189. Springer, Berlin/Heidelberg,
Germany, 2009. →
[197] S. Wasserman and K. Faust. Social Network Analysis:
Methods and Applications. Structural Analysis in the Social
Sciences. Cambridge University Press, 1994. a, b, c, d, e, f, g
[198] D. J. Watts and S. H. Strogatz. Collective dynamics of ‘small-
world’ networks. Nature, 393:440–442, 1998. a, b, c, d
[199] A. Weil. Basic Number Theory. Springer, 2005. →
[200] Hadley Wickham. ggplot2: Elegant Graphics for Data
Analysis. Springer, 2016. →
[201] Hadley Wickham. Advanced R. Chapman and Hall/CRC; 2nd
edition, 2019. →
[202] R. Wilhelm and D. Maurer. Übersetzerbau: Theorie,
Konstruktion, Generierung. Springer, 1997. a, b, c, d
[203] Thomas Wilhelm, Heinz-Peter Nasheuer, and Sui Huang.
Physical and functional modularity of the protein network in
yeast. Mol. Cell. Proteomics, 2(5):292–298, 2003. →
[204] Leland Wilkinson. The grammar of graphics. In Handbook
of Computational Statistics, pages 375–414. Springer, 2012. →
[205] S. Wolfram. Statistical mechanics of cellular automata.
Phys. Rev. E, 55(3):601–644, 1983. →
[206] S. Wolfram. A New Kind of Science. Wolfram Media, 2002. →
[207] J. A. Wright, M. H. Wright, P. Lagarias, and J. C. Reeds.
Convergence properties of the nelder-mead simplex algorithm
in low dimensions. SIAM J. Optim., 9:112–147, 1998. →
Subject Index
A
adjacency matrix 1
algorithm 1
analysis 1
antiderivative 1
Asynchronous Random Boolean Networks 1
attractor 1
attractors 1
aviation network 1
B
bar plot 1
basic programming 1
basin of the attractor 1
Bayesian networks 1
Bayes’ theorem 1
Bernoulli distribution 1
Beta distribution 1
betweenness centrality 1
bifurcation 1
bifurcation point 1
binomial coefficient 1
Binomial distribution 1
bivariate distribution 1
boolean functions 1
Boolean logic 1
boolean value 1
Boundary Value ODE 1
Boundary Value ODE problem 1
breadth-first search 1
byte code compilation 1
C
Cartesian space 1
Cauchy–Schwartz inequality 1
cellular automata 1
centrality 1
central limit theorem 1, 2
chaotic behavior 1
character string 1
Chebyshev inequality 1
Chernoff bounds 1
Chi-square distribution 1
Cholesky factorization 1
Classical Random Boolean Networks 1
closeness centrality 1
clustering coefficient 1
cobweb graph 1
codomain of a function 1
complexity 1
complex number 1
computability 1
concentration inequalities 1
conditional entropy 1
conditional probability 1, 2
conjugate gradient 1
constrained optimization 1
constraints 1
continuous distributions 1
contour plot 1
coordinates systems 1
correlation 1
covariance 1
Cramer’s method 1
critical point 1
cross product 1
cumulative distribution function 1
curvature 1
D
data science 1
data structures 1
decision variables 1
definite integral 1
degree 1
degree centrality 1
degree distribution 1
De Morgan’s laws 1
density plot 1
dependency structure 1
depth-first search 1
derivative 1
derivative-free methods 1
determinant 1
Deterministic Asynchronous Random Boolean Networks 1
Deterministic Generalized Asynchronous Random Boolean
Networks 1
diameter 1
differentiable 1
differential equations 1
differentiation 1
directed acyclic graph 1
directed network 1
Dirichlet conditions 1
discrete distributions 1
distance 1
distance matrix 1
distribution function 1
domain of a function 1
dot plot 1
dot product 1
dynamical system 1
dynamical systems 1
E
eccentricity 1
economic cost function 1
edge 1
Eigenvalues 1
Eigenvectors 1
elliptic PDE 1, 2
entropy 1
error handling 1
Euclidean norm 1
Euclidean space 1
Euler equations 1
exception handling 1
expectation value 1
exponential distribution 1
extrema 1
F
First fundamental Theorem of calculus 1
first-order PDE 1
fixed point 1
Fletcher–Reeves 1
fractal 1
functional programming 1
G
Gamma distribution 1
Generalized Asynchronous Random Boolean Networks 1
generalized tree 1
global maximum 1
global minimum 1
global network layout 1
gradient 1, 2
gradient-based algorithms 1
graph 1
graph algorithms 1
graphical models 1
graph measure 1
H
Hadamard 1
heat equation 1
Hessian 1, 2
Hestenes–Stiefel 1
histogram 1
Hoeffding’s inequality 1
hyperbolic PDE 1, 2
I
image plot 1
imperative programming 1
indefinite integral 1
information flow 1
information theory 1
Initial Value ODE 1
Initial Value ODE problem 1
integral 1
Intermediate value theorem 1
J
Jacobian 1
joint probability 1
K
Kolmogorov 1
Kullback–Leibler divergence 1
L
Lagrange multiplier 1
Lagrange polynomial 1, 2
law of large numbers 1
law of total probability 1
layered network layout 1
likelihood 1
likelihood function 1
limes 1
limiting value 1, 2
linear algebra 1
linear optimization 1
linux 1
local maximum 1
local minimum 1
logical statement 1
logic programming 1
logistic map 1
Log-normal distribution 1
Lotka–Volterra equations 1
LU factorization 1
M
Maclaurin series 1
Markov inequality 1
matrices 1
matrix factorization 1
matrix norms 1
maximization 1
maximum likelihood estimation 1
minimization 1
mixed product 1
modular network layout 1
moment 1
multi-valued function 1
multivariate distribution 1
mutual information 1
N
Negative binomial distribution 1
Nelder-Mead method 1
NetBioV 1
network 1
network visualization 1
Neumann conditions 1
Newton’s method 1
node 1
non-linear constrained optimization 1
normal distribution 1
numerical integration 1
O
objective-function 1
Object-oriented programming 1
operations with matrices 1
optimization 1
orbit 1
ordinary differential equations – ODE 1
orthogonal unit vectors 1
over-determined linear system 1
P
package 1
parabolic PDE 1, 2
partial derivative 1, 2
partial differential equations – PDE 1
path 1
Pearson’s correlation coefficient 1
periodic behavior 1
periodic point 1
pie chart 1
plot 1
p-norm 1
Poisson distribution 1
Poisson’s equation 1
Polak–Ribière–Polyak 1
polynomial interpolation 1
posterior 1
predator–prey system 1
prior 1
probability 1
programming languages 1
programming paradigm 1
Q
QR factorization 1
Qualitative techniques 1
quantitative method 1
R
radius 1
Random Boolean network 1
random networks 1
random variables 1
rank of a matrix 1
reading data 1
real sequences 1
repositories 1
Riemann sum 1
Robin conditions 1
root finding 1
rug plot 1
Rules of de Morgan 1
S
sample space 1
scalar product 1
scale-free networks 1
scatterplot 1
scope of variables 1
search direction 1
Second fundamental Theorem of calculus 1
second-order PDE 1
self-similarity 1
sequence 1
set operations 1, 2
sets 1
Sherman–Morrison–Woodbury formula 1
shortest path 1
Sierpińsky’s carpet 1
Simulated Annealing 1
singular value decomposition – SVD 1
small-world networks 1
sorting vectors 1
spanning trees 1
special matrices 1
stable fixed point 1
stable point 1
standard deviation 1
standard error 1
stationary point 1
statistical machine learning 1
steepest descent 1
strip plot 1
Student’s t-distribution 1
Support Vector Machine 1
systems of linear equations 1
T
Taylor series expansion 1
trace 1
transportation networks 1
tree 1
triangular linear system 1
Turing completeness 1
Turing machines 1
U
ubuntu 1
unconstrained optimization 1
uncontrollable variables 1
under-determined linear system 1
undirected network 1
uniform distribution 1
useful commands 1
V
variance 1
vector decomposition 1
vector projection 1
vector reflection 1
vector rotation 1
vectors 1
vector scaling 1
vector sum 1
vector translation 1
Venn diagram 1
W
walk 1
wave equation 1
Weibull distribution 1
weighted network 1
well-determined linear system 1
writing data 1
writing functions 1
Notes
1 Vi is a very simple yet powerful and fast editor used on
Unix or Linux computers.
1 →
Note that the direction of the vector u for the cross
→ →
product V × W is determined by the right-hand rule,
i. e., it is given by the direction of the right-hand thumb
→ →
when the other four fingers are rotated from V to W .
1 When we speak about a random sample, we mean an iid
sample.