0% found this document useful (0 votes)
4 views54 pages

Deep Learning For Natural Language Processing (MEAP V07) Stephan Raaijmakers Download

The document is a MEAP edition of 'Deep Learning for Natural Language Processing' by Stephan Raaijmakers, aimed at providing a practical guide to applying deep learning techniques in NLP. It covers various topics including text embeddings, sequential NLP, and advanced concepts like transformers, with a focus on hands-on coding in Python. The book serves as a resource for developers, students, and researchers interested in the intersection of deep learning and computational linguistics.

Uploaded by

ifilctsatu773
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views54 pages

Deep Learning For Natural Language Processing (MEAP V07) Stephan Raaijmakers Download

The document is a MEAP edition of 'Deep Learning for Natural Language Processing' by Stephan Raaijmakers, aimed at providing a practical guide to applying deep learning techniques in NLP. It covers various topics including text embeddings, sequential NLP, and advanced concepts like transformers, with a focus on hands-on coding in Python. The book serves as a resource for developers, students, and researchers interested in the intersection of deep learning and computational linguistics.

Uploaded by

ifilctsatu773
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 54

Deep Learning for Natural Language Processing

(MEAP V07) Stephan Raaijmakers download

https://textbookfull.com/product/deep-learning-for-natural-
language-processing-meap-v07-stephan-raaijmakers/

Download full version ebook from https://textbookfull.com


We believe these products will be a great fit for you. Click
the link to download now, or visit textbookfull.com
to discover even more!

Deep Learning for Natural Language Processing Develop


Deep Learning Models for Natural Language in Python
Jason Brownlee

https://textbookfull.com/product/deep-learning-for-natural-
language-processing-develop-deep-learning-models-for-natural-
language-in-python-jason-brownlee/

Deep learning in natural language processing Deng

https://textbookfull.com/product/deep-learning-in-natural-
language-processing-deng/

Python Natural Language Processing Advanced machine


learning and deep learning techniques for natural
language processing 1st Edition Jalaj Thanaki

https://textbookfull.com/product/python-natural-language-
processing-advanced-machine-learning-and-deep-learning-
techniques-for-natural-language-processing-1st-edition-jalaj-
thanaki/

Applied Natural Language Processing with Python:


Implementing Machine Learning and Deep Learning
Algorithms for Natural Language Processing 1st Edition
Taweh Beysolow Ii
https://textbookfull.com/product/applied-natural-language-
processing-with-python-implementing-machine-learning-and-deep-
learning-algorithms-for-natural-language-processing-1st-edition-
Deep Learning in Natural Language Processing 1st
Edition Li Deng

https://textbookfull.com/product/deep-learning-in-natural-
language-processing-1st-edition-li-deng/

Natural language processing with TensorFlow Teach


language to machines using Python s deep learning
library 1st Edition Thushan Ganegedara

https://textbookfull.com/product/natural-language-processing-
with-tensorflow-teach-language-to-machines-using-python-s-deep-
learning-library-1st-edition-thushan-ganegedara/

Natural Language Processing for Electronic Design


Automation Mathias Soeken

https://textbookfull.com/product/natural-language-processing-for-
electronic-design-automation-mathias-soeken/

Natural Language Processing 1st Edition Jacob


Eisenstein

https://textbookfull.com/product/natural-language-processing-1st-
edition-jacob-eisenstein/

Natural Language Processing for Social Media 1st


Edition Anna Atefeh Farzindar

https://textbookfull.com/product/natural-language-processing-for-
social-media-1st-edition-anna-atefeh-farzindar/
MEAP Edition
Manning Early Access Program
Deep Learning for Natural Language Processing
Version 7

Copyright 2020 Manning Publications

For more information on this and other Manning titles go to


www.manning.com/

©Manning Publications Co. To comment go to liveBook


Welcome
Thank you for purchasing the MEAP for Deep Learning for Natural Language
Processing. This book will teach you to apply deep learning to one of most vibrant
applications of current AI: the analysis of natural language. The book is addressed to
anyone who needs a practical guide to current deep learning techniques, from developers
to students and researchers of computational linguistics.
While rooted in linguistics, the field of computational linguistics, or Natural Language
Processing (NLP, for short), borrows many tools and techniques from neighboring fields
such as computer science, logic, statistics and machine learning. Much of the current
research in NLP now uses deep learning. It is quite hard to keep track of the many papers
and software repositories that are being produced in this fast-paced field. You may even feel
you’re missing out on the cutting edge stuff. This is exactly what this book attempts to
remedy: it will open up the toolbox of deep learning for NLP practitioners, outlining the
pros and cons of various actual techniques and architectures. It speaks the language of NLP
and is actually written by a computational linguist.
The book is written to be self-contained, but cannot cover all details of deep learning and
NLP. Two excellent companion books, which can either be used as reference or primer,
would be Manning’s Deep Learning with Python, and Natural Language Processing in
Action. All examples in this book are written in Python and the elegant Keras library. This is
a code-intensive book, with lots of examples of deep learning solutions for well-known NLP
problems. The book contains references to recent literature and some of our code draws
inspiration from great public domain software repositories, including the source code
examples in the Keras distribution.
Deep learning is under continuous development and may seem like a volatile moving
target. I hope this practical and hands-on book will help you become proficient in the current
types of deep learning, and will provide you with a solid basis for keeping up with new
developments.
Writing an ambitious book like this one is a daunting task, and is prone to imperfection.
Your feedback is more than welcome. If you have any questions, comments, or suggestions,
please share them in Manning’s liveBook's Discussion Forum for my book.

—Stephan Raaijmakers

©Manning Publications Co. To comment go to liveBook


brief contents
PART 1: INTRODUCTION
1 Deep learning for NLP
2 Deep Learning and language: the basics
PART 2: DEEP NLP
3 Text embeddings
4 Textual similarity
5 Sequential NLP and memory
PART 3: ADVANCED TOPICS
6 Episodic memory and NLP
7 Attention
8 Multitask learning
PART 4: TRANSFORMERS
9 Transformers and Embeddings
10 Applications of Transformers

Appendix: A random walk through NLP

©Manning Publications Co. To comment go to liveBook


1

Deep learning for NLP

Language comes naturally to humans, but is traditionally hard to grasp for computers. This book
addresses the application of recent and cutting-edge deep learning techniques to automated
language analysis. Deep learning has emerged in the last decade as the vehicle of the latest wave
1
in AI. Results have consistently redefined the state-of-the-art for a plethora of data analysis tasks
in a variety of domains. For an increasing amount of deep learning algorithms,
better-than-human (human-parity or superhuman) performance has been reported: for instance,
speech recognition in noisy conditions, and medical diagnosis based on images. Current deep
learning-based natural language processing (NLP) outperforms all pre-existing approaches with a
large margin. What exactly makes deep learning so adequate for these intricate analysis tasks, in
particular language processing? This chapter presents some of the background necessary for
answering this question. We will guide you through a selection of important topics in machine
learning for NLP.

1.1 Overview of the book


The intended audience for this book consists of anyone working in natural language processing:
computational linguists, software engineers, and machine learning or NLP students. The field of
natural language processing is vast, and comprises a daunting amount of formalisms and
approaches. With deep learning entering the stage, many are eager to get their feet wet, but are
shied away by the highly technical nature of deep learning, and the fast pace of this field, with
new approaches, software and papers emerging on a daily basis.

What you will get from this book is a thorough introduction to deep learning applied to a variety
of language analysis tasks, supported by actual hands-on code. Explicitly linking the evergreens
of computational linguistics, such as part-of-speech tagging, or query-answering, to deep
learning, will help you become a proficient deep learning NLP expert. Beyond this, the book
covers state of the art approaches to challenging new problems, such as explainable AI and the
role of semantics.

©Manning Publications Co. To comment go to liveBook


https://livebook.manning.com/#!/book/deep-learning-for-natural-language-processing/discussion
2

This first chapter covers

A short roadtrip through machine learning applied to NLP.


A brief historical overview of deep learning.
An introduction to vector-based representations of language.

Figure 1.1 Chapter organization.

You will find a succinct overview of NLP in Appendix 1. Let’s start off with discussing a few
well-known machine learning-based NLP algorithms in some detail, illustrated with a handful of
practical examples to wet your appetite. After that, we motivate the case for deep learning-based
NLP.

©Manning Publications Co. To comment go to liveBook


https://livebook.manning.com/#!/book/deep-learning-for-natural-language-processing/discussion
3

1.2 A selection of machine learning methods for NLP

Figure 1.2 Machine learning for NLP.

Current natural language processing heavily relies on machine learning. Machine learning has its
roots in statistics, building among others on the seminal work by Thomas Bayes and LaPlace in
the 18-th and 19-th century (Bayes,LaPlace) and the least squares methods for curve
approximation by Legendre in 1812 (legendre). The field of neural computing started with the
work of McCulloch and Pitts in 1943, who put forward a formal theory (and logical calculus) of
neural networks (mccullochpits43). It would take until 1950 before learning machines were
proposed by Alan Turing (Turing.

All machine learning algorithms that perform classification (labeling) share a single goal: to
arrive at linear separability of data that is labeled with classes: labels that indicate a (usually
exclusive) category a data point belongs to. Data points presented to a machine learning
algorithm typically consist of vector representations of descriptive traits. These representations
constitute a so-called input space. The subsequent processing, manipulation and abstraction of
the input space during the learning stage of a self-learning algorithm yields a feature space.
Some of this processing can be done external to the algorithm: raw data can already be converted
to features as part of a pre-processing stage, which technically creates an input space consisting
of features. The output space consists of class labels that separate the various data points in a
dataset, based on the class boundaries. The essence of deep learning, as we will see, is to learn
abstract representations in the feature space.

©Manning Publications Co. To comment go to liveBook


https://livebook.manning.com/#!/book/deep-learning-for-natural-language-processing/discussion
4

Figure 1.3 From input space to output space (labels).

Training a machine learning component involves learning boundaries between classes, which
may depend on quite complex functions. The burden of learning class separability can be
alleviated by smart feature pre-processing. Learning the class boundaries occurs by performing
implicit or explicit transformations on linearly inseparable input spaces. The following figure
shows a non-linear class boundary: a line separating objects in two classes that cannot be
modeled by a linear function f(x)=ax+b. The function corresponding to this line is a non-linear
classifier. A real-word example would be a bowl of multi-colored marbles, mixed in such a way
that they cannot be separated from each other by means of a straight plate (like a scoop).

Figure 1.4 Non-linear classifier.

A linear function that separates classes with a straight line is a linear classifier and would
produce a picture like:

©Manning Publications Co. To comment go to liveBook


https://livebook.manning.com/#!/book/deep-learning-for-natural-language-processing/discussion
5

Figure 1.5 Linear classifier

We will now briefly address three types of machine learning approaches that have had major
uptake in NLP:

The single-layer perceptron and its generalization to the multilayer perceptron


Support vector machines, and
Memory-based learning.

While there is a lot more to the story, these three types embody respectively the neural or
cognitive, eager and lazy types of machine learning. All of these approaches relate naturally to
the deep learning approach to natural language analysis, which is the main topic of this book.

1.2.1 The perceptron


In 1957, the first implementation of a biologically inspired machine learning component was
realized: Rosenblatt’s perceptron (Rosenblatt). This device, implemented on physical hardware,
allowed processing of visual stimuli represented by a square 400 (20 by 20) array of
photosensitive cells. The weights of this network were set by electromotors driving
potentiometers. The learning part of this perceptron was based on a simple one-layer neural
network, which effectively became the archetype of neural networks (see Figure img-perceptron
).

©Manning Publications Co. To comment go to liveBook


https://livebook.manning.com/#!/book/deep-learning-for-natural-language-processing/discussion
6

Figure 1.6 Rosenblatt’s perceptron

Suppose you have a vector of features that describe aspects of a certain object of interest, like the
words in a document, and you want to create a function from these features to a binary label (for
instance, you want to decide if the document conveys a positive or a negative sentiment). The
single-layer perceptron is capable of doing this. It produces a binary output y (0 or 1) from a
weighted combination of input values x1…​xn, based on a threshold θ and a bias b:

Figure 1.7 Perceptron decision function

The weights w1,…​wn are learned from annotated training data, consisting of input vectors labeled
with output labels. The thresholded unit is called a neuron. It receives the summed and weighted
input v. So, assume we have a set of weights and associated inputs:

Table 1.1mWeighted input.


Weight 1 3
Weight 2 5
Weight 3 7
Input 1 10
Input 2 20
Input 3 30

Then their summed and weighted output would be

3*10+5*20+7*30=310

©Manning Publications Co. To comment go to liveBook


https://livebook.manning.com/#!/book/deep-learning-for-natural-language-processing/discussion
7

Figure 1.8 Summed and weighted perceptron input

This simplistic network is able to learn a specific set of functions that address the class of
linearly separable problems: problems that are separable in input space with a linear function.
Usually, these are the easier problems in classification. Its is quite common for data to be heavily
entangled. Consider undoing a knot in two separate ropes. Some knots are easy and can be
undone in one step. Other knots need much more steps. This is the business of machine learning
algorithms: undoing the intertwining of data objects living in different classes.

For NLP, the single-layer perceptron nowadays plays a marginal role, but it underlies several
derived algorithms that strive for simplicity, such as online learning (ONLINE).

A practical example of a perceptron classifier is the following. We set out to build a document
classifier that categorizes raw texts as being broadly about either atheism or medical topics. The
popular 20 newsgroups dataset (20newsgroups), one of the most widely used datasets for
building and evaluating document classifiers, consists of newsgroup (usenet) texts distributed
over 20 hand-assigned topics. Here is what we do:

We make a subselection for two newsgroups of interest: alt.atheism and sci.med.


We train a simple perceptron on a vector representation of the documents in these two
classes. A vector is nothing more than a container (an ordered list of a finite dimension)
for numerical values.
The vector representation is based on a statistical representation of words called TF.IDF,
which we will discuss in the Section representations below. For now, just assume TF.IDF
is one of those magic tricks that turn documents into vectors that can be fed to a machine
learning algorithm.

Don’t worry if you don’t completely understand this code right now. It’s here to give you an idea
of what the code looks like for a basic perceptron.

©Manning Publications Co. To comment go to liveBook


https://livebook.manning.com/#!/book/deep-learning-for-natural-language-processing/discussion
8

Listing 1.1 A simple perceptron-based document classifier.


from sklearn.linear_model import perceptron
from sklearn.datasets import fetch_20newsgroups

categories = ['alt.atheism', 'sci.med']

train = fetch_20newsgroups(subset='train',categories=categories, shuffle=True)

perceptron = perceptron.Perceptron(max_iter=100)

from sklearn.feature_extraction.text import CountVectorizer


cv = CountVectorizer()
X_train_counts = cv.fit_transform(train.data)

from sklearn.feature_extraction.text import TfidfTransformer


tfidf_tf = TfidfTransformer()
X_train_tfidf = tfidf_tf.fit_transform(X_train_counts)

perceptron.fit(X_train_tfidf,train.target)

test_docs = ['Religion is widespread, even in modern times',


'His kidney failed','The pope is a controversial leader',
'White blood cells fight off infections',
'The reverend had a heart attack in church']

X_test_counts = cv.transform(test_docs)
X_test_tfidf = tfidf_tf.transform(X_test_counts)

pred = perceptron.predict(X_test_tfidf)

for doc, category in zip(test_docs, pred):


print('%r => %s' % (doc, train.target_names[category]))

We import a basic perceptron classifier from sklearn.


We import a routine for fetching the 20 newsgroups dataset from sklearn.
We limit the categories of the dataset.
We obtain documents for our category selection.
Our perceptron is defined. It will be trained for 100 iterations.
The familiar CountVectorizer is fit on our training data.
Now, we load, fit and deploy a TF.IDF transformer from sklearn. It computes
TF.IDF representations of our count vectors.
The perceptron is trained on the TF.IDF vectors.
Our test data.
The test data is vectorized, first to count vectors, then to TF.IDF vectors.
The perceptron is applied to the test documents.
Results are printed.

This produces the following results:

Religion is widespread, even in modern times => alt.atheism

©Manning Publications Co. To comment go to liveBook


https://livebook.manning.com/#!/book/deep-learning-for-natural-language-processing/discussion
9

His kidney failed => sci.med

The pope is a controversial leader => alt.atheism

White blood cells fight off infections => sci.med

The reverend had a heart attack in church => sci.med

Apparently, these few short texts can be linearly separated by a simple, weight-based algorithm.
This example is a huge simplification: the topics chosen are quite distinct. In real life, linear
algorithms fall short in separating topics that overlap and share similar vocabulary.

The multilayer perceptron generalizes the single-layer model of the original perceptron to a
multilayer model, with at least 3 layers: an input layer, one or more hidden representational
layers, and an output layer:

Figure 1.9 A multilayer perceptron with one hidden layer (h1…hn).

1.2.2 Support Vector Machines


As mentioned, machine learning algorithms that perform classification (the labeling of objects
with classes) attempt to arrive at linear boundaries between data points. Recall our
_linear_classifier for such a linear boundary. Imagine you are seeing two objects (like an orange
and an apple) on a plate, one of which is partially obscuring the other. If you close one eye, you
will lose 3D stereovision, and you cannot separate the two objects in 2D. If you open both eyes,
you will be able to separate the two objects. Support Vector Machines routinely perform this
migration to a higher dimension, separating objects in this space. Their secret weapon: kernels.

©Manning Publications Co. To comment go to liveBook


https://livebook.manning.com/#!/book/deep-learning-for-natural-language-processing/discussion
10

Figure 1.10 From 2D to 3D with a kernel.

A support vector machine is a binary classifier that implicitly maps data in feature space to
higher dimensions in which data becomes separable by a linear plane, called a hyperplane. This
mapping is implicit, and is carried out by a kernel function. This is a function that transforms the
original input space to an alternative representation that implicitly has a higher dimensionality,
with the aim of disentangling the data and making it linearly separable.

But the migration is implicit in the sense that it takes the form of a similarity function (φ in the
picture above) applied to two feature vectors, just computing their distance. This is cordially
called the kernel trick, It sounds like sheer magic, but it is actually quite simple. Let’s take a
look.

You should already be familiar with the dot product of two vectors. If not, please see Appendix 2
for a refresher. To recap, the standard dot product of two vectors a and b is the sum of the
cross-product of the two vectors:

Listing 1.2 Dot product in Python.


def dot_product(a,b):
return sum( [a[i]*b[i] for i in range(len(a))])

So, a dot product is just a multiplicative operation on two vectors that produces a single number.

Kernels are generalizations of this dot product between vectors: they compute the dot product
between altered versions of these vectors. The nature of alteration is specified by a kernel
function φ . Generally speaking, a kernel function takes two vectors, mixes in a constant (a kernel
parameter) and adds some kernel-specific ingredients to produce a specific form of a dot product
of the two vectors.

Let’s return to our orange and apple. The objects are described by pairs of coordinates (x,y),
since the table they’re lying on is a flat XY-plane. Like othertypes of kernels, the so-called
polynomial kernel maps lower-dimensional spaces to higher-dimensional ones. You may recall
©Manning Publications Co. To comment go to liveBook
https://livebook.manning.com/#!/book/deep-learning-for-natural-language-processing/discussion
11

from high school math that a polynomial function produces a value using addition, subtraction,
multiplication or positive exponentiation only. Like y=4x2+10. Polynomial kernels work on two
input values (vectors of numeri values)and usually a constant. They compute a result using a
polynomial.

For instance, a simple quadratic kernel K, using a constant c , and addressing the two
two-dimensional XY-vectors describing our orange and apple:

x=(x1,x2)

and

y=(y1,y2)

looks like:

Figure 1.11 Quadratic kernel.

Notice the subscripted T? That is vector transposition (see Appendix B), necessary for vector
multiplication. What does the kernel do? It computes a product between two vectors. This
product is a number expressing a relation between the two input vectors. But the tedious
expansion of this kernel shows that we actually operate in a six- (and not even a three-!)
dimensional space: just count the factors separated by the plus sign in the result; we have six
such factors. In this case, the kernel function K implicitly computes its dot product between the
following vectors:

since the result is the product of these two vectors. But it never explicitly created these vectors.
The whole point of this kernel trick is that, hopefully, in the higher-dimensional space that is the
playground of the kernel, things become easier separable than in the entangled input space.
Kernels do not explicitly represent this space, they implicitly work in it. You can imagine that,
for long vectors and large exponents of the polynomial, this kernel trick quickly becomes a
practical advantage!

©Manning Publications Co. To comment go to liveBook


https://livebook.manning.com/#!/book/deep-learning-for-natural-language-processing/discussion
12

In the transformed space created by the kernel trick, two classes are at best separated with
maximally wide boundaries (called maximum margins). The data points determining the slope of
these boundaries are called support vectors.

Figure 1.12 Maximum margins of an SVM. The support vectors are the points on the
dashed lines.

Learning weights that optimize the margins with the least error (as measured on some held-out
test data) is the job an SVM has to solve during training. After training, the support vectors and
the various weights plus biases constitute the model. New input is projected onto the support
vectors, and depending on which side it lands, it receives a positive or negative label (recall that
SVM’s are binary classifiers). So, SVMs throw away a lot of their training data, only keeping
some of it: the support vectors. They can be called eager forms of machine learning.

The connection between kernel-based machine learning and neural network-based learning is
briefly discussed in the Deep Learning with Python book. It is based on the intuition that hidden
layers in neural networks act like kernels in disentangling linearly inseparable input data
layer-by-layer, step-by-step.

©Manning Publications Co. To comment go to liveBook


https://livebook.manning.com/#!/book/deep-learning-for-natural-language-processing/discussion
13

1.2.3 Memory-based learning


Unlike the eager types of machine learning that build compact and representative models of their
training data, memory-based learning (MBL, Daelemans et al) is a form of lazy learning: it does
not compress training data into generalizations, but instead keeps all training data available in
memory. During classification, the actual processing of the training takes place: input data is
matched with training data by the application of similarity or distance measures. Similar to
Support Vector Machines, distance functions between vectors compute similarities. But here, we
work on explicit vectors, and we do not perform any dimensionality tricks.

A well-known distance function is the IB1 metric, a simplified version of which is:

Listing 1.3 IB1 distance metric.


def IB1(a,b):
return sum( [delta(a[i],b[i] for i in range(len(a))])

def delta(x,y):
if x==y:
return 0
if x!=y:
return 1

This metric computes the distance between two feature vectors on the basis of feature value
overlap: exact similarity for symbolic (non-numerical) values. Most MBL algorithms extend
these distance metrics with feature weighting (e.g. information-gain based weighting) or
exemplar weighting (REF). They partition the search space for matching in sets consisting of
training items with the same distance to the current test item. For instance, sets of distances d1,d2
,…​ can be found first, after which the algorithm computes the most frequent class in those sets. It
then votes over all classes, to determine the most probable label for the test item. The k
-parameter addresses the amount of distance sets to take into account, which is why MBL often
is k-nearest distances classification rather than k-nearest neighbor classification. MBL has
interesting advantages for NLP. Keeping all original training data available for classification
allows handling exceptions in language. For instance, in certain languages, morphological
operations on words, such as diminutive formation, can be arranged in pockets of exceptions:
small families of subregularities. In Dutch, for instance, we encounter

gat —> gaatje ('small hole')


pad —> paadje ('small path')
blad —> blaadje ('small leaf')

Notice the extra vowel in the diminutive form, and the diminutive suffix -je. We also have a
subfamily of patterns like

kat —> katje ('small cat')


rat —> ratje ('small rat')
©Manning Publications Co. To comment go to liveBook
https://livebook.manning.com/#!/book/deep-learning-for-natural-language-processing/discussion
14

schat —> schatje ('(my) dear')


schot —> schotje ('small fence')
schip —> scheepje ('small ship')
schaap —> schaapje ('small sheep')
guit —> guitje ('little maverick')

While schat is phonetically quite similar to gat (-ch- is pronounced similarly as -g-), it is
inflected analogous to other words prefixed with sch-. But schip is not following this pattern.

The benefit of using a memory-based learner for highly exceptional data is that exceptions can
-in principle- always be retrieved for labeling a similar exceptional input case. Memory-based
learners have perfect memory: they store everything. But eager machine learning models tend to
'compile away' these exceptions. They are after string regularities rather than exceptions. Editing
away these exceptions from training data has been found detrimental for the generalization
accuracy (the ability to handle new, unseen cases outside of the training data) of the resulting
classifier (DaelemansForgetting). Apparently, the subregularities of pockets of exceptions are
beneficial to keep. One of the challenges for deep learning, an 'eager' form of machine learning,
will be to handle these subregularities in a similar vein.

1.3 Deep Learning

Figure 1.13 Deep learning.

Deep learning is one of the most vibrant buzz phrases of the past 5 years. Deep learning by itself
is nothing new: it is, in the strictest sense, a neural network with lots of internal or hidden layers
and specific filtering operations. Deep learning deploys constructive ways of dealing with large
©Manning Publications Co. To comment go to liveBook
https://livebook.manning.com/#!/book/deep-learning-for-natural-language-processing/discussion
15

quantities of information, organized in many layers of representations. While deep learning in its
essential form was invented in the sixties of the previous century, it took three decades before it
was finally ready to use. In this section, we will shed some light on why this happened.

But first: which problem does deep learning actually solve for NLP? Deep learning can be seen
as a very effective statistical technique for working with (very) many parameters. Deep learning
is able to handle effectively millions of parameters, each one encoding an aspect of input data.
Layers in deep learning act as transformations that -step by step- accommodate input data with
the labels we assign to that data; they disentangle the spaghetti of input data in such as way that
labels can be assigned more easily. The fact that we can stack many such declutter steps on top
of each other is a major forte of deep learning. For language, deep learning provides two types of
advantages:

The repeated application of data decluttering steps proves good for NLP, but this is not
specific for language; it applies to virtual every modality deep learning is applied to.
Deep learning has facilities for handling sequential information, with memory operators
and buffers. This is quite important for language and in this respect, deep learning is a
form of stateful machine learning, as opposed to the usually stateless other types of
machine learning models. These models also usually perform just a single
disentanglement step (like Support Vector Machines).

Figure 1.14 A timeline of neural learning. As we will point out, the gist of Deep Learning
was actually invented in the 1960s.

(Image credits unclear, maybe


www.slideshare.net/deview/251-implementing-deep-learning-using-cu-dnn)

Central to deep learning is the learning of hierarchical representations of data. Under a vertical
interpretation of a multi-layer neural network, every 'lower' layer feeds into a 'higher' layer.
Layers can be seen as complex functions processing a set of inputs and weights. These weights
encode the importance of the information stored in the network. Networks receive their inputs in
a dedicated input layer, and process that input layer-by-layer, sending it 'upwards' into the
network.

Output layers, finally, produce an outcome: the label the model assigns to its input. All layers
except input and output layers are 'hidden layers' as they cannot be readily observed. As

©Manning Publications Co. To comment go to liveBook


https://livebook.manning.com/#!/book/deep-learning-for-natural-language-processing/discussion
16

mentioned, hidden layers in neural networks disentangle linearly inseparable input data
layer-by-layer, step-by-step.

During training, weights are estimated and fine-tuned between 'neurons', which are the basic
processing units of a neural network. Every layer keeps a record of the weights for the neurons
that feed into that layer. Estimating weights is the essential business of neural networks.

Since the layers in a neural network are hierarchically organized (they are stacked), the
representations they produce can be interpreted as hierarchical representations as well: going
from specific (close to the input layer) to more abstract (close to the output layer). In general, it
is hard to come up with human-understandable interpretations of these representations.

Figure 1.15 Facial data analysis with deep learning. Notice how interesting facial concepts
arise from pixels in the higher layers (honglaklee-diss-2010. Picture from
www.quora.com/What-is-deep-learning-Why-is-this-a-growing-trend-in-machine-learning-Why-not-use-SVM

Yet, in the field of image analysis, the weights associated with the layers of deep networks have
been shown to encode interpretable concepts. They can be visualized, and the lower layers
appear to encode pixels, where higher layers represent edges, corners and finally even concepts
like facial objects (see Figure 1). 1

In 1965, presumably the first reference to such hierarchical representation learning was published
by Ivakhnenko and Lapa ivaklapa65. Their paper describes a group method of data handling
(GMDH): a method for producing outputs by layers feeding into each other, based on the
following formula:

This scary looking formula describes nothing but a function Y that computes a complex sum of
©Manning Publications Co. To comment go to liveBook
https://livebook.manning.com/#!/book/deep-learning-for-natural-language-processing/discussion
17

weighted combinations (groups) of input values, with every factor a being a weight or coefficient
. Notice that we have a one-dimensional weight or coefficient matrix ai, a two-dimensional one a
ij
, a three-dimensional one aijk, and so on. The weight matrices are usually referred to as
'polynomial models'. The coefficients can be learned from training data through simple methods
such as least squares (wikipedia:ls). In a picture:

Figure 1.16 A GMDH network.

In 1971, the first "deep" 8-layer network based on GMDH was presented ivakhnenko1971.

Yet, this memorable fact did not coincide with the official launch of deep learning. The
simplistic methods for weight tuning by GMDH were not scalable to large scale training. It took
a while before more scalable weight tuning methods like backpropagation came into being:
around 1970 (see Section 2.4 of Deep learning with Python for an introduction to
backpropagation).

But even with backpropagation, neural networks suffered from various practical problems,
including the notorious vanishing gradient problem. This problem arises during training a
network. The crucial ingredient of backpropagation is the stepwise minimization of the error
function of a neural network, by taking partial derivatives of the error function of the network,
differentiating for all weights, and moving stepwise towards its minimum. Gradients are a
generalization of the one-variable partial derivative of a function. Setting a partial derivative to
zero finds a local maximum or minimum of a function. If this function is an error function that
computes the error a network makes for predicting certain output based on weights, we can look

©Manning Publications Co. To comment go to liveBook


https://livebook.manning.com/#!/book/deep-learning-for-natural-language-processing/discussion
18

for weight adjustments that push the error function ever closer to its minimum. This procedure is
called gradient descent, and it’s the driving force behind backpropagation. Backpropagation has
built up an impressive track record, and underlies the majority of neural network results.

However, for deep and complex networks, with millions of weights, weight adjustments can
easily become too tiny to be useful: they just vanish, and gradient descent is no longer effective.
Backpropagation deploys the so-called 'chain rule' from calculus for computing the weight
adaptations per layer. The chain rule is an algorithm for computing derivaties of functions that
are applied to functions. Essentially, this is what happens when we apply activation functions to
the output of layers (which themselves apply activation functions to the output of the layers that
feed into them, and so on).

As mentioned, gradient descent uses composed derivatives by working across layers and their
respective activation functions, and makes according weight updates that move the network
closer to perfection. Now, activation functions with gradients in intervals capped by small
numbers (like between 0 and 1) result in weight adaptations that are small by nature, and
repeated multiplication of these small numbers with the chain rule leads to thinning, and,
eventually, evaporation of values. This means that the weight adaptations (which are computed
from the top-most layer of the network, just 'under' the output layer) never reach the layers close
to the input layer, which, subsequently do not get "trained".

So, how did deep learning escape from this conundrum? There are at least a couple of solutions
that have alleviated the problem. The most prominent one is the use of a feature selection/data
reconstruction cycle, as put forward by Restricted Boltzmann Machines (RBM’s). RBM’s are
complete networks that learn probability distributions from data. They can be stacked on top of
each other, as layers, where every layer is a separate RBM sending its hidden layer data as input
to the next layer, and not through inter-layer connections between hidden layers. This set-up
allows for layer-wise training of networks, and eliminates much of the vanishing gradient
problem, since gradients don’t need to travel far down the network: they are confined to separate
layers. 2

In addition to all this, a new type of activation function has become popular in Deep Learning:
the rectified linear unit ReLU. A ReLU is a very simple non-linear function that computes a
maximum of two values, one of which is the input to a neuron.

Specifically

So, ReLU(x) just returns which one is bigger: zero or x. It eliminates all values for x below zero.

Now, if we apply this function to every neuron in our network, only the ones with positive values
©Manning Publications Co. To comment go to liveBook
https://livebook.manning.com/#!/book/deep-learning-for-natural-language-processing/discussion
19

promote their values:

This formula expresses the effect of ReLU applied to a sum ( ) of inputs multiplied by
weights, augmented with a bias term. Figure 3 shows hows this works.

Figure 1.17 ReLU operations at work on the final hidden layer output of a neural net.

The ReLU function is differentiable almost everywhere but for exactly 0, and its derivative is
quite simple:

ReLU'(x)=1 if x>0 and 0 else

which has beneficiary effects on the speed and scalability of the network computations during
backpropagation.

A traditional activation function is the sigmoid function:

sigmoid(x)=1/(1+e-x)

To witness the dramatic effect the choice of an activation has on the performance of your neural
network, let’s try out an overly deep network on a small snippet of sentiment data.

Scenario: You want to train a deep network on a sentiment labeling task. The task consists
of labeling texts with sentiment labels: 1 for positive sentiment, and 0 for negative. You are
unsure about which activation function you should choose. Can you find out
experimentally the best option?

©Manning Publications Co. To comment go to liveBook


https://livebook.manning.com/#!/book/deep-learning-for-natural-language-processing/discussion
20

Our processing pipeline is as follows:

Figure 1.18 Labeling sentiment with a deep multilayer perceptron (MLP).

Our data, taken from PAngLeePlotTokSUBJ looks like this: a set of sentences, labeled with
either a zero (negative sentiment) or a one (positive sentiment)

Review Sentiment
score
smart and alert , thirteen conversations about one thing is a small gem 1
color , musical bounce and warm seas lapping on island shores and just enough science to send you 1
home thinking
it is not a mass-market entertainment but an uncompromising attempt by one artist to think about 1
another
a love affair with a veterinarian who is a non-practicing jew 1
initially reluctant to help , daniel’s conscience eventually gets the better of him 0
his plans go awry , however , when his older brother , keith , returns from prison 0
inspired and motivated , the kids save the day , showing bravery and nobility 0

In raw format, our data looks like this. It is tab-delimited data, with a header containing the
names of two columns:

"text"<tab>"label"
smart and alert , thirteen conversations about one thing is a small gem <tab>1
...

Our model is a ten-layer deep network, listed below. The model is trained on 90% of our training
data while keeping 10% separate for validation purposes.

©Manning Publications Co. To comment go to liveBook


https://livebook.manning.com/#!/book/deep-learning-for-natural-language-processing/discussion
21

Listing 1.4 A 10-layer deep MLP sentiment classifier with sigmoid activation
functions.
from keras.models import Sequential
from keras.utils import np_utils
from keras.preprocessing.text import Tokenizer
from keras.layers.core import Dense, Activation

import pandas as pd
import sys

data = pd.read_csv(sys.argv[1],sep='\t')
docs=data["text"]

tokenizer = Tokenizer()
tokenizer.fit_on_texts(docs)

X_train = tokenizer.texts_to_matrix(docs, mode='binary')


y_train=np_utils.to_categorical(data["label"])

input_dim = X_train.shape[1]
nb_classes = y_train.shape[1]

model = Sequential()
model.add(Dense(128, input_dim=input_dim))
model.add(Activation('sigmoid'))
model.add(Dense(128))
model.add(Activation('sigmoid'))
model.add(Dense(128))
model.add(Activation('sigmoid'))
model.add(Dense(128))
model.add(Activation('sigmoid'))
model.add(Dense(128))
model.add(Activation('sigmoid'))
model.add(Dense(128))
model.add(Activation('sigmoid'))
model.add(Dense(128))
model.add(Activation('sigmoid'))
model.add(Dense(128))
model.add(Activation('sigmoid'))
model.add(Dense(128))
model.add(Activation('sigmoid'))
model.add(Dense(128))
model.add(Activation('sigmoid'))
model.add(Dense(nb_classes))
model.add(Activation('softmax'))
model.compile(loss='binary_crossentropy',
optimizer='adam',
metrics=['accuracy'])

print("Training...")
model.fit(X_train, y_train, epochs=10, batch_size=32, validation_split=0.1,
shuffle=False,verbose=2)

The pandas library has many handy functions for processing comma-separated
(cvv) and tab-delimited (tsv) data.
Our data is be tab-delimited: sentences separated by tabs from their labels. We read
the data into a 'dataframe' (a primitive of pandas), and extract the "text" field (the
column labeled with "text" in our data) as our document set.
The numpy library contains a function for

©Manning Publications Co. To comment go to liveBook


https://livebook.manning.com/#!/book/deep-learning-for-natural-language-processing/discussion
22

Keras has a Tokenizer facility for converting text into numerical vectors, consisting
of unique integers referring to the original words.
We apply the Tokenizer to our documents.
Next, we generate vector representations of our documents. Together, they form a
matrix. The tokenizer builds up a lexicon mapping words to integers, and generates
binary vectors of a fixed dimension, with a '1' for every word in the input
document, counting from zero. So, a '1' on position '3' means that word '2' in the
lexicon is in the document. An example is presented below.
We infer the input size (dimension) from our vectorized data, and the number of
classes.
The network contains 10 Dense layers (standard, fully-connected layers), and
deploys sigmoid activation functions that pass the incoming connections into each
neuron through the sigmoid function.
The output layer is a dense layer with as many neurons as the number of classes.
The softmax activation function generates output probabilities.
We compile the model.
The model is fitted on our data, using 10% of the training data as held-out data for
testing.

Here is an example of a Keras Tokenizer vectorizing text data:

>>> docs = ['smart and alert , thirteen conversations about one thing is a small
gem','not very smart movie']
>>> tok=Tokenizer()
>>> tok.fit_on_texts(docs)
>>> tok.texts_to_matrix(doc, mode='binary')
array([[0., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 0., 0., 0.,
0.],
[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 1., 1.,
1.]])
>>> tok.word_index
{'and': 1, 'a': 9, 'about': 5, 'very': 15, 'not': 14, 'conversations': 4, 'is': 8,
'one': 6, 'mart': 13, 'thing': 7, 'thirteen': 3, 'movie': 16, 'small': 10, 'alert': 2,
'gem': 11, 'smart': 12}

On our data, the model produces the following output:

Train on 1800 samples, validate on 200 samples


Epoch 1/10
2s - loss: 0.7079 - acc: 0.5078 - val_loss: 0.6937 - val_acc: 0.5200
Epoch 2/10
1s - loss: 0.6983 - acc: 0.5144 - val_loss: 0.6938 - val_acc: 0.5200
Epoch 3/10
1s - loss: 0.6984 - acc: 0.5100 - val_loss: 0.6955 - val_acc: 0.5200
Epoch 4/10
1s - loss: 0.6988 - acc: 0.5000 - val_loss: 0.6979 - val_acc: 0.5200
Epoch 5/10
1s - loss: 0.6994 - acc: 0.4922 - val_loss: 0.6994 - val_acc: 0.5200
Epoch 6/10
1s - loss: 0.6999 - acc: 0.4989 - val_loss: 0.6986 - val_acc: 0.5200
Epoch 7/10
1s - loss: 0.6999 - acc: 0.4978 - val_loss: 0.6966 - val_acc: 0.5200

©Manning Publications Co. To comment go to liveBook


https://livebook.manning.com/#!/book/deep-learning-for-natural-language-processing/discussion
Exploring the Variety of Random
Documents with Different Content
the inducement that led the Behunins into Zion Canyon. Later they
introduced domestic varieties.

The cabin seems to have been completed late in the season after the
corn had been harvested. There was no road into Zion Canyon at that
time, but a heavy team had hauled in a plow and other necessities.
An irrigation ditch was dug and the flats cleared of vines and
rosebushes. By the next season several acres were under cultivation,
and fruit trees, cane and garden stuff had been planted.

The Behunins also owned fifteen or twenty head of cattle all broken
to work, including milk cows. They raised pigs on the surplus corn
and did their own butchering and curing. James H. Jennings (born in
1853) tells of watching them slaughter thirteen hogs one day. They
filled a shallow pool with water and heated it by dropping into it hot
rocks from a nearby bonfire. When the water was near boiling, they
dipped the hogs in the pool to scald them and loosen the hair. The
meat was cut up and salted to make old-style home-cured hams,
shoulders and bacon.
The Watchman, overlooking Springdale in Zion Canyon.
Photo by O. J. Grimes.
The Angel Landing in Zion Canyon.
Photo by U. S. National Park Service.
Mrs. Eunice Munk of Manti, who as a girl of 12 or 14 spent 157
more than a year in Springdale, recalled that in the summer of
1864 the Behunins told her that in Zion Canyon the chickens went to
roost soon after sundown, but that the twilight was so long that they
would get tired of waiting for darkness and come out again.

Within a few years, other settlers found their way into the canyon.
William Heap took up a farm across the river from the Behunins at
the mouth of Emerald Pool Canyon and built a log cabin on the west
bank of the river north of the Emerald Pool stream. He planted an
orchard and raised annual crops such as cane, corn, and garden stuff.
John Rolf built his cabins above the Behunins. A polygamist, he
needed separate dwellings for his families; one he located near the
Behunins; the other on the site of the present grotto camp.

Protection was afforded by buck or rip-gut fences from cliff to river at


each end of the farms. These were made of short poles set in the
ground and pointed in various directions so as to effectively prevent
large animals from breaking through.

Hand plows pulled by horses or oxen were used for turning the soil.
Harrows were fashioned of hewn timbers fastened together in a
triangle. Sharpened pegs of oak were fixed in holes bored in the
framework and turned downward so that they scratched the ground
when pulled by a team. The first shovels were short-handled, but
later they had what they called “lazy man’s shovels,” because the
handles were longer and a man did not have to bend his back so
much when working. The rakes were handmade of wood with sharp
pegs for teeth. Scythe and cradle were seldom used since little hay or
grain was raised.

It was about this time that the canyon received its name. The three
settlers, hardy mountaineers though they were, nevertheless were of
a devout and religious turn of mind. It seems to have been old father
Behunin who proposed the name of Zion, to which the others agreed.
Isaac Behunin had been with the Mormons ever since they left New
York. He had helped build the Temple at Kirtland, Ohio, and had at
one time acted as body-guard to the founder of the Church, Joseph
Smith. He had been through all the “drivings of the Saints” in
Missouri and Illinois and nourished the typically bitter resentment
towards the “enemies” who had been responsible for such
“atrocities.” Here in Zion he felt that at last he had reached a place of
safety where he could rest assured of no more harryings and
persecutions. No wonder he proposed the name Zion, which implies a
resting place. He went even further, maintaining that should the
Saints again be harassed by their enemies, this would become their
place of refuge.

On one of Brigham Young’s visits to Springdale, probably in 158


1870, he was told of Zion. He inquired how it came to be so
named. The explanation, it seems, was not satisfactory to the
Mormon leader after a toilsome journey into the canyon and he
questioned its propriety, saying that “it was not Zion.” Some of his
more literal-minded followers thereafter called it “Not Zion.”

The first settlers made their way into the canyon on horseback, using
the river bed, crossing and recrossing the stream. It soon became
necessary, however, to provide other means of transportation. A
wagon road was no problem through the flats in both the upper and
lower valleys, but the precipitous canyon between was baffling. It is
[59]
related that Hyrum Morris, Shunesburg settler, and a companion
were the first to enter the upper valley by means other than
horseback. They hitched a yoke of oxen to the hind wheels of a
wagon and lashed a plow and supplies on it. When they entered the
canyon, near the present site of the bridge, they climbed the west
bank over the sand bench, down into Birch Creek and thence into the
upper Zion Valley. This did not prove to be a practicable route, and
no one seems to have followed it. Today one can hardly traverse the
route on foot.

The remains of an old cart road which followed the east bank, high
up opposite the sand bench, coming out into the upper valley about
half a mile above the present Union Pacific garage, may still be
traced. This route was used for some years, but was far from
satisfactory. Other settlers from the towns below began to cultivate
tracts in the upper valley and the timber resources of the canyon
made a better road imperative. During the winter of 1864-65, a
[60]
wagon road was built up the river bed, crossing the stream many
times. This is the road which, with minor improvements, served as
the main highway into Zion until the National Park Service built the
road that first made it fully accessible to the public. This road in turn
served until 1930, when the present well-graded highway was
constructed midway between the river and the older road that it
replaced.

It was while constructing this first wagon road on January 9, 1865,


that George Ayers was killed. A short dugway was being graded on
the slope above the river. With no blasting powder, the men were
excavating a large boulder, and George Ayers and Orson Taylor had
stopped to rest in the shade of the huge stone, rolling cigarettes.
Suddenly it began to move. A shout of warning came from Samuel
Wittwer and Heber Ayers. Taylor was able to scramble out of the way
but Ayers was squarely in its path and it fell directly upon him,
crushing and killing him instantly—the first victim of Zion Canyon.

Indian troubles, treated in a later chapter, broke out in the spring of


1866. Martial law was declared and instructions were issued 159
from the military headquarters for the settlers to concentrate
in towns of at least 150 families. It was at first decided to gather all
the settlers of the Upper Virgin River at Rockville and Toquerville, and
[61]
later, at Virgin. James Jepson recalls that his father had just
moved his cabin from Virgin to Rockville when the revised decision
reached him and he moved back to Virgin again. This was the fifth
time the cabin had been moved and his father dryly remarked that it
was so used to the process now that all he had to do was throw the
logs into the yard and they would fit themselves together.
This concentration order meant the abandonment of all smaller
places: Duncan, Grafton, Northrop, Shunesburg, Springdale and Zion.
Those who could not buy or rent a house simply dumped their
belongings in the shade and set up housekeeping under the trees.
Some moved their log houses with them, others made dugouts and
still others built new log houses.

Although the outlying towns had been abandoned, the crops had
been planted and had to be tended. Workers went in armed groups
of ten, twenty, or thirty to the fields, usually remaining during the
week in the more distant places and returning to Rockville on Sunday.
In Zion, headquarters were at the Behunin cabin, where eight or ten
men usually camped while working the crops. In Springdale, they
usually stayed at Albert Petty’s home or nearby. Petty himself refused
to abandon his ranch and stayed there throughout the Indian scare.

This concentration continued through 1867, but with the close of the
“Black Hawk War” and the subsidence of troubles with the Paiutes,
there was a general reoccupation of the villages except Long Valley
and Kanab, abandoned during the Indian troubles. This occasioned
some shifts in the population; some returned to their former homes
and lands, some stayed where they were, and a few moved
elsewhere. Springdale was reoccupied by Albert Petty and several
other families, but it did not regain its former size until about 1874.
Shunesburg and Grafton also seem to have lost slightly in population
in the reshuffling. To Zion, however, the same settlers, Behunin,
Heaps and Rolfs, returned and took up their usual tasks of raising
crops and tending livestock.

The following years were prosperous and the settlements were


greatly strengthened. Markets were established in northern Utah, at
mining camps in southern Nevada, and even in California. Commerce
was restricted because of transportation difficulties but it was an
important factor in helping to balance needs by exchange of livestock
products and cotton for goods the settlers could not produce.
The national financial panic of 1873 gradually worked its 160
paralyzing effects into Utah and spread to the southern Utah
colonies. The repercussions were not marked on the Virgin River, but
Brigham Young was gravely concerned about the more general
conditions in Utah when he came to St. George to spend his second
winter there (November, 1873).

During the spring of 1874 he initiated near St. George a communistic


movement known as the United Order. An experiment of similar
nature had been tried by the Mormons in Missouri more than thirty
years earlier. It was an attempt to combine cooperatively the efforts
of the Saints, so that all might share the benefits in accordance with
their needs. The scheme had been inaugurated in a few communities
with encouraging success and it was now proposed to launch it on a
[62]
large scale. In February, St. George, Price City, and Washington
were all organized on this basis. Price City, near St. George, is
reported as the first working farm community in which the combined
farms were managed as a unit and the farmers lived as one large
family. Some of the men were assigned to irrigation, some to raising
hay, corn, cane, garden stuff, or other crops according to the
estimated needs of the community. Fuel was supplied seasonably by
hauling wood from the scrub forests of the hills. The women were
assigned as cooks, dish washers, baby tenders, clothes makers, and
nurses. In Price and a few other places the settlers ate together, but
in most of the communities each family had its own home.

On March 5, Brigham Young visited Virgin and Rockville and


organized the United Order. During March and April, nearly all the
settlements in Dixie fell in line and a confederation of all the
settlements, known as the St. George Stake United Order, was set up
to correlate the activities of the individual settlements.

During late April and early May, Brigham Young moved northward to
Salt Lake City, initiating the novel movement in many of the towns
along the way, including Cedar, Parowan, Beaver, Fillmore and Nephi.
Upon reaching Salt Lake City, a general agency to correlate the stake
activities, known as the “United Order for all the World,” was
established.

The movement enjoyed only a brief period of prosperity. The utopian


ideal encountered difficulties when it came to the division of benefits.
Wages were assigned to each person and the benefits drawn were
charged against each family. It was soon alleged that some were
drawing more than their wages entitled them to, whereas others
were not getting all that they felt they should.

In a few favored communities having access to large natural


resources, such as abundant pasturage for livestock, either dairy or
range, the produce was more than sufficient to pay the wages
assigned and to build up a surplus capital. In most, however, 161
where agriculture was the chief dependence and products of
the farms were insufficient to pay wages, stinting was inevitable.
Under such conditions, there was general dissatisfaction with the
cooperative scheme and more progressive individuals sought to
withdraw.

Many settlements abandoned the experiment at the end of the first


season. Such was the case at Rockville, but Shunesburg and
Springdale held on for another year, through 1875. A new ditch on
higher ground was dug at Springdale at this time and the town was
moved to its present location, half a mile north of its old site. This
offered opportunity for expansion. The Gifford family from
Shunesburg and several families from Rockville came to join the
United Order in 1875.

It seems to have been the United Order movement which


depopulated Zion Canyon. In 1872, Isaac Behunin, getting old, sold
out to William Heap for two hundred bushels of corn, and moved to
Mt. Carmel, then beginning to be resettled, where he later died. With
the establishment of the United Order in Rockville in 1874, Heap and
Rolf joined, turning over their property in Zion to the corporation.
With the collapse of the Order in the fall, they withdrew their share of
the proceeds and moved to Bear Lake and later to Star Valley,
Wyoming.

These families were the last to live regularly in Zion Canyon. Farming,
however, was still carried on by settlers living in the village below.
Such names as Dennett, Gifford, Petty, Russell, Terry, Dalton,
Crawford, Stout and Flanigan, recur as farmers in Zion in the
following years. Oliver D. Gifford, long-time bishop of Springdale,
related that about 1880 he was farming the land south of the river
and west of the Great White Throne at the site of the grotto camp
when the Great Red Arch fell out, the rock pulverizing and covering
up a spring and large pine trees.

The Cable
Zion, since its discovery, had been regarded as a blind canyon. Even
the Indian name, I-oo-goon (canyon like an arrow quiver), reflects
this idea. Lee, Smith, Steele and their companions, who left Parowan
on June 12, 1852, and explored the head-waters of the Sevier and
Virgin, were balked by the Vermillion cliffs nearby, if not in Zion, in
contemplating the possibilities of a road from the mountains to the
lower valleys.

The early settlers of Rockville, needing timber from the mountains,


had explored the probabilities of a road through the canyon without
success. James H. Jennings recalls hearing Elijah Newman, an early
settler of Rockville, tell that a team could be driven from the head of
Parowan Canyon over Cedar Mountain to the rim of Zion Canyon and
that he believed that some day a way would be found through 162
the cliffs so that timber could be hauled down from the
mountain.

Brigham Young himself had encouraged the idea on one of his trips
to the upper Virgin, probably in 1863. It is related that when the
settlers were bewailing the lack of timber for flooring their cabins, he
stated in public meeting that the day would come when hundreds of
thousands of feet of lumber would be brought down the canyon. The
wise ones shook their heads and remarked that their leader had
missed it this time.

During the early seventies the mail from St. George to Kanab was
routed via Shunesburg. It was lifted over the 1,500 foot cliff at the
[63]
head of Shunesburg canyon on wires arranged in such a way that
the man at the top bringing the mail from Kanab could exchange with
the man below, carrying it down the river. This short-cut saved a
day’s travel over the road via Pipe Springs and the Arizona Strip.

It was not until the new century had dawned, however, that young
David Flanigan, who was but a small boy in the days of the
Shunesburg wire-pulley apparatus, conceived the idea of lowering
lumber over the cliffs by means of cables. As a lad of 15, in the
spring of 1888, he and three other boys hunting on the East Rim of
Zion had seen a large grove of yellow pine sawtimber and had stood
at the top of the cliff later known as Cable Mountain, where the
precipice appeared to reach almost to the floor of the canyon. The
problem of lumber remained as acute then as it was in the days of
his parents.

Ten years later, needing lumber for himself, he was forced to go to


the Trumbull or Kaibab mountains, a trip requiring a full week. The
advantage of obtaining lumber nearer home on the rim of Zion thus
impressed itself upon him. He advocated the idea of lowering it on a
cable but found no supporters. Convinced of the practical wisdom of
the idea, he undertook the work alone in 1900. He bought 50,000
feet of wire and stretched it around pulleys and drums at top and
bottom making a circular series, half of which had three wires, and
half five. The five-ply half was to run around the drum where the
weight would be greatest.
After two or three years of experimenting with pulleys and wire
tension, he finally perfected his device. To cut the timber, he bought
an old saw-mill in the summer or fall of 1904. Thus Flanigan’s
ingenuity made it possible for vast quantities of lumber to go down
Zion Canyon. By Christmas, 1906, two hundred thousand feet of
sawed lumber had actually been lowered on his cable.

It was during the summer of 1906 that the writer first visited Zion
Canyon. From Orderville he faced a trip by horse to his home in St.
George. Rather than retrace his steps by way of Kanab and 163
Pipe Springs, he inquired about the short-cut via Zion and was
directed cross-country to the head of the old East Rim trail, which
John Winder had recently made barely passable. Belated, he reached
the top of the trail at sundown, and familiar as he was with
mountaineering, started unhesitatingly on the 3,000 foot descent.
Darkness found him well along the trail which grew increasingly hard
to follow. About half way down, he met Winder and some cowboys
coming up. They described the trail and said they had just lost a pack
horse which had rolled over the cliffs below and they had been
delayed in retrieving the pack.

A full moon was just coming up over the cliffs at his back, throwing
light into the dark recesses along the way. With its help the steep,
dangerous places on the trail were negotiated without difficulty. The
dead pack horse was found beside the trail and the bottom of the
canyon was reached about 11 p.m., when the full moon was shining
into the depths of the canyon and towers and temples were illumined
with snowy brightness. Camping there, he slept out the night in his
saddle blankets under the enchanting witchery of the starry heavens
between the brilliant walls that seemed to reach the sky.
[64]
In 1906 Flanigan sold out to Alfred P. Stout and O. D. Gifford.
They replaced his wire cable with a heavier twisted rope cable, which
served for many years and over which millions of feet of lumber were
lowered into the canyon and hauled to the settlements farther
downstream. Stout established a shingle mill in Zion Canyon about a
quarter of a mile below the cable and near the foot of the northeast
corner of the Great White Throne. At first, large cottonwood logs
were cut for shingles, but as these proved of inferior quality, yellow
pine logs were supplied via cable. The shingle mill was washed away
by floods two years later.

A sad accident occurred at the top of the cable on July 28, 1908. A
party of young people vacationing on the east rim went over to see
the cable operate from the top. Three of them were standing in the
box at the edge, directly under the cable, looking into the depths of
the canyon, when a bolt of lightning struck the cable, killing Thornton
Hepworth, Jr., and stunning Clarinda Langston and Lionel Stout. Miss
Langston fell limp on the edge of the box where she was in imminent
danger of plummeting down the cliff. Miss Elza Stout, uninjured
nearby, rescued her from the precarious position, but before
assistance could be rendered him, a second bolt struck the wire and
killed Lionel Stout. Miss Langston recovered, but the bodies of the
two boys were lowered into the canyon over the cable.

It was more than a year later when people started to “ride the cable.”
About the middle of September, 1910, soon after Zion had 164
been proclaimed a national monument, some members of
Scott P. Stewart’s surveying party visited the top of the cable. They
were told that a dog had been sent up from below and that he was
nearly crazy when he reached the top. Quinby Stewart, a fearless
youth, told them that if they would bring some watermelons up to
the foot, he would go down on the cable and help eat them. True to
his word, when the melons arrived he climbed on a load of lumber
ready to be lowered, and holding to the cable, rode safely to the
bottom. It was a swift flight of two minutes, and to a young man of
his disposition a rousing thrill. Others followed suit, and after eating
the melons, rode back to the top in the empty cage.

Riding the cable proved an attraction for those gifted with strong
nerves. At a later date, Frank Petty came to operate the sawmill at
the top of the cliff. He was a large man, weighing nearly 300 pounds,
too heavy to travel comfortably up and down the trails, and the road
around the Arizona strip to his home in Rockville being too long for
convenience, he took to riding the cable. On one occasion, as he
started down, the lumber on which he was riding struck the top of
the cliff and loosened the chain holding one end of the load. His son,
Frank, operating the brakes at the top, seeing the mishap, applied
the brakes just in time to prevent his father from falling down the
face of the 1,800 foot declivity. With a few inches of the lumber still
clinging to the edge and his father paralyzed with fear and afraid to
move for fear of jarring it loose, Frank climbed underneath and re-
fastened the chain. With a sigh of relief, he then lowered his father in
safety to the bottom.
Kane County and Arizona
Just as the settlement of Iron County had provided a stepping stone
to the exploration and settlement of the Virgin River Valley, so in
turn, the latter served in like stead in opening up Kane County and
the Kaibab National Forest of northern Arizona.

In the fall of 1858, after Albert Sydney Johnston’s army had entered
Utah, Brigham Young, still doubtful about the future, instructed
missionaries under Jacob Hamblin’s leadership to cross the Colorado
River to the southeast and visit the Moquis or Town Indians with the
object of exploring the possibilities of retreating with his people to
this region should the difficulties with the army become unbearable.
[65]

Accordingly, Jacob Hamblin, one of the leading figures in Utah’s


southern frontier, left the Santa Clara on October 28, 1858, with a
party of twelve, including an Indian guide, a Spanish interpreter, and
a Welsh interpreter, the last because of wildly erroneous reports that
the Moquis spoke a variant of that tongue. The Indian led them
through the Arizona strip via Pipe Springs and Kaibab to the old Ute
ford where Escalante had crossed eighty-two years earlier. The visit
to the Moquis was brief, some of the men returning in November, the
others later in the winter. This expedition revealed the general
topography between the Virgin and Colorado.
Zion Lodge in 1925; Angel Landing in the distance.
Riding the cable in Zion Canyon in 1912. Courtesy D. D. Rust.
Railroader J. H. Manderfield, Warden Arthur Pratt and Engineer
W. D. Beers, about 1916 on a business trip to see about road
development, stop to take a cooling drink from Pipe Spring.
Gunlock Bill Hamblin shot the bottom out of the bowl of a
tobacco pipe. Once the center of a sea of grass and a strategic
fort against the Indian menace; now an historical monument.
Photo by courtesy of D. D. Rust.
A Salt Lake Chamber of Commerce scouting party of 1915
stopped at Kanab. Rust piloted them from Hurricane to Kanab in
one long day. Photo by courtesy of D. D. Rust.

A second missionary expedition to the Moquis was undertaken 165


in the fall of 1859, leaving the Santa Clara headquarters on
October 20 and reaching the Moquis on November 6. Hamblin
appointed Marion J. Shelton and Thales Haskell to remain there for
[66]
the winter and returned home with the rest of the party.
Friendship with the Moquis was cultivated by the missionaries, but
[67]
this seems to have led to difficulties with the Navajo.

When Jacob Hamblin led a third expedition across the Colorado River
to reach the Moquis in the fall of 1860, he was met by a band of
unfriendly Navajos who would not let the missionaries proceed and
debated whether to kill them or let them go home. With the Mormons
were several Indians, including two squaws. The Navajos offered to
let the party go in peace if they would leave the squaws. This
Hamblin refused to do, and an agreement was finally reached
whereby the missionaries were allowed to return home in exchange
[68]
for goods and ammunition.

They camped that night on a table-rock mesa where there was only a
narrow passageway which was carefully guarded. Next morning,
November 2, 1860, while some were exchanging goods with the
Navajos, others took the horses down to water. As they were
returning, the saddle horse of George A. Smith, Jr., started off on a
side trail and he went after it alone. He found two Indians leading his
horse away. The horse was readily turned over and Smith started
back to camp. One of the Indians rode up alongside Smith and asked
to see his revolver. Suspecting nothing, Smith handed it over. The
Indian, after examining it, passed it back to the other Indian a few
paces behind, who shot Smith three times. As he fell from his horse,
the Indians dismounted and shot three arrows into his back.

The Indians then blockaded the trail to the Moquis towns, forcing the
Mormons to retreat, who placed the dying man on a mule and started
homeward with the Navajos in hot pursuit. Traveling thus, it was
nearly dark before Smith died. His body had to be abandoned as the
Navajos seemed unwilling to give up the chase until they had taken
his scalp. The balance of the party returned home safely.

Reporting the loss of George A. Smith, Jr., was a sorrowful 166


duty for Hamblin. The young man’s father was deeply
shocked, but like a good Saint, consoled himself with the thought
that the Lord wished his son taken that way. Brigham Young sent
instructions for a company of twenty men to retrieve the remains.
Despite the hardships of mid-winter, they gathered up the few bones
that were left of Smith’s body and returned with them for interment.

Several other trips to the Moquis by different routes resulted in


detailed knowledge of northern Arizona and southern Utah. Crossings
of the Colorado were explored thoroughly and ferries were
established at the south of the Virgin, at the mouth of the Grand
Wash (1862) and at the foot of Grand Wash Cliffs about five miles
upstream (Pearce’s Ferry, 1863). These supplemented the old Ute
ford in Glen Canyon. Further exploration did not reveal a more direct
route until 1869, when the crossing later known as Lee’s Ferry was
discovered. These routes were so well explored that no better ones
have been discovered since.

Stockmen began to graze their herds of cattle and sheep on the


plains of the Arizona strip. Some time prior to 1863, W. B. Maxwell
established a ranch at Short Creek; not long after, James M.
Whitmore located ranches at Pipe Springs and Moccasin, and Ezra
Strong of Rockville settled on Kanab Creek. In the spring of 1864,
several ranches were established in the mountains and two
settlements were started, one at the present site of Kanab, where a
small fort was built, and another housing eight families at Berryville
(later Glendale) in the north end of Long Valley. In the fall, Priddy
Meeks located in the south end of the valley. He was joined the next
[69]
spring (1865) by several settlers from the Virgin River, who
brought livestock for the range and nursery stock for orchards. The
new settlement was called Winsor (later Mt. Carmel).

In the autumn, with indications of an impending rift between whites


and Indians, the Winsor settlers moved to Berryville and helped build
a stockade for protection during the winter. In the spring they
returned and planted crops, but during the summer settlement was
again interrupted by Indian difficulties and had to be abandoned.

167
Indian Troubles
The period following early settlement was marked by Indian troubles
with both Paiutes and Navajos. These are sometimes called the
Navajo raids, and in part were an outgrowth of the “Black Hawk War”
which broke out in Sevier Valley, central Utah, in 1865. The whites
had brought with them their livestock, which they grazed upon the
public domain, turning the cattle and horses loose and herding the
sheep. These animals multiplied rapidly and quickly depleted the
edible fruits and seeds upon which the Indians subsisted. Indian
resentment not unnaturally was inflamed, and with starvation staring
them in the face, there was little left for them to do but beg or steal.

The Indians had claimed the lands, the vegetation and the wild
game, and although they had given the first white men permission to
come, yet so many others had followed, like the proverbial camel’s
nose, that they were destroying the means of subsistence of the
Indians. Not only were seeds and fruits being eaten by the livestock,
but game also was getting scarce and hard to find, due largely to
encroachment of cattle and sheep which were taking the place of
deer upon the range. The white man hunted the Indians’ deer so why
should not the Indian hunt the white man’s cattle? There was some
compensation to the Indians, however; they could glean in the grain
fields of the settlers and gather waste grain as easily as they could
seeds, and pine nut crops were uninjured by the whites.

Gradually, friendly feelings of the Indians for the settlers began to


deteriorate. Begging in the settlements and the depredations on the
range increased. The Paiutes in some instances aided and abetted
the raiding Navajos, but the majority sided with the whites. The
Navajos were wont to cross the Colorado, scatter into small bands,
make swift raids on the Mormon settlements, gather up horses, cattle
and sheep, and flee back across the river before they could be
overtaken.

From the beginning a military force had been held in readiness


against any emergency. As the southern Utah settlements expanded,
improvements in this organization became advisable. In May, 1864,
the Iron Military District was recast to include Beaver, Iron,
Washington and Kane counties and William H. Dame of Parowan was
named adjutant. Nearly all the eligible men were enrolled and
companies of fifty were organized in towns wherever that many were
available. Companies consisted of five platoons of ten men each, the
first platoon of each company often being cavalry, the balance
infantry. Three companies made a battalion and about seven
battalions made a brigade. The men were occasionally called
together for inspection and drill and sometimes these included
battalion or brigade reviews. Training was emphasized during 168
the Indian troubles between 1865 and 1869. On February 17,
1866, Erastus Snow, the Mormon leader at St. George, was elected
Brigadier General and brigade headquarters were transferred from
Parowan to St. George.

The Black Hawk War broke out in 1865 and was not settled until
1868. Nearly 3,000 men were enlisted and the cost was over a million
dollars and at least seventy lives. This Ute unrest was contagious,
and the Paiutes in turn were stirred into sporadic resistance.

Hostilities in the south began late in 1865, when, on December 18, a


number of Paiutes raided Kanab and made away with some horses.
During that winter Dr. James M. Whitmore and his son-in-law, Robert
McIntyre, were herding sheep in the vicinity of Pipe Springs. Soon
after the first of the new year, a band of Navajos and Paiutes stole a
herd of Whitmore’s sheep. The next day the two men went in pursuit
and failed to return. This was reported to St. George and a cavalry
detachment was organized under Captain David H. Cannon. As his
force appeared inadequate, he sent an appeal from Pipe Springs for
additional support. D. D. McArthur came from St. George to take
Welcome to our website – the ideal destination for book lovers and
knowledge seekers. With a mission to inspire endlessly, we offer a
vast collection of books, ranging from classic literary works to
specialized publications, self-development books, and children's
literature. Each book is a new journey of discovery, expanding
knowledge and enriching the soul of the reade

Our website is not just a platform for buying books, but a bridge
connecting readers to the timeless values of culture and wisdom. With
an elegant, user-friendly interface and an intelligent search system,
we are committed to providing a quick and convenient shopping
experience. Additionally, our special promotions and home delivery
services ensure that you save time and fully enjoy the joy of reading.

Let us accompany you on the journey of exploring knowledge and


personal growth!

textbookfull.com

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy