Deep Learning For Natural Language Processing (MEAP V07) Stephan Raaijmakers Download
Deep Learning For Natural Language Processing (MEAP V07) Stephan Raaijmakers Download
https://textbookfull.com/product/deep-learning-for-natural-
language-processing-meap-v07-stephan-raaijmakers/
https://textbookfull.com/product/deep-learning-for-natural-
language-processing-develop-deep-learning-models-for-natural-
language-in-python-jason-brownlee/
https://textbookfull.com/product/deep-learning-in-natural-
language-processing-deng/
https://textbookfull.com/product/python-natural-language-
processing-advanced-machine-learning-and-deep-learning-
techniques-for-natural-language-processing-1st-edition-jalaj-
thanaki/
https://textbookfull.com/product/deep-learning-in-natural-
language-processing-1st-edition-li-deng/
https://textbookfull.com/product/natural-language-processing-
with-tensorflow-teach-language-to-machines-using-python-s-deep-
learning-library-1st-edition-thushan-ganegedara/
https://textbookfull.com/product/natural-language-processing-for-
electronic-design-automation-mathias-soeken/
https://textbookfull.com/product/natural-language-processing-1st-
edition-jacob-eisenstein/
https://textbookfull.com/product/natural-language-processing-for-
social-media-1st-edition-anna-atefeh-farzindar/
MEAP Edition
Manning Early Access Program
Deep Learning for Natural Language Processing
Version 7
—Stephan Raaijmakers
Language comes naturally to humans, but is traditionally hard to grasp for computers. This book
addresses the application of recent and cutting-edge deep learning techniques to automated
language analysis. Deep learning has emerged in the last decade as the vehicle of the latest wave
1
in AI. Results have consistently redefined the state-of-the-art for a plethora of data analysis tasks
in a variety of domains. For an increasing amount of deep learning algorithms,
better-than-human (human-parity or superhuman) performance has been reported: for instance,
speech recognition in noisy conditions, and medical diagnosis based on images. Current deep
learning-based natural language processing (NLP) outperforms all pre-existing approaches with a
large margin. What exactly makes deep learning so adequate for these intricate analysis tasks, in
particular language processing? This chapter presents some of the background necessary for
answering this question. We will guide you through a selection of important topics in machine
learning for NLP.
What you will get from this book is a thorough introduction to deep learning applied to a variety
of language analysis tasks, supported by actual hands-on code. Explicitly linking the evergreens
of computational linguistics, such as part-of-speech tagging, or query-answering, to deep
learning, will help you become a proficient deep learning NLP expert. Beyond this, the book
covers state of the art approaches to challenging new problems, such as explainable AI and the
role of semantics.
You will find a succinct overview of NLP in Appendix 1. Let’s start off with discussing a few
well-known machine learning-based NLP algorithms in some detail, illustrated with a handful of
practical examples to wet your appetite. After that, we motivate the case for deep learning-based
NLP.
Current natural language processing heavily relies on machine learning. Machine learning has its
roots in statistics, building among others on the seminal work by Thomas Bayes and LaPlace in
the 18-th and 19-th century (Bayes,LaPlace) and the least squares methods for curve
approximation by Legendre in 1812 (legendre). The field of neural computing started with the
work of McCulloch and Pitts in 1943, who put forward a formal theory (and logical calculus) of
neural networks (mccullochpits43). It would take until 1950 before learning machines were
proposed by Alan Turing (Turing.
All machine learning algorithms that perform classification (labeling) share a single goal: to
arrive at linear separability of data that is labeled with classes: labels that indicate a (usually
exclusive) category a data point belongs to. Data points presented to a machine learning
algorithm typically consist of vector representations of descriptive traits. These representations
constitute a so-called input space. The subsequent processing, manipulation and abstraction of
the input space during the learning stage of a self-learning algorithm yields a feature space.
Some of this processing can be done external to the algorithm: raw data can already be converted
to features as part of a pre-processing stage, which technically creates an input space consisting
of features. The output space consists of class labels that separate the various data points in a
dataset, based on the class boundaries. The essence of deep learning, as we will see, is to learn
abstract representations in the feature space.
Training a machine learning component involves learning boundaries between classes, which
may depend on quite complex functions. The burden of learning class separability can be
alleviated by smart feature pre-processing. Learning the class boundaries occurs by performing
implicit or explicit transformations on linearly inseparable input spaces. The following figure
shows a non-linear class boundary: a line separating objects in two classes that cannot be
modeled by a linear function f(x)=ax+b. The function corresponding to this line is a non-linear
classifier. A real-word example would be a bowl of multi-colored marbles, mixed in such a way
that they cannot be separated from each other by means of a straight plate (like a scoop).
A linear function that separates classes with a straight line is a linear classifier and would
produce a picture like:
We will now briefly address three types of machine learning approaches that have had major
uptake in NLP:
While there is a lot more to the story, these three types embody respectively the neural or
cognitive, eager and lazy types of machine learning. All of these approaches relate naturally to
the deep learning approach to natural language analysis, which is the main topic of this book.
Suppose you have a vector of features that describe aspects of a certain object of interest, like the
words in a document, and you want to create a function from these features to a binary label (for
instance, you want to decide if the document conveys a positive or a negative sentiment). The
single-layer perceptron is capable of doing this. It produces a binary output y (0 or 1) from a
weighted combination of input values x1…xn, based on a threshold θ and a bias b:
The weights w1,…wn are learned from annotated training data, consisting of input vectors labeled
with output labels. The thresholded unit is called a neuron. It receives the summed and weighted
input v. So, assume we have a set of weights and associated inputs:
3*10+5*20+7*30=310
This simplistic network is able to learn a specific set of functions that address the class of
linearly separable problems: problems that are separable in input space with a linear function.
Usually, these are the easier problems in classification. Its is quite common for data to be heavily
entangled. Consider undoing a knot in two separate ropes. Some knots are easy and can be
undone in one step. Other knots need much more steps. This is the business of machine learning
algorithms: undoing the intertwining of data objects living in different classes.
For NLP, the single-layer perceptron nowadays plays a marginal role, but it underlies several
derived algorithms that strive for simplicity, such as online learning (ONLINE).
A practical example of a perceptron classifier is the following. We set out to build a document
classifier that categorizes raw texts as being broadly about either atheism or medical topics. The
popular 20 newsgroups dataset (20newsgroups), one of the most widely used datasets for
building and evaluating document classifiers, consists of newsgroup (usenet) texts distributed
over 20 hand-assigned topics. Here is what we do:
Don’t worry if you don’t completely understand this code right now. It’s here to give you an idea
of what the code looks like for a basic perceptron.
perceptron = perceptron.Perceptron(max_iter=100)
perceptron.fit(X_train_tfidf,train.target)
X_test_counts = cv.transform(test_docs)
X_test_tfidf = tfidf_tf.transform(X_test_counts)
pred = perceptron.predict(X_test_tfidf)
Apparently, these few short texts can be linearly separated by a simple, weight-based algorithm.
This example is a huge simplification: the topics chosen are quite distinct. In real life, linear
algorithms fall short in separating topics that overlap and share similar vocabulary.
The multilayer perceptron generalizes the single-layer model of the original perceptron to a
multilayer model, with at least 3 layers: an input layer, one or more hidden representational
layers, and an output layer:
A support vector machine is a binary classifier that implicitly maps data in feature space to
higher dimensions in which data becomes separable by a linear plane, called a hyperplane. This
mapping is implicit, and is carried out by a kernel function. This is a function that transforms the
original input space to an alternative representation that implicitly has a higher dimensionality,
with the aim of disentangling the data and making it linearly separable.
But the migration is implicit in the sense that it takes the form of a similarity function (φ in the
picture above) applied to two feature vectors, just computing their distance. This is cordially
called the kernel trick, It sounds like sheer magic, but it is actually quite simple. Let’s take a
look.
You should already be familiar with the dot product of two vectors. If not, please see Appendix 2
for a refresher. To recap, the standard dot product of two vectors a and b is the sum of the
cross-product of the two vectors:
So, a dot product is just a multiplicative operation on two vectors that produces a single number.
Kernels are generalizations of this dot product between vectors: they compute the dot product
between altered versions of these vectors. The nature of alteration is specified by a kernel
function φ . Generally speaking, a kernel function takes two vectors, mixes in a constant (a kernel
parameter) and adds some kernel-specific ingredients to produce a specific form of a dot product
of the two vectors.
Let’s return to our orange and apple. The objects are described by pairs of coordinates (x,y),
since the table they’re lying on is a flat XY-plane. Like othertypes of kernels, the so-called
polynomial kernel maps lower-dimensional spaces to higher-dimensional ones. You may recall
©Manning Publications Co. To comment go to liveBook
https://livebook.manning.com/#!/book/deep-learning-for-natural-language-processing/discussion
11
from high school math that a polynomial function produces a value using addition, subtraction,
multiplication or positive exponentiation only. Like y=4x2+10. Polynomial kernels work on two
input values (vectors of numeri values)and usually a constant. They compute a result using a
polynomial.
For instance, a simple quadratic kernel K, using a constant c , and addressing the two
two-dimensional XY-vectors describing our orange and apple:
x=(x1,x2)
and
y=(y1,y2)
looks like:
Notice the subscripted T? That is vector transposition (see Appendix B), necessary for vector
multiplication. What does the kernel do? It computes a product between two vectors. This
product is a number expressing a relation between the two input vectors. But the tedious
expansion of this kernel shows that we actually operate in a six- (and not even a three-!)
dimensional space: just count the factors separated by the plus sign in the result; we have six
such factors. In this case, the kernel function K implicitly computes its dot product between the
following vectors:
since the result is the product of these two vectors. But it never explicitly created these vectors.
The whole point of this kernel trick is that, hopefully, in the higher-dimensional space that is the
playground of the kernel, things become easier separable than in the entangled input space.
Kernels do not explicitly represent this space, they implicitly work in it. You can imagine that,
for long vectors and large exponents of the polynomial, this kernel trick quickly becomes a
practical advantage!
In the transformed space created by the kernel trick, two classes are at best separated with
maximally wide boundaries (called maximum margins). The data points determining the slope of
these boundaries are called support vectors.
Figure 1.12 Maximum margins of an SVM. The support vectors are the points on the
dashed lines.
Learning weights that optimize the margins with the least error (as measured on some held-out
test data) is the job an SVM has to solve during training. After training, the support vectors and
the various weights plus biases constitute the model. New input is projected onto the support
vectors, and depending on which side it lands, it receives a positive or negative label (recall that
SVM’s are binary classifiers). So, SVMs throw away a lot of their training data, only keeping
some of it: the support vectors. They can be called eager forms of machine learning.
The connection between kernel-based machine learning and neural network-based learning is
briefly discussed in the Deep Learning with Python book. It is based on the intuition that hidden
layers in neural networks act like kernels in disentangling linearly inseparable input data
layer-by-layer, step-by-step.
A well-known distance function is the IB1 metric, a simplified version of which is:
def delta(x,y):
if x==y:
return 0
if x!=y:
return 1
This metric computes the distance between two feature vectors on the basis of feature value
overlap: exact similarity for symbolic (non-numerical) values. Most MBL algorithms extend
these distance metrics with feature weighting (e.g. information-gain based weighting) or
exemplar weighting (REF). They partition the search space for matching in sets consisting of
training items with the same distance to the current test item. For instance, sets of distances d1,d2
,… can be found first, after which the algorithm computes the most frequent class in those sets. It
then votes over all classes, to determine the most probable label for the test item. The k
-parameter addresses the amount of distance sets to take into account, which is why MBL often
is k-nearest distances classification rather than k-nearest neighbor classification. MBL has
interesting advantages for NLP. Keeping all original training data available for classification
allows handling exceptions in language. For instance, in certain languages, morphological
operations on words, such as diminutive formation, can be arranged in pockets of exceptions:
small families of subregularities. In Dutch, for instance, we encounter
Notice the extra vowel in the diminutive form, and the diminutive suffix -je. We also have a
subfamily of patterns like
While schat is phonetically quite similar to gat (-ch- is pronounced similarly as -g-), it is
inflected analogous to other words prefixed with sch-. But schip is not following this pattern.
The benefit of using a memory-based learner for highly exceptional data is that exceptions can
-in principle- always be retrieved for labeling a similar exceptional input case. Memory-based
learners have perfect memory: they store everything. But eager machine learning models tend to
'compile away' these exceptions. They are after string regularities rather than exceptions. Editing
away these exceptions from training data has been found detrimental for the generalization
accuracy (the ability to handle new, unseen cases outside of the training data) of the resulting
classifier (DaelemansForgetting). Apparently, the subregularities of pockets of exceptions are
beneficial to keep. One of the challenges for deep learning, an 'eager' form of machine learning,
will be to handle these subregularities in a similar vein.
Deep learning is one of the most vibrant buzz phrases of the past 5 years. Deep learning by itself
is nothing new: it is, in the strictest sense, a neural network with lots of internal or hidden layers
and specific filtering operations. Deep learning deploys constructive ways of dealing with large
©Manning Publications Co. To comment go to liveBook
https://livebook.manning.com/#!/book/deep-learning-for-natural-language-processing/discussion
15
quantities of information, organized in many layers of representations. While deep learning in its
essential form was invented in the sixties of the previous century, it took three decades before it
was finally ready to use. In this section, we will shed some light on why this happened.
But first: which problem does deep learning actually solve for NLP? Deep learning can be seen
as a very effective statistical technique for working with (very) many parameters. Deep learning
is able to handle effectively millions of parameters, each one encoding an aspect of input data.
Layers in deep learning act as transformations that -step by step- accommodate input data with
the labels we assign to that data; they disentangle the spaghetti of input data in such as way that
labels can be assigned more easily. The fact that we can stack many such declutter steps on top
of each other is a major forte of deep learning. For language, deep learning provides two types of
advantages:
The repeated application of data decluttering steps proves good for NLP, but this is not
specific for language; it applies to virtual every modality deep learning is applied to.
Deep learning has facilities for handling sequential information, with memory operators
and buffers. This is quite important for language and in this respect, deep learning is a
form of stateful machine learning, as opposed to the usually stateless other types of
machine learning models. These models also usually perform just a single
disentanglement step (like Support Vector Machines).
Figure 1.14 A timeline of neural learning. As we will point out, the gist of Deep Learning
was actually invented in the 1960s.
Central to deep learning is the learning of hierarchical representations of data. Under a vertical
interpretation of a multi-layer neural network, every 'lower' layer feeds into a 'higher' layer.
Layers can be seen as complex functions processing a set of inputs and weights. These weights
encode the importance of the information stored in the network. Networks receive their inputs in
a dedicated input layer, and process that input layer-by-layer, sending it 'upwards' into the
network.
Output layers, finally, produce an outcome: the label the model assigns to its input. All layers
except input and output layers are 'hidden layers' as they cannot be readily observed. As
mentioned, hidden layers in neural networks disentangle linearly inseparable input data
layer-by-layer, step-by-step.
During training, weights are estimated and fine-tuned between 'neurons', which are the basic
processing units of a neural network. Every layer keeps a record of the weights for the neurons
that feed into that layer. Estimating weights is the essential business of neural networks.
Since the layers in a neural network are hierarchically organized (they are stacked), the
representations they produce can be interpreted as hierarchical representations as well: going
from specific (close to the input layer) to more abstract (close to the output layer). In general, it
is hard to come up with human-understandable interpretations of these representations.
Figure 1.15 Facial data analysis with deep learning. Notice how interesting facial concepts
arise from pixels in the higher layers (honglaklee-diss-2010. Picture from
www.quora.com/What-is-deep-learning-Why-is-this-a-growing-trend-in-machine-learning-Why-not-use-SVM
Yet, in the field of image analysis, the weights associated with the layers of deep networks have
been shown to encode interpretable concepts. They can be visualized, and the lower layers
appear to encode pixels, where higher layers represent edges, corners and finally even concepts
like facial objects (see Figure 1). 1
In 1965, presumably the first reference to such hierarchical representation learning was published
by Ivakhnenko and Lapa ivaklapa65. Their paper describes a group method of data handling
(GMDH): a method for producing outputs by layers feeding into each other, based on the
following formula:
This scary looking formula describes nothing but a function Y that computes a complex sum of
©Manning Publications Co. To comment go to liveBook
https://livebook.manning.com/#!/book/deep-learning-for-natural-language-processing/discussion
17
weighted combinations (groups) of input values, with every factor a being a weight or coefficient
. Notice that we have a one-dimensional weight or coefficient matrix ai, a two-dimensional one a
ij
, a three-dimensional one aijk, and so on. The weight matrices are usually referred to as
'polynomial models'. The coefficients can be learned from training data through simple methods
such as least squares (wikipedia:ls). In a picture:
In 1971, the first "deep" 8-layer network based on GMDH was presented ivakhnenko1971.
Yet, this memorable fact did not coincide with the official launch of deep learning. The
simplistic methods for weight tuning by GMDH were not scalable to large scale training. It took
a while before more scalable weight tuning methods like backpropagation came into being:
around 1970 (see Section 2.4 of Deep learning with Python for an introduction to
backpropagation).
But even with backpropagation, neural networks suffered from various practical problems,
including the notorious vanishing gradient problem. This problem arises during training a
network. The crucial ingredient of backpropagation is the stepwise minimization of the error
function of a neural network, by taking partial derivatives of the error function of the network,
differentiating for all weights, and moving stepwise towards its minimum. Gradients are a
generalization of the one-variable partial derivative of a function. Setting a partial derivative to
zero finds a local maximum or minimum of a function. If this function is an error function that
computes the error a network makes for predicting certain output based on weights, we can look
for weight adjustments that push the error function ever closer to its minimum. This procedure is
called gradient descent, and it’s the driving force behind backpropagation. Backpropagation has
built up an impressive track record, and underlies the majority of neural network results.
However, for deep and complex networks, with millions of weights, weight adjustments can
easily become too tiny to be useful: they just vanish, and gradient descent is no longer effective.
Backpropagation deploys the so-called 'chain rule' from calculus for computing the weight
adaptations per layer. The chain rule is an algorithm for computing derivaties of functions that
are applied to functions. Essentially, this is what happens when we apply activation functions to
the output of layers (which themselves apply activation functions to the output of the layers that
feed into them, and so on).
As mentioned, gradient descent uses composed derivatives by working across layers and their
respective activation functions, and makes according weight updates that move the network
closer to perfection. Now, activation functions with gradients in intervals capped by small
numbers (like between 0 and 1) result in weight adaptations that are small by nature, and
repeated multiplication of these small numbers with the chain rule leads to thinning, and,
eventually, evaporation of values. This means that the weight adaptations (which are computed
from the top-most layer of the network, just 'under' the output layer) never reach the layers close
to the input layer, which, subsequently do not get "trained".
So, how did deep learning escape from this conundrum? There are at least a couple of solutions
that have alleviated the problem. The most prominent one is the use of a feature selection/data
reconstruction cycle, as put forward by Restricted Boltzmann Machines (RBM’s). RBM’s are
complete networks that learn probability distributions from data. They can be stacked on top of
each other, as layers, where every layer is a separate RBM sending its hidden layer data as input
to the next layer, and not through inter-layer connections between hidden layers. This set-up
allows for layer-wise training of networks, and eliminates much of the vanishing gradient
problem, since gradients don’t need to travel far down the network: they are confined to separate
layers. 2
In addition to all this, a new type of activation function has become popular in Deep Learning:
the rectified linear unit ReLU. A ReLU is a very simple non-linear function that computes a
maximum of two values, one of which is the input to a neuron.
Specifically
So, ReLU(x) just returns which one is bigger: zero or x. It eliminates all values for x below zero.
Now, if we apply this function to every neuron in our network, only the ones with positive values
©Manning Publications Co. To comment go to liveBook
https://livebook.manning.com/#!/book/deep-learning-for-natural-language-processing/discussion
19
This formula expresses the effect of ReLU applied to a sum ( ) of inputs multiplied by
weights, augmented with a bias term. Figure 3 shows hows this works.
Figure 1.17 ReLU operations at work on the final hidden layer output of a neural net.
The ReLU function is differentiable almost everywhere but for exactly 0, and its derivative is
quite simple:
which has beneficiary effects on the speed and scalability of the network computations during
backpropagation.
sigmoid(x)=1/(1+e-x)
To witness the dramatic effect the choice of an activation has on the performance of your neural
network, let’s try out an overly deep network on a small snippet of sentiment data.
Scenario: You want to train a deep network on a sentiment labeling task. The task consists
of labeling texts with sentiment labels: 1 for positive sentiment, and 0 for negative. You are
unsure about which activation function you should choose. Can you find out
experimentally the best option?
Our data, taken from PAngLeePlotTokSUBJ looks like this: a set of sentences, labeled with
either a zero (negative sentiment) or a one (positive sentiment)
Review Sentiment
score
smart and alert , thirteen conversations about one thing is a small gem 1
color , musical bounce and warm seas lapping on island shores and just enough science to send you 1
home thinking
it is not a mass-market entertainment but an uncompromising attempt by one artist to think about 1
another
a love affair with a veterinarian who is a non-practicing jew 1
initially reluctant to help , daniel’s conscience eventually gets the better of him 0
his plans go awry , however , when his older brother , keith , returns from prison 0
inspired and motivated , the kids save the day , showing bravery and nobility 0
In raw format, our data looks like this. It is tab-delimited data, with a header containing the
names of two columns:
"text"<tab>"label"
smart and alert , thirteen conversations about one thing is a small gem <tab>1
...
Our model is a ten-layer deep network, listed below. The model is trained on 90% of our training
data while keeping 10% separate for validation purposes.
Listing 1.4 A 10-layer deep MLP sentiment classifier with sigmoid activation
functions.
from keras.models import Sequential
from keras.utils import np_utils
from keras.preprocessing.text import Tokenizer
from keras.layers.core import Dense, Activation
import pandas as pd
import sys
data = pd.read_csv(sys.argv[1],sep='\t')
docs=data["text"]
tokenizer = Tokenizer()
tokenizer.fit_on_texts(docs)
input_dim = X_train.shape[1]
nb_classes = y_train.shape[1]
model = Sequential()
model.add(Dense(128, input_dim=input_dim))
model.add(Activation('sigmoid'))
model.add(Dense(128))
model.add(Activation('sigmoid'))
model.add(Dense(128))
model.add(Activation('sigmoid'))
model.add(Dense(128))
model.add(Activation('sigmoid'))
model.add(Dense(128))
model.add(Activation('sigmoid'))
model.add(Dense(128))
model.add(Activation('sigmoid'))
model.add(Dense(128))
model.add(Activation('sigmoid'))
model.add(Dense(128))
model.add(Activation('sigmoid'))
model.add(Dense(128))
model.add(Activation('sigmoid'))
model.add(Dense(128))
model.add(Activation('sigmoid'))
model.add(Dense(nb_classes))
model.add(Activation('softmax'))
model.compile(loss='binary_crossentropy',
optimizer='adam',
metrics=['accuracy'])
print("Training...")
model.fit(X_train, y_train, epochs=10, batch_size=32, validation_split=0.1,
shuffle=False,verbose=2)
The pandas library has many handy functions for processing comma-separated
(cvv) and tab-delimited (tsv) data.
Our data is be tab-delimited: sentences separated by tabs from their labels. We read
the data into a 'dataframe' (a primitive of pandas), and extract the "text" field (the
column labeled with "text" in our data) as our document set.
The numpy library contains a function for
Keras has a Tokenizer facility for converting text into numerical vectors, consisting
of unique integers referring to the original words.
We apply the Tokenizer to our documents.
Next, we generate vector representations of our documents. Together, they form a
matrix. The tokenizer builds up a lexicon mapping words to integers, and generates
binary vectors of a fixed dimension, with a '1' for every word in the input
document, counting from zero. So, a '1' on position '3' means that word '2' in the
lexicon is in the document. An example is presented below.
We infer the input size (dimension) from our vectorized data, and the number of
classes.
The network contains 10 Dense layers (standard, fully-connected layers), and
deploys sigmoid activation functions that pass the incoming connections into each
neuron through the sigmoid function.
The output layer is a dense layer with as many neurons as the number of classes.
The softmax activation function generates output probabilities.
We compile the model.
The model is fitted on our data, using 10% of the training data as held-out data for
testing.
>>> docs = ['smart and alert , thirteen conversations about one thing is a small
gem','not very smart movie']
>>> tok=Tokenizer()
>>> tok.fit_on_texts(docs)
>>> tok.texts_to_matrix(doc, mode='binary')
array([[0., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 0., 0., 0.,
0.],
[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 1., 1.,
1.]])
>>> tok.word_index
{'and': 1, 'a': 9, 'about': 5, 'very': 15, 'not': 14, 'conversations': 4, 'is': 8,
'one': 6, 'mart': 13, 'thing': 7, 'thirteen': 3, 'movie': 16, 'small': 10, 'alert': 2,
'gem': 11, 'smart': 12}
The cabin seems to have been completed late in the season after the
corn had been harvested. There was no road into Zion Canyon at that
time, but a heavy team had hauled in a plow and other necessities.
An irrigation ditch was dug and the flats cleared of vines and
rosebushes. By the next season several acres were under cultivation,
and fruit trees, cane and garden stuff had been planted.
The Behunins also owned fifteen or twenty head of cattle all broken
to work, including milk cows. They raised pigs on the surplus corn
and did their own butchering and curing. James H. Jennings (born in
1853) tells of watching them slaughter thirteen hogs one day. They
filled a shallow pool with water and heated it by dropping into it hot
rocks from a nearby bonfire. When the water was near boiling, they
dipped the hogs in the pool to scald them and loosen the hair. The
meat was cut up and salted to make old-style home-cured hams,
shoulders and bacon.
The Watchman, overlooking Springdale in Zion Canyon.
Photo by O. J. Grimes.
The Angel Landing in Zion Canyon.
Photo by U. S. National Park Service.
Mrs. Eunice Munk of Manti, who as a girl of 12 or 14 spent 157
more than a year in Springdale, recalled that in the summer of
1864 the Behunins told her that in Zion Canyon the chickens went to
roost soon after sundown, but that the twilight was so long that they
would get tired of waiting for darkness and come out again.
Within a few years, other settlers found their way into the canyon.
William Heap took up a farm across the river from the Behunins at
the mouth of Emerald Pool Canyon and built a log cabin on the west
bank of the river north of the Emerald Pool stream. He planted an
orchard and raised annual crops such as cane, corn, and garden stuff.
John Rolf built his cabins above the Behunins. A polygamist, he
needed separate dwellings for his families; one he located near the
Behunins; the other on the site of the present grotto camp.
Hand plows pulled by horses or oxen were used for turning the soil.
Harrows were fashioned of hewn timbers fastened together in a
triangle. Sharpened pegs of oak were fixed in holes bored in the
framework and turned downward so that they scratched the ground
when pulled by a team. The first shovels were short-handled, but
later they had what they called “lazy man’s shovels,” because the
handles were longer and a man did not have to bend his back so
much when working. The rakes were handmade of wood with sharp
pegs for teeth. Scythe and cradle were seldom used since little hay or
grain was raised.
It was about this time that the canyon received its name. The three
settlers, hardy mountaineers though they were, nevertheless were of
a devout and religious turn of mind. It seems to have been old father
Behunin who proposed the name of Zion, to which the others agreed.
Isaac Behunin had been with the Mormons ever since they left New
York. He had helped build the Temple at Kirtland, Ohio, and had at
one time acted as body-guard to the founder of the Church, Joseph
Smith. He had been through all the “drivings of the Saints” in
Missouri and Illinois and nourished the typically bitter resentment
towards the “enemies” who had been responsible for such
“atrocities.” Here in Zion he felt that at last he had reached a place of
safety where he could rest assured of no more harryings and
persecutions. No wonder he proposed the name Zion, which implies a
resting place. He went even further, maintaining that should the
Saints again be harassed by their enemies, this would become their
place of refuge.
The first settlers made their way into the canyon on horseback, using
the river bed, crossing and recrossing the stream. It soon became
necessary, however, to provide other means of transportation. A
wagon road was no problem through the flats in both the upper and
lower valleys, but the precipitous canyon between was baffling. It is
[59]
related that Hyrum Morris, Shunesburg settler, and a companion
were the first to enter the upper valley by means other than
horseback. They hitched a yoke of oxen to the hind wheels of a
wagon and lashed a plow and supplies on it. When they entered the
canyon, near the present site of the bridge, they climbed the west
bank over the sand bench, down into Birch Creek and thence into the
upper Zion Valley. This did not prove to be a practicable route, and
no one seems to have followed it. Today one can hardly traverse the
route on foot.
The remains of an old cart road which followed the east bank, high
up opposite the sand bench, coming out into the upper valley about
half a mile above the present Union Pacific garage, may still be
traced. This route was used for some years, but was far from
satisfactory. Other settlers from the towns below began to cultivate
tracts in the upper valley and the timber resources of the canyon
made a better road imperative. During the winter of 1864-65, a
[60]
wagon road was built up the river bed, crossing the stream many
times. This is the road which, with minor improvements, served as
the main highway into Zion until the National Park Service built the
road that first made it fully accessible to the public. This road in turn
served until 1930, when the present well-graded highway was
constructed midway between the river and the older road that it
replaced.
Although the outlying towns had been abandoned, the crops had
been planted and had to be tended. Workers went in armed groups
of ten, twenty, or thirty to the fields, usually remaining during the
week in the more distant places and returning to Rockville on Sunday.
In Zion, headquarters were at the Behunin cabin, where eight or ten
men usually camped while working the crops. In Springdale, they
usually stayed at Albert Petty’s home or nearby. Petty himself refused
to abandon his ranch and stayed there throughout the Indian scare.
This concentration continued through 1867, but with the close of the
“Black Hawk War” and the subsidence of troubles with the Paiutes,
there was a general reoccupation of the villages except Long Valley
and Kanab, abandoned during the Indian troubles. This occasioned
some shifts in the population; some returned to their former homes
and lands, some stayed where they were, and a few moved
elsewhere. Springdale was reoccupied by Albert Petty and several
other families, but it did not regain its former size until about 1874.
Shunesburg and Grafton also seem to have lost slightly in population
in the reshuffling. To Zion, however, the same settlers, Behunin,
Heaps and Rolfs, returned and took up their usual tasks of raising
crops and tending livestock.
During late April and early May, Brigham Young moved northward to
Salt Lake City, initiating the novel movement in many of the towns
along the way, including Cedar, Parowan, Beaver, Fillmore and Nephi.
Upon reaching Salt Lake City, a general agency to correlate the stake
activities, known as the “United Order for all the World,” was
established.
These families were the last to live regularly in Zion Canyon. Farming,
however, was still carried on by settlers living in the village below.
Such names as Dennett, Gifford, Petty, Russell, Terry, Dalton,
Crawford, Stout and Flanigan, recur as farmers in Zion in the
following years. Oliver D. Gifford, long-time bishop of Springdale,
related that about 1880 he was farming the land south of the river
and west of the Great White Throne at the site of the grotto camp
when the Great Red Arch fell out, the rock pulverizing and covering
up a spring and large pine trees.
The Cable
Zion, since its discovery, had been regarded as a blind canyon. Even
the Indian name, I-oo-goon (canyon like an arrow quiver), reflects
this idea. Lee, Smith, Steele and their companions, who left Parowan
on June 12, 1852, and explored the head-waters of the Sevier and
Virgin, were balked by the Vermillion cliffs nearby, if not in Zion, in
contemplating the possibilities of a road from the mountains to the
lower valleys.
Brigham Young himself had encouraged the idea on one of his trips
to the upper Virgin, probably in 1863. It is related that when the
settlers were bewailing the lack of timber for flooring their cabins, he
stated in public meeting that the day would come when hundreds of
thousands of feet of lumber would be brought down the canyon. The
wise ones shook their heads and remarked that their leader had
missed it this time.
During the early seventies the mail from St. George to Kanab was
routed via Shunesburg. It was lifted over the 1,500 foot cliff at the
[63]
head of Shunesburg canyon on wires arranged in such a way that
the man at the top bringing the mail from Kanab could exchange with
the man below, carrying it down the river. This short-cut saved a
day’s travel over the road via Pipe Springs and the Arizona Strip.
It was not until the new century had dawned, however, that young
David Flanigan, who was but a small boy in the days of the
Shunesburg wire-pulley apparatus, conceived the idea of lowering
lumber over the cliffs by means of cables. As a lad of 15, in the
spring of 1888, he and three other boys hunting on the East Rim of
Zion had seen a large grove of yellow pine sawtimber and had stood
at the top of the cliff later known as Cable Mountain, where the
precipice appeared to reach almost to the floor of the canyon. The
problem of lumber remained as acute then as it was in the days of
his parents.
It was during the summer of 1906 that the writer first visited Zion
Canyon. From Orderville he faced a trip by horse to his home in St.
George. Rather than retrace his steps by way of Kanab and 163
Pipe Springs, he inquired about the short-cut via Zion and was
directed cross-country to the head of the old East Rim trail, which
John Winder had recently made barely passable. Belated, he reached
the top of the trail at sundown, and familiar as he was with
mountaineering, started unhesitatingly on the 3,000 foot descent.
Darkness found him well along the trail which grew increasingly hard
to follow. About half way down, he met Winder and some cowboys
coming up. They described the trail and said they had just lost a pack
horse which had rolled over the cliffs below and they had been
delayed in retrieving the pack.
A full moon was just coming up over the cliffs at his back, throwing
light into the dark recesses along the way. With its help the steep,
dangerous places on the trail were negotiated without difficulty. The
dead pack horse was found beside the trail and the bottom of the
canyon was reached about 11 p.m., when the full moon was shining
into the depths of the canyon and towers and temples were illumined
with snowy brightness. Camping there, he slept out the night in his
saddle blankets under the enchanting witchery of the starry heavens
between the brilliant walls that seemed to reach the sky.
[64]
In 1906 Flanigan sold out to Alfred P. Stout and O. D. Gifford.
They replaced his wire cable with a heavier twisted rope cable, which
served for many years and over which millions of feet of lumber were
lowered into the canyon and hauled to the settlements farther
downstream. Stout established a shingle mill in Zion Canyon about a
quarter of a mile below the cable and near the foot of the northeast
corner of the Great White Throne. At first, large cottonwood logs
were cut for shingles, but as these proved of inferior quality, yellow
pine logs were supplied via cable. The shingle mill was washed away
by floods two years later.
A sad accident occurred at the top of the cable on July 28, 1908. A
party of young people vacationing on the east rim went over to see
the cable operate from the top. Three of them were standing in the
box at the edge, directly under the cable, looking into the depths of
the canyon, when a bolt of lightning struck the cable, killing Thornton
Hepworth, Jr., and stunning Clarinda Langston and Lionel Stout. Miss
Langston fell limp on the edge of the box where she was in imminent
danger of plummeting down the cliff. Miss Elza Stout, uninjured
nearby, rescued her from the precarious position, but before
assistance could be rendered him, a second bolt struck the wire and
killed Lionel Stout. Miss Langston recovered, but the bodies of the
two boys were lowered into the canyon over the cable.
It was more than a year later when people started to “ride the cable.”
About the middle of September, 1910, soon after Zion had 164
been proclaimed a national monument, some members of
Scott P. Stewart’s surveying party visited the top of the cable. They
were told that a dog had been sent up from below and that he was
nearly crazy when he reached the top. Quinby Stewart, a fearless
youth, told them that if they would bring some watermelons up to
the foot, he would go down on the cable and help eat them. True to
his word, when the melons arrived he climbed on a load of lumber
ready to be lowered, and holding to the cable, rode safely to the
bottom. It was a swift flight of two minutes, and to a young man of
his disposition a rousing thrill. Others followed suit, and after eating
the melons, rode back to the top in the empty cage.
Riding the cable proved an attraction for those gifted with strong
nerves. At a later date, Frank Petty came to operate the sawmill at
the top of the cliff. He was a large man, weighing nearly 300 pounds,
too heavy to travel comfortably up and down the trails, and the road
around the Arizona strip to his home in Rockville being too long for
convenience, he took to riding the cable. On one occasion, as he
started down, the lumber on which he was riding struck the top of
the cliff and loosened the chain holding one end of the load. His son,
Frank, operating the brakes at the top, seeing the mishap, applied
the brakes just in time to prevent his father from falling down the
face of the 1,800 foot declivity. With a few inches of the lumber still
clinging to the edge and his father paralyzed with fear and afraid to
move for fear of jarring it loose, Frank climbed underneath and re-
fastened the chain. With a sigh of relief, he then lowered his father in
safety to the bottom.
Kane County and Arizona
Just as the settlement of Iron County had provided a stepping stone
to the exploration and settlement of the Virgin River Valley, so in
turn, the latter served in like stead in opening up Kane County and
the Kaibab National Forest of northern Arizona.
In the fall of 1858, after Albert Sydney Johnston’s army had entered
Utah, Brigham Young, still doubtful about the future, instructed
missionaries under Jacob Hamblin’s leadership to cross the Colorado
River to the southeast and visit the Moquis or Town Indians with the
object of exploring the possibilities of retreating with his people to
this region should the difficulties with the army become unbearable.
[65]
When Jacob Hamblin led a third expedition across the Colorado River
to reach the Moquis in the fall of 1860, he was met by a band of
unfriendly Navajos who would not let the missionaries proceed and
debated whether to kill them or let them go home. With the Mormons
were several Indians, including two squaws. The Navajos offered to
let the party go in peace if they would leave the squaws. This
Hamblin refused to do, and an agreement was finally reached
whereby the missionaries were allowed to return home in exchange
[68]
for goods and ammunition.
They camped that night on a table-rock mesa where there was only a
narrow passageway which was carefully guarded. Next morning,
November 2, 1860, while some were exchanging goods with the
Navajos, others took the horses down to water. As they were
returning, the saddle horse of George A. Smith, Jr., started off on a
side trail and he went after it alone. He found two Indians leading his
horse away. The horse was readily turned over and Smith started
back to camp. One of the Indians rode up alongside Smith and asked
to see his revolver. Suspecting nothing, Smith handed it over. The
Indian, after examining it, passed it back to the other Indian a few
paces behind, who shot Smith three times. As he fell from his horse,
the Indians dismounted and shot three arrows into his back.
The Indians then blockaded the trail to the Moquis towns, forcing the
Mormons to retreat, who placed the dying man on a mule and started
homeward with the Navajos in hot pursuit. Traveling thus, it was
nearly dark before Smith died. His body had to be abandoned as the
Navajos seemed unwilling to give up the chase until they had taken
his scalp. The balance of the party returned home safely.
167
Indian Troubles
The period following early settlement was marked by Indian troubles
with both Paiutes and Navajos. These are sometimes called the
Navajo raids, and in part were an outgrowth of the “Black Hawk War”
which broke out in Sevier Valley, central Utah, in 1865. The whites
had brought with them their livestock, which they grazed upon the
public domain, turning the cattle and horses loose and herding the
sheep. These animals multiplied rapidly and quickly depleted the
edible fruits and seeds upon which the Indians subsisted. Indian
resentment not unnaturally was inflamed, and with starvation staring
them in the face, there was little left for them to do but beg or steal.
The Indians had claimed the lands, the vegetation and the wild
game, and although they had given the first white men permission to
come, yet so many others had followed, like the proverbial camel’s
nose, that they were destroying the means of subsistence of the
Indians. Not only were seeds and fruits being eaten by the livestock,
but game also was getting scarce and hard to find, due largely to
encroachment of cattle and sheep which were taking the place of
deer upon the range. The white man hunted the Indians’ deer so why
should not the Indian hunt the white man’s cattle? There was some
compensation to the Indians, however; they could glean in the grain
fields of the settlers and gather waste grain as easily as they could
seeds, and pine nut crops were uninjured by the whites.
The Black Hawk War broke out in 1865 and was not settled until
1868. Nearly 3,000 men were enlisted and the cost was over a million
dollars and at least seventy lives. This Ute unrest was contagious,
and the Paiutes in turn were stirred into sporadic resistance.
Our website is not just a platform for buying books, but a bridge
connecting readers to the timeless values of culture and wisdom. With
an elegant, user-friendly interface and an intelligent search system,
we are committed to providing a quick and convenient shopping
experience. Additionally, our special promotions and home delivery
services ensure that you save time and fully enjoy the joy of reading.
textbookfull.com