0% found this document useful (0 votes)
23 views10 pages

Deep Learning Basics

Uploaded by

agnes
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views10 pages

Deep Learning Basics

Uploaded by

agnes
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 10

Deep Learning Basics: Recurrent Neural Networks

WHAT ARE RNNS?


Basic idea:
Assumption: Text is written sequentially, so our model should read it sequentially
“RNN”: class of neural network that processes text sequentially (left-to-right or right-to-left)
Generally speaking:

- Internal “state”
- RNN consumes one input per time step j

- Update function: , where parameters theta are shared across all time steps
VANILLA RNN
Notation:

VANISHING GRADIENTS
Vanishing gradients mean that the impact that an input has on the gradient of the loss becomes smaller when it is
further away from the loss in the computation graph.
Backpropagate to via chain rule

What happens to derivative when distance J-j grows?


Remember that tanh is applied elementwise, and that the derivative of tanh is between 0 and 1. So the red Jacobian
matrix is a diagonal matrix with entries between 0 and 1. The product of many such matrices approaches zero.
Furthermore, the blue Jacobian matrix is just V. When initialized with small enough values, will approach zero
as well.

As a result, derivative above approaches to zero – it vanishes.


What does this mean?

Since the “dummy parameter gradients” of step j, are upstream from , they approach zero too, i.e.,
their effect on the “dummy gradient sum” is negligible.
This means that if the words that your RNN should be paying attention to are far from the loss, the network will not
(or slowly) adjust its weights to those words.

EXPLODING GRADIENTS

So why don’t we just use a nonlinearity with a derivative larger than 1, or initialize differently?

would become very large (“explode”). This is even worse than vanishing gradients, because it leads to
non-convergence of gradient descent.
So vanishing gradients is the lesser of two evils.

LONG-SHORT TERM MEMORY NETWORK


Became popular around 2010 for handwriting recognition, speech recognition, and many NLP problems
Addresses vanishing gradients by changing the architecture of the RNN cell

1) Two states: (“short-term memory”) and (“long-term memory”)

2) Candidate state corresponds to in the Vanilla RNN


3) Interactions are mediated by “gates” in (0; 1)^d , which apply elementwise:

- Forget gate decides what information from should be forgotten

- Input gate decides what information from should be added to

- Output gate decides what information from should be exposed to

4) Each gate and the candidate state have their own parameters

5) “Gradient highway” from with no non-linearities or matrix multiplications

LSTM DEFINITION
TAGGING (EXAMPLE: PART-OF-SPEECH)

AUTOREGRESSIVE LANGUAGE MODELING

SEQ-TO-SEQ (MACHINE TRANSLATION)

MULTI-LAYER RNNS
- Stack of several RNNs (Vanilla RNNs, LSTMs, GRUs, etc.)
- Each RNN in the stack has its own parameters
- The input vectors of the l’th RNN are the hidden states of the l-1’th RNN
- The input vectors to the first RNN are the word embeddings, as usual
- We can output the hidden states of the last RNN, or a combination (concatenation, average, etc.) of the
states of all RNNs.

BIDIRECTIONAL RNNS

Two RNNs with separate parameters


Forward RNN runs left-to-right over the input
Backward RNN runs right-to-left over the input

The bidirectional RNN yields two sequences of hidden states:


Question: If we are dealing with a sentence classification task, which states should we use to represent the
sentence?

Concatenate because they have “seen” the entire sentence

For tagging task, represent the j’th word as


Question: Can we use a bidirectional RNN for autoregressive language modeling?
- No. In autoregressive language modeling, future inputs must be unknown to the model (since we want to
learn to predict them).
- We could train two separate autoregressive RNNs (one per direction), but we cannot combine their hidden
states before making a prediction
In sequence-to-sequence (e.g., Machine Translation), the encoder can be bidirectional, but the decoder cannot
(same reason)

Attention
ENCODER-DECODER WITH RNNS
LIMITATIONS OF RNNS
In an RNN, at a given point in time j, the information about all past inputs x(1) … x(j) is “crammed” into the state

vector for an LSTM)


So for long sequences, the state becomes a bottleneck
Especially problematic in encoder-decoder models (e.g., for Machine Translation)
Solution: Attention (Bahdanau et al., 2015)– an architectural modification of the RNN encoder-decoder that allows
the model to “attend to” past encoder states
BAHDANAU ATTENTION, ATTENTION: THE BASIC RECIPE
Ingredients:
One query vector q, J key vectors K, J value vectors V, Scoring function a (that
maps a query-key pair to a scalar (“score”), “a” may be parametrized by
parameters theta_a)

MODELING LONG-RANGE
DEPENDENCIES

RNNenc (classical encoder-decoder); RNNsearch (with Attention)


BLEU score: Measure for translation quality (higher is better)
ATTENTION: THE BASIC RECIPE

Step 1: Apply a to and all keys to get scores (one per key):

Step 2: Turn e into a probability distribution with the softmax function:

Step 3: alpha-weighted sum over yields one d_v -dimensional output vector:

Intuition: is how much “attention” the model pays to when computing .


ATTENTION: AN ANALOGY
- We have J weather stations on a map

- are their geolocations (x,y coordinates)

- are their current weather conditions (temperature, humidity, etc.)

- is a new geolocation for which we want to estimate weather conditions


- ej is the relevance of the j’th station , where alpha_j is e_j as a probability

- a weighted sum of all known weather conditions, where stations that have a small distance (high alpha)
have a higher weight

ATTENTION IN NEURAL NETWORKS

Contrary to our geolocation example, the vectors of a neural network are a function of the input and
trainable parameters
So the model learns which keys are relevant for which queries, based on the training data and loss function.

A PRIMER ON THE TRANSFORMER


The Bahdanau model is still an RNN, just with attention on top.
Another architecture that consists of attention only:
- Transformer: “Attention is all you need”
- No recurrence, just Attention (as the name suggests)
- Better parallelizable (as opposed to RNNs)
No (or few) assumptions are baked into the architecture (no notion of which words are neighbors in the sentence,
sequentiality, etc.)
The lack of prior knowledge often means that the Transformer requires more training data than an RNN to achieve a
certain performance
But when presented with sufficient data, it usually outperforms them.

Revisiting words: Tokenization


PROCESS OF TEXT TOKENIZATION
Breaking text into smaller units called tokens
- Tokens are discrete text units (letters, words, etc.)
- They are the building blocks of natural language
Encoding each token with unique IDs (numbers)
Performed on the entire corpus of documents: Corpus vocabulary of unique tokens is obtained
Mandatory preprocessing step for most of NLP tasks

WHY TOKENIZE?
Computers must understand text: Text encoding is necessary, Encode small rather than large units
Corpus documents can be large and hard to interpret: Working with tokens is easier, Building meaning in bottom-up
fashion
Text may contain extra whitespaces: Tokenization removes them

WORD TOKENIZATION
Most popular type of tokenization: Applied as preprocessing step in most NLP tasks
Considers dictionary words and several delimiters
- Accuracy depends on dictionary used for training
- Tradeoff between accuracy and efficiency
Whitespaces and punctuation symbols are used: They determine word boundaries
Available in many NLP libraries
Example: What is the tallest building? => ‘What’, ‘is’, ‘the’, ‘tallest’, ‘building’, ‘?’

SUBWORD TOKENIZATION
Finer grained than word tokenization: Breaks text into words, Breaks words into smaller units (root, prefix, suffix,
etc.)
More important for highly flective languages
- Words have many forms
- Prefixes and suffixes are added
- Word meaning and function changes
Helps to reduce out of vocabulary words
Example: What is the tallest building? => ‘What’, ‘is’, ‘the’, ‘tall’, ‘est’, ‘build’, ‘ing’, ‘?’

BYTEPAIR ENCODING (BPE)


Data compression algorithm
Considering data on a byte-level
Looking at pairs of bytes:
1 Count the occurrences of all byte pairs
2 Find the most frequent byte pair
3 Replace it with an unused byte
Repeat this process until no further compression is possible
Open-vocabulary neural machine translation
Instead of looking at bytes, look at characters
Motivation: Translation as an open-vocabulary problem
Word-level NMT models:
- Handling out-of-vocabulary word by using back-off dictionaries
- Unable to translate or generate previously unseen words
Using BPE effectively solves this problem
Adapt BPE for word segmentation
Goal: Represent an open vocabulary by a vocabulary of fixed size: Use variable-length character sequences
Looking at pairs of characters:
1 Initialize the vocabulary with all characters plus end-of-word token
2 Count occurrences and find the most frequent character pair, e.g. "A" and "B" (!!! Word boundaries are not
crossed) [Side effect: Can be run on a dictionary w/ frequency counts]
3 Replace it with the new token "AB"
Only one hyperparameter: Vocabulary size (Initial vocabulary + Specified no. of merge operations) Repeat this
process until given |V| is reached
SETUP MERGING
WORDPIECE
Voice Search for Japanese and Korean
Specific Problems:
- Asian languages have larger basic character inventories compared to Western languages
- Concept of spaces between words does (partly) not exist
- Many different pronounciations for each character
WordPieceModel: Data-dependent + do not produce OOVs
1) Initialize the vocabulary with basic Unicode characters (22k for Japanese, 11k for Korean)
!! Spaces are indicated by an underscore attached before (of after) the respective basic unit or word (increases
initial |V| by up to factor 4)
2) Build a language model using this vocabulary
3) Merge word units that increase the likelihood on the training data the most, when added to the model
Two possible stopping criteria:
Vocabulary size or incremental increase of the likelihood

ABSTRACTS:
1) A Neural Probabilistic Language Model

A goal of statistical language modeling is to learn the joint probability function of sequences of words in a language.
This is intrinsically difficult because of the curse of dimensionality: a word sequence on which themodelwill be tested
is likely to be different from all the word sequences seen during training. Traditional but very successful approaches
based on n-grams obtain generalization by concatenating very short overlapping sequences seen in the training set.
We propose to fight the curse of dimensionality by learning a distributed representation for words which allows each
training sentence to inform the model about an exponential number of semantically neighboring sentences. The
model learns simultaneously (1) a distributed representation for each word along with (2) the probability function for
word sequences, expressed in terms of these representations.

Generalization is obtained because a sequence of words that has never been seen before gets high probability if it is
made of words that are similar (in the sense of having a nearby representation) to words forming an already seen
sentence. Training such large models (with millions of parameters) within a reasonable time is itself a significant
challenge. We report on experiments using neural networks for the probability function, showing on two text corpora
that the proposed approach significantly improves on state-of-the-art n-gram models, and that the proposed
approach allows to take advantage of longer contexts.

Fighting the Curse of Dimensionality with Distributed Representations

In a nutshell, the idea of the proposed approach can be summarized as follows:

1. associate with each word in the vocabulary a distributed word feature vector (a realvalued vector in Rm),

2. express the joint probability function of word sequences in terms of the feature vectors of these words in the
sequence, and

3. learn simultaneously the word feature vectors and the parameters of that probability function.

2) Efficient Estimation of Word Representations in Vector Space

We propose two novel model architectures for computing continuous vector representations of words from very
large data sets. The quality of these representations is measured in a word similarity task, and the results are
compared to the previously best performing techniques based on different types of neural networks. We observe
large improvements in accuracy at much lower computational cost, i.e. it takes less than a day to learn high quality
word vectors from a 1.6 billion words data set. Furthermore, we show that these vectors provide state-of-the-art
performance on our test set for measuring syntactic and semantic word similarities.

Many current NLP systems and techniques treat words as atomic units - there is no notion of similarity between
words, as these are represented as indices in a vocabulary. This choice has several good reasons - simplicity,
robustness and the observation that simple models trained on huge amounts of data outperform complex systems
trained on less data. An example is the popular N-gram model used for statistical language modeling - today, it is
possible to train N-grams on virtually all available data (trillions of words [3]).

However, the simple techniques are at their limits in many tasks. For example, the amount of relevant in-domain
data for automatic speech recognition is limited - the performance is usually dominated by the size of high quality
transcribed speech data (often just millions of words). In machine translation, the existing corpora for many
languages contain only a few billions of words or less. Thus, there are situations where simple scaling up of the basic
techniques will not result in any significant progress, and we have to focus on more advanced techniques.

With progress of machine learning techniques in recent years, it has become possible to train more complex models
on much larger data set, and they typically outperform the simple models. Probably the most successful concept is
to use distributed representations of words [10]. For example, neural network based language models significantly
outperform N-gram models.

3) Enriching Word Vectors with Subword Information

Continuous word representations, trained on large unlabeled corpora are useful for many natural language
processing tasks. Popular models that learn such representations ignore the morphology of words, by assigning a
distinct vector to each word. This is a limitation, especially for languages with large vocabularies and many rare
words. In this paper, we propose a new approach based on the skipgram model, where each word is represented as
a bag of character n-grams. A vector representation is associated to each character n-gram; words being
represented as the sum of these representations. Our method is fast, allowing to train models on large corpora
quickly and allows us to compute word representations for words that did not appear in the training data. We
evaluate our word representations on nine different languages, both on word similarity and analogy tasks. By
comparing to recently proposed morphological word representations, we show that our vectors achieve state-of-the-
art performance on these tasks.

In this paper, we propose to learn representations for character n-grams, and to represent words as the sum of the
n-gram vectors. Our main contribution is to introduce an extension of the continuous skipgram model (Mikolov et al.,
2013b), which takes into account subword information. We evaluate this model on nine languages exhibiting
different morphologies, showing the benefit of our approach.

Subword model: By using a distinct vector representation for each word, the skipgram model ignores the internal
structure of words. In this section, we propose a different scoring function s, in order to take into account this
information. Each word w is represented as a bag of character n-gram. We add special boundary symbols < and > at
the beginning and end of words, allowing to distinguish prefixes and suffixes from other character sequences. We
also include the word w itself in the set of its n-grams, to learn a representation for each word (in addition to
character n-grams). Taking the word where and n = 3 as an example, it will be represented by the character n-
grams: <wh, whe, her, ere, re> and the special sequence <where>. Note that the sequence <her>, corresponding
to the word her is different from the tri-gram her from the word where. In practice, we extract all the n-grams for n
greater or equal to 3 and smaller or equal to 6. This is a very simple approach, and different sets of n-grams could
be considered, for example taking all prefixes and suffixes.

4) NEURAL MACHINE TRANSLATION BY JOINTLY LEARNING TO ALIGN AND TRANSLATE

Neural machine translation is a recently proposed approach to machine translation. Unlike the traditional statistical
machine translation, the neural machine translation aims at building a single neural network that can be jointly
tuned to maximize the translation performance. The models proposed recently for neural machine translation often
belong to a family of encoder–decoders and encode a source sentence into a fixed-length vector from which a
decoder generates a translation. In this paper, we conjecture that the use of a fixed-length vector is a bottleneck in
improving the performance of this basic encoder–decoder architecture, and propose to extend this by allowing a
model to automatically (soft-)search for parts of a source sentence that are relevant to predicting a target word,
without having to form these parts as a hard segment explicitly. With this new approach, we achieve a translation
performance comparable to the existing state-of-the-art phrase-based system on the task of English-to-French
translation. Furthermore, qualitative analysis reveals that the (soft-)alignments found by the model agree well with
our intuition.
Most of the proposed neural machine translation models belong to a family of encoder– decoders (Sutskever et al.,
2014; Cho et al., 2014a), with an encoder and a decoder for each language, or involve a language-specific encoder
applied to each sentence whose outputs are then compared (Hermann and Blunsom, 2014). An encoder neural
network reads and encodes a source sentence into a fixed-length vector. A decoder then outputs a translation from
the encoded vector. The whole encoder–decoder system, which consists of the encoder and the decoder for a
language pair, is jointly trained to maximize the probability of a correct translation given a source sentence. A
potential issue with this encoder–decoder approach is that a neural network needs to be able to compress all the
necessary information of a source sentence into a fixed-length vector. This may make it difficult for the neural
network to cope with long sentences, especially those that are longer than the sentences in the training corpus. Cho
et al. (2014b) showed that indeed the performance of a basic encoder–decoder deteriorates rapidly as the length of
an input sentence increases. In order to address this issue, we introduce an extension to the encoder–decoder model
which learns to align and translate jointly. Each time the proposed model generates a word in a translation, it
(soft-)searches for a set of positions in a source sentence where the most relevant information is concentrated. The
model then predicts a target word based on the context vectors associated with these source positions and all the
previous generated target words.

The most important distinguishing feature of this approach from the basic encoder–decoder is that it does not
attempt to encode a whole input sentence into a single fixed-length vector. Instead, it encodes the input sentence
into a sequence of vectors and chooses a subset of these vectors adaptively while decoding the translation. This
frees a neural translation model from having to squash all the information of a source sentence, regardless of its
length, into a fixed-length vector. We show this allows a model to cope better with long sentences. In this paper, we
show that the proposed approach of jointly learning to align and translate achieves significantly improved
translation performance over the basic encoder–decoder approach. The improvement is more apparent with longer
sentences, but can be observed with sentences of any length. On the task of English-to-French translation, the
proposed approach achieves, with a single model, a translation performance comparable, or close, to the
conventional phrase-based system. Furthermore, qualitative analysis reveals that the proposed model finds a
linguistically plausible (soft-)alignment between a source sentence and the corresponding target sentence.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy