0% found this document useful (0 votes)

23 views10 pages

Deep Learning Basics

Uploaded by

agnes

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

23 views10 pages

Deep Learning Basics

Uploaded by

agnes

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 10

Deep Learning Basics: Recurrent Neural Networks

WHAT ARE RNNS?

Basic idea:
Assumption: Text is written sequentially, so our model should read it sequentially
“RNN”: class of neural network that processes text sequentially (left-to-right or right-to-left)
Generally speaking:

- Internal “state”
- RNN consumes one input per time step j

- Update function: , where parameters theta are shared across all time steps
VANILLA RNN
Notation:

VANISHING GRADIENTS
Vanishing gradients mean that the impact that an input has on the gradient of the loss becomes smaller when it is
further away from the loss in the computation graph.
Backpropagate to via chain rule

What happens to derivative when distance J-j grows?

Remember that tanh is applied elementwise, and that the derivative of tanh is between 0 and 1. So the red Jacobian
matrix is a diagonal matrix with entries between 0 and 1. The product of many such matrices approaches zero.
Furthermore, the blue Jacobian matrix is just V. When initialized with small enough values, will approach zero
as well.

As a result, derivative above approaches to zero – it vanishes.

What does this mean?

Since the “dummy parameter gradients” of step j, are upstream from , they approach zero too, i.e.,
their effect on the “dummy gradient sum” is negligible.
This means that if the words that your RNN should be paying attention to are far from the loss, the network will not
(or slowly) adjust its weights to those words.

EXPLODING GRADIENTS

So why don’t we just use a nonlinearity with a derivative larger than 1, or initialize differently?

would become very large (“explode”). This is even worse than vanishing gradients, because it leads to
non-convergence of gradient descent.
So vanishing gradients is the lesser of two evils.

LONG-SHORT TERM MEMORY NETWORK

Became popular around 2010 for handwriting recognition, speech recognition, and many NLP problems
Addresses vanishing gradients by changing the architecture of the RNN cell

1) Two states: (“short-term memory”) and (“long-term memory”)

2) Candidate state corresponds to in the Vanilla RNN

3) Interactions are mediated by “gates” in (0; 1)^d , which apply elementwise:

- Forget gate decides what information from should be forgotten

- Input gate decides what information from should be added to

- Output gate decides what information from should be exposed to

4) Each gate and the candidate state have their own parameters

5) “Gradient highway” from with no non-linearities or matrix multiplications

LSTM DEFINITION
TAGGING (EXAMPLE: PART-OF-SPEECH)

AUTOREGRESSIVE LANGUAGE MODELING

SEQ-TO-SEQ (MACHINE TRANSLATION)

MULTI-LAYER RNNS
- Stack of several RNNs (Vanilla RNNs, LSTMs, GRUs, etc.)
- Each RNN in the stack has its own parameters
- The input vectors of the l’th RNN are the hidden states of the l-1’th RNN
- The input vectors to the first RNN are the word embeddings, as usual
- We can output the hidden states of the last RNN, or a combination (concatenation, average, etc.) of the
states of all RNNs.

BIDIRECTIONAL RNNS

Two RNNs with separate parameters

Forward RNN runs left-to-right over the input
Backward RNN runs right-to-left over the input

The bidirectional RNN yields two sequences of hidden states:

Question: If we are dealing with a sentence classification task, which states should we use to represent the
sentence?

Concatenate because they have “seen” the entire sentence

For tagging task, represent the j’th word as

Question: Can we use a bidirectional RNN for autoregressive language modeling?
- No. In autoregressive language modeling, future inputs must be unknown to the model (since we want to
learn to predict them).
- We could train two separate autoregressive RNNs (one per direction), but we cannot combine their hidden
states before making a prediction
In sequence-to-sequence (e.g., Machine Translation), the encoder can be bidirectional, but the decoder cannot
(same reason)

Attention
ENCODER-DECODER WITH RNNS
LIMITATIONS OF RNNS
In an RNN, at a given point in time j, the information about all past inputs x(1) … x(j) is “crammed” into the state

vector for an LSTM)

So for long sequences, the state becomes a bottleneck
Especially problematic in encoder-decoder models (e.g., for Machine Translation)
Solution: Attention (Bahdanau et al., 2015)– an architectural modification of the RNN encoder-decoder that allows
the model to “attend to” past encoder states
BAHDANAU ATTENTION, ATTENTION: THE BASIC RECIPE
Ingredients:
One query vector q, J key vectors K, J value vectors V, Scoring function a (that
maps a query-key pair to a scalar (“score”), “a” may be parametrized by
parameters theta_a)

MODELING LONG-RANGE
DEPENDENCIES

RNNenc (classical encoder-decoder); RNNsearch (with Attention)

BLEU score: Measure for translation quality (higher is better)
ATTENTION: THE BASIC RECIPE

Step 1: Apply a to and all keys to get scores (one per key):

Step 2: Turn e into a probability distribution with the softmax function:

Step 3: alpha-weighted sum over yields one d_v -dimensional output vector:

Intuition: is how much “attention” the model pays to when computing .

ATTENTION: AN ANALOGY
- We have J weather stations on a map

- are their geolocations (x,y coordinates)

- are their current weather conditions (temperature, humidity, etc.)

- is a new geolocation for which we want to estimate weather conditions

- ej is the relevance of the j’th station , where alpha_j is e_j as a probability

- a weighted sum of all known weather conditions, where stations that have a small distance (high alpha)
have a higher weight

ATTENTION IN NEURAL NETWORKS

Contrary to our geolocation example, the vectors of a neural network are a function of the input and
trainable parameters
So the model learns which keys are relevant for which queries, based on the training data and loss function.

A PRIMER ON THE TRANSFORMER

The Bahdanau model is still an RNN, just with attention on top.
Another architecture that consists of attention only:
- Transformer: “Attention is all you need”
- No recurrence, just Attention (as the name suggests)
- Better parallelizable (as opposed to RNNs)
No (or few) assumptions are baked into the architecture (no notion of which words are neighbors in the sentence,
sequentiality, etc.)
The lack of prior knowledge often means that the Transformer requires more training data than an RNN to achieve a
certain performance
But when presented with sufficient data, it usually outperforms them.

Revisiting words: Tokenization

PROCESS OF TEXT TOKENIZATION
Breaking text into smaller units called tokens
- Tokens are discrete text units (letters, words, etc.)
- They are the building blocks of natural language
Encoding each token with unique IDs (numbers)
Performed on the entire corpus of documents: Corpus vocabulary of unique tokens is obtained
Mandatory preprocessing step for most of NLP tasks

WHY TOKENIZE?
Computers must understand text: Text encoding is necessary, Encode small rather than large units
Corpus documents can be large and hard to interpret: Working with tokens is easier, Building meaning in bottom-up
fashion
Text may contain extra whitespaces: Tokenization removes them

WORD TOKENIZATION
Most popular type of tokenization: Applied as preprocessing step in most NLP tasks
Considers dictionary words and several delimiters
- Accuracy depends on dictionary used for training
- Tradeoff between accuracy and efficiency
Whitespaces and punctuation symbols are used: They determine word boundaries
Available in many NLP libraries
Example: What is the tallest building? => ‘What’, ‘is’, ‘the’, ‘tallest’, ‘building’, ‘?’

SUBWORD TOKENIZATION
Finer grained than word tokenization: Breaks text into words, Breaks words into smaller units (root, prefix, suffix,
etc.)
More important for highly flective languages
- Words have many forms
- Prefixes and suffixes are added
- Word meaning and function changes
Helps to reduce out of vocabulary words
Example: What is the tallest building? => ‘What’, ‘is’, ‘the’, ‘tall’, ‘est’, ‘build’, ‘ing’, ‘?’

BYTEPAIR ENCODING (BPE)

Data compression algorithm
Considering data on a byte-level
Looking at pairs of bytes:
1 Count the occurrences of all byte pairs
2 Find the most frequent byte pair
3 Replace it with an unused byte
Repeat this process until no further compression is possible
Open-vocabulary neural machine translation
Instead of looking at bytes, look at characters
Motivation: Translation as an open-vocabulary problem
Word-level NMT models:
- Handling out-of-vocabulary word by using back-off dictionaries
- Unable to translate or generate previously unseen words
Using BPE effectively solves this problem
Adapt BPE for word segmentation
Goal: Represent an open vocabulary by a vocabulary of fixed size: Use variable-length character sequences
Looking at pairs of characters:
1 Initialize the vocabulary with all characters plus end-of-word token
2 Count occurrences and find the most frequent character pair, e.g. "A" and "B" (!!! Word boundaries are not
crossed) [Side effect: Can be run on a dictionary w/ frequency counts]
3 Replace it with the new token "AB"
Only one hyperparameter: Vocabulary size (Initial vocabulary + Specified no. of merge operations) Repeat this
process until given |V| is reached
SETUP MERGING
WORDPIECE
Voice Search for Japanese and Korean
Specific Problems:
- Asian languages have larger basic character inventories compared to Western languages
- Concept of spaces between words does (partly) not exist
- Many different pronounciations for each character
WordPieceModel: Data-dependent + do not produce OOVs
1) Initialize the vocabulary with basic Unicode characters (22k for Japanese, 11k for Korean)
!! Spaces are indicated by an underscore attached before (of after) the respective basic unit or word (increases
initial |V| by up to factor 4)
2) Build a language model using this vocabulary
3) Merge word units that increase the likelihood on the training data the most, when added to the model
Two possible stopping criteria:
Vocabulary size or incremental increase of the likelihood

ABSTRACTS:
1) A Neural Probabilistic Language Model

A goal of statistical language modeling is to learn the joint probability function of sequences of words in a language.
This is intrinsically difficult because of the curse of dimensionality: a word sequence on which themodelwill be tested
is likely to be different from all the word sequences seen during training. Traditional but very successful approaches
based on n-grams obtain generalization by concatenating very short overlapping sequences seen in the training set.
We propose to fight the curse of dimensionality by learning a distributed representation for words which allows each
training sentence to inform the model about an exponential number of semantically neighboring sentences. The
model learns simultaneously (1) a distributed representation for each word along with (2) the probability function for
word sequences, expressed in terms of these representations.

Generalization is obtained because a sequence of words that has never been seen before gets high probability if it is
made of words that are similar (in the sense of having a nearby representation) to words forming an already seen
sentence. Training such large models (with millions of parameters) within a reasonable time is itself a significant
challenge. We report on experiments using neural networks for the probability function, showing on two text corpora
that the proposed approach significantly improves on state-of-the-art n-gram models, and that the proposed
approach allows to take advantage of longer contexts.

Fighting the Curse of Dimensionality with Distributed Representations

In a nutshell, the idea of the proposed approach can be summarized as follows:

1. associate with each word in the vocabulary a distributed word feature vector (a realvalued vector in Rm),

2. express the joint probability function of word sequences in terms of the feature vectors of these words in the
sequence, and

3. learn simultaneously the word feature vectors and the parameters of that probability function.

2) Efficient Estimation of Word Representations in Vector Space

We propose two novel model architectures for computing continuous vector representations of words from very
large data sets. The quality of these representations is measured in a word similarity task, and the results are
compared to the previously best performing techniques based on different types of neural networks. We observe
large improvements in accuracy at much lower computational cost, i.e. it takes less than a day to learn high quality
word vectors from a 1.6 billion words data set. Furthermore, we show that these vectors provide state-of-the-art
performance on our test set for measuring syntactic and semantic word similarities.

Many current NLP systems and techniques treat words as atomic units - there is no notion of similarity between
words, as these are represented as indices in a vocabulary. This choice has several good reasons - simplicity,
robustness and the observation that simple models trained on huge amounts of data outperform complex systems
trained on less data. An example is the popular N-gram model used for statistical language modeling - today, it is
possible to train N-grams on virtually all available data (trillions of words [3]).

However, the simple techniques are at their limits in many tasks. For example, the amount of relevant in-domain
data for automatic speech recognition is limited - the performance is usually dominated by the size of high quality
transcribed speech data (often just millions of words). In machine translation, the existing corpora for many
languages contain only a few billions of words or less. Thus, there are situations where simple scaling up of the basic
techniques will not result in any significant progress, and we have to focus on more advanced techniques.

With progress of machine learning techniques in recent years, it has become possible to train more complex models
on much larger data set, and they typically outperform the simple models. Probably the most successful concept is
to use distributed representations of words [10]. For example, neural network based language models significantly
outperform N-gram models.

3) Enriching Word Vectors with Subword Information

Continuous word representations, trained on large unlabeled corpora are useful for many natural language
processing tasks. Popular models that learn such representations ignore the morphology of words, by assigning a
distinct vector to each word. This is a limitation, especially for languages with large vocabularies and many rare
words. In this paper, we propose a new approach based on the skipgram model, where each word is represented as
a bag of character n-grams. A vector representation is associated to each character n-gram; words being
represented as the sum of these representations. Our method is fast, allowing to train models on large corpora
quickly and allows us to compute word representations for words that did not appear in the training data. We
evaluate our word representations on nine different languages, both on word similarity and analogy tasks. By
comparing to recently proposed morphological word representations, we show that our vectors achieve state-of-the-
art performance on these tasks.

In this paper, we propose to learn representations for character n-grams, and to represent words as the sum of the
n-gram vectors. Our main contribution is to introduce an extension of the continuous skipgram model (Mikolov et al.,
2013b), which takes into account subword information. We evaluate this model on nine languages exhibiting
different morphologies, showing the benefit of our approach.

Subword model: By using a distinct vector representation for each word, the skipgram model ignores the internal
structure of words. In this section, we propose a different scoring function s, in order to take into account this
information. Each word w is represented as a bag of character n-gram. We add special boundary symbols < and > at
the beginning and end of words, allowing to distinguish prefixes and suffixes from other character sequences. We
also include the word w itself in the set of its n-grams, to learn a representation for each word (in addition to
character n-grams). Taking the word where and n = 3 as an example, it will be represented by the character n-
grams: <wh, whe, her, ere, re> and the special sequence <where>. Note that the sequence <her>, corresponding
to the word her is different from the tri-gram her from the word where. In practice, we extract all the n-grams for n
greater or equal to 3 and smaller or equal to 6. This is a very simple approach, and different sets of n-grams could
be considered, for example taking all prefixes and suffixes.

4) NEURAL MACHINE TRANSLATION BY JOINTLY LEARNING TO ALIGN AND TRANSLATE

Neural machine translation is a recently proposed approach to machine translation. Unlike the traditional statistical
machine translation, the neural machine translation aims at building a single neural network that can be jointly
tuned to maximize the translation performance. The models proposed recently for neural machine translation often
belong to a family of encoder–decoders and encode a source sentence into a fixed-length vector from which a
decoder generates a translation. In this paper, we conjecture that the use of a fixed-length vector is a bottleneck in
improving the performance of this basic encoder–decoder architecture, and propose to extend this by allowing a
model to automatically (soft-)search for parts of a source sentence that are relevant to predicting a target word,
without having to form these parts as a hard segment explicitly. With this new approach, we achieve a translation
performance comparable to the existing state-of-the-art phrase-based system on the task of English-to-French
translation. Furthermore, qualitative analysis reveals that the (soft-)alignments found by the model agree well with
our intuition.
Most of the proposed neural machine translation models belong to a family of encoder– decoders (Sutskever et al.,
2014; Cho et al., 2014a), with an encoder and a decoder for each language, or involve a language-specific encoder
applied to each sentence whose outputs are then compared (Hermann and Blunsom, 2014). An encoder neural
network reads and encodes a source sentence into a fixed-length vector. A decoder then outputs a translation from
the encoded vector. The whole encoder–decoder system, which consists of the encoder and the decoder for a
language pair, is jointly trained to maximize the probability of a correct translation given a source sentence. A
potential issue with this encoder–decoder approach is that a neural network needs to be able to compress all the
necessary information of a source sentence into a fixed-length vector. This may make it difficult for the neural
network to cope with long sentences, especially those that are longer than the sentences in the training corpus. Cho
et al. (2014b) showed that indeed the performance of a basic encoder–decoder deteriorates rapidly as the length of
an input sentence increases. In order to address this issue, we introduce an extension to the encoder–decoder model
which learns to align and translate jointly. Each time the proposed model generates a word in a translation, it
(soft-)searches for a set of positions in a source sentence where the most relevant information is concentrated. The
model then predicts a target word based on the context vectors associated with these source positions and all the
previous generated target words.

The most important distinguishing feature of this approach from the basic encoder–decoder is that it does not
attempt to encode a whole input sentence into a single fixed-length vector. Instead, it encodes the input sentence
into a sequence of vectors and chooses a subset of these vectors adaptively while decoding the translation. This
frees a neural translation model from having to squash all the information of a source sentence, regardless of its
length, into a fixed-length vector. We show this allows a model to cope better with long sentences. In this paper, we
show that the proposed approach of jointly learning to align and translate achieves significantly improved
translation performance over the basic encoder–decoder approach. The improvement is more apparent with longer
sentences, but can be observed with sentences of any length. On the task of English-to-French translation, the
proposed approach achieves, with a single model, a translation performance comparable, or close, to the
conventional phrase-based system. Furthermore, qualitative analysis reveals that the proposed model finds a
linguistically plausible (soft-)alignment between a source sentence and the corresponding target sentence.

lec-10
No ratings yet
lec-10
37 pages
RNN-StannfordBased
No ratings yet
RNN-StannfordBased
102 pages
module-4-RNN-LSTM-GRU
No ratings yet
module-4-RNN-LSTM-GRU
59 pages
Real Time Fault Monitoring of Industrial Processes
No ratings yet
Real Time Fault Monitoring of Industrial Processes
570 pages
Unit 3 Questions With Answers Ghanta Ka Password
No ratings yet
Unit 3 Questions With Answers Ghanta Ka Password
20 pages
RNN LSTM GRU Transformers
0% (1)
RNN LSTM GRU Transformers
123 pages
Dl Module 4 Notes
No ratings yet
Dl Module 4 Notes
27 pages
Bianchi
No ratings yet
Bianchi
62 pages
11-rnn
No ratings yet
11-rnn
32 pages
DL (1)
No ratings yet
DL (1)
26 pages
Hibernate Complete
No ratings yet
Hibernate Complete
330 pages
10-rnn
No ratings yet
10-rnn
56 pages
Unit 4
No ratings yet
Unit 4
50 pages
Machine learning
No ratings yet
Machine learning
17 pages
HF - Postgraduate - Supply Chain Systems and Application
No ratings yet
HF - Postgraduate - Supply Chain Systems and Application
182 pages
DL_MOD4 (3)
No ratings yet
DL_MOD4 (3)
105 pages
nndl (2)
No ratings yet
nndl (2)
10 pages
Deep Learning
No ratings yet
Deep Learning
26 pages
RNN and LSTM.pptx
No ratings yet
RNN and LSTM.pptx
65 pages
Cheatsheet Recurrent Neural Networks
No ratings yet
Cheatsheet Recurrent Neural Networks
5 pages
L15-Transformer1 (1)
No ratings yet
L15-Transformer1 (1)
19 pages
Cs224n 2025 Lecture06 Fancy Rnn
No ratings yet
Cs224n 2025 Lecture06 Fancy Rnn
57 pages
lecture 11
No ratings yet
lecture 11
57 pages
06-DL-Deep Learning For Text Data (LSTM Seq2Seq Models)
No ratings yet
06-DL-Deep Learning For Text Data (LSTM Seq2Seq Models)
44 pages
Sequence Modeling RNN-LSTM-APPL-Anand Kumar JUNE2021
No ratings yet
Sequence Modeling RNN-LSTM-APPL-Anand Kumar JUNE2021
71 pages
chapter 2
No ratings yet
chapter 2
68 pages
Deep Learning (MODULE-4)
No ratings yet
Deep Learning (MODULE-4)
102 pages
5
No ratings yet
5
9 pages
NLP Lecture 6
No ratings yet
NLP Lecture 6
57 pages
Bert
No ratings yet
Bert
60 pages
DL MODULE 5
No ratings yet
DL MODULE 5
10 pages
IT641 RNN V2-Compressed
No ratings yet
IT641 RNN V2-Compressed
74 pages
Recurrent Neural Networks cheatsheet
No ratings yet
Recurrent Neural Networks cheatsheet
44 pages
Chapter 3. IOStream
No ratings yet
Chapter 3. IOStream
43 pages
Unit-5-updated
No ratings yet
Unit-5-updated
125 pages
Solid 12
No ratings yet
Solid 12
20 pages
Nn4ir PDF
No ratings yet
Nn4ir PDF
290 pages
UNIT-3 part2
No ratings yet
UNIT-3 part2
14 pages
RNN
No ratings yet
RNN
22 pages
Recurrent Neural Networks(RNNs)
No ratings yet
Recurrent Neural Networks(RNNs)
45 pages
RNN
No ratings yet
RNN
53 pages
Attention_is_All_You_Need__Explained
No ratings yet
Attention_is_All_You_Need__Explained
46 pages
Day 4
No ratings yet
Day 4
22 pages
Recurrent Neural Network
No ratings yet
Recurrent Neural Network
34 pages
LectureLtR-neural IR 2
No ratings yet
LectureLtR-neural IR 2
52 pages
cs224n-2021-LSTM NN
No ratings yet
cs224n-2021-LSTM NN
59 pages
aM3RdIpjnYdPsGKF
No ratings yet
aM3RdIpjnYdPsGKF
20 pages
Module 4-1
No ratings yet
Module 4-1
44 pages
06 - LLM
No ratings yet
06 - LLM
18 pages
AAM unit 6 notes
No ratings yet
AAM unit 6 notes
20 pages
DEA-RNN A Hybrid Deep Learning Approach
No ratings yet
DEA-RNN A Hybrid Deep Learning Approach
93 pages
Time Series Rnn Lstm 1746197734
No ratings yet
Time Series Rnn Lstm 1746197734
25 pages
UNIT-IV DL
No ratings yet
UNIT-IV DL
23 pages
Agv-Survey
No ratings yet
Agv-Survey
29 pages
OSN 9800 U32 Enhanced Subrack Quick Installation Guide 03
No ratings yet
OSN 9800 U32 Enhanced Subrack Quick Installation Guide 03
78 pages
RNN LSTM
No ratings yet
RNN LSTM
72 pages
Industry Brief
No ratings yet
Industry Brief
18 pages
Unit_3_rcnn
No ratings yet
Unit_3_rcnn
25 pages
Setting Up (53M) : ABAP Core Data Services Extraction For SAP Data Intelligence
No ratings yet
Setting Up (53M) : ABAP Core Data Services Extraction For SAP Data Intelligence
16 pages
7A50T User Guide
No ratings yet
7A50T User Guide
27 pages
DP Module 5
No ratings yet
DP Module 5
8 pages
Unit 1 - Session 2: Email Introductions
No ratings yet
Unit 1 - Session 2: Email Introductions
16 pages
RNN
No ratings yet
RNN
9 pages
Complete NLP Guide_ From Fundamentals to Deep Learning with TensorFlow
No ratings yet
Complete NLP Guide_ From Fundamentals to Deep Learning with TensorFlow
13 pages
Blue and White Simple Business Plan Presentation
No ratings yet
Blue and White Simple Business Plan Presentation
15 pages
Cs224n Self Attention Transformers 2023 Draft
No ratings yet
Cs224n Self Attention Transformers 2023 Draft
18 pages
Chapter-7. Big Data Tools and Techniques
No ratings yet
Chapter-7. Big Data Tools and Techniques
16 pages
Rnn Tutorial
No ratings yet
Rnn Tutorial
41 pages
LM4 System-Level of Operating Virtualization
No ratings yet
LM4 System-Level of Operating Virtualization
12 pages
Advanced Operation Research Paper
No ratings yet
Advanced Operation Research Paper
3 pages
Congruence Model Analysis of Microsoft Skype: Company Profile
No ratings yet
Congruence Model Analysis of Microsoft Skype: Company Profile
3 pages
RNNs and Their Types - 15 Slides (Easy Copy-Paste Format)
No ratings yet
RNNs and Their Types - 15 Slides (Easy Copy-Paste Format)
6 pages
Hi-Scan 100100v-2is: Heimann X-Ray Technology New: 160 KV X-Ray Source - Typical Steel Penetration 37 MM
No ratings yet
Hi-Scan 100100v-2is: Heimann X-Ray Technology New: 160 KV X-Ray Source - Typical Steel Penetration 37 MM
2 pages
Project Scheduling: Project Cost Management (Mbem-202)
No ratings yet
Project Scheduling: Project Cost Management (Mbem-202)
15 pages
Jabatan Teknologi Maklumat & Komunikasi Dfc10033 Introduction To Computer System Practical Task 1 (1 2021/2022)
No ratings yet
Jabatan Teknologi Maklumat & Komunikasi Dfc10033 Introduction To Computer System Practical Task 1 (1 2021/2022)
7 pages
Introduction To Recurrent Neural Networks (RNNS) : Dr. Hans Weber February 9, 2024
No ratings yet
Introduction To Recurrent Neural Networks (RNNS) : Dr. Hans Weber February 9, 2024
9 pages
EXPERIMENT NO 2 - Resisotor Color Coding and Use of Ohmmeter
100% (1)
EXPERIMENT NO 2 - Resisotor Color Coding and Use of Ohmmeter
6 pages
EEN-4143: Microcontroller Based Design: 1. Course Books
No ratings yet
EEN-4143: Microcontroller Based Design: 1. Course Books
5 pages
Bit Map and Join Indexing
No ratings yet
Bit Map and Join Indexing
4 pages
Iceberg 3.0 - Auto Signup Process
No ratings yet
Iceberg 3.0 - Auto Signup Process
4 pages
Zhong 2017 Study On The Iot Architecture and A
No ratings yet
Zhong 2017 Study On The Iot Architecture and A
4 pages
SGH SGH NON-DISCLOSURE FORM
No ratings yet
SGH SGH NON-DISCLOSURE FORM
4 pages
Bidirectional RNN and RVNN
No ratings yet
Bidirectional RNN and RVNN
15 pages
Monte Carlo
No ratings yet
Monte Carlo
6 pages
4.3.2.3 Lab - Using Steganography
No ratings yet
4.3.2.3 Lab - Using Steganography
3 pages
Web Technology Lab
No ratings yet
Web Technology Lab
1 page
Heer Selarka
No ratings yet
Heer Selarka
2 pages
Mindray BC 2800 Auto Hematology Analyzer
100% (1)
Mindray BC 2800 Auto Hematology Analyzer
2 pages
TensorFlow in 1 Day: Make your own Neural Network
From Everand
TensorFlow in 1 Day: Make your own Neural Network
Krishna Rungta
3.5/5 (10)
Backpropagation: Fundamentals and Applications for Preparing Data for Training in Deep Learning
From Everand
Backpropagation: Fundamentals and Applications for Preparing Data for Training in Deep Learning
Fouad Sabry
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Deep Learning Basics

Uploaded by

Deep Learning Basics

Uploaded by

Deep Learning Basics: Recurrent Neural Networks

WHAT ARE RNNS?

What happens to derivative when distance J-j grows?

As a result, derivative above approaches to zero – it vanishes.

LONG-SHORT TERM MEMORY NETWORK

1) Two states: (“short-term memory”) and (“long-term memory”)

2) Candidate state corresponds to in the Vanilla RNN

- Forget gate decides what information from should be forgotten

- Input gate decides what information from should be added to

- Output gate decides what information from should be exposed to

5) “Gradient highway” from with no non-linearities or matrix multiplications

AUTOREGRESSIVE LANGUAGE MODELING

SEQ-TO-SEQ (MACHINE TRANSLATION)

Two RNNs with separate parameters

The bidirectional RNN yields two sequences of hidden states:

Concatenate because they have “seen” the entire sentence

For tagging task, represent the j’th word as

vector for an LSTM)

RNNenc (classical encoder-decoder); RNNsearch (with Attention)

Step 2: Turn e into a probability distribution with the softmax function:

Intuition: is how much “attention” the model pays to when computing .

- are their geolocations (x,y coordinates)

- are their current weather conditions (temperature, humidity, etc.)

- is a new geolocation for which we want to estimate weather conditions

ATTENTION IN NEURAL NETWORKS

A PRIMER ON THE TRANSFORMER

Revisiting words: Tokenization

BYTEPAIR ENCODING (BPE)

Fighting the Curse of Dimensionality with Distributed Representations

In a nutshell, the idea of the proposed approach can be summarized as follows:

2) Efficient Estimation of Word Representations in Vector Space

3) Enriching Word Vectors with Subword Information

4) NEURAL MACHINE TRANSLATION BY JOINTLY LEARNING TO ALIGN AND TRANSLATE

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.