Transformer
Transformer
Transformers in Machine
Learning
• Transformer is a neural network architecture used for performing machine
learning tasks particularly in natural language processing (NLP) and
computer vision(CV).
• In 2017 Vaswani et al. published a paper ”Attention is All You Need” in which
the transformers architecture was introduced. The article explores the
architecture, workings and applications of transformers.
• Transformer Architecture is a model that uses self-attention to transform
one whole sentence into a single sentence. This is useful where older
models work step by step and it helps overcome the challenges seen in
models like RNNs and LSTMs
• NB: RNN suffer from the vanishing gradient problem which leads to long-
term memory loss. RNNs process text sequentially meaning they analyze
words one at a time
Need For Transformers Model in Machine Learning
For example, in the sentence: “XYZ went to France in 2019 when there were no cases of COVID and there he met the
president of that country” the word “that country” refers to “France”.
However RNN would struggle to link “that country” to “France” since it processes each word in sequence leading to losing
context over long sentences. This limitation prevents RNNs from understanding the full meaning of the sentence.
While adding more memory cells in LSTMs (Long Short-Term Memory networks)helped address the vanishing gradient issue
they still process words one by one. This sequential processing means LSTMs can’t analyze an entire sentence at once.
For instance the word “point” has different meanings in these two sentences:
•“The needle has a sharp point.” (Point = Tip)
•“It is not polite to point at people.” (Point = Gesture)
Traditional models struggle with this context dependence, whereas, Transformer model
through its self-attention mechanism, processes the entire sentence in parallel addressing
these issues and making it significantly more effective at understanding context.
Architecture
• Let’s consider an example of machine translation. Imagine that we’re translating a French
sentence (‘Je suis étudiant’) into English (‘I am a student’). Let’s initiate our exploration by
considering the model as a black box.
• The primary function of the decoder is to take the encoded input and
generate the output tokens one at a time, in an iterative process.
• The decoding process begins with the first step, where the decoder
receives the encoder’s output along with a special start token( like
<start>. Since no output tokens have been generated yet, this first step
relies solely on the input from the encoder.
Decoder
• As the model moves through later steps, it generates the next token by
using both the encoder’s output and the sequence of previously generated
tokens. The decoder uses a masked self-attention mechanism, ensuring
that each token can only attend to tokens that came before it, preserving
the sequential nature of the output.
Decoder
Through this process, the model generates text in a step-by-step manner, ensuring each word is
contextually relevant based on the input and previously generated tokens.