Introduction To NLPAbebe Zerihun
Introduction To NLPAbebe Zerihun
Processing (NLP)
Abebe Zerihun Jabessa
Outline
Linguistic related Classical/Statistical
Introduction
issues in NLP Machine Learning
The symbolic
approach
focused on
formal language 1983–1993:
1940s–1950s: theory, AI, and 1970–1983: Empiricism and 2000–2008:
Foundational early NLP Paradigms Finite-State Rise of Machine
Insights systems. Multiply Models Learning
• Applications:
• Sentiment analysis (e.g., classifying reviews as
positive/negative)
• Spam filtering.
• Logistic Regression: Predicts probabilities for
binary/multi-class text classification.
• Text classification (e.g., topic labeling).
• Sentiment analysis.
• Support Vector Machines: Finds a hyperplane that best
separates classes; works well with high-dimensional text
data.
• Named Entity Recognition (NER).
• Document classification.
• Random Forests: Ensemble of decision trees, reducing
overfitting and improving accuracy.
• Text classification.
• Sentiment analysis.
• k-Nearest Neighbors (k-NN): Classifies text by finding the
majority label among the k most similar instances in the
dataset.
• Document classification.
• Similarity-based search.
Unsupervised ML Algorithms
• k-Means Clustering: Groups text into clusters based on
similarity in vector space.
• Document clustering (e.g., grouping news articles by topics).
• Hierarchical Clustering: Builds a hierarchy of clusters,
useful for visualizing text groupings (e.g., dendrograms).
• Document clustering.
• Grouping similar sentences.
• Word2Vec (Skip-Gram):Learns vector representations of
words by predicting their context or vice versa.
• Learning word embeddings.
• Semantic analysis.
Reinforcement Learning (RL)
• Q-Learning: RL algorithm where an agent learns optimal
actions by interacting with an environment.
• Dialogue systems (e.g., learning optimal responses).
• Policy Gradient Methods:- Trains a model to maximize
rewards for generating high-quality responses or text
sequences.
• Text generation (e.g., optimizing fluency and coherence).
Deep Learning / Generative
Learning
• Deep Learning (DL) have revolutionized Natural Language Processing
(NLP) by enabling machines to understand, analyze, and generate
human language.
• deep learning models can grasp the nuances of language, like idioms,
context, and tone, by learning patterns directly from the data.
• Benefits:
• Automatic feature extraction
• Scalability- DL models improve as they are fed more data.
• Understanding context-They capture word meanings based on context (e.g.,
"bank" in a financial or river setting)
Deep Learning Architectures
• Recurrent Neural Network (RNN):- Capture temporal
dependencies in sequences
• Struggle with long sequences due to "vanishing gradients”.
• Use case: text generation
• Vanishing gradients?
• a phenomenon that occurs during the training of deep neural
networks, where the gradients of the loss function with respect to the weights
become very small (approach zero).
• This makes it difficult for the model to update its weights effectively, slowing
down or even halting learning in the earlier layers of the network.
• Long Short-Term Memory Networks(LSTM):-A special
type of RNN with memory gates
• Useful for tasks like language modeling or text generation
• Gated Recurrent Units (GRUs):a simplified version of
LSTMs with fewer gates, making them computationally
more efficient while still has vanishing gradients
problem.
• Use case: Predicting stock prices based on historical
data.
Does vanishing gradient problem of
LSTM?
• LSTMs Address Vanishing Gradients
• Designed to handle long-term dependencies in sequential
data.
• Use memory cells and gates (input, forget, output) to regulate
gradient flow.
• Gradient Flow Mechanism:
• Cell state allows gradients to flow backward without
significant diminishment.
• Not Completely Immune:
• LSTM can still face exploding gradients.
Encoder-Decoder Architecture
• a general framework used for sequence-to-sequence
(seq2seq).
• widely used in Recurrent Neural Networks (RNNs),
LSTMs, and GRUs for tasks like machine translation
• Encoder: processes the input sequence and encodes it
into a fixed-length vector.
• capture the important features of the input sequence.
• Decoder: takes encoded vector (or vectors) and generates
the output sequence.
• Encoder :RNN/LSTM/GRU
• Decoder:RNN/LSTM/GRU
• Transformers:-The backbone of modern NLP. Unlike
RNNs, they process all words simultaneously.
• Key concept: Self-attention.
• Each word looks at all other words to decide which ones are
most important.
• Transformer is based on an encoder-decoder structure.
• No recurrent Layer :Parallel processing
• Positional Encoding
• Use case: Machine translation, text summarization,
question answering and more.
Transformer based Pre-trained
models
• BERT(Bidirectional Encoder Representation from
Transformer):-uses Masked Language modelling.
• Text classification
• Question answering
• Text summarization
• Translation.
• GPT (Generative pretrained Transformer)
• Text Generation
• Text completion
• T5 (text to text transfer Transformer)
• Translation
What is Masked LM?
• a type of language model used in natural language processing
(NLP) tasks, particularly for pre-training deep learning models like
BERT.
• In a Masked LM, some percentage of the input tokens (words) are
randomly replaced with a special token (usually denoted as
[MASK]).
• the model is trained to predict the original token that was masked.
• Example: Input sentence
• "The quick brown fox jumps over the lazy dog."
• A portion of this might be masked, such as:
• "The quick brown [MASK] jumps over the lazy dog.“
• The model's task is to predict that the masked token is "fox" based
on the surrounding context.
Activation Functions
• Activation functions introduce non-linearity, enabling the network to
learn complex patterns.
Normalization
• Normalization makes training more stable and efficient
by adjusting inputs or intermediate values to have
specific properties.
• Batch Normalization (BatchNorm):
• Normalizes the input to each layer using the mean and
variance of the mini-batch.
Optimization
• Optimization is the process of tweaking the network's
parameters (weights, biases) to minimize the loss
function.
• It measures how far the model's predictions are from
the true values.
• Optimizers
• Gradient decent: SGD, mini batch SGD, SGD with momentum.
• RMSProp
• Adam
Hyper-parameters
• Hyper-parameters are the settings you configure before
training the model.
• Grid search or random Seach.
• Examples:
• Batch size
• Learning rate
• Dropout rate
• Number of layers
Super Parameters
• Super-parameters are an informal sometimes used to
refer to parameters that control the search or
optimization of hyper-parameters.
• super-parameters might govern how hyper-parameters
are chosen or explored in a given model search.
References
1. Jurafsky, Daniel, and James H. Martin. "Speech and Language Processing: An
Introduction to Natural Language Processing, Computational Linguistics, and
Speech Recognition."
2. Manning, Christopher, and Hinrich Schutze. Foundations of statistical natural
language processing. MIT press, 1999.
3. Daniel Jurafsky & James H. Martin “Speech and Language Processing: An
introduction to natural language processing, computational linguistics, and
speech recognition”.
4. [Online: Jan 11, 2023]: Natural Language Processing (NLP) [A Complete Guide]
5. Christopher D. Manning and Hinrich Schutze, Foundations of statistical natural
language Processing.