0% found this document useful (0 votes)

9 views27 pages

State of Multilingual and Multimodal NLP

Uploaded by

Monsif BARAKA

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

9 views27 pages

State of Multilingual and Multimodal NLP

Uploaded by

Monsif BARAKA

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 27

The State of Multilingual and Multimodal NLP

Maha Elbayad
Research Scientist, FAIR (Meta AI)

ThinkAI, May 7th 2023

LLMs

Go back to the roots!

2
Language Models 101
1. How to represent text?
2. What is a Language model?
3. What is a Conditional Language Model?

3
A basic setup
Computer vision: Binary image classification.

Vector
Model
of pixel = dog?
fθ
values
0 1
x
https://ai.stanford.edu/~syyeung/cvweb/tutorial1.html

NLP: Sentiment prediction of movie reviews.

The focus in this 101 will be on

There is never a dull representation learning. We assume that:
moment in this movie. Model
? fθ
= positive review?
1. we can evaluate a loss function that
Wonderful visuals and
good actors. 0 1 measures the error of our model on some
x training data.

How to represent this input? 2. we know how to optimize this loss

function wrt the model parameters θ.

4
Text representation
Given a vocabulary 𝒱={there, bad, dull, moment, good, boring, awesome, actors, classic,
story, fights, …} of size V=|𝒱|, we will represent a word with a one-hot vector in

The vector for “bad” = (0, 1, 0, …, 0, 0, 0)

All except one position are zeros
The vector for “good” = (0, 0, 0, 0, 1, …, 0)

Bag-of-words representation:

Frequency of each word in the vocabulary

The vector of the sentence = (1, 0, 2, 0, …, 0)

1
0
There is never a dull 2
moment in this movie. 0 Model
. = positive review?
Wonderful visuals and ,
fθ
good actors. , 0 1
0 x 5
Text representation
Bag-of-words representation:

Frequency of each word in the vocabulary

The vector of the sentence = (1, 0, 2, 0, …, 0)∈

The issues:
1. Large vocabularies mean large sparse vectors. (V≫10^5)
2. Loss of word order information.
xTy is high if x & y
vector(“I like tagine but hate couscous”) = vector(“I hate tagine but like couscous”) point to the same
direction
3. There is no notion of similarity:
dot(W(‘bad’), W(‘boring’)) = dot(W(‘bad’), W(‘awesome’)) = 0,
where dot(x, y) = xTy = ||x|| ||y|| cos(θ) is the dot product of vectors x & y
We want our vectors to capture semantic information i.e.
if it’s the same meaning it should be the same vector.
6
Text representation
0.1 0.2 0.7 -0.3 0.5 0.1
The issues:
0.9 -0.8 0.2 0.1 -0.2 -0.2

1. Large vocabularies mean large sparse vectors. -0.3 0.6 -0.3 0.6 0.3 0.7

0.2 0.1 0.2 -0.8 0.7 -0.5

(V≫10^5) 0.4 -0.4 0.4 -0.1 0.2 0.8

➢ We will use dense embeddings in (d<<V). there, bad, dull, moment, good, boring

2. No notion of similarity.
We want to capture semantic information.

The Distributional Hypothesis:

Words that occur in the same contexts tend to have similar meanings
(Harris, 1954)

Solving 1+2 gave us contextualized word vectors (or contextual embedding)

How: The skip-gram model [Mikolov et al., 2013]

7
Text representation
Contextualized word vectors with skip-grams [Mikolov et al., 2013] (a simplified version for illustration purposes)
Since similar words appear in similar contexts, we will represent the word “UM6P” by these
contexts from a training data.
Located in the “Mohammed VI Green City” in Benguerir, near Marrakech, UM6P applies a “learning by doing” approach
The project will leverage the expertise of INNOV’X, an innovation engine launched by UM6P in 2022 dedicated to building innovative and sustainable businesses and ecosystems
The 13th edition of the Roundtables of Arbois and the Mediterranean saw Morocco’s OCP Group and Mohammed VI Polytechnic University (UM6P) showcase progress on green hydrogen
technologies, as well as the importance of such technologies for the institutions.
Morocco’s UM6P Bags Gold Medal at International Exhibition of Inventions in Geneva
UM6P’s Green Energy Park won an innovation award for its contributions to renewable energy research and development.
The UNITY team represented UM6P among ten schools from the African continent that participated in this international event.
This immersive visit at UM6P Campus is part of and “engagement course” for the students, to familiarize them with topics related to talent development, the needs for technology and
innovation in the country,

The designation of UM6P by the members of the steering committee as the winner of the "Coup de Coeur" was motivated by the initiatives taken and carried out to develop professional
equality in the workplace. UM6P was also congratulated and appreciated by the jury for its good practices.

We will maximize the likelihood of observing all words surrounding the word UM6P
0.1 0.2 0.7 -0.3 0.5 0.1

0.9 -0.8 0.2 0.1 -0.2 -0.2

-0.3 0.6 -0.3 0.6 0.3 0.7

0.2 0.1 0.2 -0.8 0.7 -0.5

0.4 -0.4 0.4 -0.1 0.2 0.8

8
Text representation
Contextualized word vectors with skip-grams [Mikolov et al., 2013] (a simplified version for illustration purposes)
0.1 0.2 0.7 -0.3 0.5 0.1

0.9 -0.8 0.2 0.1 -0.2 -0.2

-0.3 0.6 -0.3 0.6 0.3 0.7

0.2 0.1 0.2 -0.8 0.7 -0.5

0.4 -0.4 0.4 -0.1 0.2 0.8

There is never a dull Aggregation

moment in this movie. (mean, max, ..) Model = positive review?
Wonderful visuals and of word fθ
good actors. embeddings
0 1
x

But we still haven’t dealt with the loss of word-order information!

9
Text representation
Sequence models
Motivation: How to combine the words embeddings of a sentence (arbitrary length) into a
meaningful vector representation?

context vectors
Eact vector represents the current word and all the
previous words (context).
Sequence model

Some models include the future words as well.

Skip-gram vectors

How: Recurrent networks, 1D-convolutional models, attention models. For an introduction

course check Lena Voita’s “NLP Course for You” -
https://lena-voita.github.io/nlp_course.html 10
Language models
Given a sequence x, the role of a language model is to estimate the joint probability p(x) i.e. to
assess the plausibility or fluency of x.
With the chain rule we rewrite the joint probability as:

we compute the probability distribution of the next word xt : p(xt | x1 , x2 …, xt-1) and sample
from it or pick the token with the highest probability (if greedy decoding).

Think of it as auto-complete
working 0.4

coding 0.3

The hackathon participants spent the whole weekend __

x1 x2 x3 x4 x5 x6 x7 playing 0.2

sleeping 0.05

11
Language models
How are they related to sequence models?
We want to model xt | x1 x2, …, xt-1
If we have a vector that summarizes x1 x2, …, xt-1 then we can use it to predict what xt
could be. This vector is exactly what the output of a sequence model (encoder) is.

Predict Predict
“participants” “working”

Sequence model

The hackathon participants spent the whole weekend 12

Conditional Language models

We want to model the conditional joint probability of y | x, where y is a sequence of

tokens and x is some conditioning context (potentially a sequence itself).

Similar to language models, we decompose this probability with the chain rule:

What is the probability of the next word, given the history of previously generated
words AND conditioning context c.

Similar to LMs, except from the additional context usually processed with an
encoder. This module + the LM decoder build what we call
sequence-to-sequence models.

13
Conditional Language models
What NLP tasks do we use conditional LMs for?

Machine Translation
(text-to-text)

Speech Translation
English (speech-to-text or
speech-to-speech)

English Automatic speech recognition

ASR

Summarization

Image captioning A dog standing in the ocean.

14
NLP:
It is the field of automatic (or semi-automatic) processing of
human languages. It is focused on understanding human
language at different granularities (characters, words,
sentences, documents,…, etc.) and in different modalities
(text, speech, visual, … etc.)

It’s not just LLMs!

Next: Some of our recent works in multilingual and

multimodal NLP (speech+ text).

15
NLLB (No Language Left Behind)
North star goal: Develop a general-purpose universal machine translation model
capable of translating between any two languages in various domains.

Google Translate supports 134 & Microsoft Translator supports 110, however,
there are more than 3000 written languages in the world. How can we break the
200 barrier?
Our work towards NLLB-200 was structured around 3 axes:

Data Modeling Evaluation

Mixture
Mining of
Experts
16
NLLB (No Language Left Behind)
Problem: How can we collect enough training data for low-resource languages?
Bitexts data (pairs of source-target sentences) available to us per language

Two techniques to augment our data: (1) Back-translation, (2) Bitext mining.

17
NLLB (No Language Left Behind)
Bitext mining:
Multilingual Sentence Encoders to embed sentences and find semantically similar ones in different languages –
see LASER (Artexte and Schwenk, 2019).

Monolingual Multilingual
Le chien est brun
The dog is brown. The dog is brown.

I love eating. I enjoy food a lot. I love eating. I enjoy food a lot.
J’aime manger

‫أرﯾد اﻻﺗﺻﺎل ﺑك‬

I want to call you. I want to call you.

Sentences with similar meaning are close. Sentences with similar meaning are close
independently of their language

18
NLLB (No Language Left Behind)
Bitext mining:
What if we have a new language and we want to encode it in the same multilingual space?

19
NLLB (No Language Left Behind)
Problem: How can we collect enough training data for low-resource languages?
With the addition of back-translated and mined data, most of the
low-resource languages cross the threshold of 1M samples.

20
NLLB (No Language Left Behind)
The strength of Multinual MT is in leveraging knowledge transfer between languages.
However, it also comes with interference between languages.

The MoE solution: A technique that allows for more parameters at an equivalent
computational cost and for sparsely activated weights to be specialized in some languages.

Sparsely Gated
Mixture of Experts (MoE)
Dense

FFN1 FFN2 FFN3 … FFNE

g1 g2
Gate

Input
(token representation)

Encoder Decoder Replace every other FFN in the Transformer

Source sentence prefixed with Target sentence prefixed with model with an MoE FFN layer 21
<source_language> <target_language>
NLLB (No Language Left Behind)
Results
Our final model significantly outperforms previous SOTA. Flores-101

+37%

+44%
Flores-200

22
NLLB (No Language Left Behind)
Will LLMs replace supervised MMT models?

Recent studies (wip) have shown that

knowledge transfers poorly across
languages in LLMs: being correct on a
specific question in English does not
necessarily imply the LLM will also be
correct on the same question in other
languages.

23
NLLB (No Language Left Behind)
Open-source!

● Project webpage: https://ai.facebook.com/research/no-language-left-behind/

● The Paper: https://arxiv.org/abs/2207.04672

Codebases
● Modeling: https://github.com/facebookresearch/fairseq/tree/nllb
● LASER3: https://github.com/facebookresearch/LASER/blob/main/nllb
● Stopes (data and mining pipelines): https://github.com/facebookresearch/stopes/

Models
● Final NMT models:
https://github.com/facebookresearch/fairseq/tree/nllb#multilingual-translation-models
● LASER3 encoders: https://github.com/facebookresearch/LASER/blob/main/nllb

Data
● Flores-200, NLLB-Seed, NLLB-MD, Toxicity-200: https://github.com/facebookresearch/flores
● Mined bitexts: https://huggingface.co/datasets/allenai/nllb

24
Ongoing work on multimodality (speech + text)
We extended LASER sentence embeddings to the speech modality with
SpeechMatrix and T-modules (Duquenne et al. 2022).

25
Ongoing work on multimodality (speech + text)
Training end-to-end multimodal models (MT, ASR, S2T, S2ST) that are multilingual on
both source and target sides (e.g. Whisper S2T does only translate into English)
We can already generate text and speech (units) with a two-pass decoder in UnitY
(Inaguma et al. 2022)

26
Conclusion
There is more to NLP than LLMs.

The underlying modeling basics for these different tasks is the same.
So know your basics! The algorithmical basics.

There are a multitude of solutions to the same problem. As a

researcher/engineer you hypothesise, then you use data to prove (or disprove)
your hypothesis. And you iterate!

Deep Learning MCQ
90% (73)
Deep Learning MCQ
34 pages
Transfer Learning in Natural Language Processing PDF
0% (1)
Transfer Learning in Natural Language Processing PDF
238 pages
Large Language Models From Scratch
No ratings yet
Large Language Models From Scratch
29 pages
Word Embedding
No ratings yet
Word Embedding
35 pages
Three Reservoir Problems
100% (3)
Three Reservoir Problems
28 pages
RNN For Moodle
No ratings yet
RNN For Moodle
42 pages
cs224n 2023 Lecture9 Pretraining
No ratings yet
cs224n 2023 Lecture9 Pretraining
54 pages
Cs224n 2023 Lecture05 RNNLM
No ratings yet
Cs224n 2023 Lecture05 RNNLM
68 pages
DR Pushpak's Talk IIT Bombay, Ex IIT Patna
No ratings yet
DR Pushpak's Talk IIT Bombay, Ex IIT Patna
136 pages
Lecture 7 - Conditional Language Modeling
No ratings yet
Lecture 7 - Conditional Language Modeling
64 pages
Lecture 7 - Language Modelling
No ratings yet
Lecture 7 - Language Modelling
107 pages
BDMH LLM
No ratings yet
BDMH LLM
51 pages
ML4D-L6 nlp2
No ratings yet
ML4D-L6 nlp2
58 pages
Intro DL 10 NLP
No ratings yet
Intro DL 10 NLP
99 pages
L5 Cse256 Fa24 LM
No ratings yet
L5 Cse256 Fa24 LM
65 pages
Language Modeling
No ratings yet
Language Modeling
50 pages
Language Modeling
No ratings yet
Language Modeling
88 pages
2 Generative Models
No ratings yet
2 Generative Models
60 pages
2022 Foundations Tutorial3 Sunwang Deeplearning4nlp
No ratings yet
2022 Foundations Tutorial3 Sunwang Deeplearning4nlp
103 pages
NLP Unit-4
No ratings yet
NLP Unit-4
62 pages
LLM Book 43-102
No ratings yet
LLM Book 43-102
60 pages
PIIS2589004224005558
No ratings yet
PIIS2589004224005558
24 pages
NLP Concepts
No ratings yet
NLP Concepts
37 pages
Unit - 4 DL
No ratings yet
Unit - 4 DL
33 pages
CH 6. Applications of AI-NLP
No ratings yet
CH 6. Applications of AI-NLP
65 pages
Unit-3 (NLP)
No ratings yet
Unit-3 (NLP)
28 pages
NLP Intro
No ratings yet
NLP Intro
74 pages
2023 07 28 Evolution of Language Models
No ratings yet
2023 07 28 Evolution of Language Models
73 pages
NLP - Module 2
No ratings yet
NLP - Module 2
54 pages
08-DL-Deep Learning For Text Data (Transfer Learning in NLP)
No ratings yet
08-DL-Deep Learning For Text Data (Transfer Learning in NLP)
53 pages
Word Embadding
No ratings yet
Word Embadding
24 pages
01-Transformer Based NLP Applications
No ratings yet
01-Transformer Based NLP Applications
55 pages
NLP Summary
No ratings yet
NLP Summary
6 pages
Unit IV
No ratings yet
Unit IV
57 pages
Multilinguality
No ratings yet
Multilinguality
10 pages
Trend
No ratings yet
Trend
47 pages
08 Word Embeddings (2021)
No ratings yet
08 Word Embeddings (2021)
58 pages
Deep Network Notes
No ratings yet
Deep Network Notes
54 pages
NLP Unit2
No ratings yet
NLP Unit2
65 pages
Addition and Subtraction
No ratings yet
Addition and Subtraction
9 pages
6.chapter6 LanguageModel
No ratings yet
6.chapter6 LanguageModel
33 pages
ML For NLP-LO4
No ratings yet
ML For NLP-LO4
42 pages
Word Embeddings Classification
No ratings yet
Word Embeddings Classification
52 pages
3 - Deep Learning
No ratings yet
3 - Deep Learning
33 pages
Large Language Models: Dr. Asgari, Dr. Rohban, Soleymani Fall 2023
No ratings yet
Large Language Models: Dr. Asgari, Dr. Rohban, Soleymani Fall 2023
53 pages
Three 150224 Generative A I Intro
No ratings yet
Three 150224 Generative A I Intro
19 pages
DLT Unit-5
No ratings yet
DLT Unit-5
48 pages
Module1 L4 LLMs New
No ratings yet
Module1 L4 LLMs New
37 pages
Langauage Model
No ratings yet
Langauage Model
148 pages
Language Models
No ratings yet
Language Models
11 pages
2020 NLPDeepLearning
No ratings yet
2020 NLPDeepLearning
72 pages
Chapter II
No ratings yet
Chapter II
26 pages
NLP Notes
No ratings yet
NLP Notes
11 pages
NLP - N-Gram Language Model
No ratings yet
NLP - N-Gram Language Model
22 pages
UNIT 3 Language Modelling
No ratings yet
UNIT 3 Language Modelling
15 pages
Word2Vec - A Baby Step in Deep Learning But A Giant Leap Towards Natural Language Processing
100% (1)
Word2Vec - A Baby Step in Deep Learning But A Giant Leap Towards Natural Language Processing
12 pages
Embeddings
No ratings yet
Embeddings
3 pages
MCom PI-2 Quantitative Techniques For Business
No ratings yet
MCom PI-2 Quantitative Techniques For Business
2 pages
DL Unit-IV
No ratings yet
DL Unit-IV
20 pages
5.2 Natural Language Processing
No ratings yet
5.2 Natural Language Processing
43 pages
Word Embedding
No ratings yet
Word Embedding
9 pages
Part 3
No ratings yet
Part 3
5 pages
Lab Report 5
No ratings yet
Lab Report 5
6 pages
FactorisingExercises132 PDF
No ratings yet
FactorisingExercises132 PDF
4 pages
Ds Lab Manual
No ratings yet
Ds Lab Manual
32 pages
Data Science For Civil Engineering Unit 4 Notes
No ratings yet
Data Science For Civil Engineering Unit 4 Notes
18 pages
Correlation
No ratings yet
Correlation
4 pages
Computer Science Questions From Chapter 10 For Class 10 11
No ratings yet
Computer Science Questions From Chapter 10 For Class 10 11
11 pages
hw4 - Solution
No ratings yet
hw4 - Solution
2 pages
Chapter 4 Basic Probability
No ratings yet
Chapter 4 Basic Probability
41 pages
Chap5 ECL301L Lab Manual Pascual
No ratings yet
Chap5 ECL301L Lab Manual Pascual
12 pages
Deep Learning Lecture 6
No ratings yet
Deep Learning Lecture 6
8 pages
Response of First-Order Systems in Series
No ratings yet
Response of First-Order Systems in Series
10 pages
Maize Leaf Disease Identification
No ratings yet
Maize Leaf Disease Identification
10 pages
The Coding Interview Formula That Will Get You Any FAANG
No ratings yet
The Coding Interview Formula That Will Get You Any FAANG
6 pages
On Genetic
No ratings yet
On Genetic
17 pages
Map Mid-Mesh Thickness Tool
No ratings yet
Map Mid-Mesh Thickness Tool
8 pages
From Image To Emotion Exploring CNN Architectures For Facial Emotion Recognition
No ratings yet
From Image To Emotion Exploring CNN Architectures For Facial Emotion Recognition
6 pages
EEE2035F: Signals and Systems I: Class Test 1
No ratings yet
EEE2035F: Signals and Systems I: Class Test 1
5 pages
Design of Risk-Based Univariate Control Charts With Measurement Uncertainty
No ratings yet
Design of Risk-Based Univariate Control Charts With Measurement Uncertainty
7 pages
Social Network Analysis
No ratings yet
Social Network Analysis
3 pages
Prediction and Sentiment Analysis of Stock Using Machine Learning
No ratings yet
Prediction and Sentiment Analysis of Stock Using Machine Learning
10 pages
Decision Trees Another Example Problem
No ratings yet
Decision Trees Another Example Problem
6 pages
Lab 4 - DTFS Analysis
No ratings yet
Lab 4 - DTFS Analysis
4 pages
DAA Lesson Plan For CSE A & B 2022
No ratings yet
DAA Lesson Plan For CSE A & B 2022
3 pages
تقدير متجه المتوسطات ومصفوفة التباين والتباين المشترك PDF
No ratings yet
تقدير متجه المتوسطات ومصفوفة التباين والتباين المشترك PDF
3 pages
N Gram, RNN Tranformer
No ratings yet
N Gram, RNN Tranformer
2 pages
Managing and Summarizing Large Excel Datasets: Pivottable Calculations
No ratings yet
Managing and Summarizing Large Excel Datasets: Pivottable Calculations
2 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

State of Multilingual and Multimodal NLP

Uploaded by

State of Multilingual and Multimodal NLP

Uploaded by

The State of Multilingual and Multimodal NLP

ThinkAI, May 7th 2023

Go back to the roots!

NLP: Sentiment prediction of movie reviews.

The focus in this 101 will be on

How to represent this input? 2. we know how to optimize this loss

The vector for “bad” = (0, 1, 0, …, 0, 0, 0)

Frequency of each word in the vocabulary

Frequency of each word in the vocabulary

0.2 0.1 0.2 -0.8 0.7 -0.5

The Distributional Hypothesis:

Solving 1+2 gave us contextualized word vectors (or contextual embedding)

0.9 -0.8 0.2 0.1 -0.2 -0.2

-0.3 0.6 -0.3 0.6 0.3 0.7

0.2 0.1 0.2 -0.8 0.7 -0.5

0.4 -0.4 0.4 -0.1 0.2 0.8

0.9 -0.8 0.2 0.1 -0.2 -0.2

-0.3 0.6 -0.3 0.6 0.3 0.7

0.2 0.1 0.2 -0.8 0.7 -0.5

0.4 -0.4 0.4 -0.1 0.2 0.8

There is never a dull Aggregation

But we still haven’t dealt with the loss of word-order information!

Some models include the future words as well.

How: Recurrent networks, 1D-convolutional models, attention models. For an introduction

The hackathon participants spent the whole weekend __

The hackathon participants spent the whole weekend 12

We want to model the conditional joint probability of y | x, where y is a sequence of

English Automatic speech recognition

Image captioning A dog standing in the ocean.

It’s not just LLMs!

Next: Some of our recent works in multilingual and

Data Modeling Evaluation

‫أرﯾد اﻻﺗﺻﺎل ﺑك‬

FFN1 FFN2 FFN3 … FFNE

Encoder Decoder Replace every other FFN in the Transformer

Recent studies (wip) have shown that

● Project webpage: https://ai.facebook.com/research/no-language-left-behind/

There are a multitude of solutions to the same problem. As a

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.