State of Multilingual and Multimodal NLP
State of Multilingual and Multimodal NLP
Maha Elbayad
Research Scientist, FAIR (Meta AI)
2
Language Models 101
1. How to represent text?
2. What is a Language model?
3. What is a Conditional Language Model?
3
A basic setup
Computer vision: Binary image classification.
Vector
Model
of pixel = dog?
fθ
values
0 1
x
https://ai.stanford.edu/~syyeung/cvweb/tutorial1.html
4
Text representation
Given a vocabulary 𝒱={there, bad, dull, moment, good, boring, awesome, actors, classic,
story, fights, …} of size V=|𝒱|, we will represent a word with a one-hot vector in
Bag-of-words representation:
1
0
There is never a dull 2
moment in this movie. 0 Model
. = positive review?
Wonderful visuals and ,
fθ
good actors. , 0 1
0 x 5
Text representation
Bag-of-words representation:
The issues:
1. Large vocabularies mean large sparse vectors. (V≫10^5)
2. Loss of word order information.
xTy is high if x & y
vector(“I like tagine but hate couscous”) = vector(“I hate tagine but like couscous”) point to the same
direction
3. There is no notion of similarity:
dot(W(‘bad’), W(‘boring’)) = dot(W(‘bad’), W(‘awesome’)) = 0,
where dot(x, y) = xTy = ||x|| ||y|| cos(θ) is the dot product of vectors x & y
We want our vectors to capture semantic information i.e.
if it’s the same meaning it should be the same vector.
6
Text representation
0.1 0.2 0.7 -0.3 0.5 0.1
The issues:
0.9 -0.8 0.2 0.1 -0.2 -0.2
1. Large vocabularies mean large sparse vectors. -0.3 0.6 -0.3 0.6 0.3 0.7
➢ We will use dense embeddings in (d<<V). there, bad, dull, moment, good, boring
2. No notion of similarity.
We want to capture semantic information.
7
Text representation
Contextualized word vectors with skip-grams [Mikolov et al., 2013] (a simplified version for illustration purposes)
Since similar words appear in similar contexts, we will represent the word “UM6P” by these
contexts from a training data.
Located in the “Mohammed VI Green City” in Benguerir, near Marrakech, UM6P applies a “learning by doing” approach
The project will leverage the expertise of INNOV’X, an innovation engine launched by UM6P in 2022 dedicated to building innovative and sustainable businesses and ecosystems
The 13th edition of the Roundtables of Arbois and the Mediterranean saw Morocco’s OCP Group and Mohammed VI Polytechnic University (UM6P) showcase progress on green hydrogen
technologies, as well as the importance of such technologies for the institutions.
Morocco’s UM6P Bags Gold Medal at International Exhibition of Inventions in Geneva
UM6P’s Green Energy Park won an innovation award for its contributions to renewable energy research and development.
The UNITY team represented UM6P among ten schools from the African continent that participated in this international event.
This immersive visit at UM6P Campus is part of and “engagement course” for the students, to familiarize them with topics related to talent development, the needs for technology and
innovation in the country,
The designation of UM6P by the members of the steering committee as the winner of the "Coup de Coeur" was motivated by the initiatives taken and carried out to develop professional
equality in the workplace. UM6P was also congratulated and appreciated by the jury for its good practices.
We will maximize the likelihood of observing all words surrounding the word UM6P
0.1 0.2 0.7 -0.3 0.5 0.1
8
Text representation
Contextualized word vectors with skip-grams [Mikolov et al., 2013] (a simplified version for illustration purposes)
0.1 0.2 0.7 -0.3 0.5 0.1
9
Text representation
Sequence models
Motivation: How to combine the words embeddings of a sentence (arbitrary length) into a
meaningful vector representation?
context vectors
Eact vector represents the current word and all the
previous words (context).
Sequence model
Skip-gram vectors
we compute the probability distribution of the next word xt : p(xt | x1 , x2 …, xt-1) and sample
from it or pick the token with the highest probability (if greedy decoding).
Think of it as auto-complete
working 0.4
coding 0.3
sleeping 0.05
11
Language models
How are they related to sequence models?
We want to model xt | x1 x2, …, xt-1
If we have a vector that summarizes x1 x2, …, xt-1 then we can use it to predict what xt
could be. This vector is exactly what the output of a sequence model (encoder) is.
Predict Predict
“participants” “working”
Sequence model
Similar to language models, we decompose this probability with the chain rule:
What is the probability of the next word, given the history of previously generated
words AND conditioning context c.
Similar to LMs, except from the additional context usually processed with an
encoder. This module + the LM decoder build what we call
sequence-to-sequence models.
13
Conditional Language models
What NLP tasks do we use conditional LMs for?
Machine Translation
(text-to-text)
Speech Translation
English (speech-to-text or
speech-to-speech)
Summarization
15
NLLB (No Language Left Behind)
North star goal: Develop a general-purpose universal machine translation model
capable of translating between any two languages in various domains.
Google Translate supports 134 & Microsoft Translator supports 110, however,
there are more than 3000 written languages in the world. How can we break the
200 barrier?
Our work towards NLLB-200 was structured around 3 axes:
Mixture
Mining of
Experts
16
NLLB (No Language Left Behind)
Problem: How can we collect enough training data for low-resource languages?
Bitexts data (pairs of source-target sentences) available to us per language
Two techniques to augment our data: (1) Back-translation, (2) Bitext mining.
17
NLLB (No Language Left Behind)
Bitext mining:
Multilingual Sentence Encoders to embed sentences and find semantically similar ones in different languages –
see LASER (Artexte and Schwenk, 2019).
Monolingual Multilingual
Le chien est brun
The dog is brown. The dog is brown.
I love eating. I enjoy food a lot. I love eating. I enjoy food a lot.
J’aime manger
Sentences with similar meaning are close. Sentences with similar meaning are close
independently of their language
18
NLLB (No Language Left Behind)
Bitext mining:
What if we have a new language and we want to encode it in the same multilingual space?
19
NLLB (No Language Left Behind)
Problem: How can we collect enough training data for low-resource languages?
With the addition of back-translated and mined data, most of the
low-resource languages cross the threshold of 1M samples.
20
NLLB (No Language Left Behind)
The strength of Multinual MT is in leveraging knowledge transfer between languages.
However, it also comes with interference between languages.
The MoE solution: A technique that allows for more parameters at an equivalent
computational cost and for sparsely activated weights to be specialized in some languages.
Sparsely Gated
Mixture of Experts (MoE)
Dense
g1 g2
Gate
Input
(token representation)
+37%
+44%
Flores-200
22
NLLB (No Language Left Behind)
Will LLMs replace supervised MMT models?
23
NLLB (No Language Left Behind)
Open-source!
Codebases
● Modeling: https://github.com/facebookresearch/fairseq/tree/nllb
● LASER3: https://github.com/facebookresearch/LASER/blob/main/nllb
● Stopes (data and mining pipelines): https://github.com/facebookresearch/stopes/
Models
● Final NMT models:
https://github.com/facebookresearch/fairseq/tree/nllb#multilingual-translation-models
● LASER3 encoders: https://github.com/facebookresearch/LASER/blob/main/nllb
Data
● Flores-200, NLLB-Seed, NLLB-MD, Toxicity-200: https://github.com/facebookresearch/flores
● Mined bitexts: https://huggingface.co/datasets/allenai/nllb
24
Ongoing work on multimodality (speech + text)
We extended LASER sentence embeddings to the speech modality with
SpeechMatrix and T-modules (Duquenne et al. 2022).
25
Ongoing work on multimodality (speech + text)
Training end-to-end multimodal models (MT, ASR, S2T, S2ST) that are multilingual on
both source and target sides (e.g. Whisper S2T does only translate into English)
We can already generate text and speech (units) with a two-pass decoder in UnitY
(Inaguma et al. 2022)
26
Conclusion
There is more to NLP than LLMs.
The underlying modeling basics for these different tasks is the same.
So know your basics! The algorithmical basics.
27