0% found this document useful (0 votes)
9 views27 pages

State of Multilingual and Multimodal NLP

Uploaded by

Monsif BARAKA
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views27 pages

State of Multilingual and Multimodal NLP

Uploaded by

Monsif BARAKA
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 27

The State of Multilingual and Multimodal NLP

Maha Elbayad
Research Scientist, FAIR (Meta AI)

ThinkAI, May 7th 2023


LLMs

Go back to the roots!

2
Language Models 101
1. How to represent text?
2. What is a Language model?
3. What is a Conditional Language Model?

3
A basic setup
Computer vision: Binary image classification.

Vector
Model
of pixel = dog?

values
0 1
x
https://ai.stanford.edu/~syyeung/cvweb/tutorial1.html

NLP: Sentiment prediction of movie reviews.

The focus in this 101 will be on


There is never a dull representation learning. We assume that:
moment in this movie. Model
? fθ
= positive review?
1. we can evaluate a loss function that
Wonderful visuals and
good actors. 0 1 measures the error of our model on some
x training data.

How to represent this input? 2. we know how to optimize this loss


function wrt the model parameters θ.

4
Text representation
Given a vocabulary 𝒱={there, bad, dull, moment, good, boring, awesome, actors, classic,
story, fights, …} of size V=|𝒱|, we will represent a word with a one-hot vector in

The vector for “bad” = (0, 1, 0, …, 0, 0, 0)


All except one position are zeros
The vector for “good” = (0, 0, 0, 0, 1, …, 0)

Bag-of-words representation:

Frequency of each word in the vocabulary


The vector of the sentence = (1, 0, 2, 0, …, 0)

1
0
There is never a dull 2
moment in this movie. 0 Model
. = positive review?
Wonderful visuals and ,

good actors. , 0 1
0 x 5
Text representation
Bag-of-words representation:

Frequency of each word in the vocabulary


The vector of the sentence = (1, 0, 2, 0, …, 0)∈

The issues:
1. Large vocabularies mean large sparse vectors. (V≫10^5)
2. Loss of word order information.
xTy is high if x & y
vector(“I like tagine but hate couscous”) = vector(“I hate tagine but like couscous”) point to the same
direction
3. There is no notion of similarity:
dot(W(‘bad’), W(‘boring’)) = dot(W(‘bad’), W(‘awesome’)) = 0,
where dot(x, y) = xTy = ||x|| ||y|| cos(θ) is the dot product of vectors x & y
We want our vectors to capture semantic information i.e.
if it’s the same meaning it should be the same vector.
6
Text representation
0.1 0.2 0.7 -0.3 0.5 0.1
The issues:
0.9 -0.8 0.2 0.1 -0.2 -0.2

1. Large vocabularies mean large sparse vectors. -0.3 0.6 -0.3 0.6 0.3 0.7

0.2 0.1 0.2 -0.8 0.7 -0.5


(V≫10^5) 0.4 -0.4 0.4 -0.1 0.2 0.8

➢ We will use dense embeddings in (d<<V). there, bad, dull, moment, good, boring

2. No notion of similarity.
We want to capture semantic information.

The Distributional Hypothesis:


Words that occur in the same contexts tend to have similar meanings
(Harris, 1954)

Solving 1+2 gave us contextualized word vectors (or contextual embedding)


How: The skip-gram model [Mikolov et al., 2013]

7
Text representation
Contextualized word vectors with skip-grams [Mikolov et al., 2013] (a simplified version for illustration purposes)
Since similar words appear in similar contexts, we will represent the word “UM6P” by these
contexts from a training data.
Located in the “Mohammed VI Green City” in Benguerir, near Marrakech, UM6P applies a “learning by doing” approach
The project will leverage the expertise of INNOV’X, an innovation engine launched by UM6P in 2022 dedicated to building innovative and sustainable businesses and ecosystems
The 13th edition of the Roundtables of Arbois and the Mediterranean saw Morocco’s OCP Group and Mohammed VI Polytechnic University (UM6P) showcase progress on green hydrogen
technologies, as well as the importance of such technologies for the institutions.
Morocco’s UM6P Bags Gold Medal at International Exhibition of Inventions in Geneva
UM6P’s Green Energy Park won an innovation award for its contributions to renewable energy research and development.
The UNITY team represented UM6P among ten schools from the African continent that participated in this international event.
This immersive visit at UM6P Campus is part of and “engagement course” for the students, to familiarize them with topics related to talent development, the needs for technology and
innovation in the country,

The designation of UM6P by the members of the steering committee as the winner of the "Coup de Coeur" was motivated by the initiatives taken and carried out to develop professional
equality in the workplace. UM6P was also congratulated and appreciated by the jury for its good practices.

We will maximize the likelihood of observing all words surrounding the word UM6P
0.1 0.2 0.7 -0.3 0.5 0.1

0.9 -0.8 0.2 0.1 -0.2 -0.2

-0.3 0.6 -0.3 0.6 0.3 0.7

0.2 0.1 0.2 -0.8 0.7 -0.5

0.4 -0.4 0.4 -0.1 0.2 0.8

8
Text representation
Contextualized word vectors with skip-grams [Mikolov et al., 2013] (a simplified version for illustration purposes)
0.1 0.2 0.7 -0.3 0.5 0.1

0.9 -0.8 0.2 0.1 -0.2 -0.2

-0.3 0.6 -0.3 0.6 0.3 0.7

0.2 0.1 0.2 -0.8 0.7 -0.5

0.4 -0.4 0.4 -0.1 0.2 0.8

There is never a dull Aggregation


moment in this movie. (mean, max, ..) Model = positive review?
Wonderful visuals and of word fθ
good actors. embeddings
0 1
x

But we still haven’t dealt with the loss of word-order information!

9
Text representation
Sequence models
Motivation: How to combine the words embeddings of a sentence (arbitrary length) into a
meaningful vector representation?

context vectors
Eact vector represents the current word and all the
previous words (context).
Sequence model

Some models include the future words as well.

Skip-gram vectors

How: Recurrent networks, 1D-convolutional models, attention models. For an introduction


course check Lena Voita’s “NLP Course for You” -
https://lena-voita.github.io/nlp_course.html 10
Language models
Given a sequence x, the role of a language model is to estimate the joint probability p(x) i.e. to
assess the plausibility or fluency of x.
With the chain rule we rewrite the joint probability as:

we compute the probability distribution of the next word xt : p(xt | x1 , x2 …, xt-1) and sample
from it or pick the token with the highest probability (if greedy decoding).

Think of it as auto-complete
working 0.4

coding 0.3

The hackathon participants spent the whole weekend __


x1 x2 x3 x4 x5 x6 x7 playing 0.2

sleeping 0.05

11
Language models
How are they related to sequence models?
We want to model xt | x1 x2, …, xt-1
If we have a vector that summarizes x1 x2, …, xt-1 then we can use it to predict what xt
could be. This vector is exactly what the output of a sequence model (encoder) is.

Predict Predict
“participants” “working”

Sequence model

The hackathon participants spent the whole weekend 12


Conditional Language models

We want to model the conditional joint probability of y | x, where y is a sequence of


tokens and x is some conditioning context (potentially a sequence itself).

Similar to language models, we decompose this probability with the chain rule:

What is the probability of the next word, given the history of previously generated
words AND conditioning context c.

Similar to LMs, except from the additional context usually processed with an
encoder. This module + the LM decoder build what we call
sequence-to-sequence models.

13
Conditional Language models
What NLP tasks do we use conditional LMs for?

Machine Translation
(text-to-text)

Speech Translation
English (speech-to-text or
speech-to-speech)

English Automatic speech recognition


ASR

Summarization

Image captioning A dog standing in the ocean.


14
NLP:
It is the field of automatic (or semi-automatic) processing of
human languages. It is focused on understanding human
language at different granularities (characters, words,
sentences, documents,…, etc.) and in different modalities
(text, speech, visual, … etc.)

It’s not just LLMs!

Next: Some of our recent works in multilingual and


multimodal NLP (speech+ text).

15
NLLB (No Language Left Behind)
North star goal: Develop a general-purpose universal machine translation model
capable of translating between any two languages in various domains.

Google Translate supports 134 & Microsoft Translator supports 110, however,
there are more than 3000 written languages in the world. How can we break the
200 barrier?
Our work towards NLLB-200 was structured around 3 axes:

Data Modeling Evaluation

Mixture
Mining of
Experts
16
NLLB (No Language Left Behind)
Problem: How can we collect enough training data for low-resource languages?
Bitexts data (pairs of source-target sentences) available to us per language

Two techniques to augment our data: (1) Back-translation, (2) Bitext mining.

17
NLLB (No Language Left Behind)
Bitext mining:
Multilingual Sentence Encoders to embed sentences and find semantically similar ones in different languages –
see LASER (Artexte and Schwenk, 2019).

Monolingual Multilingual
Le chien est brun
The dog is brown. The dog is brown.

I love eating. I enjoy food a lot. I love eating. I enjoy food a lot.
J’aime manger

‫أرﯾد اﻻﺗﺻﺎل ﺑك‬


I want to call you. I want to call you.

Sentences with similar meaning are close. Sentences with similar meaning are close
independently of their language

18
NLLB (No Language Left Behind)
Bitext mining:
What if we have a new language and we want to encode it in the same multilingual space?

19
NLLB (No Language Left Behind)
Problem: How can we collect enough training data for low-resource languages?
With the addition of back-translated and mined data, most of the
low-resource languages cross the threshold of 1M samples.

20
NLLB (No Language Left Behind)
The strength of Multinual MT is in leveraging knowledge transfer between languages.
However, it also comes with interference between languages.

The MoE solution: A technique that allows for more parameters at an equivalent
computational cost and for sparsely activated weights to be specialized in some languages.

Sparsely Gated
Mixture of Experts (MoE)
Dense

FFN1 FFN2 FFN3 … FFNE

g1 g2
Gate

Input
(token representation)

Encoder Decoder Replace every other FFN in the Transformer


Source sentence prefixed with Target sentence prefixed with model with an MoE FFN layer 21
<source_language> <target_language>
NLLB (No Language Left Behind)
Results
Our final model significantly outperforms previous SOTA. Flores-101

+37%

+44%
Flores-200

22
NLLB (No Language Left Behind)
Will LLMs replace supervised MMT models?

Recent studies (wip) have shown that


knowledge transfers poorly across
languages in LLMs: being correct on a
specific question in English does not
necessarily imply the LLM will also be
correct on the same question in other
languages.

23
NLLB (No Language Left Behind)
Open-source!

● Project webpage: https://ai.facebook.com/research/no-language-left-behind/


● The Paper: https://arxiv.org/abs/2207.04672

Codebases
● Modeling: https://github.com/facebookresearch/fairseq/tree/nllb
● LASER3: https://github.com/facebookresearch/LASER/blob/main/nllb
● Stopes (data and mining pipelines): https://github.com/facebookresearch/stopes/

Models
● Final NMT models:
https://github.com/facebookresearch/fairseq/tree/nllb#multilingual-translation-models
● LASER3 encoders: https://github.com/facebookresearch/LASER/blob/main/nllb

Data
● Flores-200, NLLB-Seed, NLLB-MD, Toxicity-200: https://github.com/facebookresearch/flores
● Mined bitexts: https://huggingface.co/datasets/allenai/nllb

24
Ongoing work on multimodality (speech + text)
We extended LASER sentence embeddings to the speech modality with
SpeechMatrix and T-modules (Duquenne et al. 2022).

25
Ongoing work on multimodality (speech + text)
Training end-to-end multimodal models (MT, ASR, S2T, S2ST) that are multilingual on
both source and target sides (e.g. Whisper S2T does only translate into English)
We can already generate text and speech (units) with a two-pass decoder in UnitY
(Inaguma et al. 2022)

26
Conclusion
There is more to NLP than LLMs.

The underlying modeling basics for these different tasks is the same.
So know your basics! The algorithmical basics.

There are a multitude of solutions to the same problem. As a


researcher/engineer you hypothesise, then you use data to prove (or disprove)
your hypothesis. And you iterate!

27

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy