0% found this document useful (0 votes)
15 views49 pages

Probing Knowledge and Structure in Transformers

This document summarizes common methods for analyzing what information is stored in Transformer models, including attention visualization, probing classifiers, and mechanistic analyses. Probing classifiers involve training additional models to predict properties like part-of-speech tags based on representations from a Transformer. Studies show Transformers capture linguistic information like syntax but the representations are not entirely interpretable.

Uploaded by

Jeremy Wayin
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views49 pages

Probing Knowledge and Structure in Transformers

This document summarizes common methods for analyzing what information is stored in Transformer models, including attention visualization, probing classifiers, and mechanistic analyses. Probing classifiers involve training additional models to predict properties like part-of-speech tags based on representations from a Transformer. Studies show Transformers capture linguistic information like syntax but the representations are not entirely interpretable.

Uploaded by

Jeremy Wayin
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 49

Probing Knowledge and

Structure in Transformers
AAAI Tutorial: Transformers

Ellie Pavlick, Brown University, February 2023


Goals

• Methods for answering “What does my Transformer know? And


where is that information stored?”
• Survey several common methods for analyzing Transformers
• Summarize some key findings about how Transformers work
Limitations

• My expertise in in language, so my examples will be mostly drawn


from NLP. Similar trends likely hold in vision, RL, etc…
• There is a lot of very active work on analyzing large Transformer
language models, I am only covering the basics
• There is much more that we do not know than that we do know :)
Outline

• Attention Visualization
• Probing Classifiers & Interventions
• “Mechanistic” Analyses
Outline

• Attention Visualization
• Probing Classifiers & Interventions
• “Mechanistic” Analyses
Attention Visualization
What is attention visualization

• When trying to determine what your Transformer is doing, analyzing


attention is usually the first stop!
• Works like “feature heatmaps” in vision
• Widely agreed to not be decisive evidence of anything causal, but
still commonly used (you will see it in many papers)
Attention Visualization
What is attention visualization
Attention Visualization
What is attention visualization
Outline

• Attention Visualization
• Probing Classifiers and Interventions
• “Mechanistic” Analyses
Probing Classifiers
What is a probing classi er?

fi
Probing Classifiers
What is a probing classi er?

the car is not blue </s>

<s> the car is not blue


fi
Probing Classifiers
What is a probing classi er?

Freeze

<s> the car is not blue


fi
Probing Classifiers
What is a probing classi er?
Noun? E.g.,
Linguistic
Small Clf Knowledge

Freeze

<s> the car is not blue


fi
Probing Classifiers
What is a probing classi er?
modifier? E.g.,
Linguistic
MLP Knowledge

Freeze

<s> the car is not blue


fi
Probing Classifiers
What is a probing classi er?
Is it smaller than a breadbox?
E.g.,
Commonsense
Small Clf Knowledge

Freeze

<s> the car is not blue


fi
Probing Classifiers
What is a probing classi er?

• Training probe requires auxiliary data that captures the phenomena


you are interested in
• And requires knowing what you are looking for in advance!
• Prone to false positives, but there are methods to minimize this
• Typically people choose a “small” classifier, but this is quite
informal
• Control Tasks (Hewitt and Liang, 2019)
• Information-Theoretic Probing (Voita et al, 2020)
fi
Probing Classifiers
Minimizing False Positives
Probing Classifiers
Minimizing False Positives

Designing and Interpreting Probes with Control Tasks. Hewitt and Liang (EMNLP 2019)
Probing Classifiers
Results on Transformer Language Models
Probing Classifiers
Results on Transformer Language Models
The important thing about
Disney is that it is a global brand.

Probing Sentence Structure in Contextualized Word Representations. Tenney et al. (ICLR 2019)
Probing Classifiers & Interventions
Results on Transformer Language Models
The important thing about
Disney is that it is a global brand.

Noun Verb Noun

Probing Sentence Structure in Contextualized Word Representations. Tenney et al. (ICLR 2019)
Probing Classifiers & Interventions
Results on Transformer Language Models
The important thing about
Disney is that it is a global brand.

Noun Verb Noun

Entity Not Ent

Probing Sentence Structure in Contextualized Word Representations. Tenney et al. (ICLR 2019)
Probing Classifiers & Interventions
Results on Transformer Language Models
The important thing about
Disney is that it is a global brand.

Noun Verb Noun

Entity Not Ent

Probing Sentence Structure in Contextualized Word Representations. Tenney et al. (ICLR 2019)
Probing Classifiers & Interventions
Results on Transformer Language Models
The important thing about
Disney is that it is a global brand.

Noun Verb Noun

Entity Not Ent

Probing Sentence Structure in Contextualized Word Representations. Tenney et al. (ICLR 2019)
Probing Classifiers & Interventions
Results on Transformer Language Models
Word Prior Full Model
100
88 91
80 79 82
75 75
75 68
54
50

25

0
S

R1

R2

)
st

ep

W
ie

SR
PO

on

.(

.(
SP

SP
tit
D

ef
En

ef
C

or

or
C

C
Probing Sentence Structure in Contextualized Word Representations. Tenney et al. (ICLR 2019)
Probing Classifiers & Interventions
Results on Transformer Language Models
Word Prior Full Model
100 97 96
93 91 91 90
88 86
84 82 84
80 79
75 75
75 68
54 55
50

25

0
S

R1

R2

)
st

ep

W
ie

SR
PO

on

.(

.(
SP

SP
tit
D

ef
En

ef
C

or

or
C

C
Probing Sentence Structure in Contextualized Word Representations. Tenney et al. (ICLR 2019)
Probing Classifiers & Interventions
Results on Transformer Language Models
Large amounts
Change vs. Word Prior
of non-trivial
16 16
16 15 linguistic
13 information
12
encoded
9
8 7
5
4
2
1
0 0 0 0 0 0 0 0 0
0
S

R1

R2

)
st

ep

W
ie

SR
PO

on

.(

.(
SP

SP
tit
D

ef
En

ef
C

or

or
C

C
Probing Sentence Structure in Contextualized Word Representations. Tenney et al. (ICLR 2019)
316
317
318 Probing Classifiers & Interventions Figure 3: Pr
319 Results on Transformer Language Models ers of BERT
ACL 2019 Submission ***. Confidential Review Copy. D
320
ange is the in
over layers. B
321 (a) he smok
(`)
P⌧ (label|s
300
322
301
323
ple, the mode
302
324
noun/date/tem
303
325 ever, this phr
304
326 prets “china
305
Figure 2: Layer-wise metrics on BERT-large. Solid
327
(blue) are mixing weights (§ 3.1); outlined (purple) are network) and
306
328 (`) type and the
differential scores ⌧ (§ 3.2), normalized for each
307
329 (b) china tod
task. Horizontal axis is encoder layer.
Probing Sentence Structure in Contextualized Word Representations. Tenney et al. (ICLR 2019)
316
317
318 Probing Classifiers & Interventions Figure 3: Pr
319 Results on Transformer Language Models ers of BERT
ACL 2019 Submission ***. Confidential Review Copy. D
320
ange is the in
over layers. B
321 (a) he smok
(`)
P⌧ (label|s
300
322
301
323
ple, the mode
302
324
noun/date/tem
303
325 ever, this phr
304
326 Layer in the network. prets “china
305
Figure 2: Layer-wise metrics on BERT-large. Solid
327
(blue) are mixing weights (§ 3.1); outlined (purple) are network) and
306
328 (`) type and the
differential scores ⌧ (§ 3.2), normalized for each
307
329 (b) china tod
task. Horizontal axis is encoder layer.
Probing Sentence Structure in Contextualized Word Representations. Tenney et al. (ICLR 2019)
316
317
318 Probing Classifiers & Interventions Figure 3: Pr
319 Results on Transformer Language Models ers of BERT
ACL 2019 Submission ***. Confidential Review Copy. D
320 Importance of ange is the in
layer in decision over layers. B
321 (a) he smok
(`)
P⌧ (label|s
300
322
301
323
ple, the mode
302
324
noun/date/tem
303
325 ever, this phr
304
326 Layer in the network. prets “china
305
Figure 2: Layer-wise metrics on BERT-large. Solid
327
(blue) are mixing weights (§ 3.1); outlined (purple) are network) and
306
328 (`) type and the
differential scores ⌧ (§ 3.2), normalized for each
307
329 (b) china tod
task. Horizontal axis is encoder layer.
Probing Sentence Structure in Contextualized Word Representations. Tenney et al. (ICLR 2019)
300 350
301 351

Probing Classifiers & Interventions


302 352
303 353
304 354

Results
305 on Transformer Language Models 355
306 356
307 (b) china today blacked out a cnn interview that was ... 357
308 358
309 359
310 360
311 361
312 362
313 363
314 364
315 365
316 366
317 367
318 Figure 3: Probing classifier predictions across lay- 368
319 ers of BERT-base. Blue is the correct label; or- 369
320
ange is the incorrect label with highest average score 370
over layers. Bar heights are (normalized) probabilities
321 (`) 371
P⌧ (label|s1 , s2 ). Only select tasks shown for space.
322 372
323 373
ple, the model initially tags “today” as a common
324 374
noun/date/temporal modifier (ARGM-TMP). How-
Probing Sentence Structure in Contextualized Word Representations. Tenney et al. (ICLR 2019)
325 375
300 350
301 351

Probing Classifiers & Interventions


302 352
303 353
304 354

Results
305 on Transformer Language Models 355
306 356
307 (b) china today blacked out a cnn interview that was ... 357
308 358
309 Roughly: 359

Higher-level
310 360
311 361
312 information gets 362
313 363
314
encoded later in 364
315 the network. 365
316 366
317 367
318 Figure 3: Probing classifier predictions across lay- 368
319 ers of BERT-base. Blue is the correct label; or- 369
320
ange is the incorrect label with highest average score 370
over layers. Bar heights are (normalized) probabilities
321 (`) 371
P⌧ (label|s1 , s2 ). Only select tasks shown for space.
322 372
323 373
ple, the model initially tags “today” as a common
324 374
noun/date/temporal modifier (ARGM-TMP). How-
Probing Sentence Structure in Contextualized Word Representations. Tenney et al. (ICLR 2019)
325 375
es (SPR),300 the mix- 354 350
301 351
m, and that
302 nontriv- 355 352
Probing Classifiers & Interventions
resolved gradually
303
304
356
353
354
ntity labeling
305 many (b) china today
Results on Transformer Language Models
blacked out a cnn interview that was ... 357 355
1, but with
306
307
a long (b) china today blacked out a cnn interview that was ...
356
358 357
ak concentration
308 of 359 358

. Further309
study is Roughly: 359
360 360
Higher-level
310

is is because
311 BERT 361
361 362
312
correct 313
abstraction information gets
362 363
antic information
314 is encoded later in 364
315 the network. 363 365
316 366
364
317 367

or many319tasks, we
318 Figure 3: Probing classifier predictions across lay- 365 368
ers of BERT-base. Blue is the correct label; or- 369
s are highest
320 in the ange is the incorrect label with highest average score 366 370
over layers. Bar heights are (normalized) probabilities
yers 1-7322for BERT-
321 (`)
P⌧ (label|s1 , s2 ). Only select tasks shown for space. 367 371
372

be correctly
323 classi- Figure 3: Probing classifierple,predictions across lay-
the model initially tags “today” as a common
368 373
324
te this to325 the avail- ersProbing
of BERT-base. Blue is the correct
noun/date/temporal label;
modifier or-
(ARGM-TMP). How- 374
369 2019)
Sentence Structure in Contextualized Word Representations. Tenney et al. (ICLR 375
Probing Classifiers & Interventions
Results on Transformer Language Models

What Happens to BERT Embeddings During Finetuning? Merchant et al. (BlackboxNLP Workshop 2020)
Probing Interventions
What are probing interventions?

• Probing classifiers show you that a feature is represented


• It is purely correlative, so doesn’t speak to whether the full model
uses that feature
• Newer efforts have used the probe to run interventions
• Note: the word “causal” is debated in relation to these kinds of
studies
Probing Interventions
What are probing interventions?
Probing Interventions
What are probing interventions?

Amnesic Probing: Behavioral Explanation with Amnesic Counterfactuals. Elazar et al. (TACL 2021)
Probing Interventions
Results on Transformer Language Models

What if This Modi ed That? Syntactic Interventions via Counterfactual Embeddings. Tucker et al. (Findings of ACL 2020)
fi
Probing Classifiers & Interventions
Summary

• Used to detect if a feature is “represented” and where


• Primary Limitations:
• Probes alone are purely correlative and prone to false positives (though there are
controls)
• Methods for “causal”* intervention are under development but not yet standard
• Requires data specifying what to look for upfront
• Key Findings
• Transformer language models represent a significant amount of linguistic information
• This information is often organized across layers in an interpretable way
• Ablating the information often leads to expected changes in downstream behavior
Outline

• Attention Visualization
• Probing Classifiers & Interventions
• “Mechanistic” Analyses
Mechanistic Analyses
What are “mechanistic” analyses?

• Work by Chris Olah and others at


Anthropic popularized this term in
• Typically requires very manual
and architecture-specific analysis
of weights in the network
• But results are very low-level
interpretations of what the model
does and where
Mechanistic Analyses
Results on Transformers
“Induction Heads” which
Mechanistic Analyses copy next-token from
Results on Transformers previous points in the
context.

A Mathematical Framework for Transformer Circuits. Elhage et al. (Distil 2021)


Mechanistic Analyses
Results on Transformers
Individual neurons
Mechanistic Analyses respond to specific key
Results on Transformers patterns, to produce a
distribution over words

Transformer Feed-Forward Layers Are Key-Value Memories. Geva et al. (EMNLP 2021)
Manipulating these MLPs
Mechanistic Analyses can enable
Results on Transformers counterfactual
interventions

Locating and Editing Factual Associations in GPT. Meng et al. (NeurIPS 2022)
Mechanistic Analyses
Summary

• Promising new of work looks at low-level algorithms in play within the


Transformer
• These studies have revealed, e.g., that
• Attention heads can act as generic copy functions which move
information around in the network
• Feedforward layers can act as key-value stores which recall an
distribution over output words for a given input concept
• Main limitation is that this type of work tends to be very manual and
architecture-specific
General Discussion
Summary

• Work on analyzing neural networks (Transformers specifically) has become


increasingly less “black box”
• There is a lot we don’t know about how the models work, but there is also a lot
we do know!
• Transformer LMs capture a lot of conceptual and linguistic knowledge and
organize it in meaningful ways
• Their attention+MLP mechanisms appear to work together to copy
information and perform abstract lookups from memory
• We can sometimes intervene on individual concepts or components to
manipulate the Transformer’s behavior
General Discussion
Future Directions

• Still a lot of work needed to better understand mechanisms, and


understand behavior of large models “in the wild”
• Disagreement remains on when it is appropriate to attribute
“causality” to some/all of the model’s parameters
• Most work is in NLP—but Transformers are being applied to many
domains!
• Converging (or diverging!) evidence across domains would
significantly move the field(s) forward

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy