0% found this document useful (0 votes)
37 views10 pages

Small Models Are Valuable Plug-Ins For Large Language Models

Uploaded by

anicomanesh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
37 views10 pages

Small Models Are Valuable Plug-Ins For Large Language Models

Uploaded by

anicomanesh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Small Models are Valuable Plug-ins for Large Language Models

Canwen Xu1∗, Yichong Xu2 , Shuohang Wang2 , Yang Liu2 ,


Chenguang Zhu2 , Julian McAuley1
1
University of California, San Diego, 2 Microsoft
1
{cxu,jmcauley}@ucsd.edu, 2 {yicxu, shuowa, yaliu10, chezhu}@microsoft.com

Abstract it cannot fully exploit supervised data when there


are hundreds or thousands of examples.
Large language models (LLMs) such as GPT-
3 and GPT-4 are powerful but their weights To address these limitations, we propose Super
arXiv:2305.08848v1 [cs.CL] 15 May 2023

are often publicly unavailable and their im- In-Context Learning (SuperICL), a novel approach
mense sizes make the models difficult to be that enables black-box language models (e.g., GPT-
tuned with common hardware. As a result, 3.5) to work with locally fine-tuned smaller models
effectively tuning these models with large- (e.g., RoBERTa, Liu et al., 2019), resulting in im-
scale supervised data can be challenging. As proved performance on supervised tasks. SuperICL
an alternative, In-Context Learning (ICL) can
is designed to overcome the challenges of poor per-
only use a small number of supervised exam-
ples due to context length limits. In this pa- formance and instability of ICL.
per, we propose Super In-Context Learning SuperICL builds on the strengths of ICL while
(SuperICL) which allows black-box LLMs to mitigating its limitations. As shown in Figure 1,
work with locally fine-tuned smaller models, SuperICL leverages a combination of an LLM with
resulting in superior performance on super- smaller models, which act as plug-ins, to perform
vised tasks. Our experiments demonstrate that
supervised tasks efficiently. Specifically, we use the
SuperICL can improve performance beyond
state-of-the-art fine-tuned models while ad- plug-in model to predict labels with confidence for
dressing the instability problem of in-context in-context examples and concatenate them with the
learning. Furthermore, SuperICL can enhance input text and ground-truth labels as context. For
the capabilities of smaller models, such as mul- test examples, we also add the plug-in model’s pre-
tilinguality and interpretability.1 diction and confidence to the test input and let the
LLM predict the final label and an explanation. As
1 Introduction
these plug-in models have been fine-tuned on the
Large-scale pre-trained language models, such as task-specific data, they serve as a bridge between
GPT-3 (Brown et al., 2020) and GPT-4 (OpenAI, the large pre-trained model and the task-specific
2023), have demonstrated remarkable capabilities data, allowing for effective knowledge transfer and
in a wide range of NLP tasks. Despite the impres- improved performance.
sive performance of these recently released models, We conduct extensive experiments to evaluate
their size and limited accessibility of model weights the effectiveness of SuperICL on GLUE (Wang
can lead to difficulties in fine-tuning these models et al., 2019), a standard benchmark for natural lan-
with supervised data, which is an effective way to guage understanding. Our results show that Super-
adapt the models to specific tasks (Liu et al., 2019). ICL: (1) achieves superior performance compared
An alternative approach, In-Context Learning to state-of-the-art fine-tuned models and LLMs;
(ICL, Brown et al., 2020), involves concatenating a (2) addresses the instability problem of ICL by al-
few labeled examples with the test input, enabling lowing the plug-in models to absorb task-specific
the model to learn from the context. However, information while leaving the LLMs to focus on
ICL is limited by the maximum context length of more general language understanding; (3) enhances
the LLM, restricting the number of examples it the capabilities of plug-in models such as extending
can utilize. Consequently, while ICL can usually their multilinguality to cover a wider range of lan-
perform few-shot learning with 16 or 32 examples, guages; (4) provides interpretability via the LLM

Work done during internship at Microsoft. by providing explanations for why it overrides pre-
1
Code available at https://aka.ms/SuperICL. dictions made by plug-in models.
➕ ➕

Supervised Sample
Dataset … Input Ground Truth
In-Context Examples
Final Prediction

Test Input
LLM
(a) ICL

Supervised Sample
Plug-in
Dataset … Input Ground Truth
Model
In-Context Examples

… Input Predicted Label Confidence Ground Truth


Constructed Context
Final Prediction
Explanation
Plug-in Test Input Predicted Label Confidence LLM
Test Input
Model
(b) SuperICL

Figure 1: The workflow of ICL and SuperICL. There are three steps in SuperICL: (1) A context is constructed by
randomly sampling from the training data and incorporating the plug-in model’s predictions, including predicted
labels and their corresponding confidence scores. (2) The test input is concatenated after the context, with the
plug-in model’s prediction attached. (3) Finally, a language model generates the final prediction along with an
optional explanation.

We then conduct a thorough analysis of how each text. LM-BFF (Gao et al., 2021) and KATE (Liu
component contributes to the final performance of et al., 2022) select training examples that are se-
SuperICL, as well as the impact from the number mantically similar to the test example. Another
of in-context examples. We also explore the effects line of works (Su et al., 2022; Levy et al., 2022;
of adversarial attacks on plug-in models and how Ye et al., 2023) focus on mining diverse and repre-
this affects SuperICL’s performance. Our findings sentative examples from a training set. Zhang et al.
demonstrate the potential of combining large and (2022) utilizes active learning and reinforcement
small, cloud and local models, shedding light on a learning to select examples for ICL. Self-adaptive
promising new paradigm for supervised learning in ICL (Wu et al., 2022, 2023b) proposes a two-stage
the era of large language models. search framework to obtain the optimal in-context
examples for each test input without using a sep-
2 Related Work arate validation set. Different from these works,
In-Context Learning Originally proposed in the SuperICL demonstrates that smaller models can
GPT-3 paper (Brown et al., 2020), In-Context be integrated into large language models for su-
Learning (ICL) is considered as a new paradigm pervised tasks. Although it is orthogonal to these
that exploits LLMs on new tasks without updating prior works, by fine-tuning the plug-in model with
the parameters of the model. It prepends few-shot the entire training set, SuperICL reduces the neces-
training examples before the test input as a prompt, sity for selecting the optimal examples from the
to enable large language models to find patterns training set.
and “learn” to predict. There have been success- Moreover, prior studies also investigate how
ful applications of ICL in downstream tasks, such to prepare language models for ICL. Zhao et al.
as Machine Translation (Lin et al., 2021; Agrawal (2021) propose calibration with an empty test input
et al., 2022) and data generation (Ye et al., 2022). to reduce the influence of the label distribution and
Despite its success in few-shot learning, a major ordering. MetaICL (Min et al., 2022a) meta-trains
drawback of ICL is instability. The performance the language model to generalize to unseen tasks
of ICL is sensitive to the selected in-context exam- for better ICL performance. Chen et al. (2022)
ples (Zhao et al., 2021) and even their order (Lu propose four self-supervised objectives as interme-
et al., 2022). Based on these discoveries, there is diate tasks, to improve performance of language
a line of studies focused on constructing the con- models on ICL. Notably, both Min et al. (2022a)

2
Algorithm 1 Super In-Context Learning (SuperICL)
Require: Training set D = {(x1 , y1 ), ..., (xn , yn )}, LLM M , a small pre-trained language model P
Ensure: Predicted label yt and optional explanation et
1: Fine-tune P on D to obtain the fine-tuned plug-in model P 0
2: Randomly sample a set of examples (xi , yi ) from D to be the set of in-context examples D 0
3: for each example (xi , yi ) in D 0 do
4: Predict yi0 and ci with P 0 where yi0 is the predicted label and ci is the confidence score
5: end for
6: Construct the context C by concatenating all (xi , yi0 , ci , yi )
7: for each test example xt do
8: Predict yt0 and ct with P 0 for the test example
9: Formulate complete input I = C ⊕ (xt , yt0 , ct ) where ⊕ denotes concatenation
10: Use M to predict yt from I
11: (Optional) If yt 6= yt0 , ask M to generate an explanation et for overriding the prediction of P 0
12: end for

and Chen et al. (2022) require updating the weights, 3 Super In-Context Learning
thus are not applicable to larger black-box models
like GPT-3/4. Super In-Context Learning (SuperICL) combines
LLMs with locally fine-tuned smaller models, al-
Besides studies aiming to improve ICL’s perfor-
lowing them to work together to improve perfor-
mance, some studies have analyzed the underlying
mance on supervised tasks. The smaller models
mechanism of ICL. Min et al. (2022b) find that
act as plug-ins, providing task-specific knowledge
the label space, the distribution of the input text,
and predictions, while the large pre-trained models
and the overall format of the sequence are the key
focus on general language understanding. The over-
factors to ICL’s performance. They also claim that
all workflow of SuperICL is shown in Figure 1 and
the ground labels are not significant to the perfor-
the complete algorithm is depicted in Algorithm 1.
mance of ICL, but this conclusion is contradicted
by a later study (Yoo et al., 2022). Additionally, Plug-in Model Fine-tuning The first step in
prior studies suggest ICL could be implicitly per- the SuperICL process is fine-tuning a small NLP
forming Bayesian inference (Xie et al., 2022) or model, e.g., RoBERTa (Liu et al., 2019), on task-
gradient descent (Akyürek et al., 2022; von Oswald specific labeled data. This fine-tuning process on
et al., 2022; Dai et al., 2022). the entire training data is made possible due to
the smaller size of the model and its local acces-
Language Model Plug-ins Large language mod- sibility. This is in contrast to ICL, whose usage
els can exploit external tools to improve their capa- of labeled data is severely limited by the LLM’s
bilities. Toolformer (Schick et al., 2023) introduces context length. The fine-tuned small model is then
special symbols that allow the large language mod- integrated as a plug-in for the LLM in subsequent
els to call external APIs to complete tasks. Visual steps as follows.
ChatGPT (Wu et al., 2023a) plugs vision models
into ChatGPT, allowing for multimodal generation. Context Construction Next, a context is con-
HuggingGPT (Shen et al., 2023) uses ChatGPT to structed for the LLM to utilize the task-specific
conduct task planning and select models according knowledge provided by the smaller model. This
to their function descriptions available in Hugging context consists of a set of examples randomly
Face, execute each subtask with the selected AI sampled from the training data, along with their
model, and summarize the response according to corresponding predictions by the smaller plug-in
the execution results. Different from these works, model. The predictions include both the predicted
our work is under a classic supervised learning and labels and their associated confidence scores. An
demonstrates that even tasks like text classification, example is shown in Table 1.
which is sometimes considered “solved” by smaller On one hand, by incorporating the predicted la-
language models, can still benefit from combina- bels by the plug-in model, the LLM can better
tion with a large language model. understand the relationship among the input exam-

3
(a) Context Sentence 1: Federal agent Bill Polychronopoulos said it was not known if the man, 30, would be
charged.
Sentence 2: Federal Agent Bill Polychronopoulos said last night the man involved in the Melbourne
incident had been unarmed.
RoBERTa-Large Prediction: equivalent (Confidence: 0.51)
Label: not_equivalent

Sentence 1: Five more human cases of West Nile virus, were reported by the Mesa County Health
Department on Wednesday.
Sentence 2: As of this week, 103 human West Nile cases in 45 counties had been reported to the
health department.
RoBERTa-Large Prediction: not_equivalent (Confidence: 0.98)
Label: not_equivalent
...
(b) Test Input Sentence 1: Cooley said he expects Muhammad will similarly be called as a witness at a pretrial
hearing for Malvo.
Sentence 2: Lee Boyd Malvo will be called as a witness Wednesday in a pretrial hearing for fellow
sniper suspect John Allen Muhammad.
RoBERTa-Large Prediction: equivalent (Confidence: 0.82)
(c) Label Prediction Label: not_equivalent
(d) Explanation Explanation for overriding the prediction: The two sentences are talking about dif-
ferent people, John Allen Muhammad and Lee Boyd Malvo, and thus the prediction should be
not_equivalent.

Table 1: An example of the constructed context and inference procedure from the MRPC dataset. We first construct
the context by sampling from the supervised dataset and attach the plug-in model’s predictions. Then, for each test
example, we ask the large language model to predict the label based on the input and the plug-in model’s prediction.
We use a prompt to ask the model to explain the decision if the label predicted by the plug-in model is overridden.
The text field names (e.g., Sentence 1) are the original field names provided in the dataset.

ples, ground-truth labels and the plug-in model’s tached to the input. Thus, the LLM’s input includes
expertise. This will help the LLM in the subse- the context, test input, and plug-in model’s predic-
quent decision-making process to produce final tion. The LLM then generates a final prediction
predictions. On the other hand, confidence scores for the test input, as shown in Table 1(c). Option-
provide a measure of the plug-in model’s uncer- ally, as shown in Table 1(d), the LLM can also
tainty in its predictions. By incorporating these provide an explanation for its prediction, giving
scores in the context, the LLM can trust predic- insight into why it chose to override or follow the
tions where the plug-in model is highly confident plug-in model’s prediction. This additional inter-
and be more cautious when the plug-in model is pretability can be valuable for understanding the
uncertain. Furthermore, confidence scores can help decision-making process of the combined Super-
guide the LLM’s attention towards in-context exam- ICL model.
ples that are more challenging, enabling it to learn
from these difficult cases and potentially improve 4 Experiments
its overall performance on the task. 4.1 Experimental Settings
In summary, by considering both the predicted
Benchmarks We focus on the full supervised set-
label and the associated confidence from the plug-
ting, where we have access to the entire training
in model, the LLM decides whether to follow the
set. We conduct experiments on two widely-used
given predictions or to rely on its own understand-
benchmarks: the GLUE benchmark (Wang et al.,
ing of the task, leading to more accurate predictions
2019) for natural language understanding tasks and
overall.
the XNLI benchmark (Conneau et al., 2018) for
Inference Once the context has been constructed, zero-shot cross-lingual natural language inference
the test input (an example shown in Table 1(b)) is tasks, where the models are trained on English and
concatenated with the context, forming a complete tested on other languages. Our goal is to examine
input for the large language model. The plug-in the learning ability of SuperICL on standard bench-
model’s prediction for the test input, including the marks and whether it can empower smaller models
predicted label and confidence score, is also at- with its multilingual capability. For both ICL and

4
Methods MNLI-m MNLI-mm SST-2 QNLI MRPC QQP CoLA RTE Avg.
GPT-3.5 ICL 80.80 82.39 91.39 80.52 60.05 81.64 60.51 86.28 81.32
RoBERTa-Large 88.68 89.47 96.44 94.07 83.09 92.11 64.55 87.00 88.68
SuperICL 89.31 89.61 96.79 94.16 86.03 92.14 64.57 87.73 89.90

Table 2: Experimental results on GLUE (Wang et al., 2019) development set. The metric for CoLA is Matthews
Correlation and all other tasks use accuracy.

Lang. GPT-3.5 ICL XLM-V SuperICL with ICL (Brown et al., 2020) on the same selection
en 74.03 83.55 83.87 of in-context examples and the predictions made
ar 60.15 70.78 72.28 by the plug-in models alone.
bg 67.64 77.09 77.74
de 71.78 75.23 80.28
el 65.85 72.73 74.29
4.2 Main Results
es 76.79 77.07 81.38 GLUE As shown in Table 2, SuperICL outper-
fr 74.99 77.01 77.47
hi 56.29 69.62 70.02 forms both GPT-3.5 ICL and the plug-in model
ru 65.39 73.53 76.85 RoBERTa-Large with an average advantage of 8.58
sw 56.13 67.43 68.94 and 1.22 on GLUE, respectively. It is worth noting
th 57.03 68.90 69.36
tr 66.01 72.34 72.63 that SuperICL consistently outperforms the base-
ur 51.18 63.57 57.90 lines on all tasks, which makes it a reliable choice
vi 62.91 72.91 74.45
that would not compromise the performance of the
zh 67.90 73.75 74.21
plug-in model.
Avg. 64.94 73.03 74.11
XNLI For XNLI, as presented in Table 3, while
Table 3: Experimental results on the XNLI (Conneau XLM-V (Liang et al., 2023) is specifically designed
et al., 2018) test set. The metric is accuracy.
for multilingual tasks, combining it with GPT-3.5
can still lead to significant improvements in most
SuperICL, we only consider the prediction to be languages. However, SuperICL fails to enhance the
correct when the generated label matches the prede- performance of XLM-V for Urdu. It is worth men-
fined label exactly. For analytical experiments, we tioning that GPT-3.5 ICL also exhibits poor perfor-
evaluate the model on a subset of GLUE consisting mance for Urdu, implying that GPT-3.5 may lack
of three representative tasks: MNLI, SST-2 and ability for low-resource languages like Urdu. This
MRPC, due to budget constraints. is also consistent with recent analysis on the multi-
linguality of GPT-3.5/ChatGPT (Lai et al., 2023).
Large Language Model and Plug-ins We use Additionally, since the BPE tokenizer used in GPT-
OpenAI’s text-davinci-003 language model, 3.5 yields more tokens for non-Latin languages, the
also known as GPT-3.5. For the GLUE benchmark, number of in-context examples is limited, adversely
we use RoBERTa-large (Liu et al., 2019) as the affecting the model’s performance. We believe that
plug-in model. For the XNLI benchmark, we use subsequent GPT models that employ a multilingual
XLM-V (Liang et al., 2023), as the plug-in model. tokenizer, train on more non-English data, and have
Both models are fine-tuned on their respective tasks a longer maximum context can achieve even better
to serve as plug-ins for SuperICL. For GLUE tasks, performance for cross-lingual SuperICL.
we randomly select 32 examples from the training
set. For XNLI, as the input is multilingual, the BPE 4.3 Ablation Study
tokenizer used in GPT-3.5 results in a longer token We conduct an ablation study to understand the
sequence. Thus, we use at most 16 examples for effect of each component in SuperICL. We inves-
each language. Note that for some languages (e.g., tigate the performance of the three components in
Thai), the in-context examples are fewer than 16, SuperICL: (a) Context, which comprises in-context
as we fit as many examples as possible to the maxi- examples; (b) The confidence scores of the plug-in
mum allowed sequence length of 4,096 of GPT-3.5. model in both in-context examples and test input;
For our main experiments and all analysis experi- (c) The plug-in model’s prediction for the test input.
ments, we compare the performance of SuperICL The experimental results are shown in Table 4:

5
Components Method MNLI SST-2 MRPC
MNLI SST-2 MRPC
(a) Ctxt. (b) Conf. (c) Ref.
%Overriden 0.22% 0.23% 12.50%
GPT-3.5 ICL 80.80 91.39 60.05 Overridden Accuracy 81.81% 100.00% 64.71%
RoBERTa-Large 88.68 96.44 83.09
(1) 3 3 7 81.23 92.43 65.69 Table 5: Statistics of overridden predictions. “%Over-
(2) 3 7 3 88.75 96.67 83.09 ridden” indicates the percentage of final predictions
(3) 7 3 3 88.89 96.44 83.59
(4) 7 7 3 88.84 96.44 83.58
that differ from the plug-in model’s predictions, out of
the total number of examples. “Overridden Accuracy”
3 3 3 89.31 96.79 86.03
represents the percentage of correct predictions among
the overridden ones.
Table 4: Experimental results of the ablation study. (a)
Ctxt. means the in-context examples from the train-
ing set; (b) Conf. represents the plug-in model’s confi-
300 All
dence score; (c) Ref. means whether we use the plug-in Overriden
model’s prediction for the test input. 250

#Examples
200
(1) We first attempt to remove the plug-in model’s 150
prediction for the test input. This has a signifi- 100
cant negative impact on the performance of ICL 50
as it creates a mismatch between in-context ex- 0
amples and test input. Interestingly, even though 0.5 0.6 0.7 0.8 0.9 1.0
Confidence
we remove the plug-in model for test input, Super-
ICL can still outperform ICL. We suspect this is Figure 2: Effect of plug-in model confidence for over-
due to an in-context effect similar to knowledge rides. The figure is the distribution of RoBERTa confi-
distillation (Hinton et al., 2015), which transfers dence for all examples (blue) and examples with a final
task knowledge from the fine-tuned RoBERTa to prediction overridden by GPT-3.5 (orange) on MRPC.
GPT-3.5. (2) We attempt to remove the confidence
scores from SuperICL which results in a decrease
proximately 0.2%), but with high accuracy. Con-
in its performance. This is because GPT-3.5 be-
versely, GPT-3.5 overrides a substantial percentage
comes unaware of the uncertainty of RoBERTa,
of 12.5% of predictions made by RoBERTa, al-
and as a result, it is unable to determine when to
though with lower accuracy. These findings suggest
override the prediction. Also, similar to remov-
that the override behavior of SuperICL is heavily
ing the softmax score from knowledge distillation,
reliant on the specific dataset and the performance
removing the confidence score makes knowledge
of the plug-in model.
transfer less effective. (3) When removing all in-
context examples, SuperICL is essentially doing To gain insights into the decision-making pro-
zero-shot inference for the test input. Although cess of the LLM in overriding the predictions of
there is a slight improvement over RoBERTa, we the plug-in model, we examine the distribution of
can see adding in-context examples help Super- confidence levels exhibited by RoBERTa and the
ICL learn to calibrate the confidence and override extent to which GPT-3.5 overrides them, as shown
RoBERTa’s prediction. Also similar to ICL versus in Figure 2. The findings reveal a pattern where
zero-shot inference, adding in-context examples GPT-3.5 tends to override predictions when the
helps GPT-3.5 to improve its own task-specific per- plug-in model’s confidence is low. This behavior
formance. (4) Further removing confidence scores supports our motivation and indicates that GPT-3.5
from zero-shot inference also slightly decreases the recognizes the uncertainty associated with the plug-
performance. in model’s predictions via the confidence scores.

4.4 Analysis on Prediction Overrides 4.5 Analysis on Example Selection


We also analyze the statistics of predictions over- We compare ICL and SuperICL by analyzing the
ridden by GPT-3.5, as displayed in Table 5. A sensitivity of different in-context examples. To en-
significant difference can be observed between var- sure a fair comparison, we randomly sample five
ious datasets. On both MNLI and SST-2, GPT-3.5 batches of in-context examples for each dataset us-
overrides only a minimal portion of examples (ap- ing different random seeds, ensuring that the same

6
MNLI SST-2 MRPC
90 98 87

80 93 82

70 77
88
60 72
83
50 67

40 78 62
1 2 4 8 16 32 1 2 4 8 16 32 1 2 4 8 16 32

ICL RoBERTa-Large SuperICL ICL RoBERTa-Large SuperICL ICL RoBERTa-Large SuperICL

Figure 3: Effect of number of examples on the performance of ICL and SuperICL. The results are averages of
three runs.

Random seed Method MNLI SST-2 MRPC


Method Var.
42 0 1 2 3
ICL 80.80 91.39 60.05
ICL 80.80 81.26 79.74 81.26 80.79 0.39
MNLI

RoBERTa 88.68 88.68 88.68 88.68 88.68 - RoBERTa-Large 88.68 96.44 83.09
SuperICL 89.31 88.94 88.79 89.17 88.78 0.06 SuperICL + RoBERTa 89.31 96.79 86.03
ICL 91.39 94.04 94.38 93.12 93.46 1.35
SST-2

RoBERTa 96.44 96.44 96.44 96.44 96.44 - DeBERTa V3-Large 90.49 96.56 90.44
SuperICL 96.79 96.56 96.56 96.56 96.56 0.01 SuperICL + DeBERTa 90.76 96.79 90.93
ICL 60.05 73.53 73.28 73.28 65.44 37.50
MRPC

RoBERTa 83.09 83.09 83.09 83.09 83.09 - Table 7: Experimental results of SuperICL with differ-
SuperICL 86.03 87.99 87.75 84.31 86.52 2.20 ent plug-in models.

Table 6: Accuracy and variance of ICL and SuperICL


with example selections randomly sampled with differ- Test Set R1 R2 R3 All
ent seeds. # Examples 1000 1000 1200 3200
ICL 59.50 52.40 52.58 54.69
RoBERTa-Large 41.60 27.40 24.58 30.78
in-context examples are used for both ICL and Su- SuperICL 56.10 42.70 44.17 47.44
perICL.
Our results, as shown in Table 6 show that ICL Table 8: Zero-shot results on ANLI (Nie et al., 2019).
demonstrates a larger variance compared to Super- ICL and SuperICL use in-context examples sampled
ICL, especially on MRPC, and its performance is from MNLI. The RoBERTa-Large model is fine-tuned
drastically affected by the selection of in-context on MNLI. R1, R2 and R3 denote the first, second and
third round of adversarial attacks, respectively.
examples. On the other hand, SuperICL consis-
tently outperforms both ICL and RoBERTa, while
maintaining a more stable performance. MRPC, SuperICL can improve the performance of
We argue that SuperICL’s improved stability is RoBERTa by a large margin with more in-context
due to the added reference prediction from the plug- examples present.
in model, which has been trained on a large super-
vised training set. This helps the LLM to focus 4.7 Analysis on Plug-in Models
on learning to correct RoBERTa’s predictions and We also attempt to replace the plug-in model with
override them when needed, minimizing the differ- a state-of-the-art model, DeBERTa V3-Large (He
ences caused by different in-context examples. et al., 2021) fine-tuned on the datasets. Our re-
sults, presented in Table 7, demonstrate that Su-
4.6 Analysis on Number of Examples perICL continues to enhance the performance of
We compare the effect of the number of examples state-of-the-art models, although the improvement
on the performance of ICL and SuperICL in Fig- is smaller compared to RoBERTa. This reduction
ure 3. As shown, SuperICL can already achieve the may be attributed to the smaller capability gap be-
best accuracy with merely 2 or 4 examples, while tween the small and large models. However, we
ICL requires more examples to achieve reason- expect that using an even more advanced large lan-
able accuracy. Also, on the more difficult dataset guage model in the future will resolve this issue.

7
(1) MNLI
Premise: Then he sobered.
Hypothesis: He was always sober.
RoBERTa Prediction: neutral (Confidence: 0.7) 7
Final Prediction: contradiction 3
Explanation: The premise states that he sobered, implying that he was not sober before, while the hypothesis states that he
was always sober, which contradicts the premise.
(2) MNLI
Premise: Without the discount, nobody would buy the stock.
Hypothesis: Nobody would buy the stock if there was a discount.
RoBERTa Prediction: contradiction (Confidence: 0.6) 3
Final Prediction: entailment 7
Explanation: The premise and hypothesis are essentially the same statement, but with the premise being a negative
statement and the hypothesis being a positive statement. The model interprets this as a contradiction, but it should be an
entailment.
(3) SST-2
Sentence: no screen fantasy-adventure in recent memory has the showmanship of clones’ last 45 minutes.
RoBERTa Prediction: negative (Confidence: 0.54) 7
Final Prediction: positive 3
Explanation: The sentence is actually a positive statement, praising the last 45 minutes of the movie for its showmanship.
The prediction was flipped because the sentence also contains the word "no" which can be interpreted as a negative sentiment.

Table 9: Examples of SuperICL’s predictions and explanations. The contexts are omitted. The prompts here are
modified for clarity. The original prompt template is shown in Table 1. The correct and incorrect predictions are
marked with 3and 7, respectively.

4.8 Analysis on Adversarial Robustness 5 Case Study

Additionally, we analyze the adversarial robust- We conduct a case study to better understand the be-
ness of SuperICL by testing it on ANLI (Nie et al., havior of SuperICL, with three examples presented
2019). ANLI is a dataset for evaluating the ro- in Table 9. We find that even without any explicit
bustness and generalization of natural language task instruction, GPT-3.5 demonstrates the ability
inference (NLI) models. It consists of 16,000 to comprehend the tasks and explain its own rea-
premise-hypothesis pairs that are categorized into soning. In the first example from Table 9, GPT-3.5
three classes: entailment, contradiction, and neu- effectively grasps the implication in the premise
tral. The dataset is constructed with three rounds that “he” was not sober. However, in the second
(R1, R2, and R3) and thus has three splits, with example, GPT-3.5 incorrectly flips the prediction,
R3 being the most challenging and diverse. ANLI possibly due to confusion caused by negation. This
is collected using a human-and-model-in-the-loop phenomenon has been recognized as a common
training method, where human annotators act as flaw in LLMs, as noted by Hosseini et al. (2021)
adversaries and attempt to fool the model into mis- and Jang et al. (2023). In the last example, GPT-3.5
classifying while still being understandable to other not only corrects RoBERTa’s prediction success-
humans. This benchmark is designed to be chal- fully but also provides an analysis explaining why
lenging for language models including RoBERTa RoBERTa makes the wrong prediction.
as RoBERTa is attacked in R2 and R3 of data con-
struction. 6 Conclusion and Future Work
As shown in Table 8, GPT-3.5 ICL is rather ro-
bust while RoBERTa-Large is vulnerable to adver- In this paper, we propose SuperICL, a simple yet
sarial attack. However, this directly has a negative effective method for combining a large language
impact on SuperICL. Although SuperICL achieves model API with a locally fine-tuned plug-in model.
better performance than RoBERTa-Large, it under- For future work, we would like to explore using
performs ICL. This finding suggests that Super- large language models to plan for the fine-tuning
ICL’s performance relies on the performance of of the local plug-in model for an unseen task and
the incorporated plug-in model and adversarial at- automate the entire workflow. Also, a theoretical
tack to the plug-in model could lead to a drastic analysis may be important to further reveal the
performance drop for SuperICL. internal mechanism of SuperICL.

8
Limitations Neelakantan, Pranav Shyam, Girish Sastry, Amanda
Askell, Sandhini Agarwal, Ariel Herbert-Voss,
Additional Delay and Cost Since SuperICL in- Gretchen Krueger, Tom Henighan, Rewon Child,
volves serialized small and large models, the total Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu,
inference delay equals to the sum of the inference Clemens Winter, Christopher Hesse, Mark Chen,
Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin
delay of the two models. Also, calling the API of Chess, Jack Clark, Christopher Berner, Sam Mc-
large language model can be expensive compared Candlish, Alec Radford, Ilya Sutskever, and Dario
to using a locally deployed small model. Amodei. 2020. Language models are few-shot learn-
ers. In NeurIPS.
Adversarial Vulnerability As discussed in Sec-
tion 4.8, the vulnerability of the plug-in model to Mingda Chen, Jingfei Du, Ramakanth Pasunuru, Todor
adversarial attacks can be inherited by SuperICL. Mihaylov, Srini Iyer, Veselin Stoyanov, and Zornitsa
Kozareva. 2022. Improving in-context few-shot
Thus, when the plug-in model is under adversarial learning via self-supervised training. In NAACL-
attack, the entire system could underperform ICL. HLT, pages 3558–3573. Association for Computa-
tional Linguistics.
Limited Evaluation Tasks Due to space and
budget limit, we only investigate text classification Alexis Conneau, Guillaume Lample, Ruty Rinott, Ad-
in this paper. However, it would be interesting to ina Williams, Samuel R Bowman, Holger Schwenk,
also look into generation tasks such as text summa- and Veselin Stoyanov. 2018. Xnli: Evaluating cross-
lingual sentence representations. arXiv preprint
rization, question answering, and semantic parsing. arXiv:1809.05053.
Broader Impact Damai Dai, Yutao Sun, Li Dong, Yaru Hao, Zhifang
Sui, and Furu Wei. 2022. Why can gpt learn in-
As a technique that combines large and small lan- context? language models secretly perform gra-
guage models for improved predictions, SuperICL dient descent as meta optimizers. arXiv preprint
shares the potential social biases of language mod- arXiv:2212.10559.
els. While our approach is not likely to amplify
these biases compared to other methods, it is im- Tianyu Gao, Adam Fisch, and Danqi Chen. 2021.
Making pre-trained language models better few-shot
portant to investigate whether SuperICL has any learners. In ACL-IJCNLP, pages 3816–3830. Asso-
effect on increasing or decreasing them. Further- ciation for Computational Linguistics.
more, incorporating small models as plug-ins to the
inference of large language models may lead to a Pengcheng He, Jianfeng Gao, and Weizhu Chen. 2021.
Debertav3: Improving deberta using electra-style
slightly higher carbon footprint, resulting in a nega- pre-training with gradient-disentangled embedding
tive environmental impact. Therefore, practitioners sharing. arXiv preprint arXiv:2111.09543.
should carefully consider the trade-offs between
performance gains and environmental costs when Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. 2015.
using SuperICL. Distilling the knowledge in a neural network. arXiv
preprint arXiv:1503.02531.
Acknowledgements
Arian Hosseini, Siva Reddy, Dzmitry Bahdanau, R. De-
We would like to thank Junheng Hao, Ziyi Yang, von Hjelm, Alessandro Sordoni, and Aaron C.
Courville. 2021. Understanding by understand-
Dan Iter and Daya Guo for discussion. ing not: Modeling negation in language models.
In NAACL-HLT, pages 1301–1312. Association for
Computational Linguistics.
References
Sweta Agrawal, Chunting Zhou, Mike Lewis, Luke Joel Jang, Seonghyeon Ye, and Minjoon Seo.
Zettlemoyer, and Marjan Ghazvininejad. 2022. In- 2023. Can large language models truly understand
context examples selection for machine translation. prompts? a case study with negated prompts. In
arXiv preprint arXiv:2212.02437. Transfer Learning for Natural Language Processing
Workshop, pages 52–62. PMLR.
Ekin Akyürek, Dale Schuurmans, Jacob Andreas,
Tengyu Ma, and Denny Zhou. 2022. What learning Viet Dac Lai, Nghia Trung Ngo, Amir Pouran Ben
algorithm is in-context learning? investigations with Veyseh, Hieu Man, Franck Dernoncourt, Trung Bui,
linear models. arXiv preprint arXiv:2211.15661. and Thien Huu Nguyen. 2023. Chatgpt beyond en-
glish: Towards a comprehensive evaluation of large
Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie language models in multilingual learning. arXiv
Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind preprint arXiv:2304.05613.

9
Itay Levy, Ben Bogin, and Jonathan Berant. 2022. Hongjin Su, Jungo Kasai, Chen Henry Wu, Weijia Shi,
Diverse demonstrations improve in-context Tianlu Wang, Jiayi Xin, Rui Zhang, Mari Ostendorf,
compositional generalization. arXiv preprint Luke Zettlemoyer, Noah A Smith, et al. 2022. Selec-
arXiv:2212.06800. tive annotation makes language models better few-
shot learners. arXiv preprint arXiv:2209.01975.
Davis Liang, Hila Gonen, Yuning Mao, Rui Hou, Na-
man Goyal, Marjan Ghazvininejad, Luke Zettle- Johannes von Oswald, Eyvind Niklasson, Ettore Ran-
moyer, and Madian Khabsa. 2023. Xlm-v: dazzo, João Sacramento, Alexander Mordvintsev,
Overcoming the vocabulary bottleneck in multilin- Andrey Zhmoginov, and Max Vladymyrov. 2022.
gual masked language models. arXiv preprint Transformers learn in-context by gradient descent.
arXiv:2301.10472. arXiv preprint arXiv:2212.07677.

Xi Victoria Lin, Todor Mihaylov, Mikel Artetxe, Tianlu Alex Wang, Amanpreet Singh, Julian Michael, Felix
Wang, Shuohui Chen, Daniel Simig, Myle Ott, Na- Hill, Omer Levy, and Samuel R. Bowman. 2019.
man Goyal, Shruti Bhosale, Jingfei Du, et al. 2021. GLUE: A multi-task benchmark and analysis plat-
Few-shot learning with multilingual language mod- form for natural language understanding. In ICLR.
els. arXiv preprint arXiv:2112.10668. OpenReview.net.

Jiachang Liu, Dinghan Shen, Yizhe Zhang, Bill Dolan, Chenfei Wu, Shengming Yin, Weizhen Qi, Xi-
Lawrence Carin, and Weizhu Chen. 2022. What aodong Wang, Zecheng Tang, and Nan Duan.
makes good in-context examples for gpt-3? In Dee- 2023a. Visual chatgpt: Talking, drawing and edit-
LIO@ACL, pages 100–114. Association for Compu- ing with visual foundation models. arXiv preprint
tational Linguistics. arXiv:2303.04671.

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man- Zhenyu Wu, YaoXiang Wang, Jiacheng Ye, Jiang-
dar Joshi, Danqi Chen, Omer Levy, Mike Lewis, tao Feng, Jingjing Xu, Yu Qiao, and Zhiyong Wu.
Luke Zettlemoyer, and Veselin Stoyanov. 2019. 2023b. Openicl: An open-source framework for in-
Roberta: A robustly optimized bert pretraining ap- context learning. arXiv preprint arXiv:2303.02913.
proach. arXiv preprint arXiv:1907.11692.
Zhiyong Wu, Yaoxiang Wang, Jiacheng Ye, and Ling-
Yao Lu, Max Bartolo, Alastair Moore, Sebastian peng Kong. 2022. Self-adaptive in-context learning.
Riedel, and Pontus Stenetorp. 2022. Fantastically arXiv preprint arXiv:2212.10375.
ordered prompts and where to find them: Overcom- Sang Michael Xie, Aditi Raghunathan, Percy Liang,
ing few-shot prompt order sensitivity. In ACL, pages and Tengyu Ma. 2022. An explanation of in-context
8086–8098. Association for Computational Linguis- learning as implicit bayesian inference. In ICLR.
tics. OpenReview.net.
Sewon Min, Mike Lewis, Luke Zettlemoyer, and Han- Jiacheng Ye, Jiahui Gao, Zhiyong Wu, Jiangtao Feng,
naneh Hajishirzi. 2022a. Metaicl: Learning to learn Tao Yu, and Lingpeng Kong. 2022. Progen: Pro-
in context. In NAACL-HLT, pages 2791–2809. As- gressive zero-shot dataset generation via in-context
sociation for Computational Linguistics. feedback. In EMNLP (Findings), pages 3671–3683.
Sewon Min, Xinxi Lyu, Ari Holtzman, Mikel Artetxe, Association for Computational Linguistics.
Mike Lewis, Hannaneh Hajishirzi, and Luke Zettle- Jiacheng Ye, Zhiyong Wu, Jiangtao Feng, Tao Yu,
moyer. 2022b. Rethinking the role of demonstra- and Lingpeng Kong. 2023. Compositional ex-
tions: What makes in-context learning work? In emplars for in-context learning. arXiv preprint
EMNLP, pages 11048–11064. Association for Com- arXiv:2302.05698.
putational Linguistics.
Kang Min Yoo, Junyeob Kim, Hyuhng Joon Kim,
Yixin Nie, Adina Williams, Emily Dinan, Mohit Hyunsoo Cho, Hwiyeol Jo, Sang-Woo Lee, Sang-
Bansal, Jason Weston, and Douwe Kiela. 2019. Ad- goo Lee, and Taeuk Kim. 2022. Ground-truth labels
versarial nli: A new benchmark for natural language matter: A deeper look into input-label demonstra-
understanding. arXiv preprint arXiv:1910.14599. tions. In EMNLP, pages 2422–2437. Association for
Computational Linguistics.
OpenAI. 2023. Gpt-4 technical report.
Yiming Zhang, Shi Feng, and Chenhao Tan. 2022. Ac-
Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, tive example selection for in-context learning. In
Roberta Raileanu, Maria Lomeli, Luke Zettlemoyer, EMNLP, pages 9134–9148. Association for Compu-
Nicola Cancedda, and Thomas Scialom. 2023. Tool- tational Linguistics.
former: Language models can teach themselves to
use tools. arXiv preprint arXiv:2302.04761. Zihao Zhao, Eric Wallace, Shi Feng, Dan Klein, and
Sameer Singh. 2021. Calibrate before use: Im-
Yongliang Shen, Kaitao Song, Xu Tan, Dongsheng Li, proving few-shot performance of language models.
Weiming Lu, and Yueting Zhuang. 2023. Hugging- In ICML, volume 139 of Proceedings of Machine
gpt: Solving ai tasks with chatgpt and its friends in Learning Research, pages 12697–12706. PMLR.
huggingface. arXiv preprint arXiv:2303.17580.

10

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy