0% found this document useful (0 votes)
18 views17 pages

LLM Diversity

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views17 pages

LLM Diversity

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

Increasing Diversity While Maintaining Accuracy: Text Data Generation

with Large Language Models and Human Interventions

John Joon Young Chung Ece Kamar Saleema Amershi


University of Michigan Microsoft Research Microsoft Research
jjyc@umich.edu eckamar@microsoft.com samershi@microsoft.com

Abstract terms of the time and resources to scrape data or


pay people to generate or annotate new data).
Large language models (LLMs) can be used
to generate text data for training and evalu- Advances in generative large language mod-
ating other models. However, creating high- els (LLMs), such as GPT-3 (Brown et al., 2020),
arXiv:2306.04140v1 [cs.CL] 7 Jun 2023

quality datasets with LLMs can be challenging. present a novel approach for creating training data
In this work, we explore human-AI partner- for classification models (Yoo et al., 2021; Sahu
ships to facilitate high diversity and accuracy et al., 2022; Kumar et al., 2020). Model builders
in LLM-based text data generation. We first
can prompt an LLM with the domain of texts and
examine two approaches to diversify text gen-
eration: 1) logit suppression, which minimizes
labels of interest and the LLM can quickly gener-
the generation of languages that have already ate text data for the model builder’s needs. This
been frequently generated, and 2) temperature approach allows model builders to acquire a large
sampling, which flattens the token sampling amount of data even when they initially have no
probability. We found that diversification ap- or few data instances. With the generated data, the
proaches can increase data diversity but often model builder can train a separate affordable model
at the cost of data accuracy (i.e., text and labels (e.g., BERT (Devlin et al., 2019)) to perform the
being appropriate for the target domain). To ad-
specific task.
dress this issue, we examined two human inter-
ventions, 1) label replacement (LR), correcting While LLMs can directly support this classifica-
misaligned labels, and 2) out-of-scope filtering tion task with few-shot learning, it might not be the
(OOSF), removing instances that are out of the best option for every model builder—some might
user’s domain of interest or to which no con- not have enough resources (e.g., GPUs) or budget
sidered label applies. With oracle studies, we (e.g., credit for GPT-3) to run expensive models.
found that LR increases the absolute accuracy
Others might be concerned about privacy or secu-
of models trained with diversified datasets by
14.4%. Moreover, we found that some models rity issues when they use LLMs from external APIs
trained with data generated with LR interven- (e.g., OpenAI API). In such cases, generating data
tions outperformed LLM-based few-shot classi- from LLMs and training custom models could be
fication. In contrast, OOSF was not effective in a more viable approach. Moreover, if we share
increasing model accuracy, implying the need generated datasets within the community, we can
for future work in human-in-the-loop text data also benefit those who do not have access to LLMs.
generation. Lastly, we can also use generated datasets to test
1 Introduction models. With these benefits of generating new text
datasets with LLMs, the practical concern is how
Training custom natural language classification to generate high-quality datasets.
models has become easier with many tools (e.g., In this work, we investigate human-AI partner-
Huggingface1 ). However, data collection remains ships to efficiently create high-quality datasets with
a costly part of model building. For example, exist- LLM-based text generation. High-quality datasets
ing open-source datasets may not be usable if they should have high diversity and coverage, informing
do not match the distribution of a model builder’s the extent of data that the model may encounter. At
target domain or do not contain desired labels. In the same time, the generated text should have high
such cases, the model builder may need to collect accuracy, being relevant to the model’s target task
and label new data which could be costly (e.g., in while having accurate accompanying labels. To
1
https://huggingface.co/ these ends, we first study two technical approaches
to diversify text generation (Section 3): 1) logit sup- 2 Related Work
pression, which diversifies the generated texts by
2.1 Text Data Generation for Model Training
decreasing the probability of sampling tokens that
have already appeared frequently in the previous In NLP, data augmentation, where data are multi-
generation, and 2) temperature sampling, which plied based on existing data, is one context where
flattens the probability distribution of sampled to- text data are generated for model training. There
kens to pick less likely texts. From an experiment were many approaches, from replacing words with
on eight classification tasks with GPT-3 as a text synonyms (Wei and Zou, 2019; Zhang et al., 2015),
generator (Section 4), we found that diversifica- to randomly editing texts (Wei and Zou, 2019), pre-
tion approaches can have mixed results. While dicting replaceable words (Ng et al., 2020), back-
increasing data diversity, these approaches can hurt translating (Fadaee et al., 2017), generating label-
accuracy in generation and similarity to the original flipped data (Zhou et al., 2022), or using reinforce-
datasets for the task. ment learning to condition generation (Liu et al.,
We demonstrate that human interventions (Sec- 2020). Inspired by MixUp (Zhang et al., 2018),
tion 5) are the key to resolving these issues in text which mixes different examples in vision data, re-
generation diversification. We examine human in- searchers also blended texts to augment data (Guo
terventions of replacing inaccurate labels with ac- et al., 2020; Sun et al., 2020; Zhang et al., 2022).
curate ones (label replacement) and filtering out- Other approaches generate texts by learning from
of-scope data (out-of-scope data filtering). With different datasets (Xia et al., 2020; Hou et al., 2018;
oracle studies (Section 6), we found that replac- Chen et al., 2020; Yoo et al., 2019).
ing all incorrect labels increased model accuracy Recently, with the generative capacity of LLMs,
by 14.4% when we used both logit suppression researchers proposed generating datasets with zero
and high temperature. This performance increase or very few samples and training a separate model
brings in practical benefits—without label replace- to serve the specific task (Kumar et al., 2020; Yoo
ment, the average accuracy of models trained with et al., 2021; Sahu et al., 2022; Yuan et al., 2021;
GPT-3-generated data was lower than that of GPT-3 Hartvigsen et al., 2022). As this approach would
classification with few-shot learning, but with 180 extract information from large models, they would
instances label-replaced, the models trained with be analogous to knowledge distillation (Phuong
generated data started to outperform GPT-3 few- and Lampert, 2019; Hinton et al., 2015) or dataset
shot classification. Out-of-scope data filtering had distillation (Wang et al., 2018; Cazenavette et al.,
limited utility in increasing model accuracy, possi- 2022). LLM-generated data has also been used to
bly due to the negative impact of removing training test other trained models (Ribeiro and Lundberg,
instances. We discuss how human interventions 2022; Perez et al., 2022). In this work, we extend
can further facilitate the diversity and accuracy of the previous work by investigating the generation
text data generation. of high-quality data with accurate diversification.
Our contributions are: 2.2 Text Generation with LLMs
As the size of language models increases, re-
• A methodolgy that combines LLM generation searchers found that LLMs can serve different
approaches and human supervision for diver- generation tasks based on input prompts and ex-
sified and accurate data generation. amples (Brown et al., 2020). This approach can
be used to generate text data with instructional
• An experiment showing how text generation prompts and a few examples. However, for the
diversification impacts the accuracy of trained generated data to be useful, diversity and cover-
models and other qualities of the data, such as age should be ensured. Control of the sampling
diversity and accuracy in the generation. temperature (Goodfellow et al., 2016) would be rel-
evant, as it facilitates the unlikely generation, but
• Oracle studies on how human effort to replace it was not evaluated for the facilitation of diversity
misaligned labels and filter out-of-scope data and coverage. Inspired by previous work on con-
instances can impact the performance of mod- trolling LLM generation, we examine human-AI
els trained on data generated with text diversi- approaches to steer data generation to have higher
fication. diversity while securing accuracy in the alignment
of specified labels.
2.3 Human-In-The-Loop
Human interventions are imperative to train high-
performance machine learning models, as people
curate datasets, configure model architectures, and
test the trained models. Researchers investigated
approaches to make human interventions more Figure 1: Examples of Diversification Approaches.
interactive in model training pipelines, by clos-
ing gaps between model training and data cura-
tion (Fogarty et al., 2008; Amershi et al., 2009, 3.2.1 Settings for Data Generation
2012; Levonian et al., 2022), humans extracting When prompting LLMs, we consider 1) a text type
features (Branson et al., 2010; Cheng and Bern- and 2) labels in the prompts. While there can be
stein, 2015), interactively changing the error pat- many different prompts, in our paper, we used the
terns (Kapoor et al., 2010; Talbot et al., 2009), or following prompt:
interactively testing models (Wu et al., 2019; Yuan Write a movie review (text type) to cover all fol-
et al., 2022; Ribeiro et al., 2020; Cabrera et al., lowing elements
(A)
Elements: positive sentiment (label)
2021; Suh et al., 2019). Generative models intro- Movie review (text type): "This is a great movie"
duce novel approaches to interactively tune and
evaluate models by leveraging generated results as Model builders can also prepend examples in the
data instances for training and testing (Ribeiro and same format. The generation process is iterative,
Lundberg, 2022). In this work, we explored har- and model builders can use intermediate data points
nessing diversified and accurate datasets by com- as examples in later prompts. The model builders
bining LLM-based text generation and human in- can generate data until they reach the desired num-
terventions. ber of data points. With the generated data, the
model builder would finetune a separate smaller
3 Diversified Text Data Generation model that serves the target task. With this ap-
proach of finetuning a smaller model, there can be
We lay out the desired characteristics of the datasets
a question of whether finetuning a separate model
for model building. Then, we introduce approaches
would result in higher accuracy than using zero-
to generate diversified datasets with LLMs.
shot or few-shot learning of the LLM. In the later
3.1 Goals study, we show the cases where finetuned smaller
models perform better than the LLM.
Ideal classification datasets need to have the fol-
lowing characteristics: 1) Scoped: fall in the model 3.2.2 Logit Suppression
builder’s domain of interest while classifiable with Logit suppression is a diversification approach that
labels of interest, 2) Label accurate: accompany suppresses tokens that have already been generated
accurate labels, and 3) Diverse: cover cases the frequently in the intermediate dataset (Figure 1a).
model would encounter during test time. These With this approach, the generation pipeline logs
goals are difficult to achieve simultaneously but the frequency of tokens that have been generated
need to be balanced. Only considering diversity, so far. Then, to diversify the selection of tokens,
randomly generating any text would be enough, but logit suppression decreases the probability of high-
it would hurt scope and label accuracy. Likewise, frequency tokens. However, with this approach,
only considering the scope and label accuracy, gen- some tokens that could contribute to accurate gen-
erating an accurate but limited variety of text would eration can be suppressed.
be enough, but it would hurt the diversity.
3.2.3 High Temperature
3.2 Diversifying Approaches The temperature of sampling distribution (Good-
We introduce the setting to use LLM-based data fellow et al., 2016) controls how “flat” the token
generation for model training. Then, we lay out sampling probability is (the equation is explained
two approaches to promote diversity in text data in Appendix A). High temperature leads to “flatter”
generation. We also note their potential risks of token sampling probabilities (Figure 1b), increas-
harming the scope and accuracy. ing the probability of sampling “less likely” tokens
and diversifying generation. Similar to logit sup- points. We chose these numbers to ensure a low
pression, extremely high temperatures can result in generation budget while having fair quality when
tokens irrelevant to the prompt, hurting accuracy in training models. Specifically, with a maximum
generation results. length of 100 tokens for each generated instance,
if the prompt includes examples for n classes, the
4 Experiment1: Diversified Text Data number of required tokens for each instance would
Generation be (100+30) × (n+1) (where 30 come from the
instructional prompts). With the generation pricing
We evaluated how diversification approaches im-
of $0.02/1000 tokens for text-davinci-002
pact the diversity of the generated data and the
model, 5600 and 6922 instances resulted in
accuracy of models trained with the dataset.
maximum spending of $14.56 × (n+1) and $17.80
4.1 Experiment Settings × (n+1), respectively. In our pilot tests, model
accuracy saturated after these numbers of instances.
4.1.1 Tasks For the oracle training dataset, with which we
We used tasks from eight datasets. SST-2 (Socher compared the quality of the datasets, we sampled
et al., 2013) is a binary sentiment classification instances from the original training dataset for
dataset from Rotten Tomatoes movie reviews. the task. The test dataset was sampled from the
Clickbait classification dataset (CB) (Chakraborty original test dataset. We provide details on how we
et al., 2016) is news headlines labeled either click- sampled these instances in Appendix B.2.
bait or non-clickbait. CARER (Saravia et al., 2018)
is Twitter statements labeled with one of the six Generation Conditions In addition to logit sup-
emotion categories. PubMed 200k RCT (Dernon- pression and temperature sampling, we also con-
court and Lee, 2017) has five classes regarding the sider example seeding, whether the generation
roles of sentences in medical papers. The subjec- pipeline begins with an initial set of example in-
tivity dataset (SUBJ) is movie review texts labeled stances. We can use multiple approaches simultane-
subjective or objective (Pang and Lee, 2004). For- ously (e.g., using logit suppression and temperature
mality classification dataset (FO) (Lahiri, 2015) sampling together), and how these approaches in-
has labels on whether the text is formal or informal. teract is also the scope of our questions. For a
HWU64 (Liu et al., 2021) is a dataset with hu- single combination of conditions, we generated
man utterances to chatbots, and we used 18 domain three datasets, as there could be some variance in
classes for our experiments. Corpus of Linguistic the results with the initial seeds and the examples
Acceptability (COLA) (Warstadt et al., 2019) is generated initially.
publication texts with annotations on whether the We instantiated logit suppression with the logit
text is grammatically correct or not. bias function in OpenAI API Access2 , which can
increase or decrease the probability of sampling to-
4.1.2 Generation Method kens. Every time we complete a single generation
As a generative LLM, we used the iteration, we recorded the frequency of tokens gen-
text-davinci-002 model of GPT-3 through erated by GPT-3. As the OpenAI API only allows
OpenAI API Access with Prompt A. We list the 100 tokens for logit biasing, we suppressed only
specific text types and labels used for each dataset the 100 most appeared tokens. Specifically, for the
in Appendix B.1. The generation process was logit bias weights, we multiplied the token appear-
iterative, with 20 data points generated with a ance ratio (in percentage) by -7.5 while capping the
single prompt for each API call. As a single minimum weight at –7.5. For temperature sam-
prompt can only generate data instances for a pling, we used four temperature values, 0.3, 0.7,
single label, the generation process cycled through 0.9, and 1.3. When seeding examples, we first ran-
all considered labels while balancing the number domly sampled 18 examples from oracle training
of instances for each class. As our tasks dealt with data with a balanced number of labels. Only for
short text data, we limited the generation length PubMed, which has five classes, we used 15 seed
to 100 tokens. We set the frequency penalty and examples. We used sampled data points as an initial
top p to 0.02 and 1, respectively. Except for SST-2, example pool. With example seeding, from the first
we generated 5600 instances for a single training 2
https://beta.openai.com/docs/api-reference/
dataset. For SST-2, we generated 6922 data completions/create#completions/create-logit_bias
Model Accuracy Label Accuracy Diversity 1.00
Similarity
1.0 1.0 0.20
0.95
0.8 0.8 0.15
0.90
0.10 0.85
0.6 0.6
0.05 0.80
0.4 0.4
0.00 0.75
Oracle Temp=0.3, Logit Sup=X Temp=0.3, Logit Sup=O Temp=1.3, Logit Sup=O
GPT Zero Temp=0.7, Logit Sup=X Temp=0.7, Logit Sup=O Example=X
GPT Few Temp=0.9, Logit Sup=X Temp=0.9, Logit Sup=O Example=O
Base Similarity Temp=1.3, Logit Sup=X

Figure 2: Impact of logit suppression and high temperatures on model accuracy, label accuracy, diversity, and
similarity to the oracle dataset, averaged across eight tasks. Bars without hatches start generation without examples
while those with hatches start with few-shot generation. Throughout this paper, error bars indicate 95% confidence
interval.

generation iteration, examples were randomly cho- ded generated data with BERT (Devlin et al., 2019),
sen from the pool. Without the seeding examples, then calculated the distances. We also evaluated
we completed the first cycle of generations as a label accuracy, which is the accuracy of the align-
zero-shot generation. After the first cycle, since we ment between the generated texts and the specified
would have generated data instances for all labels, labels. For this metric, except for SST-2, we used
we added examples to the prompt. When adding the oracle model as the evaluator. For SST-2, we
examples, we randomly sampled the examples for used GPT-3 few-shot classification as the evalua-
all labels, one example for each label. tor, as it has higher accuracy than the oracle model.
We also measured the similarity of the generated
4.1.3 Training Method
dataset to the oracle dataset with the average mean
With the generated data, we finetuned base size pairwise distances between the two. For similarity,
BERT (Devlin et al., 2019) classifiers with 109M we also used BERT to embed the generated texts.
parameters using pretrained weights from the Hug-
gingface Transformer library (Wolf et al., 2020) 4.3 Results
with a randomly initialized fully connected clas-
sifier layer. For each dataset, we trained the five Figure 2 shows the results of the first experiment
different models with the same dataset. With three for all tasks. The first column shows the model
datasets for each combination of approaches, it accuracy results. It also shows the accuracy of
resulted in 15 models for a condition. While train- zero-shot and few-shot GPT-3 classification (gray
ing, Adam optimizer was used, with a learning rate solid and dashed line, respectively) and the model
of 3e-5 and a warm-up period of 3 epochs. We trained with the oracle training dataset (purple line).
adopted the early stopping with the patience of five The second column shows the label accuracy, and
training epochs. We used PyTorch and RTX A6000 the third column shows the diversity. The diversity
GPUs for training. plots also show the diversity of oracle datasets (pur-
ple line). The last column shows the similarity. It
4.2 Metrics also shows the base similarity (brown line), which
We compared the accuracies of models trained with is the average distance between all the different
generated data to 1) models trained with oracle datasets that we considered.
datasets (oracle model) and 2) GPT-3’s few-/zero- First, to evaluate how diversity, label accuracy,
shot classifications (text-davinci-002). For and similarity impact model accuracy, we per-
GPT-3 few-shot learning, we used 18 examples formed a linear regression analysis. The analysis
(15 only for PubMed) with the same number of showed that label accuracy, diversity, and similarity
examples for each label. We also measured the are positively correlated with model accuracy, with
diversity of the dataset using Remote-Clique met- significance (coef=.4797 and p<0.001 for label ac-
ric (Rhys Cox et al., 2021), which is the average curacy, coef=.2260 and p<0.001 for diversity, and
mean pairwise distances. Specifically, we embed- coef=0.1980 and p<0.005 for similarity).
Regarding specific patterns, logit suppression in- builder. We introduced the specific implementation
creased diversity while hurting the label accuracy of this approach in Section 6.
and the similarity to the oracle dataset. High tem-
perature increased diversity and decreased label 6 Experiment2: Human Interventions
accuracy, but to a smaller degree than logit sup- For Diversifed Text Generation
pression. The application of each diversification
We evaluated LR and OOSF. Except for adding LR
approach increased the model accuracy, but when
and OOSF, we used the same tasks, datasets, train-
used together, the benefit did not add up. For in-
ing methods, and metrics as in Section 4. In this
stance, in Model Accuracy of Figure 2, each high
section, we focus on reporting results for two tem-
temperature (1.3, red light bars) and logit suppres-
perature values, 0.3 and 1.3. We present the results
sion (dark blue bars) could increase the model ac-
with the rest of the temperatures in Appendix E.
curacy from when using a low temperature (0.3,
Also, in this section, when reporting, we merged
light blue bars). However, when using them to-
conditions with and without example seeding.
gether (dark red bars), the resulting accuracy was
not much different from only using high temper- 6.1 Experiment Settings
atures (light red bars). It indicates that the effect
of logit suppression has diminished by using high 6.1.1 Label Replacement
temperatures and logit suppression together. Seed- For LR, we conducted an oracle experiment. For
ing examples increases label accuracy and model each task, we used the highest accuracy model as
accuracy. Examples also slightly increased diver- the oracle labeler. Therefore, we used oracle mod-
sity when used without logit suppression. Whether els as a labeler, but only for SST-2, we used GPT-3
models trained with LLM-generated data would few-shot classification as a labeler. We conducted
have higher accuracy than zero- or few-shot learn- LR on the datasets generated in experiment 1.
ing of LLMs depends on the task. We provide a We had two approaches for LR: 1) do LR to all
detailed result on each task in Appendix C. data points and 2) use proxy models with LR on
partial data. For 1), we inspected all generated
5 Human Interventions to Fix Inaccurate texts with simulated labelers and replaced labels
Text Generation as the labelers predicted. For 2), we sampled a set
of instances from the generated dataset, applied
The first study shows that diversifying approaches the oracle labeler to them, and then trained proxy
can have mixed effects, hurting the accuracy in gen- models with those data. Specifically, we sampled
eration. We propose two human interventions to 90, 180, or 270 data instances. When training, for
improve the generated data, based on issues that each class, we trained a proxy model that performs
we found from qualitatively analyzing the gener- binary classification for the class. For each proxy
ated data. The first is label replacement (LR), model, the data instances labeled with the target
switching the misaligned label to the correct one. label were used as positive instances, while the rest
The second is out-of-scope data filtering (OOSF), were used as negative instances. We applied proxy
which removes instances that are outside the do- models to the uninspected data to obtain confidence
main of interest and do not match any labels (OOS scores for each label. For each class, we calculated
instances). the final score as follows:
While LR and OOSF might facilitate accurate
generation with diversifying approaches, inspect- Sf,i = Ss,i ∗ w + Sp,i ∗ (1 − w) (1)
ing all data points can require a lot of effort. Hence,
we propose a simple way to scale the effort of the where for the class i, Sf,i is the final score, Sp,i
model builder, which is training a proxy model. is the confidence score of the proxy model, Ss,i is
With this approach, model builders will first label if the class is specified when generating the text (1
a small number of data points. Then, with those when the class is specified, 0 otherwise), and w is
labels, they will train binary classifiers as proxy the weighting constant. We considered Ss,i as there
models, where each learns about a single label (i.e., can be a chance that the proxy model is inaccurate
a label class from labels of interest or if the instance and the correct labels are swapped. For our experi-
is out of scope). For unlabeled data points, proxy ment, we used w of 0.3. We chose the label with
models can make inferences on behalf of the model the highest final score as the label to be replaced.
Task Ratio Task Ratio
CARER 20.56% CB 1.39% 1.0 0.9
COLA 0.00% FO 0.56% 0.9 0.85
HWU64 0.28% PubMed 1.11% 0.8
SST-2 3.61% SUBJ 3.06% 0.8 0.75
0.7
0.7
Table 1: Ratio of out-of-scope instances from 360 sam- 0.65
0.6 0.6
ples.

Task Accuracy (std) Task Accuracy (std)


CARER 94.93 (2.20) CB 100 (0.00)
SST-2 97.18 (0.89) SUBJ 97.5 (1.04) Figure 3: Impact of label replacement on label accuracy
and model accuracy. Throughout this paper, error areas
Table 2: OOSF proxy model performance. Note that indicate 95% confidence interval.
CB only had five OOS instances, with one used for test.

our samples, which are extremely few.


For training proxy models, we trained linear sup- After applying LR or OOSF, we trained BERT
port vector classifiers with a maximum iteration of models that serve the target task. For each dataset
10000 while using texts embedded with BERT (De- that applied LR without proxy models or used
vlin et al., 2019) as input. We chose to train mul- OOSF, we ran the training five times. For each
tiple proxy models for each class over training a dataset that used LR with proxy models, since each
single proxy model for all classes, as it tends to dataset from experiment 1 has been label-replaced
be more reliable in our pilots when there are many five times, we ran training only once. With this
classes. As the labeling of the proxy model de- approach, we acquired 15 model accuracy results
pends on the initial samples, for each generated for each task and condition.
dataset in experiment 1, we applied the approach
five times. 6.2 Results
6.1.2 Out-of-Scope Filtering 6.2.1 Label Replacement
With OOSF, we first tried to understand how OOS Label Accuracy and Model Accuracy in Figure 3
instances occur. Therefore, we sampled 360 data shows the results with LR. It shows how model
instances for each task from the union of all the accuracy and label accuracy change with the num-
datasets generated for the task. Then, an author ber of instances inspected (x-axis). Other metrics,
served as the oracle and annotated if they were diversity, and similarity would not change with LR,
OOS or not. Note that, as the definition of OOS as it keeps the texts as they are. For model accuracy,
instance, we filtered those instances that are out- we also visualized the performance of oracle mod-
side the task domain or to which no label is appli- els and the GPT-3 few-/zero-shot classification.
cable. We found that COLA, FO, HWU64, and LR increases the model accuracy and label ac-
PubMed have zero to four instances of OOS (Ta- curacy. Moreover, with more labels inspected,
ble 1). For the later analysis, we only considered the model accuracy and label accuracy further in-
the rest of the datasets, with at least five OOS in- creased. LR also added more values to logit sup-
stances. We present examples of OOS instances in pression. For example, without LR, using both
Appendix D.1. high temperature (1.3) and logit suppression did
With the annotated data, we trained proxy mod- not have a comparative benefit over using only
els to annotate the instances unseen by the author, high temperature. However, with label replace-
which were binary linear support vector classifiers ment, the addition of logit suppression started to
with the maximum iteration of 10000 and BERT- benefit the model accuracy when using high tem-
embedded inputs. With the trained model, we did perature. When doing LR with proxy models, the
OOSF on the datasets generated in experiment 1. benefit of logit suppression increased with more in-
Table 2 shows the accuracy of the proxy model, stances inspected, but with full LR, the size of this
when we divide the annotated data into training gap decreased a little bit. With LR of all instances,
and test sets with an 8:2 ratio, with a split of ten using both high temperature and logit suppression
times. Note that the perfect accuracy in CB is be- increased the absolute model accuracy by 17.8%,
cause we identified only five OOS instances from compared to when using neither. It was greater than
Figure 4: The ratio of instances filtered with OOSF, and its impact on model accuracy, label accuracy, diversity, and
similarity, in aggregation across all tasks. As we examined the effect of OOSF with LR, for model accuracy and
label accuracy, numbers left to +OOS indicate how many instances are inspected with LR.

the increase from diversification approaches when with small increases in label accuracy and similar-
LR was not used (9.4%). Furthermore, with high ity while decreasing diversity. However, in some
temperature and logit suppression, using LR on all cases, these changes were subtle or within the 95%
instances could increase the absolute model accu- confidence intervals. Moreover, how the OOSF
racy by 14.4% compared to not doing LR. When changes the model accuracy depends on the spe-
a high temperature and logit suppression are used cific task and condition. We provide the OOSF
together, the model accuracy outperformed GPT3’s results for each task in Appendix E.2.
few-shot classification when LR was done for 180
instances. Across tasks, we found that specific pat- 7 Conclusion
terns on how diversification approaches and LR
In this work, we investigate approaches to harness
impact the model accuracy can vary between tasks.
LLMs and human efforts to generate text classi-
We provide details in Appendix E.1.
fication datasets with high accuracy and diversity.
6.2.2 Out-of-Scope Instances Filtering We study two text generation diversification ap-
Figure 4 shows how many instances were filtered proaches, 1) logit suppression, which restrains gen-
with OOSF and how it affects model accuracy, la- erating already frequently generated tokens, and 2)
bel accuracy, diversity, and similarity. We present high temperature, which flattens the sampling prob-
model accuracy from both unbalanced and bal- ability of tokens. We found that they diversify text
anced data: when we balanced data, we used generation but hurt the accuracy in aligning speci-
datasets with the same number of instances across fied labels with the generated data. We experiment
different conditions by subsampling data with the with two human intervention approaches, 1) replac-
smallest size of the filtered dataset. It was because ing misaligned labels with more adequate ones, and
filtering can make the number of instances different 2) filtering out-of-scope instances. We found that
between conditions. For unbalanced data, we did replacing labels makes diversification approaches
not balance the number of instances. more beneficial by increasing the accuracy of mod-
els trained with the generated dataset. On the other
OOSF either increases or maintains label accu-
hand, efficient filtering of out-of-scope instances
racy and similarity while decreasing or maintaining
did not have a positive impact on the model accu-
diversity, but there was no unified pattern of how
racy.
they impact the model accuracy. There tend to be
few OOS-filtered instances without diversification
8 Limitations
approaches. For example, with a temperature of 0.3
and without logit suppression, OOSF removed very Our implementation of proxy models applies those
few data instances. Consequently, label accuracy, models after the whole data is generated. Due
diversity, and similarity remained the same with to this, in the resulting dataset, the number of in-
OOSF. Without diversification approaches, the ac- stances can often be unbalanced between labels.
curacy of trained models tends to be more unstable Such a limitation might be addressable by training
with large confidence intervals. On the other hand, proxy models from intermediate datasets with a
with diversification approaches, OOSF removed smaller number of instances, and using those mod-
more instances, and hence there were slightly more els while generating the rest of the dataset. As
changes in label accuracy, diversity, and similarity, the data become unbalanced during the generation,
the generation pipeline can try to generate more these approaches still would have limitations and
instances with labels that are a minority in the in- how these approaches would impact the data bias
termediate dataset. However, when we piloted this and the resulting model performance would need
approach, we identified potential problems. First, to be further researched.
intermediately trained proxy models could perform
worse than those trained after all data are generated, Acknowledgements
due to the lower diversity in intermediate data used We want to thank Microsoft Research for support-
to train proxy models. Second, if many data points ing the work.
generated with a specific label (label a) actually
belong to another label (label b), there can be cases
where most instances of label b come from the References
prompt with label a. It can skew the linguistic pat- Saleema Amershi, James Fogarty, Ashish Kapoor, and
terns of instances within the dataset, as only a small Desney Tan. 2009. Overview based example selec-
number of texts for label b might have been from tion in end user interactive concept learning. In Pro-
the prompt with label b. Advanced approaches to ceedings of the 22nd Annual ACM Symposium on
User Interface Software and Technology, UIST ’09,
address these issues can be future work directions. page 247–256, New York, NY, USA. Association for
Our implementation of efficient OOSF was not Computing Machinery.
effective in increasing model accuracy. It might be
Saleema Amershi, James Fogarty, and Daniel Weld.
due to the negative impact of removing instances, 2012. Regroup: Interactive machine learning for
such as filtering instances on the decision boundary. on-demand group creation in social networks. In
As our study of OOSF was not complete, future Proceedings of the SIGCHI Conference on Human
work is necessary. Applying OOSF to the entire Factors in Computing Systems, CHI ’12, page 21–30,
New York, NY, USA. Association for Computing
generated dataset and seeing the impact of their
Machinery.
removal would be the first step. With a comprehen-
sible understanding of OOSF, we would be able Steve Branson, Catherine Wah, Florian Schroff, Boris
to design better OOSF strategies, such as filtering Babenko, Peter Welinder, Pietro Perona, and Serge
Belongie. 2010. Visual recognition with humans
instances with various criteria. in the loop. In Proceedings of the 11th European
In this work, we only examined the Conference on Computer Vision: Part IV, ECCV’10,
text-davinci-002 model of GPT-3. Al- page 438–451, Berlin, Heidelberg. Springer-Verlag.
though we believe that the overall trends of results Tom Brown, Benjamin Mann, Nick Ryder, Melanie
would be similar for other models, examining Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind
other models with our approaches is a necessary Neelakantan, Pranav Shyam, Girish Sastry, Amanda
future work. We also examined only one prompt Askell, Sandhini Agarwal, Ariel Herbert-Voss,
Gretchen Krueger, Tom Henighan, Rewon Child,
(Prompt A), while there may be other options. In Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens
Appendix F, we present partial results on using Winter, Chris Hesse, Mark Chen, Eric Sigler, Ma-
another prompt, showing that our approach is teusz Litwin, Scott Gray, Benjamin Chess, Jack
generalizable to other prompts. Combining human Clark, Christopher Berner, Sam McCandlish, Alec
Radford, Ilya Sutskever, and Dario Amodei. 2020.
interventions with automatic annotation error
Language models are few-shot learners. In Ad-
detection (Klie et al., 2023) can be another future vances in Neural Information Processing Systems,
direction. volume 33, pages 1877–1901. Curran Associates,
Inc.
9 Ethics Statement Ángel Alexander Cabrera, Abraham J. Druck, Jason I.
Hong, and Adam Perer. 2021. Discovering and val-
LLM-generated text data could have replicated bi- idating ai errors with crowdsourced failure reports.
ases within the used LLM. Diversification might Proc. ACM Hum.-Comput. Interact., 5(CSCW2).
alleviate such issues, as it steers the LLM to gener-
George Cazenavette, Tongzhou Wang, Antonio Torralba,
ate texts that it considers less probable, but bias can Alexei A. Efros, and Jun-Yan Zhu. 2022. Dataset
still exist after using the approach. More human distillation by matching training trajectories. In Pro-
intervention approaches can be a potential solution. ceedings of the IEEE/CVF Conference on Computer
For example, the model builder can provide more Vision and Pattern Recognition.
specific prompts and examples to counter the bi- Abhijnan Chakraborty, Bhargavi Paranjape, Sourya
ased generation (Hartvigsen et al., 2022). However, Kakarla, and Niloy Ganguly. 2016. Stop clickbait:
Detecting and preventing clickbaits in online news Thomas Hartvigsen, Saadia Gabriel, Hamid Palangi,
media. In 2016 IEEE/ACM International Conference Maarten Sap, Dipankar Ray, and Ece Kamar. 2022.
on Advances in Social Networks Analysis and Mining ToxiGen: A large-scale machine-generated dataset
(ASONAM), pages 9–16. for adversarial and implicit hate speech detection.
In Proceedings of the 60th Annual Meeting of the
Jiaao Chen, Zichao Yang, and Diyi Yang. 2020. Mix- Association for Computational Linguistics (Volume
Text: Linguistically-informed interpolation of hid- 1: Long Papers), pages 3309–3326, Dublin, Ireland.
den space for semi-supervised text classification. In Association for Computational Linguistics.
Proceedings of the 58th Annual Meeting of the Asso-
ciation for Computational Linguistics, pages 2147– Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. 2015.
2157, Online. Association for Computational Lin- Distilling the knowledge in a neural network.
guistics.
Yutai Hou, Yijia Liu, Wanxiang Che, and Ting Liu.
Justin Cheng and Michael S. Bernstein. 2015. Flock: 2018. Sequence-to-sequence data augmentation for
Hybrid crowd-machine learning classifiers. In Pro- dialogue language understanding. In Proceedings
ceedings of the 18th ACM Conference on Computer of the 27th International Conference on Computa-
Supported Cooperative Work & Social Computing, tional Linguistics, pages 1234–1245, Santa Fe, New
CSCW ’15, page 600–611, New York, NY, USA. Mexico, USA. Association for Computational Lin-
Association for Computing Machinery. guistics.

Franck Dernoncourt and Ji Young Lee. 2017. PubMed Ashish Kapoor, Bongshin Lee, Desney Tan, and Eric
200k RCT: a dataset for sequential sentence clas- Horvitz. 2010. Interactive optimization for steer-
sification in medical abstracts. In Proceedings of ing machine classification. In Proceedings of the
the Eighth International Joint Conference on Natu- SIGCHI Conference on Human Factors in Comput-
ral Language Processing (Volume 2: Short Papers), ing Systems, CHI ’10, page 1343–1352, New York,
pages 308–313, Taipei, Taiwan. Asian Federation of NY, USA. Association for Computing Machinery.
Natural Language Processing.
Jan-Christoph Klie, Bonnie Webber, and Iryna
Gurevych. 2023. Annotation Error Detection: An-
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
alyzing the Past and Present for a More Coherent
Kristina Toutanova. 2019. BERT: Pre-training of
Future. Computational Linguistics, 49(1):157–198.
deep bidirectional transformers for language under-
standing. In Proceedings of the 2019 Conference of
Varun Kumar, Ashutosh Choudhary, and Eunah Cho.
the North American Chapter of the Association for
2020. Data augmentation using pre-trained trans-
Computational Linguistics: Human Language Tech-
former models. In Proceedings of the 2nd Workshop
nologies, Volume 1 (Long and Short Papers), pages
on Life-long Learning for Spoken Language Systems,
4171–4186, Minneapolis, Minnesota. Association for
pages 18–26, Suzhou, China. Association for Com-
Computational Linguistics.
putational Linguistics.
Marzieh Fadaee, Arianna Bisazza, and Christof Monz. Shibamouli Lahiri. 2015. Squinky! a corpus of
2017. Data augmentation for low-resource neural sentence-level formality, informativeness, and im-
machine translation. In Proceedings of the 55th An- plicature.
nual Meeting of the Association for Computational
Linguistics (Volume 2: Short Papers), pages 567–573, Zachary Levonian, Chia-Jung Lee, Vanessa Murdock,
Vancouver, Canada. Association for Computational and F. Maxwell Harper. 2022. Trade-offs in sampling
Linguistics. and search for early-stage interactive text classifica-
tion. In 27th International Conference on Intelligent
James Fogarty, Desney Tan, Ashish Kapoor, and Simon User Interfaces, IUI ’22, page 566–583, New York,
Winder. 2008. Cueflik: Interactive concept learning NY, USA. Association for Computing Machinery.
in image search. In Proceedings of the SIGCHI Con-
ference on Human Factors in Computing Systems, Ruibo Liu, Guangxuan Xu, Chenyan Jia, Weicheng
CHI ’08, page 29–38, New York, NY, USA. Associa- Ma, Lili Wang, and Soroush Vosoughi. 2020. Data
tion for Computing Machinery. boost: Text data augmentation through reinforcement
learning guided conditional generation. In Proceed-
Ian J. Goodfellow, Yoshua Bengio, and Aaron Courville. ings of the 2020 Conference on Empirical Methods
2016. Deep Learning. MIT Press, Cambridge, MA, in Natural Language Processing (EMNLP), pages
USA. http://www.deeplearningbook.org. 9031–9041, Online. Association for Computational
Linguistics.
Demi Guo, Yoon Kim, and Alexander Rush. 2020.
Sequence-level mixed sample data augmentation. In Xingkun Liu, Arash Eshghi, Pawel Swietojanski, and
Proceedings of the 2020 Conference on Empirical Verena Rieser. 2021. Benchmarking Natural Lan-
Methods in Natural Language Processing (EMNLP), guage Understanding Services for Building Conver-
pages 5547–5552, Online. Association for Computa- sational Agents, pages 165–183. Springer Singapore,
tional Linguistics. Singapore.
Nathan Ng, Kyunghyun Cho, and Marzyeh Ghassemi. Richard Socher, Alex Perelygin, Jean Wu, Jason
2020. SSMBA: Self-supervised manifold based data Chuang, Christopher D. Manning, Andrew Ng, and
augmentation for improving out-of-domain robust- Christopher Potts. 2013. Recursive deep models for
ness. In Proceedings of the 2020 Conference on semantic compositionality over a sentiment treebank.
Empirical Methods in Natural Language Processing In Proceedings of the 2013 Conference on Empiri-
(EMNLP), pages 1268–1283, Online. Association for cal Methods in Natural Language Processing, pages
Computational Linguistics. 1631–1642, Seattle, Washington, USA. Association
for Computational Linguistics.
Bo Pang and Lillian Lee. 2004. A sentimental educa-
tion: Sentiment analysis using subjectivity summa- Jina Suh, Soroush Ghorashi, Gonzalo Ramos, Nan-Chen
rization based on minimum cuts. In Proceedings Chen, Steven Drucker, Johan Verwey, and Patrice
of the 42nd Annual Meeting of the Association for Simard. 2019. Anchorviz: Facilitating semantic data
Computational Linguistics (ACL-04), pages 271–278, exploration and concept discovery for interactive ma-
Barcelona, Spain. chine learning. ACM Trans. Interact. Intell. Syst.,
10(1).
Ethan Perez, Saffron Huang, Francis Song, Trevor Cai,
Roman Ring, John Aslanides, Amelia Glaese, Nat Lichao Sun, Congying Xia, Wenpeng Yin, Tingting
McAleese, and Geoffrey Irving. 2022. Red teaming Liang, Philip Yu, and Lifang He. 2020. Mixup-
language models with language models. transformer: Dynamic data augmentation for NLP
tasks. In Proceedings of the 28th International Con-
Mary Phuong and Christoph Lampert. 2019. Towards ference on Computational Linguistics, pages 3436–
understanding knowledge distillation. In Proceed- 3440, Barcelona, Spain (Online). International Com-
ings of the 36th International Conference on Ma- mittee on Computational Linguistics.
chine Learning, volume 97 of Proceedings of Ma-
chine Learning Research, pages 5142–5151. PMLR. Justin Talbot, Bongshin Lee, Ashish Kapoor, and
Desney S. Tan. 2009. Ensemblematrix: Interactive vi-
Samuel Rhys Cox, Yunlong Wang, Ashraf Abdul, Chris- sualization to support machine learning with multiple
tian von der Weth, and Brian Y. Lim. 2021. Directed classifiers. In Proceedings of the SIGCHI Conference
diversity: Leveraging language embedding distances on Human Factors in Computing Systems, CHI ’09,
for collective creativity in crowd ideation. In Pro- page 1283–1292, New York, NY, USA. Association
ceedings of the 2021 CHI Conference on Human for Computing Machinery.
Factors in Computing Systems, CHI ’21, New York,
Tongzhou Wang, Jun-Yan Zhu, Antonio Torralba, and
NY, USA. Association for Computing Machinery.
Alexei A Efros. 2018. Dataset distillation. arXiv
preprint arXiv:1811.10959.
Marco Tulio Ribeiro and Scott Lundberg. 2022. Adap-
tive testing and debugging of NLP models. In Pro- Alex Warstadt, Amanpreet Singh, and Samuel R. Bow-
ceedings of the 60th Annual Meeting of the Associa- man. 2019. Neural network acceptability judgments.
tion for Computational Linguistics (Volume 1: Long Transactions of the Association for Computational
Papers), pages 3253–3267, Dublin, Ireland. Associa- Linguistics, 7:625–641.
tion for Computational Linguistics.
Jason Wei and Kai Zou. 2019. EDA: Easy data augmen-
Marco Tulio Ribeiro, Tongshuang Wu, Carlos Guestrin, tation techniques for boosting performance on text
and Sameer Singh. 2020. Beyond accuracy: Be- classification tasks. In Proceedings of the 2019 Con-
havioral testing of NLP models with CheckList. In ference on Empirical Methods in Natural Language
Proceedings of the 58th Annual Meeting of the Asso- Processing and the 9th International Joint Confer-
ciation for Computational Linguistics, pages 4902– ence on Natural Language Processing (EMNLP-
4912, Online. Association for Computational Lin- IJCNLP), pages 6382–6388, Hong Kong, China. As-
guistics. sociation for Computational Linguistics.
Gaurav Sahu, Pau Rodriguez, Issam Laradji, Parmida Thomas Wolf, Lysandre Debut, Victor Sanh, Julien
Atighehchian, David Vazquez, and Dzmitry Bah- Chaumond, Clement Delangue, Anthony Moi, Pier-
danau. 2022. Data augmentation for intent classi- ric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz,
fication with off-the-shelf large language models. In Joe Davison, Sam Shleifer, Patrick von Platen, Clara
Proceedings of the 4th Workshop on NLP for Conver- Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le
sational AI, pages 47–57, Dublin, Ireland. Associa- Scao, Sylvain Gugger, Mariama Drame, Quentin
tion for Computational Linguistics. Lhoest, and Alexander M. Rush. 2020. Transform-
ers: State-of-the-art natural language processing. In
Elvis Saravia, Hsien-Chi Toby Liu, Yen-Hao Huang, Proceedings of the 2020 Conference on Empirical
Junlin Wu, and Yi-Shin Chen. 2018. CARER: Con- Methods in Natural Language Processing: System
textualized affect representations for emotion recog- Demonstrations, pages 38–45, Online. Association
nition. In Proceedings of the 2018 Conference on for Computational Linguistics.
Empirical Methods in Natural Language Processing,
pages 3687–3697, Brussels, Belgium. Association Tongshuang Wu, Marco Tulio Ribeiro, Jeffrey Heer, and
for Computational Linguistics. Daniel Weld. 2019. Errudite: Scalable, reproducible,
and testable error analysis. In Proceedings of the of the 60th Annual Meeting of the Association for
57th Annual Meeting of the Association for Compu- Computational Linguistics (Volume 1: Long Papers),
tational Linguistics, pages 747–763, Florence, Italy. pages 8646–8665, Dublin, Ireland. Association for
Association for Computational Linguistics. Computational Linguistics.

Congying Xia, Chenwei Zhang, Hoang Nguyen, Jiawei


Zhang, and Philip Yu. 2020. Cg-bert: Conditional A Equation for Temperature Sampling
text generation with bert for generalized few-shot
intent detection. Mathematically, with the temperature T and origi-
nal probability of token, pi , the temperature sam-
Kang Min Yoo, Dongju Park, Jaewook Kang, Sang-Woo pled probability of token i, fT (p)i , would be de-
Lee, and Woomyoung Park. 2021. GPT3Mix: Lever-
aging large-scale language models for text augmen- noted as below:
tation. In Findings of the Association for Computa- 1/T
tional Linguistics: EMNLP 2021, pages 2225–2239, pi
Punta Cana, Dominican Republic. Association for fT (p)i = 1/T
(2)
Computational Linguistics. Σj p j

Kang Min Yoo, Youhyun Shin, and Sang-goo Lee. 2019. B Experiment 1 Details
Data augmentation for spoken language understand-
ing via joint variational generation. In Proceedings of B.1 Prompts Used in LLM Generation
the Thirty-Third AAAI Conference on Artificial Intelli- For each task, we used prompt A with text types
gence and Thirty-First Innovative Applications of Ar-
tificial Intelligence Conference and Ninth AAAI Sym- and labels as in Table 3. For example, for CB, a
posium on Educational Advances in Artificial Intelli- prompt can look like the below with examples:
gence, AAAI’19/IAAI’19/EAAI’19. AAAI Press.
Write a news headline to cover all following elements
Ann Yuan, Daphne Ippolito, Vitaly Nikolaev, Chris Elements: valid news
Callison-Burch, Andy Coenen, and Sebastian News headline: "Zach Johnson Wins Sony Open"
Gehrmann. 2021. Synthbio: A case study in faster -----
curation of text datasets. In Thirty-fifth Conference Write a news headline to cover all following elements
on Neural Information Processing Systems Datasets Elements: clickbait (B)
and Benchmarks Track (Round 2). News headline: "10 Of The Biggest Lies We Were
Told In 2015"
Jun Yuan, Jesse Vig, and Nazneen Rajani. 2022. Isea: -----
An interactive pipeline for semantic error analysis Write a news headline to cover all following elements
of nlp models. In 27th International Conference on Elements: clickbait
Intelligent User Interfaces, IUI ’22, page 878–888, News headline:"
New York, NY, USA. Association for Computing B.2 Sampling Oracle Dataset
Machinery.
For the oracle dataset, if there are more than 5600
Hongyi Zhang, Moustapha Cisse, Yann N. Dauphin, and data points in the original dataset (CB, CARER,
David Lopez-Paz. 2018. mixup: Beyond empirical HATE, COLA, HWU64, SUBJ), we subsampled
risk minimization. In International Conference on
Learning Representations.
5600 training data points. For SST2, we used all
6922 instances from the original dataset. Note that
Le Zhang, Zichao Yang, and Diyi Yang. 2022. TreeMix: these numbers are the same as the number of gen-
Compositional constituency-based data augmentation erated data instances. For FO, we used the original
for natural language understanding. In Proceedings
of the 2022 Conference of the North American Chap- training dataset as is (with 3622 data instances),
ter of the Association for Computational Linguistics: as there are fewer than 5600 instances. For test
Human Language Technologies, pages 5243–5258, datasets, from the same original dataset exclud-
Seattle, United States. Association for Computational ing instances used for the oracle dataset, we sam-
Linguistics.
pled 2400 data points for CB, CARER, HATE, and
Xiang Zhang, Junbo Zhao, and Yann LeCun. 2015. HWU64. For FO, COLA, SUBJ, and SST-2, we
Character-level convolutional networks for text clas- used the original test datasets as there were fewer
sification. In Proceedings of the 28th International than 2400 instances.
Conference on Neural Information Processing Sys-
tems - Volume 1, NIPS’15, page 649–657, Cambridge,
MA, USA. MIT Press.
C Results of the Experiment 1 on
Individual Dataset
Jing Zhou, Yanan Zheng, Jie Tang, Li Jian, and Zhilin
Yang. 2022. FlipDA: Effective and robust data aug- Here, we introduce the result of the first experiment
mentation for few-shot learning. In Proceedings for individual tasks (Figure 5).
Model Accuracy Label Accuracy Diversity 1.0
Similarity
1.0 1.0
a) CARER

0.8 0.8 0.2


0.9
0.6 0.6 0.1
0.4 0.4 0.8
0.0
1.0 1.0 1.0
0.8 0.8 0.2
0.9
b) CB

0.6 0.6 0.1


0.4 0.4 0.8
0.0
1.0 1.0 1.0
0.2
c) COLA

0.8 0.8 0.9


0.6 0.6 0.1
0.4 0.4 0.8
0.0
1.0 1.0 1.0
0.8 0.8 0.2
0.9
d) FO

0.6 0.6 0.1


0.4 0.4 0.8
0.0
1.0 1.0 1.0
e) HWU64

0.8 0.8 0.2


0.9
0.6 0.6 0.1
0.4 0.4 0.8
0.0
1.0 1.0 1.0
f) PubMed

0.8 0.8 0.2


0.9
0.6 0.6 0.1
0.4 0.4 0.8
0.0
1.0 1.0 1.0
0.2
g) SST2

0.8 0.8 0.9


0.6 0.6 0.1
0.4 0.4 0.8
0.0
1.0 1.0 1.0
0.2
h) SUBJ

0.8 0.8 0.9


0.6 0.6 0.1
0.4 0.4 0.8
0.0
Oracle Temp=0.3, Logit Sup=X Temp=0.3, Logit Sup=O Temp=1.3, Logit Sup=O
GPT Zero Temp=0.7, Logit Sup=X Temp=0.7, Logit Sup=O Example=X
GPT Few Temp=0.9, Logit Sup=X Temp=0.9, Logit Sup=O Example=O
Base Similarity Temp=1.3, Logit Sup=X

Figure 5: Impact of logit suppression and high temperatures on model accuracy, label accuracy, diversity, and
similarity to the oracle dataset, for each task.
Task Text type Label → Label in prompts
CARER emotional tweet joy → expressing joy, anger → expressing anger, fear → expressing fear,
sadness → expressing sadness, love → expressing love, surprise → expressing surprise
CB news headline non-clickbait → valid news, clickbait → clickbait
COLA sentence grammatically acceptable → grammatically correct sentence,
grammatically unacceptable → grammatically incorrect sentence
FO sentence informal → informal, formal → formal
HWU64 human utterance to news → news, weather → weather, play → play, datetime → datetime, iot → iot,
a chatbot cooking → cooking, recommendation → recommendation, calendar → calendar,
music → music, takeaway → takeaway, lists → list, transport → transport, qa → qa,
social → social, general → general, alarm → alarm, email → email, audio → audio
PubMed sentence from a objective → sentence about objective, methods → sentence about methods, results →
medical paper sentence about results, conclusions → sentence about conclusions,
background → sentence about background
SST-2 movie review positive → positive sentiment, negative → negative sentiment
SUBJ sentence from a objective → objective statement, subjective → subjective statement
movie review

Table 3: Text types and labels used in prompts.

The benefit of logit suppression for each task was against the general pattern. In this case, similar
depends on the combination of label accuracy, di- to what we found with logit suppression, the lack of
versity, and similarity. Tasks that have high base diversification approaches led to the generation of
label accuracy tend to improve model accuracy narrowly populated error instances. CARER was
more with logit suppressions. For example, for CB another case with the reversed trend: without logit
and SST-2, those conditions with logit suppressions suppression and seeding examples, the mean diver-
were clear winners in model accuracy over other sity was higher with a temperature of 0.7 than with
combinations of approaches. For other tasks, where a temperature of 1.3. It was because, with the high
overall label accuracy tends to be lower, logit sup- temperature of 1.3, many sentences started with
pression did not have large benefits. COLA was the “I’m so,” (on average 3012 occurrences) which was
extreme case where the label accuracy was about less the case for the lower temperatures of 0.7 and
50% in binary classification, indicating that the per- 0.9 (on average 841.5 occurrences). In CARER,
formance of the LLM in generating label-accurate when example seeding and logit suppression are
instances was not better than random chance. In not used, label accuracy was also higher with the
this case, logit suppression resulted in almost no temperature of 1.3 than with lower temperatures,
increase in the model accuracy. Even in this case, although the means were within 95% confidence
logit suppression could increase the diversity of the intervals. In this case, with lower temperatures of
generated text. With PubMed, we could observe an 0.7 and 0.9, more instances started with “No matter
exception of label accuracy increasing with logit what,” which continues with advice on what to do
suppression when example seeding and high tem- in emotional situations. For such cases, no label is
perature (1.3) are not used (compare light and dark- applicable since they are not the self-expression of
colored unhatched bars in PubMed’s Label Accu- emotions (on average, 32 occurrences with a tem-
racy from Figure 5, except for red bars). It was be- perature of 1.3 and 682.7 occurrences with temper-
cause GPT-3 generates many similar errors without atures of 0.7 or 0.9). Note that these are examples
logit suppression and seeding examples. Specifi- of out-of-scope instances. Summarizing results of
cally, without logit suppression, when prompted to logit suppression and temperature sampling, these
write about the background sentence in a medical approaches increased diversity while hurting the
paper, GPT-3 generated many sentences starting label accuracy, but specific patterns could vary be-
with “The purpose of this study was,” which is tween tasks.
more about the objective.
The utility of example seeding in label accuracy
For temperature also, specific patterns on how and model accuracy could also vary between tasks.
it affected label accuracy, diversity, and similarity For example, in the extreme case of COLA, ex-
differ between tasks. In PubMed, without logit amples did not increase label accuracy and model
suppression and example seeding, label accuracy accuracy. How seeding examples impact the gen-
even increased with higher temperatures, which eration of data similar to the oracle dataset also
Task Example Reason for filtering
CARER No matter what life throws at you, always Not a self-expression of emotion
remember to find joy in the little things.
#HappyThoughts
CB Valid News Not a news headline
SST-2 Jurassic World Fallen Kingdom Only movie title
SUBJ For what it’s worth, Incomplete sentence and unable to decide subjectivity

Table 4: Examples of OOS instances.

Model Accuracy Label Accuracy 1.0


Model Accuracy Label Accuracy
1.0 1.00
0.8 0.9
a) CARER

0.8 0.8 0.95

b) CB
0.6
0.7
0.4 0.6 0.90
0.6
0 90 180 270 All 0 90 180 270 All 0 90 180 270 All 0 90 180 270 All

1.0 1.0
0.8 0.9 0.9
c) COLA

0.8

d) FO
0.8 0.8
0.7
0.7 0.7
0.6
0.6 0.6 0.6
0 90 180 270 All 0 90 180 270 All 0 90 180 270 All 0 90 180 270 All

1.0 1.0
0.8 0.9 0.7 0.9
e) HWU64

f) PubMed

0.8 0.6 0.8


0.6 0.7 0.5 0.7
0.6 0.4 0.6
0 90 180 270 All 0 90 180 270 All 0 90 180 270 All 0 90 180 270 All

1.00
1.00
0.9 0.8 0.75
g) SST2

h) SUBJ

0.95 0.50
0.8 0.6 0.25
0.90 0.00
0.4
0 90 180 270 All 0 90 180 270 All 0 90 180 270 All 0 90 180 270 All
# of instances inspected # of instances inspected # of instances inspected # of instances inspected
Temp=0.3 Temp=0.9 Logit Sup=X Oracle
Temp=0.7 Temp=1.3 Logit Sup=O GPT Few

Figure 6: Impact of label replacement on model accuracy, label accuracy, for each task, on all temperature values.

depends on the task.

For CARER, HWU64, and PubMed in Figure 5,


there were cases where the model accuracy was
higher than the accuracy of GPT-3’s few-shot learn-
ing. Other tasks showed lower accuracy than GPT-
3’s few-shot learning accuracy, indicating that GPT-
3 few-shot classification can be a better alternative
than training a model with generated data if the
model builder has a budget to continuously access
Figure 7: Impact of label replacement on model ac- GPT-3 and is willing to hand over data through
curacy, label accuracy, for all tasks aggregated, on all API. In Section 6, we show that human interven-
temperature values. tions can be a way to make the data generation
approach applicable in more tasks by increasing
the model accuracy higher than that of few-shot
classifications from GPT-3.
Figure 8: The ratio of instances filtered with OOSF, and its impact on model accuracy, label accuracy, diversity, and
similarity, for all tasks aggregated, on all temperature values. As we examined the effect of OOSF with LR, for
model accuracy and label accuracy, numbers left to +OOS indicate how many instances are inspected with LR.

D Experiment 2 Details that of GPT-3 few-shot classification. As expected,


no approaches outperform oracle models as those
D.1 Examples of OOS instances. models are used for LR. Fifth, for tasks with many
We present examples of OOS instances in Table 4. classes (CARER, HWU64, and PubMed), when us-
ing LR with proxy models, the performance tends
E Results of the Experiment 2 on Varying to increase not much dramatically as the number
Tasks of annotated instances increases (Model Accuracy
of CARER, HWU64, and PubMed in Figure 6).
We present the results of experiment 2 for individ-
Higher model accuracy leaps occurred when all
ual tasks. Note that we also show results for all
instances were inspected. It may indicate the diffi-
temperature values (0.3, 0.7, 0.9, and 1.3).
culty of training accurate proxy models with many
E.1 Label Replacement classes to consider.

Figure 6 and 7 shows the LR result for individ- E.2 Out-of-Scope Filtering
ual tasks and whole tasks aggregated, respectively, Figure 8 and 9 shows the OOSF results with all
with all temperatures. First, there were cases where temperatures, for the aggregation of all tasks and
logit suppression provided additional benefit upon individual tasks, respectively. As mentioned in the
high temperature only when LR was applied (com- main text, it was difficult to find a general pattern of
paring thick and thin red lines in Model Accuracy how OOSF impacts the model accuracy. Consistent
of CARER, HWU64, and PubMed in Figure 6). patterns were that OOSF tends to increase or main-
Second, for tasks that already have high accuracy tain label accuracy and similarity while decreasing
without LR (CB and SST-2), LR either resulted in or maintaining diversity.
very small model accuracy increases or even hurted
the accuracy. For example, in SST-2, the label ac- F Results on Prompt C
curacy was already high without LR, and doing LR
On two tasks (FO, HWU64), we conducted the
with proxy models could even decrease the label
experiment with another instructional prompt:
accuracy and model accuracy. Third, without diver-
sification approaches, there were also cases where Show me a text type that has the following charac-
LR did not increase model accuracy much while la- teristics
(C)
Characteristics: label
bel accuracy was greatly increased (thin blue lines text type: "Generated text"
in Model Accuracy of CARER, CB, FO, PubMed,
SST2, SUBJ in Figure 6). It may show that fixing We measured model accuracy, label accuracy,
labels is more beneficial when there is enough di- diversity, and similarity of generated datasets and
versity in the generated dataset. Fourth, CB, FO, also investigated how label replacement impacts la-
and SUBJ were cases where models trained with bel accuracy and model accuracy. The experiment
generated data could outperform GPT-3’s few-shot setting was the same as the main experiment we
classification only with label replacement (some conducted, except the prompt used. The trend in
colored lines go over gray dashed lines with LR in the results (Figure 10) was similar to that of the
Model Accuracy of CB, FO, and SUBJ in Figure 6). prompt A.
Among them, with FO, inspecting partial instances
could also turn the model accuracy higher than
Ratio of Unfiltered Unbalanced Model Accuracy Balanced Model Accuracy Label Accuracy Diversity Similarity
0.84
1.00 1.0 0.175
0.75 0.8 0.8 0.82
0.150
a) CARER

0.8 0.80
0.50 0.6 0.6 0.125
0.78
0.25 0.6 0.100
0.4 0.4 0.76
0.00
0 +OOS 180 +OOS All +OOS 0 +OOS 180 +OOS All +OOS 0 +OOS 180 +OOS All +OOS +OOS +OOS
1.00 0.79
1.00
0.75 0.9 0.9 0.20 0.78
0.95
b) CB

0.50 0.77
0.8 0.8 0.15
0.25 0.90 0.76
0.7 0.7
0.00 0.75
0 +OOS 180 +OOS All +OOS 0 +OOS 180 +OOS All +OOS 0 +OOS 180 +OOS All +OOS +OOS +OOS
0.84
0.95 0.95
1.0 1.00 0.82
0.90 0.90 0.15
c) SST2

0.80
0.85 0.85 0.95
0.5 0.10 0.78
0.80 0.80
0.90 0.76
0.0 0.75 0.05
0 +OOS 180 +OOS All +OOS 0 +OOS 180 +OOS All +OOS 0 +OOS 180 +OOS All +OOS +OOS +OOS
1.00 1.0 0.20
0.9 0.9 0.80
0.75
0.8
d) SUBJ

0.8 0.8 0.15 0.78


0.50
0.7 0.7
0.25 0.6 0.10
0.6 0.6 0.76
0.00
0 +OOS 180 +OOS All +OOS 0 +OOS 180 +OOS All +OOS 0 +OOS 180 +OOS All +OOS +OOS +OOS
Base Similarity Temp=1.3 GPT Few Temp=0.9, Logit Sup=X Temp=0.7, Logit Sup=O
Temp=0.3 Logit Sup=X Temp=0.3, Logit Sup=X Temp=1.3, Logit Sup=X Temp=0.9, Logit Sup=O
Temp=0.7 Logit Sup=O Temp=0.7, Logit Sup=X Temp=0.3, Logit Sup=O Temp=1.3, Logit Sup=O
Temp=0.9 Oracle

Figure 9: The ratio of instances filtered with OOSF, and its impact on model accuracy, label accuracy, diversity,
and similarity, for each task, on all temperature values. As we examined the effect of OOSF with LR, for model
accuracy and label accuracy, numbers left to +OOS indicate how many instances are inspected with LR.

Diversity Label Accuracy 1.05


Similarity Model Accuracy
Oracle 1.00 Base Similarity
0.4 Low Temp (0.3) 0.95 1.00 0.95
High Temp (1.3) 0.95
Logit Sup+Low Temp (0.3) 0.90
0.3 0.90 0.90
Logit Sup+High Temp (1.3) 0.85
a) FO

0.85
0.2 0.80 0.85
Low Temp (0.3) 0.80
0.75
0.1 Logit Sup+Low Temp (0.3) 0.75 0.80
0.70 High Temp (1.3) GPT3 Few shot
Logit Sup+High Temp (1.3) 0.70 Oracle
0.65 0.75
0.0 0.65
0 90 180 270 All 0 90 180 270 All
# of Instances Inspected # of Instances Inspected
with label replacement with label replacement
Diversity Label Accuracy 1.05
Similarity Model Accuracy
1.0
0.4 1.00 0.9
0.9 0.95 0.8
0.3 0.90
b) HWU64

0.8 0.7
0.85
0.2
0.7 0.80 0.6
0.1 0.75 0.5
0.6 0.70
0.0 0.65 0.4
0 90 180 270 All 0 90 180 270 All
# of Instances Inspected # of Instances Inspected
with label replacement with label replacement

Figure 10: Result on prompt C.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy