0% found this document useful (0 votes)
54 views12 pages

Text Encoders Lack Knowledge: Leveraging Generative Llms For Domain-Specific Semantic Textual Similarity

Uploaded by

Getnete degemu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
54 views12 pages

Text Encoders Lack Knowledge: Leveraging Generative Llms For Domain-Specific Semantic Textual Similarity

Uploaded by

Getnete degemu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Text Encoders Lack Knowledge: Leveraging Generative LLMs for

Domain-Specific Semantic Textual Similarity

Joseph Gatto, Omar Sharif, Parker Seegmiller, Philip Bohlman, Sarah Masud Preum
Department of Computer Science, Dartmouth College

Abstract by showing ChatGPT is inferior to pre-trained


RoBERTa models on a small (n=50) set of STS
Amidst the sharp rise in the evaluation of large
samples. In (Yang et al., 2023), they suggest that
language models (LLMs) on various tasks, we
find that semantic textual similarity (STS) has STS-B, and more generally regression tasks have
arXiv:2309.06541v1 [cs.CL] 12 Sep 2023

been under-explored. In this study, we show “no use case" in the context of LLMs — citing the
that STS can be cast as a text generation prob- extreme misalignment between LLM training and
lem while maintaining strong performance on the prediction of a continuous value. In this study,
multiple STS benchmarks. Additionally, we we aim to show that there are two intuitive reasons
show generative LLMs significantly outper- as to why LLMs are highly applicable to Semantic
form existing encoder-based STS models when Textual Similarity. 1) World Knowledge: LLMs
characterizing the semantic similarity between
do not rely on human-labeled data, allowing them
two texts with complex semantic relationships
dependent on world knowledge. We validate to be exposed to a broad range of world knowl-
this claim by evaluating both generative LLMs edge. Very little human-annotated domain-specific
and existing encoder-based STS models on data exists for direct STS training or contrastive
three newly collected STS challenge sets which learning of sentence embeddings (Gao et al., 2021),
require world knowledge in the domains of making applications of text encoders to niche do-
Health, Politics, and Sports. All newly col- mains challenging. Thus, if we can apply LLMs to
lected data is sourced from social media con-
STS, we may greatly expand the set of problem do-
tent posted after May 2023 to ensure the perfor-
mance of closed-source models like ChatGPT mains where STS is impactful. 2) STS Regression
cannot be credited to memorization. Our re- May Align with Language Modeling: The STS
sults show that, on average, generative LLMs task can be formulated such that the output space
outperform the best encoder-only baselines by is constrained to prediction of a continuous value
an average of 22.3% on STS tasks requiring between [0-1]. Such a formulation reduces the task
world knowledge. Our results suggest genera- to outputting similarity as a percentage (e.g. Text A
tive language models with STS-specific prompt- and Text B are 60% similar). During pre-training,
ing strategies achieve state-of-the-art perfor-
LLMs are very likely to see many texts that use
mance in complex, domain-specific STS tasks.
percentages in various contexts, as humans fre-
quently cite percentages in natural language. Thus,
1 Introduction when we combine LLMs strong pairwise textual
reasoning capabilities with their predisposition to
The NLP community has seen a rapid advancement
percentages in natural language — LLMs appear
in many areas since the onset of large language
well-suited to the STS task.
models (LLMs) trained using Reinforcement Learn-
ing with Human Feedback, including text summa- A limitation of using LLMs for STS is they can
rization, machine translation, and problem solving, be highly expensive and inefficient. For example,
amongst others (Yang et al., 2023). One area that STS models are often used in information retrieval,
has not been well explored is the applicability of where the goal may be to compare a query text to
generative LLMs to Semantic Textual Similarity a large number of documents and then rank the
(STS) tasks. documents based on their similarity to the query
In recent works, it has been explicitly suggested (Nguyen et al., 2016). It may not be viable to lever-
that LLMs are not well-suited for the STS-B task. age generative LLMs for such a task in production,
In (Zhong et al., 2023) they support this claim as text generation can suffer from low throughput
Figure 1: Comparing the performance of ChatGPT vs a RoBERTa-based STS cross encoder on a sample from our
STS-Sports challenge set. This sample requires significant world knowledge as proper inference requires knowing
1) that the Cowboys NFL team are often referred to as “America’s Team" and 2) that “recovering" an onside kick is
equivalent to “getting the ball back" with an onside kick. The prompt corresponds to our best-performing ChatGPT
0-Shot prompt found in Table 2.

and high cost. However, there are many small-scale ChatGPT pipeline provides SOTA performance on
tasks in academic settings where the poor efficiency the STS13 and STS15 datasets, with near-SOTA
of LLMs for STS are often of lesser concern. In the performance on STS14 and SICK-R (i.e. 0.45%
literature, we find small-scale applications of STS and 0.51% difference in correlation respectively)
in the fields of psychology (Marjieh et al., 2022), when compared to unsupervised SOTA models.
community question answering (Hoogeveen et al., Given the opaque nature of ChatGPT’s training
2018), computational social science (Maldeniya data, we confirm our results are not the result of
et al., 2017), and propaganda detection (Mohtaj memorization by collecting 3 new STS challenge
and Möller, 2022) which use generic text encoders datasets using texts written after May 2023 across
for knowledge-intensive/domain-specific problems. three domains: health, sports, and politics. We de-
In this study, we aim to show that LLMs are more velop each dataset such that similarity is difficult
well-suited than generic text encoders for such to quantify without significant world knowledge
tasks. and demonstrate that ChatGPT provides SOTA per-
formance for challenging domain-specific STS. A
We confirm our intuition that LLMs like Chat-
summary of our contributions is as follows:
GPT are well-suited to perform STS by conducting
the first thorough exploration of STS in the context
of text generation. We evaluate two LLMs (i.e., • We introduce three new domain-specific STS
ChatGPT, Llama2) for STS in the context of both challenge sets in the domains of Health, Poli-
existing STS benchmarks and domain-specific STS tics, and Sports. We show that ChatGPT out-
challenge sets. Our work identifies STS-specific performs the closest text encoder baseline by
prompting strategies that significantly outperform an average of 22.3% on STS challenge sets.
prompts from prior works (Zhong et al., 2023).
Specifically, we find that mapping the original [0-
5] similarity scale used in STS benchmarks to be • We show that with STS-specific prompting
between [0-1] significantly improves performance strategies, ChatGPT achieves SOTA perfor-
of LLMs on the STS task. In other words, asking mance on two STS benchmark datasets and
LLMs to infer similarity as a percentage improves competitive performance in other datasets
performance vs. asking LLMs to utilize an arbitrary when compared to SOTA text encoders.
scale. See Figure 1 for an example STS prompt
used in this study.
• We analyze errors made by ChatGPT to guide
On existing benchmarks, we find that a 0-Shot future works on LLMs for STS.
2 Related Work encoder-only LMs to evaluate sentence representa-
tions. Specifically, we explore SBERT1 (Reimers
2.1 Supervised STS and Gurevych, 2019), SimCSE (Gao et al., 2021),
In the supervised setting, STS is commonly evalu- and GenSE+ (Chen et al., 2022).
ated as a part of the GLUE benchmark — specif-
ically on the STS-B dataset, where texts can be Domain-Specific STS: We explore the perfor-
cross-encoded by an LLM and fine-tuned for regres- mance of 0-shot, few-shot, and chain-of-thought
sion. Supervised STS is largely limited to training (COT) prompting strategies on our domain-specific
on samples sourced from news headlines and image datasets. Our 0-shot methodology on domain-
captions — making such models limited in scope specific texts follows our best 0-shot prompt as
when applied to new domains. LLMs are well- determined by performance on the benchmark STS
suited to generalize to domain-specific STS data as datasets. For few-shot prompting, we use 5 exam-
they contains vast world knowledge. We compare ples which were manually crafted by the authors.
LLMs to both RoBERTa-base and RoBERTa-large Note, we did no prompt optimization but rather
(Liu et al., 2019) fine-tuned on the STS-B dataset aimed to write a simple prompt that introduced
on our 3 domain-specific datasets. the LLM to the label space as suggested by (Min
et al., 2022). In each example, we use the same
2.2 Unsupervised STS sentence 1, but a different sentence 2, producing
evenly spaced similarity scores between 0 and 1,
Unsupervised STS occurs when two texts are inde-
exposing the model to the complete spectrum of
pendently encoded and then compared using mea-
label space. Our COT prompting strategy follows a
sures of embedding similarity. A seminal work in
1-shot paradigm, showing the model one example
the field of unsupervised STS is SBERT (Reimers
of how to reason about the solution step-by-step.
and Gurevych, 2019), which displays how NLI
The authors wrote the COT example and instructed
samples can be used to teach BERT (Devlin et al.,
the model to output the score between a set of
2019) how to pool sequences of token embed-
brackets (e.g. [semantic similarity = 0.3]) to enable
dings to provide a single vector representation of a
easy prediction extraction. All prompts used in this
given text. Later improvements on SBERT include
study can be found in Section B.2.
SimCSE (Gao et al., 2021) which leveraged con-
We compare LLMs to both supervised and unsu-
trastive learning to produce better sentence repre-
pervised STS models. For supervised models, we
sentations. Current state-of-the-art models such as
use the RoBERTa-base and RoBERTa-large cross-
GenSE (Chen et al., 2022) produces SOTA results
encoders provided by the Sentence-Transformers
on STS tasks via large-scale synthetic generation
library2 , which are fine-tuned on the STS-B dataset.
of contrastive training triplets.
LLMs and unsupervised STS use different ap- Evaluation Details: The evaluation pipeline fol-
proaches for text encoding, making their direct lows (Gao et al., 2021), which reports the Spear-
comparison difficult. For example, unsupervised man’s rank correlation between all predicted and
STS models excel at this specific task but have ground truth similarity scores for all samples in
fewer parameters, while LLMs are not designed for a given dataset. To conduct our experiments, we
regression, but have far more parameters and are evaluate two LLMs 1) ChatGPT (‘gpt-3.5-turbo-
trained on large-scale unsupervised data. Nonethe- 0301’) from OpenAI and 2) Llama2-7b (Touvron
less, evaluating LLMs in the 0-shot setting on un- et al., 2023) from Meta3 . We choose these two
supervised STS datasets can provide insights into models as they are extremely popular, easy to ac-
their capabilities for STS. cess, and represent the highest-performing LLMs
at their given scales (Touvron et al., 2023). Note,
3 Methods we exclude GPT-4 from the experimentation due to
its significantly higher cost.
3.1 Experimental Setup
We report results after a small grid search on
Benchmarking LLMs on 0-Shot STS: We eval- the temperature and top-p hyperparameters of the
uate various STS-specific 0-shot prompting strate- 1
Huggingface model string: ‘sentence-transformers/all-
gies. An example of our 0-shot inference can be MiniLM-L6-v2’
found in Figure 1. We compare our approach to 2
sbert.net
3
three baseline unsupervised STS models, which use Huggingface model string: ‘Llama-2-7b-chat-hf’
STS12 STS13 STS14 STS15 STS16 STS-B SICK-R
SBERT 72.37 80.60 75.59 85.39 78.99 82.03 77.15
SimCSE-BERT-B 75.30 84.67 80.19 85.40 80.82 84.26 80.39
SimCSE-RoBERTa-L 77.46 87.27 82.36 86.66 83.93 86.70 81.95
GenSE+ 80.66 88.18 84.69 89.03 85.82 87.88 80.10
Llama2-7b (Baseline Prompt [0-5]) 44.05 50.27 43.03 46.02 27.23 44.37 45.33
Llama2-7b (STS Prompt [0-5]) 42.59 41.66 30.37 33.30 26.62 35.79 39.30
Llama2-7b (STS Prompt [0-1]) 51.83 67.74 60.77 57.48 61.73 64.56 62.48
ChatGPT (Baseline Prompt [0-5]) 64.86 85.66 79.05 86.15 79.75 82.62 81.44
ChatGPT (STS Prompt [0-5]) 64.58 86.07 80.15 85.99 79.27 81.31 78.77
ChatGPT (STS Prompt [0-1]) 68.97 89.09 84.24 89.11 84.54 84.73 79.84

Table 1: Results comparing baseline encoder-only LMs to ChatGPT on standard 7 STS datasets based on Spearman
correlation. We find that ChatGPT achieves SOTA results on STS13 and STS15 as well as extremely competitive
performance on STS14 and SICK-R. Note: [0-5] prompts use the original similarity score scale of [0.0-5.0]. Our
results show that mapping the labels to be between [0.0-1.0] provides a significant performance increase.

LLMs. For both models, we use temperature = 0 , annotators is the final continuous value.
top-p = 1. Since Llama2 requires a non-zero tem-
perature, we use 0.0001 as our zero temperature 3.3.2 Challenge Datasets
parameter. Additional details regarding our hyper- We additionally evaluate each model on 3 newly
parameter selection can be found in Appendix B.1. collected datasets with data collected after May
2023 to ensure ChatGPT’s performance is not due
3.2 Extracting Predictions from LLMs
to memorization of any information regarding the
We use a simple string parsing mechanism to ex- standard STS benchmarks. Furthermore, this data
tract predictions from generative LLMs. For 0-Shot allows us to evaluate each model’s capacity to per-
and Few-Shot models, we simply return the first form STS when greater world knowledge is re-
number outputted by the model. For COT methods, quired. Our three datasets are 1) STS-Sports: Red-
we extract the decimal found in the set of brack- dit headlines about the National Football League
ets which the LLM is instructed to produce during (NFL) and National Basketball Association (NBA);
inference. If a text cannot be parsed (i.e. no num- 2) STS-Health: Texts sourced from online discus-
ber is output by the model) then we default to a sions on Reddit regarding Long COVID; and 3)
prediction of 0 similarity. STS-News: A Reddit dataset of recent political
We note some qualitative analysis regarding the headlines. Each dataset has (n=100) text pairs. The
above design choices. First, our highest performing data was collected by the authors with the goal of
model, ChatGPT, is very good at following STS semantic similarity labels being driven by world
prompt instructions and thus almost exclusively knowledge relationships.
outputs a single number, so rarely do we default Each sample in each dataset consists of 1 real
to 0. For lesser-performing models like Llama2, sample from a given source and one human-
it can happen more frequently, but is still a rare generated sample. Human-generated texts were
occurrence. written by the authors and crafted to contrast with
the source sample in a manner that produces a di-
3.3 Datasets
verse set of scores across the similarity spectrum.
3.3.1 Benchmark Datasets Specifically, high-similarity pairs often employ
Each model is evaluated on the standard 7 STS complex variations of the same information, which
benchmark datasets: STS 12-16 (Agirre et al., 2012, require world knowledge, while low-similarity
2013, 2014, 2015, 2016), STS-B (Cer et al., 2017), pairs are often constructed to have high token over-
and SICK-R (Marelli et al., 2014). All samples lap but low semantic similarity, requiring the model
in each dataset are annotated on a scale of [0-5], to focus deeply on the semantics.
where the mean similarity score across multiple We chose to manually construct texts as it is ex-
tremely difficult to collect samples such as those Model Sports News Health
presented in Figure 1, where the texts are on the
exact same topic but differ drastically in terms of Unsupervised Models
their presentation. Each pair was annotated by three SimCSE-R-L 58.87 62.47 50.98
different researchers at the authors’ institution and GenSE+ 42.88 56.03 40.67
averaged to produce the final similarity score. Each
Supervised Models
annotator was ensured to be sufficiently knowledge-
able about the domain within which they were an- RoBERTa-B 63.17 58.29 31.56
notating. The annotation guidelines provided were RoBERTa-L 63.59 65.56 50.33
identical to those released for the STS13 task. The Llama2 Experiments
inter-annotator agreement for each dataset can be
found in Appendix A Table 3. Please refer to the 0-Shot 47.34 44.58 37.10
appendix A for additional details on data collection, Few-shot 66.52 58.04 46.51
data statistics, and example data. COT 18.73 30.98 25.55
ChatGPT Experiments
4 Results
0-Shot 80.99 87.21 78.11
4.1 0-Shot STS Few-shot 82.28 80.81 68.28
Our 0-shot STS results on benchmark datasets are COT 83.42 87.74 73.71
summarized in Table 1. We find that ChatGPT out-
Table 2: Results comparing our two best-unsupervised
performs text encoders on the STS13 and STS15
models (i.e., SimCSE-RoBERTa-Large and GenSE+)
datasets. Additionally ChatGPT shows competi- and two RoBERTa models fine-tuned on STS-B to
tive performance on STS14, and SICK-R, where LLMs on our three newly collected domain-specific
there is only a 0.45% and 0.51% difference between datasets. We find that ChatGPT outperforms encoder-
ChatGPT and the best encoder baseline. We find only models on all tasks by a significant margin. Note:
that the only dataset on which encoder models sig- All 0-Shot prompts follow the best 0-shot strategy as
nificantly out-perform ChatGPT is on STS12. This determined by results in Table 1.
is in part due to the large number of linguistically
incoherent texts in STS12. We further discuss the
which use a Likert scale.
limitations of ChatGPT on certain types of texts
in Section 5. Llama2, we find, performs poorly
4.2 Domain-Specific STS
on 0-Shot STS on existing benchmarks. This sug-
gests that STS may be an ability emergent at scale In Table 2 we see the results of four different model
for LLMs, as our 7b parameter Llama2 baseline families on our newly collected STS datasets which
significantly under-performs all other baselines on heavily depend on world knowledge from three dif-
STS. ferent domains. We find that across all domains,
We find that the prompts explored in previous ChatGPT performs significantly better than Llama2
works, which prompt ChatGPT to perform STS as well as both supervised and unsupervised STS
on the original [0-5] similarity scale, perform sig- models, beating the next closest model by an aver-
nificantly worse than when we map the labels be- age of 22.3%. ChatGPT’s competitive performance
tween [0-1]. For example, our mapping translates on the standard STS benchmarks demonstrates it’s
to asking ChatGPT to predict that two texts have ability to perform the task, thus it is intuitive that
80% similarity instead of 4/5 similarity. As shown a model with diverse world knowledge should out-
in Table 1, “Baseline Prompt [0-5]" (taken from perform existing off-the-shelf STS models which
(Zhong et al., 2023)) and “STS Prompt [0-5]" per- contain limited current world knowledge. For ex-
form worse on 6/7 tasks, often by a large margin. ample, success on STS-Sports requires a model
We find it to be intuitive that LLMs have an eas- to know Lebron James plays for the Los Angeles
ier time understanding and representing semantic Lakers. STS-News requires the model to know
similarity as a percentage, as percentages are com- that congresswoman Alexandria Ocasio-Cortez is
monly used to describe various phenomena in a known as AOC. STS-Health requires the model to
variety of texts (thus making them more likely to know that “brain fog" is related to “confusion" and
appear in LLM training data) unlike comparisons “lack of focus". This sort of niche knowledge seems
unreasonable for many encoder models to contain 5.2 Numeric Reasoning
— which is why we argue that ChatGPT is the best It is well-documented that large language models
option for domain-specific, STS-dependent NLP have trouble with numeric reasoning tasks (Chen
tasks looking to employ an off-the-shelf model. et al., 2023). In this study, we find that ChatGPT’s
We note that while Llama2 under-performs Chat- definition of what constitutes a semantically similar
GPT on all experiments, it does get a significant text is not very sensitive to differences in numeric
performance increase in the Few-Shot setting when quantities. In other words, ChatGPT commonly
compared to 0-shot. This may suggest that smaller gives high semantic equivalence to linguistically
LLMs require more explicit instruction to perform similar texts with very different numeric quantities.
well on the STS task. Future works may explore This is in contrast to the annotation of the STS12-
STS-specific in-context learning strategies that en- 16 benchmarks, where similarity scores can be very
able the use of smaller-scale LLMs on this task. sensitive to numeric differences.
If we assume that samples with numeric quan-
5 Where Does ChatGPT Fail on STS? tities in each text require some numeric compari-
son, we specifically find that, of the top-500 worst
In this section, we analyze the top 500 predicted
predictions made by ChatGPT, 12.4% require a
samples from ChatGPT with the largest absolute
numeric comparison. Consider the following ex-
difference between prediction and ground truth
ample:
across five STS datasets in the 0-shot setting (STS
Text 1: singapore stocks end up 0.26 percent
12-16 ). We aim to surface the types of text pairs
Text 2: singapore stocks end up 0.11 pct
ill-suited for semantic similarity modeling with
Ground Truth Similarity Score: 0.4
ChatGPT.
ChatGPT is good at recognizing that both texts
pertain to Singapore stocks, however ChatGPT’s
5.1 Linguistic Acceptability
prediction of 0.95 similarity shows little sensitivity
We qualitatively observed that ChatGPT struggles to the numeric difference between the texts. Such a
with samples that are syntactically or grammati- prediction by ChatGPT may be considered accurate
cally incoherent. We validate this claim by running in different settings, however under the STS12-16
a RoBERTa-base model fine-tuned on the COLA annotation guidelines produced poor results.
(Warstadt et al., 2018) dataset 4 , which tests if a text
is linguistically acceptable. We find that 34.6% of 6 Conclusion
highly inaccurate predictions contain a linguis- In this study, we show that while smaller LLMs like
tically unacceptable text. For example, consider Llama2 struggle on STS, larger models like Chat-
the following sample from STS14: GPT are highly capable of performing semantic
Text 1: what isn ’t how what was sold ? similarity tasks, as it achieves SOTA performance
Text 2: it ’s not how it was sold , gb. on 2/7 standard STS datasets. We additionally show
Ground Truth Similarity Score: 0.32 that ChatGPT is far superior to existing STS mod-
ChatGPT has very little content or semantics to rely els on world knowledge-dependent comparisons
on when analyzing two linguistically unacceptable — as ChatGPT outperforms existing models by an
texts. Thus, it outputs a high similarity score of 0.8 average of 22.3% on domain-specific STS tasks. In
potentially due to token overlap. conclusion, ChatGPT shows promising results for
To further verify our claim, we evaluate Chat- domain-specific STS tasks.
GPT on STS12 in two different contexts — all
samples vs. only text pairs that are both linguis- 7 Limitations
tically acceptable. We choose STS12 as it has a
A limitation of this work is the use of a closed-
high number of linguistically unacceptable texts.
source model, making it impossible to verify if the
We find that on the linguistically acceptable subset
model has encountered the data used in our evalua-
(2195/3108 samples in STS12), we get a correla-
tion sets collected prior to September 2021. Also,
tion of 75.95%, which is a 6.62% increase in per-
frequent updates to ChatGPT make it challenging
formance compared to evaluation on all samples.
to anticipate how results may change in the future.
4
Huggingface model string: ‘textattack/roberta-base- Additionally, our STS solution may not be suit-
CoLA’ able for large-scale pairwise comparison tasks
due to API costs and slow inference speeds. As pages 252–263, Denver, Colorado. Association for
it stands, our approach is primarily designed Computational Linguistics.
for small-scale analysis seeking high-quality out- Eneko Agirre, Carmen Banea, Claire Cardie, Daniel Cer,
comes. To demonstrate this, we introduce three Mona Diab, Aitor Gonzalez-Agirre, Weiwei Guo,
new domain-specific challenging STS datasets. Rada Mihalcea, German Rigau, and Janyce Wiebe.
The size of the new datasets is limited as it’s expen- 2014. SemEval-2014 task 10: Multilingual semantic
textual similarity. In Proceedings of the 8th Interna-
sive to scale the annotation process as we want to tional Workshop on Semantic Evaluation (SemEval
ensure high-quality data with reliable annotation. 2014), pages 81–91, Dublin, Ireland. Association for
However, the number of samples in our domain- Computational Linguistics.
specific evaluation sets is on par with other domain-
Eneko Agirre, Carmen Banea, Daniel Cer, Mona Diab,
specific STS datasets (Soğancıoğlu et al., 2017). Aitor Gonzalez-Agirre, Rada Mihalcea, German
Finally, we note that we did not do any prompt Rigau, and Janyce Wiebe. 2016. SemEval-2016
optimization as a part of this study, which lim- task 1: Semantic textual similarity, monolingual
its the performance potential of our experiments. and cross-lingual evaluation. In Proceedings of the
10th International Workshop on Semantic Evaluation
Future iterations of this work may find that perfor- (SemEval-2016), pages 497–511, San Diego, Califor-
mance can be increased by employing different few nia. Association for Computational Linguistics.
shot/COT examples, or by optimizing the problem
Eneko Agirre, Daniel Cer, Mona Diab, and Aitor
description. Gonzalez-Agirre. 2012. SemEval-2012 task 6: A
pilot on semantic textual similarity. In *SEM 2012:
8 Ethical Considerations The First Joint Conference on Lexical and Compu-
tational Semantics – Volume 1: Proceedings of the
The datasets introduced in this paper collect sam- main conference and the shared task, and Volume
ples from a total of 6 different subreddits. All of 2: Proceedings of the Sixth International Workshop
on Semantic Evaluation (SemEval 2012), pages 385–
this information was collected manually from the 393, Montréal, Canada. Association for Computa-
public-facing site. Samples in STS-Sports and STS- tional Linguistics.
News are headlines or texts that are describing pub-
Eneko Agirre, Daniel Cer, Mona Diab, Aitor Gonzalez-
lic events and thus contain no sensitive information.
Agirre, and Weiwei Guo. 2013. *SEM 2013 shared
We note that while samples in STS-Health do con- task: Semantic textual similarity. In Second Joint
tain posts and comments describing personal health Conference on Lexical and Computational Semantics
experiences, none of the selected samples contain (*SEM), Volume 1: Proceedings of the Main Confer-
any personally identifying information and are pub- ence and the Shared Task: Semantic Textual Similar-
ity, pages 32–43, Atlanta, Georgia, USA. Association
licly available on the internet. Additionally, this for Computational Linguistics.
is not human subjects research and thus qualifies
for IRB exemption at authors’ institution. Reddit Adrian Benton, Glen Coppersmith, and Mark Dredze.
2017. Ethical research protocols for social media
was chosen as a data source because it is a suitable
health research. In Proceedings of the First ACL
platform to collect time-stamped anonymous data Workshop on Ethics in Natural Language Process-
in specific domains and on timely topics. However, ing, pages 94–102, Valencia, Spain. Association for
in the interest of protecting user privacy we plan to Computational Linguistics.
provide paraphrased versions of the user-generated Daniel Cer, Mona Diab, Eneko Agirre, Iñigo Lopez-
samples in STS-Health so that users cannot be iden- Gazpio, and Lucia Specia. 2017. SemEval-2017
tified via internet search of our dataset as suggested task 1: Semantic textual similarity multilingual and
in (Benton et al., 2017). crosslingual focused evaluation. In Proceedings
of the 11th International Workshop on Semantic
Evaluation (SemEval-2017), pages 1–14, Vancouver,
Canada. Association for Computational Linguistics.
References
Jiuhai Chen, Lichang Chen, Heng Huang, and Tianyi
Eneko Agirre, Carmen Banea, Claire Cardie, Daniel Zhou. 2023. When do you need chain-of-thought
Cer, Mona Diab, Aitor Gonzalez-Agirre, Weiwei prompting for chatgpt?
Guo, Iñigo Lopez-Gazpio, Montse Maritxalar, Rada
Mihalcea, German Rigau, Larraitz Uria, and Janyce Yiming Chen, Yan Zhang, Bin Wang, Zuozhu Liu,
Wiebe. 2015. SemEval-2015 task 2: Semantic tex- and Haizhou Li. 2022. Generate, discriminate and
tual similarity, English, Spanish and pilot on inter- contrast: A semi-supervised sentence representation
pretability. In Proceedings of the 9th International learning framework. In Proceedings of the 2022 Con-
Workshop on Semantic Evaluation (SemEval 2015), ference on Empirical Methods in Natural Language
Processing, pages 8150–8161, Abu Dhabi, United Natural Language Processing, pages 11048–11064,
Arab Emirates. Association for Computational Lin- Abu Dhabi, United Arab Emirates. Association for
guistics. Computational Linguistics.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Salar Mohtaj and Sebastian Möller. 2022. TUB at
Kristina Toutanova. 2019. BERT: Pre-training of WANLP22 shared task: Using semantic similarity
deep bidirectional transformers for language under- for propaganda detection in Arabic. In Proceedings
standing. In Proceedings of the 2019 Conference of of the The Seventh Arabic Natural Language Pro-
the North American Chapter of the Association for cessing Workshop (WANLP), pages 501–505, Abu
Computational Linguistics: Human Language Tech- Dhabi, United Arab Emirates (Hybrid). Association
nologies, Volume 1 (Long and Short Papers), pages for Computational Linguistics.
4171–4186, Minneapolis, Minnesota. Association for
Computational Linguistics. Tri Nguyen, Mir Rosenberg, Xia Song, Jianfeng Gao,
Saurabh Tiwary, Rangan Majumder, and Li Deng.
Tianyu Gao, Xingcheng Yao, and Danqi Chen. 2021. 2016. Ms marco: A human generated machine read-
SimCSE: Simple contrastive learning of sentence em- ing comprehension dataset. choice, 2640:660.
beddings. In Proceedings of the 2021 Conference
on Empirical Methods in Natural Language Process- Nils Reimers and Iryna Gurevych. 2019. Sentence-
ing, pages 6894–6910, Online and Punta Cana, Do- BERT: Sentence embeddings using Siamese BERT-
minican Republic. Association for Computational networks. In Proceedings of the 2019 Conference on
Linguistics. Empirical Methods in Natural Language Processing
and the 9th International Joint Conference on Natu-
Doris Hoogeveen, Andrew Bennett, Yitong Li, Karin ral Language Processing (EMNLP-IJCNLP), pages
Verspoor, and Timothy Baldwin. 2018. Detect- 3982–3992, Hong Kong, China. Association for Com-
ing misflagged duplicate questions in community putational Linguistics.
question-answering archives. In Proceedings of the
International AAAI Conference on Web and Social Gizem Soğancıoğlu, Hakime Öztürk, and Arzucan
Media, volume 12. Özgür. 2017. Biosses: a semantic sentence simi-
larity estimation system for the biomedical domain.
Pengfei Liu, Weizhe Yuan, Jinlan Fu, Zhengbao Jiang, Bioinformatics, 33(14):i49–i58.
Hiroaki Hayashi, and Graham Neubig. 2023. Pre-
train, prompt, and predict: A systematic survey of Hugo Touvron, Louis Martin, Kevin Stone, Peter Al-
prompting methods in natural language processing. bert, Amjad Almahairi, Yasmine Babaei, Nikolay
ACM Comput. Surv., 55(9). Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti
Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man- Ferrer, Moya Chen, Guillem Cucurull, David Esiobu,
dar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller,
Luke Zettlemoyer, and Veselin Stoyanov. 2019. Cynthia Gao, Vedanuj Goswami, Naman Goyal, An-
Roberta: A robustly optimized BERT pretraining thony Hartshorn, Saghar Hosseini, Rui Hou, Hakan
approach. CoRR, abs/1907.11692. Inan, Marcin Kardas, Viktor Kerkez, Madian Khabsa,
Isabel Kloumann, Artem Korenev, Punit Singh Koura,
Danaja Maldeniya, Arun Varghese, Toby Stuart, and Marie-Anne Lachaux, Thibaut Lavril, Jenya Lee, Di-
Daniel Romero. 2017. The role of optimal distinc- ana Liskovich, Yinghai Lu, Yuning Mao, Xavier Mar-
tiveness and homophily in online dating. In Proceed- tinet, Todor Mihaylov, Pushkar Mishra, Igor Moly-
ings of the International AAAI Conference on Web bog, Yixin Nie, Andrew Poulton, Jeremy Reizen-
and Social Media, volume 11, pages 616–619. stein, Rashi Rungta, Kalyan Saladi, Alan Schelten,
Ruan Silva, Eric Michael Smith, Ranjan Subrama-
Marco Marelli, Stefano Menini, Marco Baroni, Luisa nian, Xiaoqing Ellen Tan, Binh Tang, Ross Tay-
Bentivogli, Raffaella Bernardi, and Roberto Zam- lor, Adina Williams, Jian Xiang Kuan, Puxin Xu,
parelli. 2014. A SICK cure for the evaluation of Zheng Yan, Iliyan Zarov, Yuchen Zhang, Angela Fan,
compositional distributional semantic models. In Melanie Kambadur, Sharan Narang, Aurelien Ro-
Proceedings of the Ninth International Conference driguez, Robert Stojnic, Sergey Edunov, and Thomas
on Language Resources and Evaluation (LREC’14), Scialom. 2023. Llama 2: Open foundation and fine-
pages 216–223, Reykjavik, Iceland. European Lan- tuned chat models.
guage Resources Association (ELRA).
Alex Warstadt, Amanpreet Singh, and Samuel R Bow-
Raja Marjieh, Ilia Sucholutsky, Theodore R Sumers, man. 2018. Neural network acceptability judgments.
Nori Jacoby, and Thomas L Griffiths. 2022. Predict- arXiv preprint arXiv:1805.12471.
ing human similarity judgments using large language
models. arXiv preprint arXiv:2202.04728. Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten
Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and
Sewon Min, Xinxi Lyu, Ari Holtzman, Mikel Artetxe, Denny Zhou. 2023. Chain-of-thought prompting elic-
Mike Lewis, Hannaneh Hajishirzi, and Luke Zettle- its reasoning in large language models.
moyer. 2022. Rethinking the role of demonstrations:
What makes in-context learning work? In Proceed- Jingfeng Yang, Hongye Jin, Ruixiang Tang, Xiaotian
ings of the 2022 Conference on Empirical Methods in Han, Qizhang Feng, Haoming Jiang, Bing Yin, and
Xia Hu. 2023. Harnessing the power of llms in prac- Appendix
tice: A survey on chatgpt and beyond. arXiv preprint
arXiv:2304.13712. A Dataset Overview
Qihuang Zhong, Liang Ding, Juhua Liu, Bo Du, and In this section, we provide additional dataset infor-
Dacheng Tao. 2023. Can chatgpt understand too? mation including sample data and summary statis-
a comparative study on chatgpt and fine-tuned bert.
arXiv preprint arXiv:2302.10198.
tics of our newly collected datasets. In Figure 2
we show the distribution of ground truth similarity
scores for each of our newly collected datasets.

Figure 2: Histogram of the similarity scores for our


newly collected sports, news, and health datasets.

Dataset IAA
STS-Sports 80.19
STS-Health 73.38
STS-News 82.30

Table 3: Inter-Annotator Agreement (IAA) for each of


our newly collected datasets. We define IAA as the
mean pearson correlation between all annotators. That
is, for our three annotators, we report the mean of 32
correlations above.

A.1 STS-Sports
This dataset contains post titles from three different
sports subreddits: r/NBA, r/NBATalk, and r/NFL.
These subreddits were chosen as they pertain to
sports within which our annotators have significant
domain knowledge. An example text pair from
STS-Sports is shown below:

Text 1: [Highlight] Murray calling his


own “BANG” and points at Mike Breen
Text 2: Jamal Murray seen yelling Mike
Breen’s signature catch phrase after hit-
ting a three
Similarity Score: 0.86
Author Explanation: This is an ex- Similarity Score: 0.66
tremely difficult STS sample as it re- Author Explanation: This is a difficult
quires a model to know who Jamal Mur- sample that requires the model to under-
ray is (basketball player), who Mike stand a very complex and implicit form
Breen is (basketball announcer), and of hate speech towards the transgender
what Breen’s catch-phrase is when peo- community. The model is unable to rely
ple hit a three-point shot ("BANG!"). on any token overlap between the two
This is a near semantic match with the texts.
difference being that in Text 2 there is no Chat GPT Output: 0.6
mention of pointing at Mike Breen. RoBERTa-large Cross-Encoder: 0.41
Chat GPT Output: 0.8
RoBERTa-large Cross-Encoder: 0.48 B LLM Hyperparameters & Prompts
A.2 STS-Health B.1 LLM Hyperparameters
This dataset consists of post titles, post body For both ChatGPT (gpt-3.5-turbo-0301) and
content, and comments from two different Llama2-7b-chat, we evaluated performance on
health-related subreddits: r/covidlonghaulers and three different hyperparameter configurations:
r/LongCovid . These subreddits were chosen as
they contain health discussions which are user- • Temperature = 0, Top-P = 1
generated (i.e. non-clinical data) and not overly
• Temperature = 1, Top-P = 1
technical. Validating performance on such data
shows ChatGPT’s capacity to model social health • Temperature = 0, Top-P = 0.01
texts which has many important downstream appli-
cations in NLP for public health. An example text We identify the best configuration for an exper-
pair from STS-Health is shown below. iment (i.e. benchmark STS and domain-specific
STS) by averaging the results across all datasets
Text 1: Drs are Gaslighting me for each set of hyperparameters. Whichever con-
Text 2: My doctor is making me feel like figuration produces the highest average perform-
im crazy! ing experiment (rounded to two decimal places) is
Similarity Score: 0.93 chosen as the configuration for all datasets in that
Author Explanation: This sample is dif- experiment. We find that all experiments found the
ficult as there is little token overlap out- best performance, on average, from the Tempera-
side of the mention of doctor. Under- ture = 0, Top-P = 1 configuration. However, it is
standing this sample requires the model worth noting that Temperature = 0, Top-P = 0.01
to know modern slang terms such as often provided equivalent performance in certain
"Gaslighting". experiments. However, we chose the Top-P = 1 for
Chat GPT Output: 0.8 our configuration as this is the default value pro-
RoBERTa-large Cross-Encoder: 0.57 vided by the OpenAI API and will thus be a more
A.3 STS-News common configuration for future users.
This dataset contains post titles from r/Politics. We B.2 Prompts
use this subreddit as the post titles are often head-
In this section, we provide details on the STS
lines containing a diverse array of political figures
prompts used to produce our results. We note that
and phrases that requires significant world knowl-
Llama2 struggled to perform the STS task in the
edge. An example text pair from STS-News is
0-shot setting without a specific prompt structure.
shown below.
Specifically, it is the case that the 0-shot prompts
Text 1: Montana Republican Lawmaker in this section all needed to have “Output:" added
Suggested She’d Prefer Her Daughter to the end of the prompt for the model to properly
Die By Suicide Than Transition output its prediction. Thus, in the 0-shot examples
Text 2: Politician makes insensitive com- that follow, we display the ChatGPT version of the
ment towards the transgender commu- prompt. The Llama2 version is the same prompt
nity with the addition of “Output:" appended to the end.
1. Baseline Prompt [0-5] (b) Prompt:
Output a number between 0 and 1
describing the semantic similiarity
(a) Motivation: This prompt was used in
between the following two sentences:
(Zhong et al., 2023) to evaluate ChatGPT
on a subset of the STS-B dataset. We run
this prompt on all datasets in this study Sentence 1: John gave two apples to
as a baseline reference. annie
Sentence 2: The ball bounced on the
(b) Prompt: Determine the similarity be-
ground
tween the following two sentences:
Similarity Score: 0
<Text 1> and <Text 2>. The score should
be ranging from 0.0 to 5.0, and can be a
decimal. Sentence 1: John gave two apples to
annie
2. STS Prompt [0-1] Sentence 2: Annie is a girl who likes to
read
(a) Motivation: Our highest performing Similarity Score: 0.25
prompt. We find that having ChatGPT
predict labels which are mapped between Sentence 1: John gave two apples to
[0-1] significantly improve performance. annie
(b) Prompt: Sentence 2: Annie likes to eat apples
Output a number between 0 and 1 de- Similarity Score: 0.5
scribing the semantic similiarity between
the following two sentences: Sentence 1: John gave two apples to
Sentence 1: <Text 1> annie
Sentence 2: <Text 2> Sentence 2: John gave four apples to
annie
Similarity Score: 0.75
3. STS Prompt [0-5]
Sentence 1: John gave two apples to
(a) Motivation: To validate our claim that annie
ChatGPT performs better on normalized Sentence 2: Annie got two apples from
STS labels, we run the same prompt on john
the original STS scale of [0-5]. Similarity Score: 1
(b) Prompt:
Output a number between 0.0 and 5.0 de- Sentence 1: <Text 1>
scribing the semantic similiarity between Sentence 2: <Text 2>
the following two sentences:
Sentence 1: <Text 1> 5. Chain-of-Thought (COT) Prompt:
Sentence 2: <Text 2>

(a) Motivation: Chain-of-Thought prompt-


4. Few Shot STS Prompt [0-1]: ing has been shown to be a state-of-the-
(a) Motivation: Few shot prompting is a art prompting strategy for many multi-
well-established method in the literature step reasoning tasks (Wei et al., 2023).
(Liu et al., 2023). We thus evaluate on We thus evaluate the applicability of
few shot prompting as a baseline mea- COT for STS tasks as a baseline. Note:
sure. Note: The samples used in the few The 1-shot COT example here was writ-
shot prompt were crafted by the authors ten by the author to avoid interacting
with the goal of being domain agnostic with any of the existing STS datasets.
while introducing the model to the full (b) Prompt:
spectrum of the label space. Discuss how these two texts are similar
and different, then assign a semantic
similarity score between [0.0-1.0] which
describes their semantic similarity:

Sentence 1: Over 50 men have decided


that they want to upgrade their iphone
Sentence 2: We interviewed 25 people
and all of them want a new phone
Similarity: Lets think step by step. Sen-
tence 1 and Sentence 2 both discuss the
upgrade of phones. However they differ
in that sentence 1 refers specifically to
the iphone and only reports a statistic
about men, while sentence 2 discusses
phones generally and only for 25 people.
Thus, these sentences have a [semantic
similarity = 0.7]

Discuss how these two texts are similar


and different, then assign a semantic
similarity score between [0.0-1.0] which
describes their semantic similarity:

Sentence 1: <Text 1>


Sentence 2: <Text 2>
Similarity: Lets think step by step.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy