0% found this document useful (0 votes)
3 views23 pages

Unleashing The Emergent Cognitive Synergy in Large Language Models: A Task-Solving Agent Through Multi-Persona Self-Collaboration

This document presents Solo Performance Prompting (SPP), a method that enhances problem-solving in large language models (LLMs) by enabling multi-persona self-collaboration. SPP dynamically identifies and simulates various personas to improve both knowledge and reasoning abilities, demonstrating significant advantages over traditional prompting methods. The findings indicate that cognitive synergy emerges only in advanced models like GPT-4, highlighting the potential for LLMs to leverage collaborative cognitive processes for complex tasks.

Uploaded by

Jing Ma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views23 pages

Unleashing The Emergent Cognitive Synergy in Large Language Models: A Task-Solving Agent Through Multi-Persona Self-Collaboration

This document presents Solo Performance Prompting (SPP), a method that enhances problem-solving in large language models (LLMs) by enabling multi-persona self-collaboration. SPP dynamically identifies and simulates various personas to improve both knowledge and reasoning abilities, demonstrating significant advantages over traditional prompting methods. The findings indicate that cognitive synergy emerges only in advanced models like GPT-4, highlighting the potential for LLMs to leverage collaborative cognitive processes for complex tasks.

Uploaded by

Jing Ma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

Unleashing the Emergent Cognitive Synergy in Large Language Models:

A Task-Solving Agent through Multi-Persona Self-Collaboration


∗ †
Zhenhailong Wang1 , Shaoguang Mao2 , Wenshan Wu2 , Tao Ge2 , Furu Wei2 , Heng Ji1
1
University of Illinois Urbana-Champaign, 2 Microsoft Research Asia
{wangz3,hengji}@illinois.edu
{shaoguang.mao,wenshan.wu,tage,fuwei}@microsoft.com

Abstract ... a single LLM personas

single persona
Human intelligence thrives on cognitive syn-
Input output
ergy, where collaboration among different AI Assistant
minds yield superior outcomes compared to iso- (a) Standard Prompting
arXiv:2307.05300v4 [cs.AI] 26 Mar 2024

lated individuals. In this work, we propose Solo single persona


Performance Prompting (SPP), which trans- Input output
forms a single LLM into a cognitive synergist AI Assistant Thoughts

by engaging in multi-turn self-collaboration (b) Chain-of-Thought Prompting (CoT)


with multiple personas. A cognitive syner- multiple personas
gist is an intelligent agent that collaboratively
combines multiple minds’ strengths and knowl-
edge to enhance problem-solving in complex Input output
AI Assistant
tasks. By dynamically identifying and simu-
lating different personas based on task inputs, Domain Experts / Audiences
SPP unleashes the potential of cognitive syn- (c) Solo Performance Prompting (SPP)
ergy in LLMs. Our in-depth analysis shows
that assigning multiple fine-grained personas Figure 1: Schematic illustration of Solo Performance
in LLMs improves problem-solving abilities Prompting (SPP) and the difference compared to previ-
compared to using a single or fixed number ous prompting methods.
of personas. We evaluate SPP on three chal-
lenging tasks: Trivia Creative Writing, Code-
names Collaborative, and Logic Grid Puzzle, task-solving agents, they still encounter chal-
encompassing both knowledge-intensive and lenges (Qin et al., 2023; Bang et al., 2023; OpenAI,
reasoning-intensive types. Unlike previous 2023b; Bubeck et al., 2023) in various knowledge-
works, such as Chain-of-Thought, that solely intensive and reasoning-intensive tasks due to fac-
enhance the reasoning abilities in LLMs, ex- tual hallucination (Maynez et al., 2020) and a lack
perimental results demonstrate that SPP effec-
of slow-thinking (Sloman, 1996) capabilities. Un-
tively reduces factual hallucination, and main-
tains strong reasoning capabilities. Addition- like humans, who can leverage the power of collab-
ally, comparative experiments show that cog- oration and information integration among differ-
nitive synergy only emerges in GPT-4 and ent cognitive processes and individuals (referred to
does not appear in less capable models, such as cognitive synergy (Curşeu et al., 2015; Goertzel,
as GPT-3.5-turbo and Llama2-13b-chat, which 2009, 2017)), current LLMs are akin to "jack-of-all-
draws an interesting analogy to human devel- trades" with a vast mixture of knowledge and char-
opment. Code, data, and prompts can be found acteristics. Recent advancements, such as Chain-
at: https://github.com/MikeWangWZHL/
Solo-Performance-Prompting.git
of-Thought (CoT) prompting (Wei et al., 2023; Ko-
jima et al., 2022) and Self-refinement (Madaan
et al., 2023; Shinn et al., 2023), have successfully
1 Introduction enhanced the reasoning abilities of LLMs by sim-
ulating slow-thinking through the generation of
Although large language models (LLMs) have
intermediate steps or iterative revision. However,
demonstrated impressive performance as general
factual hallucination remains a major challenge for

Work was done when interning at Microsoft Research LLMs on knowledge-intensive tasks.
Asia.

Corresponding author. A cognitive synergist is an intelligent agent
Persona Identification
Write a short, one-paragraph background
story of an NPC for the next Legend of Participants: AI Assistant (you) Game Designer Harry Potter Fan
Zelda game. The background story Game of Thrones Fan Jay Chou Fan
should mention (1) the incantation of the
Patronus Charm in Harry Potter (2) the Start Collaboration!
name of a character who is beheaded in Brainstorming
the ninth episode of the Game of Thrones : The NPC should be a character that fits well into the Legend of Zelda universe
TV series, and (3) the name of the last : The incantation of the Patronus Charm in Harry Potter is "Expecto Patronum."
song in the second album by Jay Chou. : In the ninth episode of the Game of Thrones TV series, the character who is
beheaded is Eddard Stark
Task Input : The last song in the second album by Jay Chou is "An Jing."

Multi-Persona Iterative Collaboration


:Thanks for the information! Here's my initial attempt at the background story:
In the land of Hyrule, there lived a wise [ ... story draft ... ]
and mysterious NPC named Eddard, who
:The background story is engaging and adds depth to the game world...
was once a renowned wizard in the magical

Feedback
world of Hogwarts. Eddard was known for his :The incantation "Expecto Patronum" is well-incorporated into the story.
mastery of the Patronus Charm, a powerful :The character name Eddard Stark is used effectively in the story.
spell that conjured a protective guardian to
:The inclusion of "An Jing" adds a nice touch to the story.
ward off dark forces. (missing the
incantation!) However, his life took a tragic Finish Collaboration!
turn when he was beheaded in a cruel twist Final Answer
of fate, mirroring the demise of his namesake
In the mystical land of Hyrule, there lived a wise and powerful sorcerer named
in the ninth episode of the Game of Thrones
Eddard. Eddard was known far and wide for his mastery of ancient spells,
TV series. Now, as a spirit, Eddard roams the
including the legendary "Expecto Patronum" incantation from a distant realm.
land, offering guidance to those in need and
One fateful day, Eddard stumbled upon a mysterious artifact that played a haunting
seeking solace in the haunting melody of
"The Final Battle," (factual error!) the last melody called "An Jing." The melody resonated with his soul, unlocking a hidden
power within him. With this newfound power, Eddard dedicated his life to protecting
song in the second album by Jay Chou,
the people of Hyrule from the forces of darkness, becoming a beloved figure in the
which serves as a constant reminder of his
Legend of Zelda.
own final moments.

Standard Prompting Result (GPT-4) Solo Performance Prompting Result (GPT-4)

Figure 2: Task-solving example of Solo Performance Prompting (SPP) with GPT-4. The personas of the participants
are automatically identified by GPT-4 based on the task input. This example shows that Standard Prompting suffers
from factual errors, whereas SPP provides accurate information and a coherent answer. Note that, in real-world
applications, the domains can vary not only within entertainment but also encompass history, science, education,
healthcare, etc.

that collaborates with multiple minds to enhance overview of SPP. Here, a persona can represent
problem-solving and efficacy in complex tasks. In either a domain expert, such as a movie enthusi-
this work, we aim to create a cognitive syner- ast, or a target audience, such as a ten-year-old
gist based on a single LLM that can "split into" child. Through the dynamic identification of vari-
multiple personas and engage in self-collaboration ous personas, we empower a single LLM to acquire
to solve both knowledge-intensive and reasoning- diverse domain knowledge accurately without ad-
intensive tasks. This idea is heavily inspired by ditional retrieval systems. By facilitating multi-
the role of pretend play (Piaget, 1954; Pellegrini, turn self-collaboration, we enable self-revision and
2009) in cognitive development and recent findings self-feedback from various perspectives without
that assigning personas (Deshpande et al., 2023; requiring additional agents.
Xu et al., 2023) to LLMs can elicit specific be-
haviors, improve answer quality, and potentially In real-world scenarios, such as those in creative
build an AI society (Park et al., 2023; Schick et al., industries, there is often a need to incorporate di-
2022; Li et al., 2023; Cai et al., 2023) with col- verse information from different domains. Figure 2
laborative LLM agents. However, as shown in Ta- presents a concrete example of how SPP operates
ble 1, previous works have limitations such as fixed on a challenging task that requires creative integra-
or task-specific personas, the need for additional tion of information from various domains, such as
fine-tuning, and increased inference costs due to the Legend of Zelda game, Harry Potter movies,
multiple LLM instances. and Jay Chou’s albums. Standard prompting fails
to generate satisfactory output due to missing es-
To unleash the potential of cognitive synergy sential information and factual errors. In contrast,
for general task-solving, we propose Solo Perfor- SPP produces informative and coherent answers
mance Prompting (SPP), which prompts a single by automatically identifying expert personas and
LLM to identify, simulate, and collaborate with engaging in a multi-turn self-collaboration. In this
multiple personas. Figure 1 provides a high-level process, the AI Assistant persona iteratively writes
General task Pure zero-shot Has multiple Personas dynamically Has iterative Need only a
solving? prompting? personas? identified? refinement? single LLM?
† Standard Prompting (Brown et al., 2020)
† Chain-of-Thought (Wei et al., 2023)
Inner Monologue (Huang et al., 2022)
ReAct (Yao et al., 2022)
Reflexion (Shinn et al., 2023)
† Self-Refine (Madaan et al., 2023)
Tree-of-thought (Yao et al., 2023)
GPT-Bargaining (Fu et al., 2023) (fixed to 3)
Camel (Li et al., 2023) (fixed to 2)
ExpertPrompting (Xu et al., 2023)
Solo Performance Prompting (ours) (varied)

Table 1: High-level comparison with various prompting-based methods. Methods directly comparable to ours are
denoted by †. Results for the comparison can be found in Section 3. In Section 4, we further design and compare
with two variants of Solo Performance Prompting: one adopting fixed personas, as in Camel (Li et al., 2023), and
another with additional persona profiles, as proposed in ExpertPrompting (Xu et al., 2023).

drafts of the story, solicits feedback from other 2 Solo Performance Prompting
participants, and revises accordingly.
To unleash the power of synergizing different per-
To explore the prevalence of cognitive synergy
sonas to tackle complex problems, we propose Solo
in different LLMs, we apply SPP to LLMs with
Performance Prompting (SPP) which instructs a
varying scales and capabilities, including GPT-4,
LLM to perform the following the procedure for
GPT-3.5-turbo, and Llama-13b-chat. Comparative
general task-solving: (1) Persona Identification:
results show that cognitive synergy only emerges in
Identify multiple participants with special personas
GPT-4 and not in less capable models. This draws
(including a leader persona: AI Assistant) that are
an interesting analogy to human development, as
essential for solving the particular task. (2) Brain-
children typically start engaging in role-playing at
storming: The participants share knowledge and
the age of 2 to 3 (Piaget, 1954), but not earlier. In
provide suggestions on how to approach the task
summary, the key contributions of this paper are as
based on their own expertise. (3) Multi-Persona
follows:
Iterative Collaboration: The leader persona, AI
• We investigate whether LLMs can leverag- Assistant, proposes initial solutions, consults the
ing cognitive synergy for general task-solving. other participants for feedback, and revise the an-
We introduce Solo Performance Prompting swer iteratively. Figure 2 shows a walking example
(SPP), which simulates multi-agent, multi- of SPP during inference. Next, we formally de-
persona collaboration in a pure zero-shot man- scribe the SPP procedure in detail.
ner. Given an input sequence x and a model M,
• We evaluate SPP across three challenging let a prompt (including demonstration examples)
tasks: Trivia Creative Writing, Codenames prepended to the input to be p and the final output
Collaborative and Logic Grid Puzzle, span- to be y. Denote an intermediate generation before
ning both knowledge- and reasoning-intensive generating the final y as z. Under this formulation,
domains. To our knowledge, SPP is the first Standard Prompting and Chain-of-Thought (CoT)
zero-shot prompting method that can enhance Prompting can be described as:
both knowledge and reasoning abilities on
Standard Prompting: y = M(x) (1)
GPT-4.
CoT Prompting: y = M(pcot ∥x∥{z1 , z2 , ..., zn }) (2)
• We present an intriguing finding regarding the
emergent nature of cognitive synergy ability where pcot is the CoT prompt, e.g., "Solve the
in LLMs, which only emerges in GPT-4 and task step-by-step" and {z1 , z2 ..., zn } are the
not in less powerful models. intermediate steps. In contrast, our proposed Solo
• We conduct in-depth analyses of the impact Performance Prompting can be described as fol-
of the identified personas and SPP prompt de- lows:
sign, providing insights into why dynamic,
Solo Performance Prompting: y=
fine-grained personas are necessary, as op-
posed to fixed, coarse-grained personas. M(pspp ∥x∥zp ∥{zb1 , zb2 , ..., zbm }∥{zs0 , zf1 , ..., zfm }j=1..n ) (3)
where the SPP prompt (pspp ) includes a high-level answer. The final answer can be directly read out
instruction and two carefully crafted demonstration following user-specified output format.
examples1 that showcase the expected task-solving In summary, SPP instructs an LLM to solve gen-
procedure of SPP. We describe the design details eral tasks via multi-persona self-collaboration in
of the prompt in §A.1. The corresponding interme- a pure zero-shot manner. In contrast, as detailed
diate generations (z) of SPP are detailed below. in Table 1, previous prompting-based methods are
either task-specific or require additional mecha-
Persona Identification (zp ). Given an input task,
nism, e.g., searching (Yao et al., 2023), external
SPP first generates a list of participants with differ-
tools (Yao et al., 2022), memory component (Shinn
ent personas. For example in Figure 2, the model
et al., 2023), and fine-tuning (Xu et al., 2023).
identified a Jay Chou Fan persona to help answer
"the last song in the second album by Jay Chou". 3 Experiments
We let the language model identify the personas
dynamically instead of manually defining them. To explore the effectiveness of Solo Performance
Given only two demonstration examples (detailed Prompting (SPP), we adopt an evaluation method-
in §A), we observe that a state-of-the-art large lan- ology similar to that of previous work (Yao et al.,
guage model, e.g., GPT-4 (OpenAI, 2023b), can 2023). We carefully design new tasks and select
identify accurate and meaningful personas for di- tasks from existing benchmarks (Srivastava et al.,
verse tasks. We denote this part of intermediate 2022) that are challenging even for the most capa-
generation as zp in Equation 3. ble LLMs (OpenAI, 2023b). The evaluation aims
to cover diverse types of tasks encompassing both
Brainstorming (zbi ). Among the identified par- knowledge-intensive and reasoning-intensive do-
ticipants, "AI Assistant (you)" is treated as a leader mains.
persona that initiates the collaboration and gener-
Tasks. We invent the Trivia Creative Writing
ates initial solutions. Before generating the ini-
task (§3.1), which requires the model to internally
tial answer, the personas brainstorm on how to
acquire and integrate diverse information from var-
approach the task from their own perspectives. For
ious fields. We observe that even GPT-4 (OpenAI,
example, the Jay Chou Fan points out that the last
2023b) frequently exhibit hallucination and fac-
song in Jay Chou’s second album is "An Jing" ("Si-
tuality errors in the Trivia Creative Writing task.
lence"). We find that the brainstorming phase effec-
We also propose the Codenames Collaborative
tively improves the quality of the initial solution. In
task (§3.2), an extension of the Codenames task
Equation 3, the superscript i = 0 is used to denote
from the BigBench (Srivastava et al., 2022) that
the "AI Assistant" persona, while i ≥ 1 represents
features a two-role collaboration setup. Codenames
other dynamically identified personas. The inter-
Collaborative demands creative reasoning across a
mediate generations of the brainstorming step are
broad range of related knowledge and challenges
denoted as {zb1 , zb2 , ..., zbm }.
the model’s theory of mind skills. Lastly, we in-
Multi-Persona Iterative Collaboration (zs0 , zfi ). clude a challenging pure-reasoning task, Logic
Based on the brainstorming remarks, the AI Assis- Grid Puzzle (§3.3), from the BigBench (Srivastava
tant persona generates an initial solution zs0 , then it et al., 2022) which necessitates complex multi-step
consults each of the other participants for feedback reasoning.
{zfi }. The participants are encouraged to critique
Baselines. We compare our approach with
the current generation and give revision sugges-
Standard Prompting, Chain-of-Thought (CoT)
tions. For example, the Jay Chou Fan persona
prompting methods (outlined in §2) and Self-
checks whether the song "An Jing" ("Silence") is
Refine (Madaan et al., 2023). For CoT, a similar
correctly included in the story. This process can be
prompt design to (Yao et al., 2023) is employed,
repeated for multiple times until every participant
where the model is prompted to generate a plan or
is satisfied with the current solution. In Equation 3,
a series of steps before producing the final output.
we denote the intermediate generations of the multi-
For Self-Refine, we follow (Madaan et al., 2023)
turn dialogue as {zs0 , zf1 , ..., zfm }j=1...n where n is
to design feedback and refine prompts. We perform
the number of iterations before reaching the final
one self-refine iteration which requires three times
1
The tasks we use in the demonstration examples do not more inferences than SPP. Full prompts for the
overlap with the evaluation tasks. methods can be found in Appendix A.2.
Trivia.C.W (N=5) Trivia.C.W (N=10) Codenames.C Logic.G.Puzzle
Methods
Score (%) ∆ Score (%) ∆ Score (%) ∆ Score (%) ∆
Standard 74.6 0.0% 77.0 0.0% 75.4 0.0% 57.7 0.0%
CoT 67.1 ↓10.0% 68.5 ↓11.1% 72.7 ↓3.6% 65.8 ↑14.1%
Self-Refine [iter=0] 73.8 76.3 75.2 58.8
Self-Refine [iter=1] 73.9 ↓1.0% 76.9 ↓0.1% 64.6 ↓14.6% 60.0 ↑4.0%
SPP (ours) 79.9 ↑7.1% 84.7 ↑10.0% 79.0 ↑4.8% 68.3 ↑18.5%

Table 2: GPT-4 results on Trivia Creative Writing (Trivia.C.W), Codenames Collaborative (Codenames.C) and Logic
Grid Puzzle (Logic.G.Puzzle). ∆ indicates the relative gain/loss compared with Standard Prompting (first row). We
report the average scores across two individual runs with/without a system message (detailed in Appendix C).

Models. The default model we use is GPT- eration. For each question, a match to any of the
4 (OpenAI, 2023b). Detailed inference configura- answer aliases provided by the TriviaQA dataset is
tions, API versions, and full results can be found in considered a correct mention. The metric score is
Appendices C and F. In §3.4, we further investigate computed as: # correct answer mentions
# trivia questions .
the prevalence of cognitive synergy in LLMs with Results. Table 2 presents the results of the Trivia
different scales and capabilities, including GPT-3.5- Creative Writing task. The key observations are as
turbo (OpenAI, 2023a) and Llama2-13b-chat (Tou- follows: (1) Chain-of-Thought (CoT) does not out-
vron et al., 2023). perform Standard prompting, indicating that CoT
3.1 Trivia Creative Writing: A is ineffective in eliciting an LLM’s knowledge abil-
Knowledge-Intensive Task ities. Qualitative examples in Figure 8 and 11 il-
lustrate that although CoT generates reasonable
Task Description. As illustrated in Figure 3, plans for task resolution, the final generation still
Trivia Creative Writing asks a model to write a contains factual errors and hallucinations. (2) Self-
coherent story while incorporating the answers to Refine only brings marginal improvements over
N trivia questions. Our preliminary experiments iterations. (3) SPP outperforms all baselines sig-
(Figure 10) show that a sufficiently large N can nificantly. The improvement is more pronounced
effectively challenge GPT-4 to demonstrate fac- in the N = 10 setting compared to N = 5 (10%
tual knowledge across diverse domains. Thus, we vs. 7%), suggesting that Solo Performance Prompt-
mainly consider two evaluation settings, N = 5 ing is particularly beneficial when the task requires
and N = 10. We built a benchmark with 100 in- incorporating knowledge from numerous domains.
stances for each N , covering a total of 1000 trivia
questions2 extracted from the TriviaQA (Joshi et al., 3.2 Codenames Collaborative: A
2017) dataset. More details can be found in Ap- Knowledge+Reasoning Task
pendix B.1. Task Description. As illustrated in 4, Code-
Evaluation Metrics. Evaluating GPT-4 level names Collaborative is a collaborative task that
generation results can be challenging. Our prelim- challenges a model’s knowledge, reasoning, and
inary experiments indicate that, even for humans, theory of mind abilities by assigning two player
it is very difficult to identify which generation is roles: the Spymaster and the Guesser. The Spy-
better in terms of overall "quality" of the story from master’s role is to provide a hint word related to
different prompting methods. Thus, instead of fo- the target words, excluding some other distractor
cusing on evaluating the coherence of the genera- words, while the Guesser’s role is to identify the
tion, which can be highly subjective, we employ target words based on the given hint and the full
an automatic metric which focuses on detecting list of words. The same LLM (GPT-4 (OpenAI,
factual hallucinations. As shown in Figure 3, we 2023b)) is used for both roles sequentially, and a
perform string matching with the ground truth tar- dataset with 50 instances is constructed based on
get answers for each question on the output gen- BigBench’s (Srivastava et al., 2022) Codenames
2 task data.
To select difficult question instances that can pose
challenges to GPT-4, we use a smaller open-source LLM, Evaluation Metrics. The original Codenames
fastchat_t5_3b (Zheng et al., 2023), to obtain preliminary per-
formance on the validation set, and then choose the failure task in the BigBench dataset has limitations due to
cases as our question selection. its focus on the Guesser role and subjectivity in hint
Trivia Creative Writing Input Example (N=5)
Write a short and coherent story about {Harry Potter} that incorporates the answers to the following {5} questions: {Who was the man
behind The Chipmunks? Which Lloyd Webber musical premiered in the US on 10th December 1993? Who was the next British Prime
Minister after Arthur Balfour? Who had a 70s No 1 hit with Kiss You All Over? What claimed the life of singer Kathleen Ferrier?}

Output Example Target Answers


One day, Harry Potter stumbled upon a mysterious book in the Hogwarts library. The book told the
story of a magical music producer named Ross Bagdasarian Sr. (incorrect answer), who created David Seville
a group of singing chipmunks. Intrigued, Harry shared the story with his friends Ron and Hermione. Sunset Boulevard
As they discussed the tale, they discovered a connection between the magical world and the Muggle Henry Campbell-Bannerman
world. It turned out that a famous Muggle composer, Andrew Lloyd Webber, had been inspired by
Exile
the magical world to create a musical called "Sunset Boulevard," which premiered in the US on
10th December 1993. As they dug deeper, they found out that the magical world had influenced Cancer
Muggle politics as well. Henry Campbell-Bannerman , the British Prime Minister after Arthur
Balfour, had a secret friendship with a wizard who advised him on important matters. Their research
also led them to the Muggle band Exile , which had a 70s No 1 hit with "Kiss You All Over." The
band members were actually wizards in disguise who used their magical talents to create Evaluation Metric
enchanting music. Lastly, they learned about the tragic story of Kathleen Ferrier, a talented Muggle
singer whose life was claimed by breast cancer . Moved by her story, Hermione decided to Correct Answer Mentions: 4
research magical cures for Muggle illnesses, hoping to prevent such tragedies in the future. Harry, Number of Questions (N): 5
Ron, and Hermione marveled at the connections between the magical and Muggle worlds, realizing Score: 4 / 5 = 0.8
that magic had a greater impact on the world than they had ever imagined.

Figure 3: Trivia Creative Writing task example.

Codenames Collaborative Input/Output Example

Spymaster Input
Try to find a single word hint that can accurately represent and link the {4} given words: {"director,
kiss, popcorn, street"}. The key is to select a hint that does not cause confusion with other words
from the following list: {locust, fever, street, sherwood, kiss, popcorn, craft, ant, crystal, tear, cowboy,
government, pine, mountie, soap, genie, king arthur, sphinx, director, bride, razor, fog, whistle}.

Spymaster Output Example

"movie"

Guesser Input
Try to identify the {4} words best associated with the word {"movie"} from the following list: {locust,
fever, street, sherwood, kiss, popcorn, craft, ant, crystal, tear, cowboy, government, pine, mountie,
soap, genie, king arthur, sphinx, director, bride, razor, fog, whistle.}

Guesser Output Example

director, popcorn, cowboy, bride

Evaluation Metric
Targets: director, kiss, popcorn, street Guesser outputs: director, popcorn, cowboy, bride Score: 2 / 4 = 0.5

Figure 4: Codenames Collaborative task example.

words. Our new task, Codenames Collaborative, mediate dialogues.


resolves this by creating a self-contained evalua-
tion setting that accurately measures the model’s
capability without human annotation. As illustrated 3.3 Logic Grid Puzzle: A Reasoning-Intensive
in Figure 4, we compute the overlapping ratio be- Task
tween the predicted words from the Guesser and
the target words as the metric. Task Description and Evaluation Metrics We
utilize the Logic Grid Puzzle task from the Big-
Results. Table 2 shows the results on the Code- bench (Srivastava et al., 2022) dataset, which com-
names Collaborative task. Similar to the Trivia prises 200 instances. Each instance describes a
Creative Writing task, we find that CoT does not logic puzzle typically involving 2 to 5 houses, with
bring positive gains compared with the Standard each house inhabited by a person with specific char-
prompting. Interestingly, iterative self-refinement acteristics, such as playing the piano. The objective
brings negative impact on this task, due to a high is to answer questions about house numbers based
tendency changing the initial response even if it is on given clues, which requires multi-step reasoning
already good. In contrast, SPP brings significant and the selection of relevant information. An exam-
improvements (~5%), which indicates its effective- ple input and output of the Logic Grid Puzzle task
ness on collaborative tasks that require knowledge, are illustrated in Figure 5. For evaluation metrics,
reasoning, and theory of mind skills. Figure 12 we calculate the accuracy of the predicted house
provides further qualitative examples illustrating numbers by comparing them with the ground truth
that SPP generates detailed and interpretable inter- targets provided by the dataset.
Logic Grid Puzzle Input/Output/Evaluation Example

Input Example
Q: There are 4 houses in a row, numbered 1 on the left to 4 on the right. There is one person living in each house. The people
in these houses have different characteristics:
- Each person has different flowers in their foyer: one has a carnations arrangement, one has a bouquet of daffodils, one
has a vase of tulips, and one has a bouquet of lilies
- Each person plays a different musical instrument: one is a guitarist, one is a pianist, one is a percussionist, and one is a
flutist

Clue(s):
1. The flutist lives in the second house.
2. The person who has a vase of tulips lives directly left of the guitarist.
3. The person who has a bouquet of lilies lives directly left of the person who has a carnations arrangement.
4. There is one house between where the flutist lives and where the pianist lives.

What is the number of the house where the person who has a vase of tulips lives?
choice: 2
choice: 4
choice: 1
choice: 3

Output Example Evaluation Metric

The house number where the person who has a vase of tulips lives is 3. Target: 2 Prediction: 3 Score: 0 (2!=3)

Figure 5: Logic Grid Puzzle task example.

Results. Table 2 presents the results on Logic (SPP) not only brings significant improvements to
Grid Puzzle. In contrast to the previous two tasks, knowledge-intensive tasks such as Trivia Creative
we find that CoT brings significant improvements Writing and Codenames Collaborative without rely-
compared to Standard prompting, verifying the ob- ing on external knowledge bases, but also achieves
servation from previous work that CoT elicits bet- strong performance on reasoning-intensive tasks
ter reasoning abilities. Furthermore, we discover like Logic Grid Puzzle. To our knowledge, SPP
that SPP also achieves strong performance on this is the first zero-shot prompting method that can
reasoning-intensive task. enhance both knowledge and reasoning abilities on
GPT-4.
3.4 The Emergence of Cognitive Synergy
We further discover that cognitive synergy can LLMs can effectively identify useful personas in
only be fully unleashed in LLMs with a certain a zero-shot manner. We are interested in inves-
level of instruction-following capabilities, akin tigating whether the identified personas are highly
to that of GPT-4. This can be intriguingly com- relevant to the tasks. We visualize the personas
pared to human development, where children usu- automatically identified by SPP using a word cloud
ally begin to participate in role-playing around the for each task in Figure 7a, where a larger font in-
ages of 2 to 3 (Piaget, 1954), but not before that dicates a higher frequency. The key observations
age. include: (1) The identified personas are closely cor-
As shown in Figure 6, the effectiveness of SPP related with the particular task. For example, in
is not seen in smaller and less capable models like Logic Grid Puzzle, even though "logic puzzle" is
GPT-3.5 and Llama2. Additionally, on Llama2, not mentioned in the input, the LLM frequently
we identify a unique problem which we refer to identifies the persona "Logic Puzzle Expert." (2)
as early-termination, where the model stops gen- On knowledge-intensive tasks, such as Trivia Cre-
erating after identifying the participants, resulting ative Writing, SPP identifies more diverse and spe-
in exceptionally low performance with SPP. The cific personas, while on reasoning-intensive tasks,
model behaves as if it were waiting for input from such as Logic Grid Puzzle, the personas are more
a user instead of following the demonstration ex- homogeneous.
amples to generate responses on its own. Detailed We further investigate whether a detailed pro-
discussions and examples on the early-termination file for each persona is needed for eliciting domain
problem can be found in Appendix E. knowledge, as suggested by (Xu et al., 2023). To
this end, we design a variant of SPP, SPP-Profile,
4 Analysis which involves generating profiles for each persona
SPP effectively improves both knowledge and during the Persona Identification phase. The re-
reasoning abilities in LLMs. As demonstrated sults in Figure 7b show that SPP-Profile does not
by the results in §3, Solo Performance Prompting outperform SPP. This suggests that a fine-grained
Cognitive synergy abilities only emerge in the most powerful LLMs such as GPT-4

Figure 6: SPP achieves superior performance only with the most powerful LLM (GPT-4), but not with GPT-3.5 and
Llama2-13b. This indicates that cognitive synergy abilities only emerge in LLMs with GPT-4 level capabilities.

persona name without a detailed description may show good performance with only the first demo
already be sufficient for eliciting certain domain example.
knowledge.
5 Related Work
Dynamic personas v.s. fixed personas. To fur- LLMs as role-playing agents. Recent re-
ther investigate the importance of dynamically iden- search (Deshpande et al., 2023; Xu et al., 2023; Fu
tifying personas for each task instance instead et al., 2023; aut, 2023; Li et al., 2023) demonstrates
of fixing a general persona, an ablated variant that assigning personas or roles to LLMs influences
of SPP, SPP-Fixed-Persona, is introduced. For their generation behavior. AI societies with dis-
SPP-Fixed-Persona, we modify the prompt (Fig- tinct personas or occupations have been explored
ure 17) to force the personas to be fixed as an "AI for collaboration (Park et al., 2023; Schick et al.,
Assistant" and an "Expert". Comparing SPP and 2022; Li et al., 2023; Cai et al., 2023). However,
SPP-Fixed-Persona in Figure 7b, we have the fol- limitations in persona assignment and multi-agent
lowing insights: (1) SPP consistently outperforms collaboration include single or fixed persona as-
SPP-Fixed-Persona across all tasks, suggesting signments (Xu et al., 2023; Fu et al., 2023; Schick
that dynamic, fine-grained personas are more effec- et al., 2022; Li et al., 2023) and the need for mul-
tive than fixed, general personas. Qualitative exam- tiple LLM instances, increasing inference cost. In
ples in Figure 8 and 13 shows that the fine-grained contrast, SPP uses a single LLM to dynamically
personas such as "Film Expert" and "Sports Enthu- identify useful personas for general tasks. Our
siast" correctly provide the answers, while the fixed discovery on the emergent nature of cognitive syn-
persona "Expert" fails. (2) SPP-Fixed-Persona also ergy also aligns with related work (Olausson et al.,
suffers from the early-termination problem as de- 2023), which investigates the emergent ability of
fined in §3.4, where the LLM stops collaboration self-debugging in code generation.
before providing the final answer as if it were wait-
ing for external inputs. Enhancing reasoning and factual knowledge
in LLMs. LLMs face challenges in com-
Impact of the demonstrations in SPP prompt. plex knowledge-intensive tasks due to hallucina-
To investigate the effectiveness of the hand-crafted tion (Maynez et al., 2020) and reasoning-intensive
demonstration examples in SPP, we conduct an tasks due to the lack of human-like slow think-
ablation study where we remove the second demo ing (Sloman, 1996; Kahneman, 2011). Approaches
example and preserve the first one, which shows like Chain-of-Thought (CoT) and Self-Refinement
only a two-persona collaboration setting. As shown encourage LLMs to solve tasks step by step or
in Figure 9, we observe that (1) Adding the second iteratively revise their answers (Wei et al., 2023;
example, which requires collaboration of more than Kojima et al., 2022; Zhang et al., 2022; Fu et al.,
two personas, effectively boosts the performance. 2022; Xue et al., 2023; Yao et al., 2023; Madaan
(2) SPP is fairly robust to the prompt change and et al., 2023; Shinn et al., 2023; Gou et al., 2023;
(a) Visualization of the SPP- (b) Comparison between SPP, SPP-Fixed-Persona (with two fixed personas) and
identified personas. The per- SPP-Profile (additionally generating persona profiles). SPP significantly outperforms
sonas show a high correla- SPP-Fixed-Persona, highlighting the importance of automatically identifying dynamic,
tion with the nature of the fine-grained personas. SPP slightly outperforms SPP-Profile, indicating that the persona
tasks. names (without detailed description of the expertise) are probably already sufficient for
eliciting cognitive synergy.
Figure 7: (a) Qualitative analysis on the identified personas; (b) Quantitative analysis on two SPP variants.

SPP v.s. CoT (Trivia Creative Writing N=5)


Instance Index: 10
Question #5: In which country is the Sky Train Rail bridge?
SPP CoT
Participants: AI Assistant (you); Movie Expert; Art Expert; Plan:
Sports Expert; Geography Expert [...]
[...] 6. Incorporate the Sky Train Rail bridge, which is in
Geography Expert: The Sky Train Rail Bridge is located Thailand.
in Canada.

SPP v.s. SPP-Fixed-Persona (Trivia Creative Writing N=5)


Instance Index: 30
Question #1: Which orchestral conductor was married to one of the subjects of the film Hillary and Jackie?
SPP SPP-Fixed-Persona
Participants: AI Assistant (you); Film Expert; Literary Participants: AI Assistant (you); Expert
Expert; History Expert; Geography Expert; Architecture Expert: Sir Simon Rattle was married to one of the
Expert subjects of the film Hillary and Jackie.
Film Expert: The orchestral conductor married to one of
the subjects of the film Hillary and Jackie is Daniel
Barenboim.

Figure 8: Qualitative examples on Trivia Creative Writing comparing SPP, CoT and SPP-Fixed-Persona. While
CoT provides reasonable intermediate steps, it still struggles with factual hallucination. SPP v.s. SPP-Fixed-Persona
reveals that dynamically identified fine-grained personas, such as the "Film Expert," tend to outperform the fixed
general persona of an "Expert. More examples can be found in Figures 11, 12, and 13.

Chen et al., 2023; Huang et al., 2022; Yao et al., 6 Conclusion


2022). However, these methods do not necessarily Solo Performance Prompting unleashes the cogni-
reduce factual hallucination. Retrieval augmented tive synergy abilities within powerful LLMs, sig-
LLMs (Borgeaud et al., 2022; Izacard et al., 2022; nificantly reducing factual hallucination while en-
Wang et al., 2022; Shuster et al., 2021) enhance hancing reasoning. The performance is assessed
knowledge acquisition but do not improve rea- using newly proposed tasks, e.g., Trivia Creative
soning abilities. We propose Solo Performance Writing and Codenames Collaborative, demonstrat-
Prompting (SPP) to elicit both knowledge and ing superior results compared to Standard, CoT
reasoning abilities in LLMs, improving factual- and Self-Refine. The discovery of the emergent na-
ity while maintaining strong performance on pure- ture of cognitive synergy on different LLMs draws
reasoning tasks. interesting analogy to human development.
Limitations Ji, Tiezheng Yu, Willy Chung, et al. 2023. A multi-
task, multilingual, multimodal evaluation of chatgpt
Although Solo Performance Prompting exhibits on reasoning, hallucination, and interactivity. arXiv
promising improvements in acquiring factually cor- preprint arXiv:2302.04023.
rect knowledge compared to Standard prompting, Sebastian Borgeaud, Arthur Mensch, Jordan Hoff-
it has some limitations. For instance, even when a mann, Trevor Cai, Eliza Rutherford, Katie Milli-
fine-grained persona is assigned, the answer may can, George Bm Van Den Driessche, Jean-Baptiste
still be incorrect. It remains unclear to what ex- Lespiau, Bogdan Damoc, Aidan Clark, et al. 2022.
Improving language models by retrieving from tril-
tent assigning a persona can help enhance domain lions of tokens. In International conference on ma-
knowledge in a specific area. Dedicated diagnostic chine learning, pages 2206–2240. PMLR.
experiments and theoretical efforts are needed to
quantify the impact of having a persona or not. Tom Brown, Benjamin Mann, Nick Ryder, Melanie
Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind
Furthermore, we currently adopt an identical Neelakantan, Pranav Shyam, Girish Sastry, Amanda
SPP prompt with the same two demonstration ex- Askell, et al. 2020. Language models are few-shot
amples for any given task inputs, which may be learners. Advances in neural information processing
suboptimal. Future work investigating how to find systems, 33:1877–1901.
better demonstration examples conditioned on each Sébastien Bubeck, Varun Chandrasekaran, Ronen El-
input could further improve the effectiveness of dan, Johannes Gehrke, Eric Horvitz, Ece Kamar,
SPP. Peter Lee, Yin Tat Lee, Yuanzhi Li, Scott Lund-
berg, et al. 2023. Sparks of artificial general intelli-
Last but not least, if given sufficient computa-
gence: Early experiments with gpt-4. arXiv preprint
tional budget, a natural variant of SPP could extend arXiv:2303.12712.
to a multi-agent cognitive synergist setup where
a leader persona identifies several expert agents Tianle Cai, Xuezhi Wang, Tengyu Ma, Xinyun Chen,
and Denny Zhou. 2023. Large language models as
and forms a cabinet to collaboratively solve a task. tool makers. arXiv preprint arXiv:2305.17126.
The multi-agent setup allows for leveraging richer
computation power, larger local memory, and more Xinyun Chen, Maxwell Lin, Nathanael Schärli, and
flexible human-computer interaction, which could Denny Zhou. 2023. Teaching large language models
to self-debug. arXiv preprint arXiv:2304.05128.
be essential for deploying to real-world applica-
tions. Petru L Curşeu, Nicoleta Meslec, Helen Pluut, and Ger-
ardus JM Lucas. 2015. Cognitive synergy in groups
Acknowledgements and group-to-individual transfer of decision-making
competencies. Frontiers in psychology, 6:1375.
We would like to express our gratitude to the anony-
Ameet Deshpande, Vishvak Murahari, Tanmay Rajpuro-
mous reviewers for their insightful comments and hit, Ashwin Kalyan, and Karthik Narasimhan. 2023.
suggestions. We would also like to thank our col- Toxicity in chatgpt: Analyzing persona-assigned lan-
leagues and fellow interns at Microsoft Research guage models. arXiv preprint arXiv:2304.05335.
Asia for their valuable internal discussions and
Yao Fu, Hao Peng, Tushar Khot, and Mirella Lapata.
feedback. Zhenhailong Wang and Heng Ji are par- 2023. Improving language model negotiation with
tially supported by U.S. DARPA ECOLE Program self-play and in-context learning from ai feedback.
No. #HR00112390060 and U.S. DARPA ITM Pro- arXiv preprint arXiv:2305.10142.
gram No. FA8650-23-C-7316. The views and Yao Fu, Hao Peng, Ashish Sabharwal, Peter Clark,
conclusions contained herein are those of the au- and Tushar Khot. 2022. Complexity-based prompt-
thors and should not be interpreted as necessarily ing for multi-step reasoning. arXiv preprint
representing the official policies, either expressed arXiv:2210.00720.
or implied, of DARPA, or the U.S. Government. Ben Goertzel. 2009. Cognitive synergy: A universal
principle for feasible general intelligence. In 2009
8th IEEE International Conference on Cognitive In-
References formatics, pages 464–468. IEEE.

2023. Auto-gpt. https://github.com/Significant- Ben Goertzel. 2017. A formal model of cognitive syn-
Gravitas/Auto-GPT. ergy. In Artificial General Intelligence: 10th Interna-
tional Conference, AGI 2017, Melbourne, VIC, Aus-
Yejin Bang, Samuel Cahyawijaya, Nayeon Lee, Wen- tralia, August 15-18, 2017, Proceedings 10, pages
liang Dai, Dan Su, Bryan Wilie, Holy Lovenia, Ziwei 13–22. Springer.
Zhibin Gou, Zhihong Shao, Yeyun Gong, Yelong Shen, Joon Sung Park, Joseph C O’Brien, Carrie J Cai, Mered-
Yujiu Yang, Nan Duan, and Weizhu Chen. 2023. ith Ringel Morris, Percy Liang, and Michael S
Critic: Large language models can self-correct with Bernstein. 2023. Generative agents: Interactive
tool-interactive critiquing. simulacra of human behavior. arXiv preprint
arXiv:2304.03442.
Wenlong Huang, Fei Xia, Ted Xiao, Harris Chan, Jacky
Liang, Pete Florence, Andy Zeng, Jonathan Tomp- Anthony D Pellegrini. 2009. The role of play in human
son, Igor Mordatch, Yevgen Chebotar, et al. 2022. development. Oxford University Press, USA.
Inner monologue: Embodied reasoning through
planning with language models. arXiv preprint Jean Piaget. 1954. The construction of reality in the
arXiv:2207.05608. child.

Chengwei Qin, Aston Zhang, Zhuosheng Zhang, Jiaao


Gautier Izacard, Patrick Lewis, Maria Lomeli, Lu- Chen, Michihiro Yasunaga, and Diyi Yang. 2023. Is
cas Hosseini, Fabio Petroni, Timo Schick, Jane chatgpt a general-purpose natural language process-
Dwivedi-Yu, Armand Joulin, Sebastian Riedel, and ing task solver? arXiv preprint arXiv:2302.06476.
Edouard Grave. 2022. Few-shot learning with re-
trieval augmented language models. arXiv preprint Timo Schick, Jane Dwivedi-Yu, Zhengbao Jiang, Fabio
arXiv:2208.03299. Petroni, Patrick Lewis, Gautier Izacard, Qingfei You,
Christoforos Nalmpantis, Edouard Grave, and Sebas-
Mandar Joshi, Eunsol Choi, Daniel Weld, and Luke tian Riedel. 2022. Peer: A collaborative language
Zettlemoyer. 2017. TriviaQA: A large scale distantly model. arXiv preprint arXiv:2208.11663.
supervised challenge dataset for reading comprehen-
sion. In Proceedings of the 55th Annual Meeting of Noah Shinn, Beck Labash, and Ashwin Gopinath.
the Association for Computational Linguistics (Vol- 2023. Reflexion: an autonomous agent with dy-
ume 1: Long Papers), pages 1601–1611, Vancouver, namic memory and self-reflection. arXiv preprint
Canada. Association for Computational Linguistics. arXiv:2303.11366.

Daniel Kahneman. 2011. Thinking, fast and slow. Kurt Shuster, Spencer Poff, Moya Chen, Douwe Kiela,
macmillan. and Jason Weston. 2021. Retrieval augmentation
reduces hallucination in conversation. arXiv preprint
arXiv:2104.07567.
Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yu-
taka Matsuo, and Yusuke Iwasawa. 2022. Large lan- Steven A Sloman. 1996. The empirical case for two sys-
guage models are zero-shot reasoners. arXiv preprint tems of reasoning. Psychological bulletin, 119(1):3.
arXiv:2205.11916.
Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao,
Guohao Li, Hasan Abed Al Kader Hammoud, Hani Abu Awal Md Shoeb, Abubakar Abid, Adam Fisch,
Itani, Dmitrii Khizbullin, and Bernard Ghanem. 2023. Adam R Brown, Adam Santoro, Aditya Gupta,
Camel: Communicative agents for" mind" explo- Adrià Garriga-Alonso, et al. 2022. Beyond the
ration of large scale language model society. arXiv imitation game: Quantifying and extrapolating the
preprint arXiv:2303.17760. capabilities of language models. arXiv preprint
arXiv:2206.04615.
Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler
Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Hugo Touvron, Louis Martin, Kevin Stone, Peter Al-
Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, bert, Amjad Almahairi, Yasmine Babaei, Nikolay
et al. 2023. Self-refine: Iterative refinement with Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti
self-feedback. arXiv preprint arXiv:2303.17651. Bhosale, et al. 2023. Llama 2: Open founda-
tion and fine-tuned chat models. arXiv preprint
Joshua Maynez, Shashi Narayan, Bernd Bohnet, and arXiv:2307.09288.
Ryan McDonald. 2020. On faithfulness and factu-
ality in abstractive summarization. In Proceedings Zhenhailong Wang, Xiaoman Pan, Dian Yu, Dong Yu,
of the 58th Annual Meeting of the Association for Jianshu Chen, and Heng Ji. 2022. Zemi: Learn-
Computational Linguistics, pages 1906–1919, On- ing zero-shot semi-parametric language models from
line. Association for Computational Linguistics. multiple tasks. arXiv preprint arXiv:2210.00185.

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten


Theo X Olausson, Jeevana Priya Inala, Chenglong Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and
Wang, Jianfeng Gao, and Armando Solar-Lezama. Denny Zhou. 2023. Chain-of-thought prompting elic-
2023. Demystifying gpt self-repair for code genera- its reasoning in large language models.
tion. arXiv preprint arXiv:2306.09896.
Benfeng Xu, An Yang, Junyang Lin, Quan Wang,
OpenAI. 2023a. Gpt-35. Chang Zhou, Yongdong Zhang, and Zhendong Mao.
https://platform.openai.com/docs/models/gpt-3-5. 2023. Expertprompting: Instructing large language
models to be distinguished experts. arXiv preprint
OpenAI. 2023b. Gpt-4 technical report. arXiv:2305.14688.
Tianci Xue, Ziqi Wang, Zhenhailong Wang, Chi Han, A Prompts
Pengfei Yu, and Heng Ji. 2023. Rcot: Detect-
ing and rectifying factual inconsistency in reason- A.1 SPP Prompt Design
ing by reversing chain-of-thought. arXiv preprint
arXiv:2305.11499. To prompt an LLM to behave as a cognitive syner-
gist that follows the expected task-solving proce-
Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, dure as mentioned in §2, we carefully designed the
Thomas L Griffiths, Yuan Cao, and Karthik
Narasimhan. 2023. Tree of thoughts: Deliberate structure of the SPP prompt as follows. The full
problem solving with large language models. arXiv prompts can be found in § A.2.3
preprint arXiv:2305.10601.
System Principle. The first part of the prompt
Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak contains a high-level instruction: "When faced
Shafran, Karthik Narasimhan, and Yuan Cao. 2022.
React: Synergizing reasoning and acting in language with a task, begin by identifying
models. ArXiv, abs/2210.03629. the participants who will contribute
to solving the task. Then, initiate
Zhuosheng Zhang, Aston Zhang, Mu Li, and Alex
Smola. 2022. Automatic chain of thought prompting
a multi-turn collaboration process
in large language models. until a final solution is reached.
The participants will give critical
Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan comments and detailed suggestions
Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin,
Zhuohan Li, Dacheng Li, Eric. P Xing, Hao Zhang, whenever necessary."
Joseph E. Gonzalez, and Ion Stoica. 2023. Judging
llm-as-a-judge with mt-bench and chatbot arena. Demonstration Examples. Then, we include
two manually crafted demonstration examples to
showcase the expected task-solving behavior. The
first example describes a Game of 24 task, where
we only include two personas: an AI Assistant and
a Math Expert. This task aims to provide an ex-
ample of a reasoning-intensive task, where the AI
Assistant needs to propose multiple proposals, and
the other participants need to give fine-grained feed-
back on where the current solution went wrong and
how to improve it. The second example describes
a poem-writing task with diverse requirements, in-
cluding lexical constraints, semantic constraints,
and audience awareness. This task aims to provide
an example of a knowledge-intensive task, where di-
verse personas are required to collaboratively solve
the task. This example also demonstrates a case
where it is important to assign a dedicated persona
to the audience, e.g., a ten-year-old child.

Task Prefix. The last part of the prompt


reminds the model to "identify the
participants and collaboratively solve
the following task step by step." fol-
lowed by task-specific format instructions and
inputs.

A.2 Full Prompts


Figures 15, 16 and 17 show the full prompts for
SPP, SPP-Profile and SPP-Fixed-Persona respec-
tively. Figure 18 shows the prompts for Chain-of-
3
We use the same prompt for any arbitrary tasks.
(most conservative) and top_p to 1.0 for all gen-
erations to maximize reproducibility. Since even
though the temperature is set to 0.0 the GPT-4 gen-
eration can still be non-deterministic, we conduct
additional experiment to investigate its generation
consistency under this configuration. As shown in
Table 3, we perform three individual runs and com-
pute the mean and standard deviation of the metric
score on Trivia Creative Writing. We find that the
variance is sufficiently small and Solo Performance
Prompting enjoys lower variance than Standard and
CoT prompting.
Figure 9: Analysis on the impact of the demonstration
examples in SPP prompt. We compare the effectiveness
Methods Run 1 Run 2 Run 3 Mean (std)
of the original SPP prompt with a variant where we re-
move the second demonstration example, which shows Standard 75.6 74.4 73.1 74.4 ±1.3
CoT 68.8 69.6 70.8 69.7 ±1.0
a multi-persona scenario. We observe that (1) SPP is
SPP 80.0 79.8 80.8 80.2 ±0.5
fairly robust to the change in the prompt; (2) adding an
additional multi-persona example apart from the single-
Table 3: Investigation on the generation consistency of
persona one effectively boosts performance on all three
GPT-4 API. The experiment is performed on the Trivia
tasks.
Creative Task (N=5). We set the inference temperature
to 0.0 and top_p to 1.0 as all experiments conducted in
the paper. The results show that the GPT-4 generation
Thought (CoT) prompting. Figure 19 shows the
is fairly consistent with a small variance (∼ 1%). We
prompts for Self-Refine prompting. also observe that SPP shows lower variance compared
with Standard and CoT prompting across different runs.
B Task Details
B.1 Trivia Creative Writing To evaluate the potential impact of initial persona
assignment through a system message, we consider
Figure 3 shows a detailed illustration of the Trivia two inference settings: with or without the default
Creative Writing task. Additionally, we investi- system message, "You are an AI assistant
gate how the number of the questions (N) and the that helps people find information". Di-
ordering of the questions would affect the perfor- vergent patterns are observed across various tasks
mance on the Trivia Creative Writing task. As and methods regarding the use of the system mes-
shown in Figure 10, with a larger number of ques- sage. We report the average metric scores across
tions (N≥5), Trivia Creative Writing effectively both inference settings in Table 2. Full GPT-4 re-
challenges GPT-4’s performance. While a single sults for each setting can be found in Appendix F.
question (N=1) yields similar outcomes regardless For GPT-3.5 results in Figure 6, we employ the
of the prompting method, SPP approach is notably same prompt, hyper-parameters and the best system
superior for larger Ns. The ordering of the ques- message setting in terms of SPP’s GPT-4 perfor-
tions has minimal impact to the task performance. mance. For Llama2, we leverage the Huggingface
The topic list is automatically generated by text-generation pipeline6 with greedy decoding.
prompting GPT-4 to provide 100 nouns from pop
culture4 . D Additional Qualitative Analysis
C Inference Configurations Figure 11 presents examples of the Trivia Creative
Writing task, illustrating that although CoT can
The main results in Table 2 are obtained from GPT- generate plausible plans for task resolution, the fi-
4. The GPT-4 API version we employ is Azure nal outcomes often contain factual inaccuracies and
2023-3-15-preview.5 The temperature is set to 0.0 instances of hallucination. In contrast, SPP elicits
4
The full prompt for generating the topic list can be found
precise knowledge with fine-grained personas.
in Figure 20. We performed further human curation to avoid Figure 12 displays examples of the Codenames
potential harmful content. Collaborative task, illustrating that SPP generates
5
There are rare cases when a generation triggers the content
6
filter of the API. We exclude those instances from our results. https://huggingface.co/blog/llama2
(a) Trivia Creative Writing with a large enough number of (b) The ordering of the questions in the Trivia Creative
questions (N) effectively pose challenge to GPT-4 in terms Writing task does not bring too much impact. The per-
of factual correctness. With N=1, different prompting formance on shuffled questions is close to the original
methods result in similar performance, while with N>=5, ordered questions.
SPP shows visible superiority.
Figure 10: Analysis on the impact of the number of questions (N) and the ordering of the questions for the Trivia
Creative Writing task.

intermediate dialogues that are both detailed and G Usage of AI assistants in writing
interpretable, leading to superior performance com-
We used ChatGPT and GPT-4 solely for checking
pared to CoT.
and correcting grammars.
Figure 13 shows additional qualitative examples
on Solo Performance Prompting vs SPP-Profile.

E Early-termination with
SPP-Fixed-Persona

Figure 14 shows an example of the early-


termination problem (defined in § 4) where the
generation stops before reaching the final solution
as if the models is waiting input from an external
user.
The problem is particularly severe on certain
tasks, e.g., Codenames Collaborative, resulting in
unexpectedly low performance as shown in Fig-
ure 7b. The problem can be largely alleviated by
removing the system message but cannot be en-
tirely eliminated. Table 4 shows the statistics of
the early-termination problem for each task and
method. In contrast, we did not observe early-
termination on SPP, SPP-Profile, Standard, or CoT
prompting with GPT-4.

F Full Results

Full results of the three tasks: Trivia Creative Writ-


ing, Codenames Collaborative and Logic Grid Puz-
zle can be found in Tables 5, 6 and 7, respectively.
SPP v.s. CoT (Trivia Creative Writing N=5)

Instance Index: 10
Question #5: In which country is the Sky Train Rail bridge?
SPP CoT
Participants: AI Assistant (you); Movie Expert; Art Expert; Plan:
Sports Expert; Geography Expert [...]
[...] 6. Incorporate the Sky Train Rail bridge, which is in
Geography Expert: The Sky Train Rail Bridge is located Thailand. (factual error)
in Canada.

Instance Index: 12
Question #5: Which 70s show was based on the British show Till Death Us Do Part?
SPP CoT

Participants: AI Assistant (you); Story Writer; Music Plan:


Expert; TV Show Expert; Cultural Expert [...] 6. Mention a show that the Little Mermaid watches,
[...] based on the British show Till Death Us Do Part.
TV Show Expert: [...] The 70s show All in the Family Story:
was based on the British show Till Death Us Do Part. [...] Ariel and her friends often gathered to watch their
favorite show, "Undersea Squabbles," (hallucination)
which was based on the British show Till Death Us Do Part.

Figure 11: SPP vs CoT qualitative examples on Trivia Creative Writing (N=5). We find that although CoT generates
reasonable plans or steps, it tends to suffer from factual errors and hallucination.

Tasks added system message # early-termination


yes 18 / 100
Trivia Creative Writing (N=5)
no 0 / 100
yes 16 / 100
Trivia Creative Writing (N=10)
no 1 / 100
yes 37 / 50
Codenames Collaborative
no 4 / 50
yes 11 / 200
Logic Grid Puzzle
no 15 / 200

Table 4: Early termination statistics on SPP-Fixed-Persona: Removing the system message, "You are an AI
assistant that helps people find information.", can effectively reduce the problem but cannot fully
eliminate it.
SPP v.s. CoT (Codenames Collaborative)

Instance Index: 9
Word List: shoulder, glacier, anchor, magician, tunnel, sugar, gear, book, rainbow, ram, nose, pacific, volume, rope,
mark, drum, page, sack, kick
Target Words: kick, rope
SPP Spymaster CoT Spymaster
Participants: AI Assistant (you); Language Expert; Word Steps:
Association Expert
1. Analyze the given words: "kick" and "rope".
Language Expert: The task is to find a single word hint 2. Identify the common theme or connection between the
that can accurately represent and link the 2 given words: two words.
"kick, rope". The hint should not cause confusion with 3. Compare the connection with the list of words provided
other words from the list provided. to ensure the hint does not cause confusion with other
Word Association Expert: Let's first analyze the words words.
"kick" and "rope". We can think of activities or objects that 4. Select a hint that accurately represents and links the
involve both kicking and using a rope. For example, sports two given words without causing confusion.
or games.
AI Assistant (you): Thanks for the suggestions! Based on Answer: Sport
the idea of sports or games, I propose the hint "jump". It
can be associated with both "kick" and "rope" without
causing confusion with the other words in the list.
[...]
Final answer: jump

SPP Spymaster Hint: jump CoT Spymaster Hint: Sport

SPP Guesser CoT Guesser


Participants: AI Assistant (you); Linguist; Athlete Steps:

Linguist: When thinking about words associated with 1. Review the list of words provided.
"jump," consider words that are related to physical 2. Identify words that are directly related to or commonly
activities, sports, or movement. Look for words that can be associated with sports.
used in the same context as "jump" or have a similar 3. Select the top 2 words that are most closely related to
meaning. sports.
Athlete: In sports and physical activities, we often use
equipment or tools that help us jump or perform jumping- Answer: gear, kick
related actions. Keep that in mind when selecting the
words.
AI Assistant (you): Thanks for the guidance! Based on
your suggestions, I think "rope" and "kick" are the two
words best associated with "jump" from the list. Rope can
be related to jumping rope, and kick can be associated
with jumping in some sports like soccer or martial arts.
Linguist: Let's verify if the words are related to "jump."
Rope is indeed related to jumping rope, which is a
common physical activity. Kick is also related to jumping in
some sports, as you mentioned. Both words seem to be
good choices.
Athlete: I agree with the choices. Rope and kick are both
related to jumping and physical activities.

Final answer: rope, kick

Score: 2 / 2 = 1.0 Score: 1 / 2 = 0.5

Figure 12: SPP vs CoT qualitative examples on Codenames Collaborative. We find that SPP provides much more
detailed and interpretable intermediate discussions from various perspectives, which leads to stronger knowledge
selection, integration, and theory-of-mind capabilities.
SPP v.s. SPP-Fixed-Persona (Trivia Creative Writing N=5)

Instance Index: 30
Question #1: Which orchestral conductor was married to one of the subjects of the film Hillary and Jackie?
SPP SPP-Fixed-Persona
Participants: AI Assistant (you); Film Expert; Literary Participants: AI Assistant (you); Expert
Expert; History Expert; Geography Expert; Architecture Expert: Sir Simon Rattle was married to one of the
Expert subjects of the film Hillary and Jackie.
Film Expert: The orchestral conductor married to one of
the subjects of the film Hillary and Jackie is Daniel
Barenboim.

Instance Index: 7
Question #3: Which musician founded the Red Hot Peppers?
SPP SPP-Fixed-Persona
Participants: AI Assistant (you); Storyteller; Mythology Participants: AI Assistant (you); Expert
Expert; Music Enthusiast; Historian; Speed Record Expert AI Assistant (you): The musician who founded the Red
Music Enthusiast: The musician who founded the Red Hot Peppers is Louis Armstrong.
Hot Peppers is Jelly Roll Morton.

Instance Index: 56
Question #1: What sport or pastime would you be playing if your balls were black and blue and your opponent's were red
and yellow?
SPP SPP-Fixed-Persona
Participants: AI Assistant (you); Story Writer; Sports Participants: AI Assistant (you); Expert
Enthusiast; Historian; Movie Buff; Music Expert; Mythology Expert: Snooker
Expert
Sports Enthusiast: The sport with black and blue balls for
one player and red and yellow balls for the other is
Croquet.

Figure 13: SPP vs SPP-Fixed-Persona qualitative examples on Trivia Creative Writing (N=5). Each example shows
one of the trivia questions in the input instance, the identified participants and the provided answer. We observe that
the dynamically identified fine-grained personas, such as "Film Expert", "Music Enthusiast" and "Sports Enthusiast",
tend to outperform the fixed general personas, "Expert".
Early-termination example with SPP on Llama2-13b-chat

Input
[... SPP procedure instructions and demonstration examples ...]

Now, identify the participants and collaboratively solve the following task step by step. Remember to present your final
solution with the prefix "Final answer:".

Task: {Write a short and coherent story about Harry Potter that incorporates the answers to the following 5 questions:
Who was the man behind The Chipmunks? Which Lloyd Webber musical premiered in the US on 10th December 1993?
Who was the next British Prime Minister after Arthur Balfour? Who had a 70s No 1 hit with Kiss You All Over? What
claimed the life of singer Kathleen Ferrier?}

Generation

Participants: AI Assistant (you); Harry Potter Expert; Music Expert; History Expert; Medical Expert
Start collaboration!

The generation stopped here without simulating the multi-


turn dialogue following the demonstration examples
Further Observation: Llama2 cannot follow the procedure as demonstrated in the examples, even with additional
prompting, such as: "Follow the same procedure as the example tasks shown above. Simulate the dialogue of the
participants." This indicates that a strong instruction-following ability is a prerequisite for SPP to effectively elicit cognitive
synergy abilities.

Early-termination example with SPP-Fixed-Persona on GPT-4

Input

[... SPP-Fixed-Persona Prompt ...]

Now, identify the participants and collaboratively solve the following task step by step. Note that the participants can only
be either AI Assistant (you) or Expert. Remember to provide the final solution with the following format "Final answer: (a
list of words here).".

Task: {Try to identify the 4 words best associated with the word "film" from the following list: locust, fever, street,
sherwood, kiss, popcorn, craft, ant, crystal, tear, cowboy, government, pine, mountie, soap, genie, king arthur, sphinx,
director, bride, razor, fog, whistle. Your answer should be a comma-separated list of words.}

Generation
Participants: AI Assistant (you); Expert

AI Assistant (you): Let's first analyze the task. We need to find the 4 words that are most closely associated with the
word "film" from the given list. We can start by identifying words that are directly related to films, such as those related to
movie genres, settings, or elements commonly found in films.

Expert: That's a good approach. You can also consider words that are related to the experience of watching a film, such
as items that are commonly consumed during a movie or words that describe the atmosphere of a film. Make sure to pick
the 4 words that have the strongest association with "film".

The generation stopped here without providing the Final answer

Figure 14: Examples of the early-termination problem with SPP on Llama2-13b-chat and SPP-Fixed-Persona on
GPT-4.
SPP Prompt
When faced with a task, begin by identifying the participants who will contribute to solving the task. Then, initiate a multi-round collaboration process until a final solution is reached. The participants will
give critical comments and detailed suggestions whenever necessary.

Here are some examples:


---
Example Task 1: Use numbers and basic arithmetic operations (+ - * /) to obtain 24. You need to use all numbers, and each number can only be used once.
Input: 6 12 1 1

Participants: AI Assistant (you); Math Expert

Start collaboration!

Math Expert: Let's analyze the task in detail. You need to make sure that you meet the requirement, that you need to use exactly the four numbers (6 12 1 1) to construct 24. To reach 24, you can think
of the common divisors of 24 such as 4, 6, 8, 3 and try to construct these first. Also you need to think of potential additions that can reach 24, such as 12 + 12.
AI Assistant (you): Thanks for the hints! Here's one initial solution: (12 / (1 + 1)) * 6 = 24
Math Expert: Let's check the answer step by step. (1+1) = 2, (12 / 2) = 6, 6 * 6 = 36 which is not 24! The answer is not correct. Can you fix this by considering other combinations? Please do not make
similar mistakes.
AI Assistant (you): Thanks for pointing out the mistake. Here is a revised solution considering 24 can also be reached by 3 * 8: (6 + 1 + 1) * (12 / 4) = 24.
Math Expert: Let's first check if the calculation is correct. (6 + 1 + 1) = 8, 12 / 4 = 3, 8 * 3 = 24. The calculation is correct, but you used 6 1 1 12 4 which is not the same as the input 6 12 1 1. Can you
avoid using a number that is not part of the input?
AI Assistant (you): You are right, here is a revised solution considering 24 can be reached by 12 + 12 and without using any additional numbers: 6 * (1 - 1) + 12 = 24.
Math Expert: Let's check the answer again. 1 - 1 = 0, 6 * 0 = 0, 0 + 12 = 12. I believe you are very close, here is a hint: try to change the "1 - 1" to "1 + 1".
AI Assistant (you): Sure, here is the corrected answer: 6 * (1+1) + 12 = 24
Math Expert: Let's verify the solution. 1 + 1 = 2, 6 * 2 = 12, 12 + 12 = 12. You used 1 1 6 12 which is identical to the input 6 12 1 1. Everything looks good!

Finish collaboration!

Final answer: 6 * (1 + 1) + 12 = 24

---
Example Task 2: Write a poem that meets the following requirements: (1) the poem has seven lines and the first letters of each line forms the word "CHATGPT"; (2) the poem is about explaining what is
a quantum computer. (3) the poem needs to be easy to understand by a ten years old kid.

Participants: AI Assistant (you); Poet; Computer Scientist; Ten year old child

Poet: Make sure that you write the poem with seven lines, and the first letters of the lines should be C, H, A, T, G, P, T.
Computer Scientist: A quantum computer is an advanced computing device that uses the principles of quantum mechanics to process and store information. Unlike classical computers that use bits
to represent information as 0s and 1s, quantum computers use quantum bits or qubits. Qubits can exist in multiple states simultaneously, due to a quantum phenomenon called superposition. You can
consider using these information for the poem.
Ten year old child: I hope the poem to be fun and easy to understanding. I don't want to see a lot of jargons or complicated concepts.
AI Assistant (you): Thanks for the guidance! Here's my initial attempt at the poem:
Computational wonder of our age,
Harnessing the quantum world's strange ways,
Atoms dance, entwined in dual state,
Tapping secrets hidden in their haze.

Grand power to solve, simulate,


Profound problems that perplex the wise,
Transforming our future, we await.

Poet: Let's verify if the poem meets the requirements. The first letters are CHATGPT which is correct! And the poem rhymes well. Good job!
Computer Scientist: Everything looks good to me!
Ten year old child: I don't know what does perplex mean. Can you make the use of words easier to understand?
AI Assistant (you): Sure, let me revise the poem by using more common words. Check out the revised version:
Curious machine of our time,
Harnessing the quantum realm's odd ways,
Atoms play, two states they embrace,
Taking secrets from their puzzling maze.

Great power to solve and imitate,


Problems that confuse the brightest minds,
Transforming our future, we await.

Poet: Let's check again if the poem meets the requirements. The first letters are C H A T G P T. And now the poem is more accessible to children. Everything looks good to me.
Computer Scientist: Looking good!
Ten year old child: I like this version a lot!

Finish collaboration!

Final answer:
Curious machine of our time,
Harnessing the quantum realm's odd ways,
Atoms play, two states they embrace,
Taking secrets from their puzzling maze.

Great power to solve and imitate,


Problems that confuse the brightest minds,
Transforming our future, we await.

---
Now, identify the participants and collaboratively solve the following task step by step. {Task-specific Formating Instruction}

Task: {Task input}

Figure 15: SPP full prompt.


SPP-Profile Prompt
When faced with a task, begin by identifying the participants who will contribute to solving the task. Provide profiles of the participants, describing their expertise or needs. Then, initiate a multi-round
collaboration process until a final solution is reached. The participants will give critical comments and detailed suggestions whenever necessary.

Here are some examples:

[...]

Participants: AI Assistant (you); Math Expert

Profiles:
- AI Assistant (you): A super-intelligent AI assistant capable of performing tasks more effectively than humans.
- Math expert: A person who is good at math games, arithmetic calculation, and long-term planning.

[...]

Participants: AI Assistant (you); Poet; Computer Scientist; Ten year old child

Profiles:
- AI Assistant (you): A super-intelligent AI assistant capable of performing tasks more effectively than humans.
- Poet: A person who studies and creates poetry. The poet is familiar with the rules and formats of poetry and can provide guidance on how to write a poem.
- Computer Scientist: A scholar who specializes in the academic study of computer science. The computer scientist is familiar with the concept of a quantum computer and can provide
guidance on how to explain it.
- Ten year old child: A child with a limited English vocabulary and little knowledge about complicated concepts, such as a quantum computer.

[...]

---
Now, identify the participants, provide their profiles, and collaboratively solve the following task step by step. {Task-specific Formating Instruction}

Task: {Task input}

Figure 16: SPP-Profile full prompt. "[...]" indicates identical parts with SPP. Green text indicates the key difference
between SPP-Profile and SPP.

Scores (N = 5) (%)
Methods
w/ system message w/o system message average max
Standard 75.6 73.6 74.6 75.6
CoT 68.8 65.6 67.1 68.8
Self-Refine [iter=0] 74.9 72.7 73.8 74.9
Self-Refine [iter=1] 75.3 72.5 73.9 75.3
SPP-Fixed-Persona 66.1 79.6 72.9 79.6
SPP-Profile 79.8 78.3 79.1 79.8
SPP 80.0 79.8 79.9 80.0

Scores (N = 10) (%)


Methods
w/ system message w/o system message average max
Standard 77.2 76.8 77.0 77.2
CoT 71.6 65.3 68.5 71.6
Self-Refine [iter=0] 77.1 75.4 76.3 77.1
Self-Refine [iter=1] 78.2 75.6 76.9 78.2
SPP-Fixed-Persona 70.5 81.3 75.9 81.3
SPP-Profile 82.3 83.8 83.0 83.8
SPP 85.2 84.2 84.7 85.2

Table 5: Trivia Creative Writing full results, including two inference settings: with system message and without
system message. "average" and "max" indicating the mean and max score across the two settings. The system
message we use is: “You are an AI assistant that helps people find information.”
SPP-Fixed-Persona Prompt
When faced with a task, begin by identifying the participants who will contribute to solving the task. Note that the participants can only be either AI Assistant (you) or Expert. Then, initiate a multi-round
collaboration process until a final conclusion is reached. The Expert will give critical comments and detailed suggestions whenever necessary.

Here are some examples:


---
Example Task 1: Use numbers and basic arithmetic operations (+ - * /) to obtain 24. You need to use all numbers, and each number can only be used once.
Input: 6 12 1 1

Participants: AI Assistant (you); Expert

Start collaboration!

Expert: Let's analyze the task in detail. You need to make sure that you meet the requirement, that you need to use exactly the four numbers (6 12 1 1) to construct 24. To reach 24, you can think of
the common divisors of 24 such as 4, 6, 8, 3 and try to construct these first. Also you need to think of potential additions that can reach 24, such as 12 + 12.
AI Assistant (you): Thanks for the hints! Here's one initial solution: (12 / (1 + 1)) * 6 = 24
Expert: Let's check the answer step by step. (1+1) = 2, (12 / 2) = 6, 6 * 6 = 36 which is not 24! The answer is not correct. Can you fix this by considering other combinations? Please do not make
similar mistakes.
AI Assistant (you): Thanks for pointing out the mistake. Here is a revised solution considering 24 can also be reached by 3 * 8: (6 + 1 + 1) * (12 / 4) = 24.
Expert: Let's first check if the calculation is correct. (6 + 1 + 1) = 8, 12 / 4 = 3, 8 * 3 = 24. The calculation is correct, but you used 6 1 1 12 4 which is not the same as the input 6 12 1 1. Can you avoid
using a number that is not part of the input?
AI Assistant (you): You are right, here is a revised solution considering 24 can be reached by 12 + 12 and without using any additional numbers: 6 * (1 - 1) + 12 = 24.
Expert: Let's check the answer again. 1 - 1 = 0, 6 * 0 = 0, 0 + 12 = 12. I believe you are very close, here is a hint: try to change the “1 - 1” to “1 + 1”.
AI Assistant (you): Sure, here is the corrected answer: 6 * (1+1) + 12 = 24
Expert: Let's verify the solution. 1 + 1 = 2, 6 * 2 = 12, 12 + 12 = 12. You used 1 1 6 12 which is identical to the input 6 12 1 1. Everything looks good!

Finish collaboration!

Final answer: 6 * (1 + 1) + 12 = 24

---
Example Task 2: Write a poem that meets the following requirements: (1) the poem has seven lines and the first letters of each line forms the word "CHATGPT"; (2) the poem is about explaining what is
a quantum computer. (3) the poem needs to be easy to understand by a ten years old kid.

Participants: AI Assistant (you); Expert

Expert: Make sure that you write the poem with seven lines, and the first letters of the lines should be C, H, A, T, G, P, T. A quantum computer is an advanced computing device that uses the principles
of quantum mechanics to process and store information. Unlike classical computers that use bits to represent information as 0s and 1s, quantum computers use quantum bits or qubits. Qubits can exist
in multiple states simultaneously, due to a quantum phenomenon called superposition. You can consider using these information for the poem. I hope the poem to be fun and easy to understanding. I
don't want to see a lot of jargons or complicated concepts.
AI Assistant (you): Thanks for the guidance! Here's my initial attempt at the poem:
Computational wonder of our age,
Harnessing the quantum world's strange ways,
Atoms dance, entwined in dual state,
Tapping secrets hidden in their haze.

Grand power to solve, simulate,


Profound problems that perplex the wise,
Transforming our future, we await.

Expert: Let's verify if the poem meets the requirements. The first letters are CHATGPT which is correct! And the poem rhymes well. Good job! I don't know what does perplex mean. Can you make the
use of words easier to understand?
AI Assistant (you): Sure, let me revise the poem by using more common words. Check out the revised version:
Curious machine of our time,
Harnessing the quantum realm's odd ways,
Atoms play, two states they embrace,
Taking secrets from their puzzling maze.

Great power to solve and imitate,


Problems that confuse the brightest minds,
Transforming our future, we await.

Expert: Let's check again if the poem meets the requirements. The first letters are C H A T G P T. And now the poem is more accessible to children. Everything looks good to me. I like this version a
lot!

Finish collaboration!

Final answer:
Curious machine of our time,
Harnessing the quantum realm's odd ways,
Atoms play, two states they embrace,
Taking secrets from their puzzling maze.

Great power to solve and imitate,


Problems that confuse the brightest minds,
Transforming our future, we await.

---

Now, identify the participants and collaboratively solve the following task step by step. {Task-specific Formating Instruction}

Task: {Task input}

Figure 17: SPP-Fixed-Persona full prompt. Red text indicates the key difference between SPP-Fixed-Persona and
SPP.
CoT Prompts

{Trivia Creative Writing Task Input} {Codenames Spymaster/Guesser Input} {Logic Grid Puzzle Input}

Make a plan then write. Your output Solve the task step by step. Your output Solve the task step by step. Your output
should be of the following format: should be of the following format: should be of the following format:

Plan: Steps: Steps:


Your plan here. Your steps here. Your steps here.

Story: Answer: (a single word here) / (A list of Answer:


Your story here. words here) The house number here.

Trivia Creative Writing Codenames Collaborative Logic Grid Puzzle

Figure 18: CoT prompts.

Self-Refine Prompts

{task instruction + previous response}


{task instruction + previous spymaster /
--- {task instruction + previous answer}
guesser response}
Reflect on the response. Analyze the ---
Feedback ---
correctness of the information provided, Analyze the correctness of the answer. If
Prompts Analyze the quality of the answer.
and the coherence of the story. Provide it is not correct, provide critque to
Provide critque to improve the answer.
critque to help improve the response. improve the answer. Your feedback:
Your feedback:
Your feedback:

{task instruction + previous answer}


{task instruction + previous spymaster / ---
{task instruction / previous response}
guesser response} {feedback}
---
--- ---
{feedback}
Refine {feedback} Based on your initial answer and the
---
Prompts --- subsequent feedback, revise the answer.
Based on your initial response and the
Based on your initial answer and the Your revised answer:
subsequent feedback, revise the
subsequent feedback, revise the answer. The house number here. (Follow the
response. Your revised response:
Your revised answer: original format. DO NOT add anything
after the answer.)

Trivia Creative Writing Codenames Collaborative Logic Grid Puzzle

Figure 19: Self-refine prompts.

Provide 100 nouns from pop culture that are PG or PG 13 rated. Try not to include any adult, racial or harmful
content. Try to be as diverse as possible, including movies, books, games, shows, etc. Do not include duplicates.

Figure 20: Prompt for generating the topic list for the Trivia Creative Writing task.

Scores (%)
Methods
w/ system message w/o system message average max
Standard 74.5 76.3 75.4 76.3
CoT 71.4 74.0 72.7 74.0
Self-Refine [iter=0] 77.3 73.2 75.3 77.3
Self-Refine [iter=1] 70.1 58.8 64.4 70.1
SPP-Fixed-Persona 10.1 66.0 38.1 66.0
SPP-Profile 80.4 72.9 76.7 80.4
SPP 82.5 75.5 79.0 82.5

Table 6: Codenames Collaborative full results, including two inference settings: with system message and without
system message. "average" and "max" indicating the mean and max score across the two settings. The system
message we use is: “You are an AI assistant that helps people find information.”
Scores (%)
Methods
w/ system message w/o system message average max
Standard 56.8 58.6 57.7 58.6
CoT 69.5 62.1 65.8 69.5
Self-Refine [iter=0] 62.0 55.5 58.8 62.0
Self-Refine [iter=1] 64.5 55.5 60.0 64.5
SPP-Fixed-Persona 63.3 65.3 64.3 65.3
SPP-Profile 65.7 64.0 64.8 65.7
SPP 66.3 70.4 68.3 70.4

Table 7: Logic Grid Puzzle full results, including two inference settings: with system message and without system
message. "average" and "max" indicating the mean and max score across the two settings. The system message we
use is: “You are an AI assistant that helps people find information.”

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy