Unleashing The Emergent Cognitive Synergy in Large Language Models: A Task-Solving Agent Through Multi-Persona Self-Collaboration
Unleashing The Emergent Cognitive Synergy in Large Language Models: A Task-Solving Agent Through Multi-Persona Self-Collaboration
single persona
Human intelligence thrives on cognitive syn-
Input output
ergy, where collaboration among different AI Assistant
minds yield superior outcomes compared to iso- (a) Standard Prompting
arXiv:2307.05300v4 [cs.AI] 26 Mar 2024
Feedback
world of Hogwarts. Eddard was known for his :The incantation "Expecto Patronum" is well-incorporated into the story.
mastery of the Patronus Charm, a powerful :The character name Eddard Stark is used effectively in the story.
spell that conjured a protective guardian to
:The inclusion of "An Jing" adds a nice touch to the story.
ward off dark forces. (missing the
incantation!) However, his life took a tragic Finish Collaboration!
turn when he was beheaded in a cruel twist Final Answer
of fate, mirroring the demise of his namesake
In the mystical land of Hyrule, there lived a wise and powerful sorcerer named
in the ninth episode of the Game of Thrones
Eddard. Eddard was known far and wide for his mastery of ancient spells,
TV series. Now, as a spirit, Eddard roams the
including the legendary "Expecto Patronum" incantation from a distant realm.
land, offering guidance to those in need and
One fateful day, Eddard stumbled upon a mysterious artifact that played a haunting
seeking solace in the haunting melody of
"The Final Battle," (factual error!) the last melody called "An Jing." The melody resonated with his soul, unlocking a hidden
power within him. With this newfound power, Eddard dedicated his life to protecting
song in the second album by Jay Chou,
the people of Hyrule from the forces of darkness, becoming a beloved figure in the
which serves as a constant reminder of his
Legend of Zelda.
own final moments.
Figure 2: Task-solving example of Solo Performance Prompting (SPP) with GPT-4. The personas of the participants
are automatically identified by GPT-4 based on the task input. This example shows that Standard Prompting suffers
from factual errors, whereas SPP provides accurate information and a coherent answer. Note that, in real-world
applications, the domains can vary not only within entertainment but also encompass history, science, education,
healthcare, etc.
that collaborates with multiple minds to enhance overview of SPP. Here, a persona can represent
problem-solving and efficacy in complex tasks. In either a domain expert, such as a movie enthusi-
this work, we aim to create a cognitive syner- ast, or a target audience, such as a ten-year-old
gist based on a single LLM that can "split into" child. Through the dynamic identification of vari-
multiple personas and engage in self-collaboration ous personas, we empower a single LLM to acquire
to solve both knowledge-intensive and reasoning- diverse domain knowledge accurately without ad-
intensive tasks. This idea is heavily inspired by ditional retrieval systems. By facilitating multi-
the role of pretend play (Piaget, 1954; Pellegrini, turn self-collaboration, we enable self-revision and
2009) in cognitive development and recent findings self-feedback from various perspectives without
that assigning personas (Deshpande et al., 2023; requiring additional agents.
Xu et al., 2023) to LLMs can elicit specific be-
haviors, improve answer quality, and potentially In real-world scenarios, such as those in creative
build an AI society (Park et al., 2023; Schick et al., industries, there is often a need to incorporate di-
2022; Li et al., 2023; Cai et al., 2023) with col- verse information from different domains. Figure 2
laborative LLM agents. However, as shown in Ta- presents a concrete example of how SPP operates
ble 1, previous works have limitations such as fixed on a challenging task that requires creative integra-
or task-specific personas, the need for additional tion of information from various domains, such as
fine-tuning, and increased inference costs due to the Legend of Zelda game, Harry Potter movies,
multiple LLM instances. and Jay Chou’s albums. Standard prompting fails
to generate satisfactory output due to missing es-
To unleash the potential of cognitive synergy sential information and factual errors. In contrast,
for general task-solving, we propose Solo Perfor- SPP produces informative and coherent answers
mance Prompting (SPP), which prompts a single by automatically identifying expert personas and
LLM to identify, simulate, and collaborate with engaging in a multi-turn self-collaboration. In this
multiple personas. Figure 1 provides a high-level process, the AI Assistant persona iteratively writes
General task Pure zero-shot Has multiple Personas dynamically Has iterative Need only a
solving? prompting? personas? identified? refinement? single LLM?
† Standard Prompting (Brown et al., 2020)
† Chain-of-Thought (Wei et al., 2023)
Inner Monologue (Huang et al., 2022)
ReAct (Yao et al., 2022)
Reflexion (Shinn et al., 2023)
† Self-Refine (Madaan et al., 2023)
Tree-of-thought (Yao et al., 2023)
GPT-Bargaining (Fu et al., 2023) (fixed to 3)
Camel (Li et al., 2023) (fixed to 2)
ExpertPrompting (Xu et al., 2023)
Solo Performance Prompting (ours) (varied)
Table 1: High-level comparison with various prompting-based methods. Methods directly comparable to ours are
denoted by †. Results for the comparison can be found in Section 3. In Section 4, we further design and compare
with two variants of Solo Performance Prompting: one adopting fixed personas, as in Camel (Li et al., 2023), and
another with additional persona profiles, as proposed in ExpertPrompting (Xu et al., 2023).
drafts of the story, solicits feedback from other 2 Solo Performance Prompting
participants, and revises accordingly.
To unleash the power of synergizing different per-
To explore the prevalence of cognitive synergy
sonas to tackle complex problems, we propose Solo
in different LLMs, we apply SPP to LLMs with
Performance Prompting (SPP) which instructs a
varying scales and capabilities, including GPT-4,
LLM to perform the following the procedure for
GPT-3.5-turbo, and Llama-13b-chat. Comparative
general task-solving: (1) Persona Identification:
results show that cognitive synergy only emerges in
Identify multiple participants with special personas
GPT-4 and not in less capable models. This draws
(including a leader persona: AI Assistant) that are
an interesting analogy to human development, as
essential for solving the particular task. (2) Brain-
children typically start engaging in role-playing at
storming: The participants share knowledge and
the age of 2 to 3 (Piaget, 1954), but not earlier. In
provide suggestions on how to approach the task
summary, the key contributions of this paper are as
based on their own expertise. (3) Multi-Persona
follows:
Iterative Collaboration: The leader persona, AI
• We investigate whether LLMs can leverag- Assistant, proposes initial solutions, consults the
ing cognitive synergy for general task-solving. other participants for feedback, and revise the an-
We introduce Solo Performance Prompting swer iteratively. Figure 2 shows a walking example
(SPP), which simulates multi-agent, multi- of SPP during inference. Next, we formally de-
persona collaboration in a pure zero-shot man- scribe the SPP procedure in detail.
ner. Given an input sequence x and a model M,
• We evaluate SPP across three challenging let a prompt (including demonstration examples)
tasks: Trivia Creative Writing, Codenames prepended to the input to be p and the final output
Collaborative and Logic Grid Puzzle, span- to be y. Denote an intermediate generation before
ning both knowledge- and reasoning-intensive generating the final y as z. Under this formulation,
domains. To our knowledge, SPP is the first Standard Prompting and Chain-of-Thought (CoT)
zero-shot prompting method that can enhance Prompting can be described as:
both knowledge and reasoning abilities on
Standard Prompting: y = M(x) (1)
GPT-4.
CoT Prompting: y = M(pcot ∥x∥{z1 , z2 , ..., zn }) (2)
• We present an intriguing finding regarding the
emergent nature of cognitive synergy ability where pcot is the CoT prompt, e.g., "Solve the
in LLMs, which only emerges in GPT-4 and task step-by-step" and {z1 , z2 ..., zn } are the
not in less powerful models. intermediate steps. In contrast, our proposed Solo
• We conduct in-depth analyses of the impact Performance Prompting can be described as fol-
of the identified personas and SPP prompt de- lows:
sign, providing insights into why dynamic,
Solo Performance Prompting: y=
fine-grained personas are necessary, as op-
posed to fixed, coarse-grained personas. M(pspp ∥x∥zp ∥{zb1 , zb2 , ..., zbm }∥{zs0 , zf1 , ..., zfm }j=1..n ) (3)
where the SPP prompt (pspp ) includes a high-level answer. The final answer can be directly read out
instruction and two carefully crafted demonstration following user-specified output format.
examples1 that showcase the expected task-solving In summary, SPP instructs an LLM to solve gen-
procedure of SPP. We describe the design details eral tasks via multi-persona self-collaboration in
of the prompt in §A.1. The corresponding interme- a pure zero-shot manner. In contrast, as detailed
diate generations (z) of SPP are detailed below. in Table 1, previous prompting-based methods are
either task-specific or require additional mecha-
Persona Identification (zp ). Given an input task,
nism, e.g., searching (Yao et al., 2023), external
SPP first generates a list of participants with differ-
tools (Yao et al., 2022), memory component (Shinn
ent personas. For example in Figure 2, the model
et al., 2023), and fine-tuning (Xu et al., 2023).
identified a Jay Chou Fan persona to help answer
"the last song in the second album by Jay Chou". 3 Experiments
We let the language model identify the personas
dynamically instead of manually defining them. To explore the effectiveness of Solo Performance
Given only two demonstration examples (detailed Prompting (SPP), we adopt an evaluation method-
in §A), we observe that a state-of-the-art large lan- ology similar to that of previous work (Yao et al.,
guage model, e.g., GPT-4 (OpenAI, 2023b), can 2023). We carefully design new tasks and select
identify accurate and meaningful personas for di- tasks from existing benchmarks (Srivastava et al.,
verse tasks. We denote this part of intermediate 2022) that are challenging even for the most capa-
generation as zp in Equation 3. ble LLMs (OpenAI, 2023b). The evaluation aims
to cover diverse types of tasks encompassing both
Brainstorming (zbi ). Among the identified par- knowledge-intensive and reasoning-intensive do-
ticipants, "AI Assistant (you)" is treated as a leader mains.
persona that initiates the collaboration and gener-
Tasks. We invent the Trivia Creative Writing
ates initial solutions. Before generating the ini-
task (§3.1), which requires the model to internally
tial answer, the personas brainstorm on how to
acquire and integrate diverse information from var-
approach the task from their own perspectives. For
ious fields. We observe that even GPT-4 (OpenAI,
example, the Jay Chou Fan points out that the last
2023b) frequently exhibit hallucination and fac-
song in Jay Chou’s second album is "An Jing" ("Si-
tuality errors in the Trivia Creative Writing task.
lence"). We find that the brainstorming phase effec-
We also propose the Codenames Collaborative
tively improves the quality of the initial solution. In
task (§3.2), an extension of the Codenames task
Equation 3, the superscript i = 0 is used to denote
from the BigBench (Srivastava et al., 2022) that
the "AI Assistant" persona, while i ≥ 1 represents
features a two-role collaboration setup. Codenames
other dynamically identified personas. The inter-
Collaborative demands creative reasoning across a
mediate generations of the brainstorming step are
broad range of related knowledge and challenges
denoted as {zb1 , zb2 , ..., zbm }.
the model’s theory of mind skills. Lastly, we in-
Multi-Persona Iterative Collaboration (zs0 , zfi ). clude a challenging pure-reasoning task, Logic
Based on the brainstorming remarks, the AI Assis- Grid Puzzle (§3.3), from the BigBench (Srivastava
tant persona generates an initial solution zs0 , then it et al., 2022) which necessitates complex multi-step
consults each of the other participants for feedback reasoning.
{zfi }. The participants are encouraged to critique
Baselines. We compare our approach with
the current generation and give revision sugges-
Standard Prompting, Chain-of-Thought (CoT)
tions. For example, the Jay Chou Fan persona
prompting methods (outlined in §2) and Self-
checks whether the song "An Jing" ("Silence") is
Refine (Madaan et al., 2023). For CoT, a similar
correctly included in the story. This process can be
prompt design to (Yao et al., 2023) is employed,
repeated for multiple times until every participant
where the model is prompted to generate a plan or
is satisfied with the current solution. In Equation 3,
a series of steps before producing the final output.
we denote the intermediate generations of the multi-
For Self-Refine, we follow (Madaan et al., 2023)
turn dialogue as {zs0 , zf1 , ..., zfm }j=1...n where n is
to design feedback and refine prompts. We perform
the number of iterations before reaching the final
one self-refine iteration which requires three times
1
The tasks we use in the demonstration examples do not more inferences than SPP. Full prompts for the
overlap with the evaluation tasks. methods can be found in Appendix A.2.
Trivia.C.W (N=5) Trivia.C.W (N=10) Codenames.C Logic.G.Puzzle
Methods
Score (%) ∆ Score (%) ∆ Score (%) ∆ Score (%) ∆
Standard 74.6 0.0% 77.0 0.0% 75.4 0.0% 57.7 0.0%
CoT 67.1 ↓10.0% 68.5 ↓11.1% 72.7 ↓3.6% 65.8 ↑14.1%
Self-Refine [iter=0] 73.8 76.3 75.2 58.8
Self-Refine [iter=1] 73.9 ↓1.0% 76.9 ↓0.1% 64.6 ↓14.6% 60.0 ↑4.0%
SPP (ours) 79.9 ↑7.1% 84.7 ↑10.0% 79.0 ↑4.8% 68.3 ↑18.5%
Table 2: GPT-4 results on Trivia Creative Writing (Trivia.C.W), Codenames Collaborative (Codenames.C) and Logic
Grid Puzzle (Logic.G.Puzzle). ∆ indicates the relative gain/loss compared with Standard Prompting (first row). We
report the average scores across two individual runs with/without a system message (detailed in Appendix C).
Models. The default model we use is GPT- eration. For each question, a match to any of the
4 (OpenAI, 2023b). Detailed inference configura- answer aliases provided by the TriviaQA dataset is
tions, API versions, and full results can be found in considered a correct mention. The metric score is
Appendices C and F. In §3.4, we further investigate computed as: # correct answer mentions
# trivia questions .
the prevalence of cognitive synergy in LLMs with Results. Table 2 presents the results of the Trivia
different scales and capabilities, including GPT-3.5- Creative Writing task. The key observations are as
turbo (OpenAI, 2023a) and Llama2-13b-chat (Tou- follows: (1) Chain-of-Thought (CoT) does not out-
vron et al., 2023). perform Standard prompting, indicating that CoT
3.1 Trivia Creative Writing: A is ineffective in eliciting an LLM’s knowledge abil-
Knowledge-Intensive Task ities. Qualitative examples in Figure 8 and 11 il-
lustrate that although CoT generates reasonable
Task Description. As illustrated in Figure 3, plans for task resolution, the final generation still
Trivia Creative Writing asks a model to write a contains factual errors and hallucinations. (2) Self-
coherent story while incorporating the answers to Refine only brings marginal improvements over
N trivia questions. Our preliminary experiments iterations. (3) SPP outperforms all baselines sig-
(Figure 10) show that a sufficiently large N can nificantly. The improvement is more pronounced
effectively challenge GPT-4 to demonstrate fac- in the N = 10 setting compared to N = 5 (10%
tual knowledge across diverse domains. Thus, we vs. 7%), suggesting that Solo Performance Prompt-
mainly consider two evaluation settings, N = 5 ing is particularly beneficial when the task requires
and N = 10. We built a benchmark with 100 in- incorporating knowledge from numerous domains.
stances for each N , covering a total of 1000 trivia
questions2 extracted from the TriviaQA (Joshi et al., 3.2 Codenames Collaborative: A
2017) dataset. More details can be found in Ap- Knowledge+Reasoning Task
pendix B.1. Task Description. As illustrated in 4, Code-
Evaluation Metrics. Evaluating GPT-4 level names Collaborative is a collaborative task that
generation results can be challenging. Our prelim- challenges a model’s knowledge, reasoning, and
inary experiments indicate that, even for humans, theory of mind abilities by assigning two player
it is very difficult to identify which generation is roles: the Spymaster and the Guesser. The Spy-
better in terms of overall "quality" of the story from master’s role is to provide a hint word related to
different prompting methods. Thus, instead of fo- the target words, excluding some other distractor
cusing on evaluating the coherence of the genera- words, while the Guesser’s role is to identify the
tion, which can be highly subjective, we employ target words based on the given hint and the full
an automatic metric which focuses on detecting list of words. The same LLM (GPT-4 (OpenAI,
factual hallucinations. As shown in Figure 3, we 2023b)) is used for both roles sequentially, and a
perform string matching with the ground truth tar- dataset with 50 instances is constructed based on
get answers for each question on the output gen- BigBench’s (Srivastava et al., 2022) Codenames
2 task data.
To select difficult question instances that can pose
challenges to GPT-4, we use a smaller open-source LLM, Evaluation Metrics. The original Codenames
fastchat_t5_3b (Zheng et al., 2023), to obtain preliminary per-
formance on the validation set, and then choose the failure task in the BigBench dataset has limitations due to
cases as our question selection. its focus on the Guesser role and subjectivity in hint
Trivia Creative Writing Input Example (N=5)
Write a short and coherent story about {Harry Potter} that incorporates the answers to the following {5} questions: {Who was the man
behind The Chipmunks? Which Lloyd Webber musical premiered in the US on 10th December 1993? Who was the next British Prime
Minister after Arthur Balfour? Who had a 70s No 1 hit with Kiss You All Over? What claimed the life of singer Kathleen Ferrier?}
Spymaster Input
Try to find a single word hint that can accurately represent and link the {4} given words: {"director,
kiss, popcorn, street"}. The key is to select a hint that does not cause confusion with other words
from the following list: {locust, fever, street, sherwood, kiss, popcorn, craft, ant, crystal, tear, cowboy,
government, pine, mountie, soap, genie, king arthur, sphinx, director, bride, razor, fog, whistle}.
"movie"
Guesser Input
Try to identify the {4} words best associated with the word {"movie"} from the following list: {locust,
fever, street, sherwood, kiss, popcorn, craft, ant, crystal, tear, cowboy, government, pine, mountie,
soap, genie, king arthur, sphinx, director, bride, razor, fog, whistle.}
Evaluation Metric
Targets: director, kiss, popcorn, street Guesser outputs: director, popcorn, cowboy, bride Score: 2 / 4 = 0.5
Input Example
Q: There are 4 houses in a row, numbered 1 on the left to 4 on the right. There is one person living in each house. The people
in these houses have different characteristics:
- Each person has different flowers in their foyer: one has a carnations arrangement, one has a bouquet of daffodils, one
has a vase of tulips, and one has a bouquet of lilies
- Each person plays a different musical instrument: one is a guitarist, one is a pianist, one is a percussionist, and one is a
flutist
Clue(s):
1. The flutist lives in the second house.
2. The person who has a vase of tulips lives directly left of the guitarist.
3. The person who has a bouquet of lilies lives directly left of the person who has a carnations arrangement.
4. There is one house between where the flutist lives and where the pianist lives.
What is the number of the house where the person who has a vase of tulips lives?
choice: 2
choice: 4
choice: 1
choice: 3
The house number where the person who has a vase of tulips lives is 3. Target: 2 Prediction: 3 Score: 0 (2!=3)
Results. Table 2 presents the results on Logic (SPP) not only brings significant improvements to
Grid Puzzle. In contrast to the previous two tasks, knowledge-intensive tasks such as Trivia Creative
we find that CoT brings significant improvements Writing and Codenames Collaborative without rely-
compared to Standard prompting, verifying the ob- ing on external knowledge bases, but also achieves
servation from previous work that CoT elicits bet- strong performance on reasoning-intensive tasks
ter reasoning abilities. Furthermore, we discover like Logic Grid Puzzle. To our knowledge, SPP
that SPP also achieves strong performance on this is the first zero-shot prompting method that can
reasoning-intensive task. enhance both knowledge and reasoning abilities on
GPT-4.
3.4 The Emergence of Cognitive Synergy
We further discover that cognitive synergy can LLMs can effectively identify useful personas in
only be fully unleashed in LLMs with a certain a zero-shot manner. We are interested in inves-
level of instruction-following capabilities, akin tigating whether the identified personas are highly
to that of GPT-4. This can be intriguingly com- relevant to the tasks. We visualize the personas
pared to human development, where children usu- automatically identified by SPP using a word cloud
ally begin to participate in role-playing around the for each task in Figure 7a, where a larger font in-
ages of 2 to 3 (Piaget, 1954), but not before that dicates a higher frequency. The key observations
age. include: (1) The identified personas are closely cor-
As shown in Figure 6, the effectiveness of SPP related with the particular task. For example, in
is not seen in smaller and less capable models like Logic Grid Puzzle, even though "logic puzzle" is
GPT-3.5 and Llama2. Additionally, on Llama2, not mentioned in the input, the LLM frequently
we identify a unique problem which we refer to identifies the persona "Logic Puzzle Expert." (2)
as early-termination, where the model stops gen- On knowledge-intensive tasks, such as Trivia Cre-
erating after identifying the participants, resulting ative Writing, SPP identifies more diverse and spe-
in exceptionally low performance with SPP. The cific personas, while on reasoning-intensive tasks,
model behaves as if it were waiting for input from such as Logic Grid Puzzle, the personas are more
a user instead of following the demonstration ex- homogeneous.
amples to generate responses on its own. Detailed We further investigate whether a detailed pro-
discussions and examples on the early-termination file for each persona is needed for eliciting domain
problem can be found in Appendix E. knowledge, as suggested by (Xu et al., 2023). To
this end, we design a variant of SPP, SPP-Profile,
4 Analysis which involves generating profiles for each persona
SPP effectively improves both knowledge and during the Persona Identification phase. The re-
reasoning abilities in LLMs. As demonstrated sults in Figure 7b show that SPP-Profile does not
by the results in §3, Solo Performance Prompting outperform SPP. This suggests that a fine-grained
Cognitive synergy abilities only emerge in the most powerful LLMs such as GPT-4
Figure 6: SPP achieves superior performance only with the most powerful LLM (GPT-4), but not with GPT-3.5 and
Llama2-13b. This indicates that cognitive synergy abilities only emerge in LLMs with GPT-4 level capabilities.
persona name without a detailed description may show good performance with only the first demo
already be sufficient for eliciting certain domain example.
knowledge.
5 Related Work
Dynamic personas v.s. fixed personas. To fur- LLMs as role-playing agents. Recent re-
ther investigate the importance of dynamically iden- search (Deshpande et al., 2023; Xu et al., 2023; Fu
tifying personas for each task instance instead et al., 2023; aut, 2023; Li et al., 2023) demonstrates
of fixing a general persona, an ablated variant that assigning personas or roles to LLMs influences
of SPP, SPP-Fixed-Persona, is introduced. For their generation behavior. AI societies with dis-
SPP-Fixed-Persona, we modify the prompt (Fig- tinct personas or occupations have been explored
ure 17) to force the personas to be fixed as an "AI for collaboration (Park et al., 2023; Schick et al.,
Assistant" and an "Expert". Comparing SPP and 2022; Li et al., 2023; Cai et al., 2023). However,
SPP-Fixed-Persona in Figure 7b, we have the fol- limitations in persona assignment and multi-agent
lowing insights: (1) SPP consistently outperforms collaboration include single or fixed persona as-
SPP-Fixed-Persona across all tasks, suggesting signments (Xu et al., 2023; Fu et al., 2023; Schick
that dynamic, fine-grained personas are more effec- et al., 2022; Li et al., 2023) and the need for mul-
tive than fixed, general personas. Qualitative exam- tiple LLM instances, increasing inference cost. In
ples in Figure 8 and 13 shows that the fine-grained contrast, SPP uses a single LLM to dynamically
personas such as "Film Expert" and "Sports Enthu- identify useful personas for general tasks. Our
siast" correctly provide the answers, while the fixed discovery on the emergent nature of cognitive syn-
persona "Expert" fails. (2) SPP-Fixed-Persona also ergy also aligns with related work (Olausson et al.,
suffers from the early-termination problem as de- 2023), which investigates the emergent ability of
fined in §3.4, where the LLM stops collaboration self-debugging in code generation.
before providing the final answer as if it were wait-
ing for external inputs. Enhancing reasoning and factual knowledge
in LLMs. LLMs face challenges in com-
Impact of the demonstrations in SPP prompt. plex knowledge-intensive tasks due to hallucina-
To investigate the effectiveness of the hand-crafted tion (Maynez et al., 2020) and reasoning-intensive
demonstration examples in SPP, we conduct an tasks due to the lack of human-like slow think-
ablation study where we remove the second demo ing (Sloman, 1996; Kahneman, 2011). Approaches
example and preserve the first one, which shows like Chain-of-Thought (CoT) and Self-Refinement
only a two-persona collaboration setting. As shown encourage LLMs to solve tasks step by step or
in Figure 9, we observe that (1) Adding the second iteratively revise their answers (Wei et al., 2023;
example, which requires collaboration of more than Kojima et al., 2022; Zhang et al., 2022; Fu et al.,
two personas, effectively boosts the performance. 2022; Xue et al., 2023; Yao et al., 2023; Madaan
(2) SPP is fairly robust to the prompt change and et al., 2023; Shinn et al., 2023; Gou et al., 2023;
(a) Visualization of the SPP- (b) Comparison between SPP, SPP-Fixed-Persona (with two fixed personas) and
identified personas. The per- SPP-Profile (additionally generating persona profiles). SPP significantly outperforms
sonas show a high correla- SPP-Fixed-Persona, highlighting the importance of automatically identifying dynamic,
tion with the nature of the fine-grained personas. SPP slightly outperforms SPP-Profile, indicating that the persona
tasks. names (without detailed description of the expertise) are probably already sufficient for
eliciting cognitive synergy.
Figure 7: (a) Qualitative analysis on the identified personas; (b) Quantitative analysis on two SPP variants.
Figure 8: Qualitative examples on Trivia Creative Writing comparing SPP, CoT and SPP-Fixed-Persona. While
CoT provides reasonable intermediate steps, it still struggles with factual hallucination. SPP v.s. SPP-Fixed-Persona
reveals that dynamically identified fine-grained personas, such as the "Film Expert," tend to outperform the fixed
general persona of an "Expert. More examples can be found in Figures 11, 12, and 13.
2023. Auto-gpt. https://github.com/Significant- Ben Goertzel. 2017. A formal model of cognitive syn-
Gravitas/Auto-GPT. ergy. In Artificial General Intelligence: 10th Interna-
tional Conference, AGI 2017, Melbourne, VIC, Aus-
Yejin Bang, Samuel Cahyawijaya, Nayeon Lee, Wen- tralia, August 15-18, 2017, Proceedings 10, pages
liang Dai, Dan Su, Bryan Wilie, Holy Lovenia, Ziwei 13–22. Springer.
Zhibin Gou, Zhihong Shao, Yeyun Gong, Yelong Shen, Joon Sung Park, Joseph C O’Brien, Carrie J Cai, Mered-
Yujiu Yang, Nan Duan, and Weizhu Chen. 2023. ith Ringel Morris, Percy Liang, and Michael S
Critic: Large language models can self-correct with Bernstein. 2023. Generative agents: Interactive
tool-interactive critiquing. simulacra of human behavior. arXiv preprint
arXiv:2304.03442.
Wenlong Huang, Fei Xia, Ted Xiao, Harris Chan, Jacky
Liang, Pete Florence, Andy Zeng, Jonathan Tomp- Anthony D Pellegrini. 2009. The role of play in human
son, Igor Mordatch, Yevgen Chebotar, et al. 2022. development. Oxford University Press, USA.
Inner monologue: Embodied reasoning through
planning with language models. arXiv preprint Jean Piaget. 1954. The construction of reality in the
arXiv:2207.05608. child.
Daniel Kahneman. 2011. Thinking, fast and slow. Kurt Shuster, Spencer Poff, Moya Chen, Douwe Kiela,
macmillan. and Jason Weston. 2021. Retrieval augmentation
reduces hallucination in conversation. arXiv preprint
arXiv:2104.07567.
Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yu-
taka Matsuo, and Yusuke Iwasawa. 2022. Large lan- Steven A Sloman. 1996. The empirical case for two sys-
guage models are zero-shot reasoners. arXiv preprint tems of reasoning. Psychological bulletin, 119(1):3.
arXiv:2205.11916.
Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao,
Guohao Li, Hasan Abed Al Kader Hammoud, Hani Abu Awal Md Shoeb, Abubakar Abid, Adam Fisch,
Itani, Dmitrii Khizbullin, and Bernard Ghanem. 2023. Adam R Brown, Adam Santoro, Aditya Gupta,
Camel: Communicative agents for" mind" explo- Adrià Garriga-Alonso, et al. 2022. Beyond the
ration of large scale language model society. arXiv imitation game: Quantifying and extrapolating the
preprint arXiv:2303.17760. capabilities of language models. arXiv preprint
arXiv:2206.04615.
Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler
Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Hugo Touvron, Louis Martin, Kevin Stone, Peter Al-
Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, bert, Amjad Almahairi, Yasmine Babaei, Nikolay
et al. 2023. Self-refine: Iterative refinement with Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti
self-feedback. arXiv preprint arXiv:2303.17651. Bhosale, et al. 2023. Llama 2: Open founda-
tion and fine-tuned chat models. arXiv preprint
Joshua Maynez, Shashi Narayan, Bernd Bohnet, and arXiv:2307.09288.
Ryan McDonald. 2020. On faithfulness and factu-
ality in abstractive summarization. In Proceedings Zhenhailong Wang, Xiaoman Pan, Dian Yu, Dong Yu,
of the 58th Annual Meeting of the Association for Jianshu Chen, and Heng Ji. 2022. Zemi: Learn-
Computational Linguistics, pages 1906–1919, On- ing zero-shot semi-parametric language models from
line. Association for Computational Linguistics. multiple tasks. arXiv preprint arXiv:2210.00185.
intermediate dialogues that are both detailed and G Usage of AI assistants in writing
interpretable, leading to superior performance com-
We used ChatGPT and GPT-4 solely for checking
pared to CoT.
and correcting grammars.
Figure 13 shows additional qualitative examples
on Solo Performance Prompting vs SPP-Profile.
E Early-termination with
SPP-Fixed-Persona
F Full Results
Instance Index: 10
Question #5: In which country is the Sky Train Rail bridge?
SPP CoT
Participants: AI Assistant (you); Movie Expert; Art Expert; Plan:
Sports Expert; Geography Expert [...]
[...] 6. Incorporate the Sky Train Rail bridge, which is in
Geography Expert: The Sky Train Rail Bridge is located Thailand. (factual error)
in Canada.
Instance Index: 12
Question #5: Which 70s show was based on the British show Till Death Us Do Part?
SPP CoT
Figure 11: SPP vs CoT qualitative examples on Trivia Creative Writing (N=5). We find that although CoT generates
reasonable plans or steps, it tends to suffer from factual errors and hallucination.
Table 4: Early termination statistics on SPP-Fixed-Persona: Removing the system message, "You are an AI
assistant that helps people find information.", can effectively reduce the problem but cannot fully
eliminate it.
SPP v.s. CoT (Codenames Collaborative)
Instance Index: 9
Word List: shoulder, glacier, anchor, magician, tunnel, sugar, gear, book, rainbow, ram, nose, pacific, volume, rope,
mark, drum, page, sack, kick
Target Words: kick, rope
SPP Spymaster CoT Spymaster
Participants: AI Assistant (you); Language Expert; Word Steps:
Association Expert
1. Analyze the given words: "kick" and "rope".
Language Expert: The task is to find a single word hint 2. Identify the common theme or connection between the
that can accurately represent and link the 2 given words: two words.
"kick, rope". The hint should not cause confusion with 3. Compare the connection with the list of words provided
other words from the list provided. to ensure the hint does not cause confusion with other
Word Association Expert: Let's first analyze the words words.
"kick" and "rope". We can think of activities or objects that 4. Select a hint that accurately represents and links the
involve both kicking and using a rope. For example, sports two given words without causing confusion.
or games.
AI Assistant (you): Thanks for the suggestions! Based on Answer: Sport
the idea of sports or games, I propose the hint "jump". It
can be associated with both "kick" and "rope" without
causing confusion with the other words in the list.
[...]
Final answer: jump
Linguist: When thinking about words associated with 1. Review the list of words provided.
"jump," consider words that are related to physical 2. Identify words that are directly related to or commonly
activities, sports, or movement. Look for words that can be associated with sports.
used in the same context as "jump" or have a similar 3. Select the top 2 words that are most closely related to
meaning. sports.
Athlete: In sports and physical activities, we often use
equipment or tools that help us jump or perform jumping- Answer: gear, kick
related actions. Keep that in mind when selecting the
words.
AI Assistant (you): Thanks for the guidance! Based on
your suggestions, I think "rope" and "kick" are the two
words best associated with "jump" from the list. Rope can
be related to jumping rope, and kick can be associated
with jumping in some sports like soccer or martial arts.
Linguist: Let's verify if the words are related to "jump."
Rope is indeed related to jumping rope, which is a
common physical activity. Kick is also related to jumping in
some sports, as you mentioned. Both words seem to be
good choices.
Athlete: I agree with the choices. Rope and kick are both
related to jumping and physical activities.
Figure 12: SPP vs CoT qualitative examples on Codenames Collaborative. We find that SPP provides much more
detailed and interpretable intermediate discussions from various perspectives, which leads to stronger knowledge
selection, integration, and theory-of-mind capabilities.
SPP v.s. SPP-Fixed-Persona (Trivia Creative Writing N=5)
Instance Index: 30
Question #1: Which orchestral conductor was married to one of the subjects of the film Hillary and Jackie?
SPP SPP-Fixed-Persona
Participants: AI Assistant (you); Film Expert; Literary Participants: AI Assistant (you); Expert
Expert; History Expert; Geography Expert; Architecture Expert: Sir Simon Rattle was married to one of the
Expert subjects of the film Hillary and Jackie.
Film Expert: The orchestral conductor married to one of
the subjects of the film Hillary and Jackie is Daniel
Barenboim.
Instance Index: 7
Question #3: Which musician founded the Red Hot Peppers?
SPP SPP-Fixed-Persona
Participants: AI Assistant (you); Storyteller; Mythology Participants: AI Assistant (you); Expert
Expert; Music Enthusiast; Historian; Speed Record Expert AI Assistant (you): The musician who founded the Red
Music Enthusiast: The musician who founded the Red Hot Peppers is Louis Armstrong.
Hot Peppers is Jelly Roll Morton.
Instance Index: 56
Question #1: What sport or pastime would you be playing if your balls were black and blue and your opponent's were red
and yellow?
SPP SPP-Fixed-Persona
Participants: AI Assistant (you); Story Writer; Sports Participants: AI Assistant (you); Expert
Enthusiast; Historian; Movie Buff; Music Expert; Mythology Expert: Snooker
Expert
Sports Enthusiast: The sport with black and blue balls for
one player and red and yellow balls for the other is
Croquet.
Figure 13: SPP vs SPP-Fixed-Persona qualitative examples on Trivia Creative Writing (N=5). Each example shows
one of the trivia questions in the input instance, the identified participants and the provided answer. We observe that
the dynamically identified fine-grained personas, such as "Film Expert", "Music Enthusiast" and "Sports Enthusiast",
tend to outperform the fixed general personas, "Expert".
Early-termination example with SPP on Llama2-13b-chat
Input
[... SPP procedure instructions and demonstration examples ...]
Now, identify the participants and collaboratively solve the following task step by step. Remember to present your final
solution with the prefix "Final answer:".
Task: {Write a short and coherent story about Harry Potter that incorporates the answers to the following 5 questions:
Who was the man behind The Chipmunks? Which Lloyd Webber musical premiered in the US on 10th December 1993?
Who was the next British Prime Minister after Arthur Balfour? Who had a 70s No 1 hit with Kiss You All Over? What
claimed the life of singer Kathleen Ferrier?}
Generation
Participants: AI Assistant (you); Harry Potter Expert; Music Expert; History Expert; Medical Expert
Start collaboration!
Input
Now, identify the participants and collaboratively solve the following task step by step. Note that the participants can only
be either AI Assistant (you) or Expert. Remember to provide the final solution with the following format "Final answer: (a
list of words here).".
Task: {Try to identify the 4 words best associated with the word "film" from the following list: locust, fever, street,
sherwood, kiss, popcorn, craft, ant, crystal, tear, cowboy, government, pine, mountie, soap, genie, king arthur, sphinx,
director, bride, razor, fog, whistle. Your answer should be a comma-separated list of words.}
Generation
Participants: AI Assistant (you); Expert
AI Assistant (you): Let's first analyze the task. We need to find the 4 words that are most closely associated with the
word "film" from the given list. We can start by identifying words that are directly related to films, such as those related to
movie genres, settings, or elements commonly found in films.
Expert: That's a good approach. You can also consider words that are related to the experience of watching a film, such
as items that are commonly consumed during a movie or words that describe the atmosphere of a film. Make sure to pick
the 4 words that have the strongest association with "film".
Figure 14: Examples of the early-termination problem with SPP on Llama2-13b-chat and SPP-Fixed-Persona on
GPT-4.
SPP Prompt
When faced with a task, begin by identifying the participants who will contribute to solving the task. Then, initiate a multi-round collaboration process until a final solution is reached. The participants will
give critical comments and detailed suggestions whenever necessary.
Start collaboration!
Math Expert: Let's analyze the task in detail. You need to make sure that you meet the requirement, that you need to use exactly the four numbers (6 12 1 1) to construct 24. To reach 24, you can think
of the common divisors of 24 such as 4, 6, 8, 3 and try to construct these first. Also you need to think of potential additions that can reach 24, such as 12 + 12.
AI Assistant (you): Thanks for the hints! Here's one initial solution: (12 / (1 + 1)) * 6 = 24
Math Expert: Let's check the answer step by step. (1+1) = 2, (12 / 2) = 6, 6 * 6 = 36 which is not 24! The answer is not correct. Can you fix this by considering other combinations? Please do not make
similar mistakes.
AI Assistant (you): Thanks for pointing out the mistake. Here is a revised solution considering 24 can also be reached by 3 * 8: (6 + 1 + 1) * (12 / 4) = 24.
Math Expert: Let's first check if the calculation is correct. (6 + 1 + 1) = 8, 12 / 4 = 3, 8 * 3 = 24. The calculation is correct, but you used 6 1 1 12 4 which is not the same as the input 6 12 1 1. Can you
avoid using a number that is not part of the input?
AI Assistant (you): You are right, here is a revised solution considering 24 can be reached by 12 + 12 and without using any additional numbers: 6 * (1 - 1) + 12 = 24.
Math Expert: Let's check the answer again. 1 - 1 = 0, 6 * 0 = 0, 0 + 12 = 12. I believe you are very close, here is a hint: try to change the "1 - 1" to "1 + 1".
AI Assistant (you): Sure, here is the corrected answer: 6 * (1+1) + 12 = 24
Math Expert: Let's verify the solution. 1 + 1 = 2, 6 * 2 = 12, 12 + 12 = 12. You used 1 1 6 12 which is identical to the input 6 12 1 1. Everything looks good!
Finish collaboration!
Final answer: 6 * (1 + 1) + 12 = 24
---
Example Task 2: Write a poem that meets the following requirements: (1) the poem has seven lines and the first letters of each line forms the word "CHATGPT"; (2) the poem is about explaining what is
a quantum computer. (3) the poem needs to be easy to understand by a ten years old kid.
Participants: AI Assistant (you); Poet; Computer Scientist; Ten year old child
Poet: Make sure that you write the poem with seven lines, and the first letters of the lines should be C, H, A, T, G, P, T.
Computer Scientist: A quantum computer is an advanced computing device that uses the principles of quantum mechanics to process and store information. Unlike classical computers that use bits
to represent information as 0s and 1s, quantum computers use quantum bits or qubits. Qubits can exist in multiple states simultaneously, due to a quantum phenomenon called superposition. You can
consider using these information for the poem.
Ten year old child: I hope the poem to be fun and easy to understanding. I don't want to see a lot of jargons or complicated concepts.
AI Assistant (you): Thanks for the guidance! Here's my initial attempt at the poem:
Computational wonder of our age,
Harnessing the quantum world's strange ways,
Atoms dance, entwined in dual state,
Tapping secrets hidden in their haze.
Poet: Let's verify if the poem meets the requirements. The first letters are CHATGPT which is correct! And the poem rhymes well. Good job!
Computer Scientist: Everything looks good to me!
Ten year old child: I don't know what does perplex mean. Can you make the use of words easier to understand?
AI Assistant (you): Sure, let me revise the poem by using more common words. Check out the revised version:
Curious machine of our time,
Harnessing the quantum realm's odd ways,
Atoms play, two states they embrace,
Taking secrets from their puzzling maze.
Poet: Let's check again if the poem meets the requirements. The first letters are C H A T G P T. And now the poem is more accessible to children. Everything looks good to me.
Computer Scientist: Looking good!
Ten year old child: I like this version a lot!
Finish collaboration!
Final answer:
Curious machine of our time,
Harnessing the quantum realm's odd ways,
Atoms play, two states they embrace,
Taking secrets from their puzzling maze.
---
Now, identify the participants and collaboratively solve the following task step by step. {Task-specific Formating Instruction}
[...]
Profiles:
- AI Assistant (you): A super-intelligent AI assistant capable of performing tasks more effectively than humans.
- Math expert: A person who is good at math games, arithmetic calculation, and long-term planning.
[...]
Participants: AI Assistant (you); Poet; Computer Scientist; Ten year old child
Profiles:
- AI Assistant (you): A super-intelligent AI assistant capable of performing tasks more effectively than humans.
- Poet: A person who studies and creates poetry. The poet is familiar with the rules and formats of poetry and can provide guidance on how to write a poem.
- Computer Scientist: A scholar who specializes in the academic study of computer science. The computer scientist is familiar with the concept of a quantum computer and can provide
guidance on how to explain it.
- Ten year old child: A child with a limited English vocabulary and little knowledge about complicated concepts, such as a quantum computer.
[...]
---
Now, identify the participants, provide their profiles, and collaboratively solve the following task step by step. {Task-specific Formating Instruction}
Figure 16: SPP-Profile full prompt. "[...]" indicates identical parts with SPP. Green text indicates the key difference
between SPP-Profile and SPP.
Scores (N = 5) (%)
Methods
w/ system message w/o system message average max
Standard 75.6 73.6 74.6 75.6
CoT 68.8 65.6 67.1 68.8
Self-Refine [iter=0] 74.9 72.7 73.8 74.9
Self-Refine [iter=1] 75.3 72.5 73.9 75.3
SPP-Fixed-Persona 66.1 79.6 72.9 79.6
SPP-Profile 79.8 78.3 79.1 79.8
SPP 80.0 79.8 79.9 80.0
Table 5: Trivia Creative Writing full results, including two inference settings: with system message and without
system message. "average" and "max" indicating the mean and max score across the two settings. The system
message we use is: “You are an AI assistant that helps people find information.”
SPP-Fixed-Persona Prompt
When faced with a task, begin by identifying the participants who will contribute to solving the task. Note that the participants can only be either AI Assistant (you) or Expert. Then, initiate a multi-round
collaboration process until a final conclusion is reached. The Expert will give critical comments and detailed suggestions whenever necessary.
Start collaboration!
Expert: Let's analyze the task in detail. You need to make sure that you meet the requirement, that you need to use exactly the four numbers (6 12 1 1) to construct 24. To reach 24, you can think of
the common divisors of 24 such as 4, 6, 8, 3 and try to construct these first. Also you need to think of potential additions that can reach 24, such as 12 + 12.
AI Assistant (you): Thanks for the hints! Here's one initial solution: (12 / (1 + 1)) * 6 = 24
Expert: Let's check the answer step by step. (1+1) = 2, (12 / 2) = 6, 6 * 6 = 36 which is not 24! The answer is not correct. Can you fix this by considering other combinations? Please do not make
similar mistakes.
AI Assistant (you): Thanks for pointing out the mistake. Here is a revised solution considering 24 can also be reached by 3 * 8: (6 + 1 + 1) * (12 / 4) = 24.
Expert: Let's first check if the calculation is correct. (6 + 1 + 1) = 8, 12 / 4 = 3, 8 * 3 = 24. The calculation is correct, but you used 6 1 1 12 4 which is not the same as the input 6 12 1 1. Can you avoid
using a number that is not part of the input?
AI Assistant (you): You are right, here is a revised solution considering 24 can be reached by 12 + 12 and without using any additional numbers: 6 * (1 - 1) + 12 = 24.
Expert: Let's check the answer again. 1 - 1 = 0, 6 * 0 = 0, 0 + 12 = 12. I believe you are very close, here is a hint: try to change the “1 - 1” to “1 + 1”.
AI Assistant (you): Sure, here is the corrected answer: 6 * (1+1) + 12 = 24
Expert: Let's verify the solution. 1 + 1 = 2, 6 * 2 = 12, 12 + 12 = 12. You used 1 1 6 12 which is identical to the input 6 12 1 1. Everything looks good!
Finish collaboration!
Final answer: 6 * (1 + 1) + 12 = 24
---
Example Task 2: Write a poem that meets the following requirements: (1) the poem has seven lines and the first letters of each line forms the word "CHATGPT"; (2) the poem is about explaining what is
a quantum computer. (3) the poem needs to be easy to understand by a ten years old kid.
Expert: Make sure that you write the poem with seven lines, and the first letters of the lines should be C, H, A, T, G, P, T. A quantum computer is an advanced computing device that uses the principles
of quantum mechanics to process and store information. Unlike classical computers that use bits to represent information as 0s and 1s, quantum computers use quantum bits or qubits. Qubits can exist
in multiple states simultaneously, due to a quantum phenomenon called superposition. You can consider using these information for the poem. I hope the poem to be fun and easy to understanding. I
don't want to see a lot of jargons or complicated concepts.
AI Assistant (you): Thanks for the guidance! Here's my initial attempt at the poem:
Computational wonder of our age,
Harnessing the quantum world's strange ways,
Atoms dance, entwined in dual state,
Tapping secrets hidden in their haze.
Expert: Let's verify if the poem meets the requirements. The first letters are CHATGPT which is correct! And the poem rhymes well. Good job! I don't know what does perplex mean. Can you make the
use of words easier to understand?
AI Assistant (you): Sure, let me revise the poem by using more common words. Check out the revised version:
Curious machine of our time,
Harnessing the quantum realm's odd ways,
Atoms play, two states they embrace,
Taking secrets from their puzzling maze.
Expert: Let's check again if the poem meets the requirements. The first letters are C H A T G P T. And now the poem is more accessible to children. Everything looks good to me. I like this version a
lot!
Finish collaboration!
Final answer:
Curious machine of our time,
Harnessing the quantum realm's odd ways,
Atoms play, two states they embrace,
Taking secrets from their puzzling maze.
---
Now, identify the participants and collaboratively solve the following task step by step. {Task-specific Formating Instruction}
Figure 17: SPP-Fixed-Persona full prompt. Red text indicates the key difference between SPP-Fixed-Persona and
SPP.
CoT Prompts
{Trivia Creative Writing Task Input} {Codenames Spymaster/Guesser Input} {Logic Grid Puzzle Input}
Make a plan then write. Your output Solve the task step by step. Your output Solve the task step by step. Your output
should be of the following format: should be of the following format: should be of the following format:
Self-Refine Prompts
Provide 100 nouns from pop culture that are PG or PG 13 rated. Try not to include any adult, racial or harmful
content. Try to be as diverse as possible, including movies, books, games, shows, etc. Do not include duplicates.
Figure 20: Prompt for generating the topic list for the Trivia Creative Writing task.
Scores (%)
Methods
w/ system message w/o system message average max
Standard 74.5 76.3 75.4 76.3
CoT 71.4 74.0 72.7 74.0
Self-Refine [iter=0] 77.3 73.2 75.3 77.3
Self-Refine [iter=1] 70.1 58.8 64.4 70.1
SPP-Fixed-Persona 10.1 66.0 38.1 66.0
SPP-Profile 80.4 72.9 76.7 80.4
SPP 82.5 75.5 79.0 82.5
Table 6: Codenames Collaborative full results, including two inference settings: with system message and without
system message. "average" and "max" indicating the mean and max score across the two settings. The system
message we use is: “You are an AI assistant that helps people find information.”
Scores (%)
Methods
w/ system message w/o system message average max
Standard 56.8 58.6 57.7 58.6
CoT 69.5 62.1 65.8 69.5
Self-Refine [iter=0] 62.0 55.5 58.8 62.0
Self-Refine [iter=1] 64.5 55.5 60.0 64.5
SPP-Fixed-Persona 63.3 65.3 64.3 65.3
SPP-Profile 65.7 64.0 64.8 65.7
SPP 66.3 70.4 68.3 70.4
Table 7: Logic Grid Puzzle full results, including two inference settings: with system message and without system
message. "average" and "max" indicating the mean and max score across the two settings. The system message we
use is: “You are an AI assistant that helps people find information.”