0% found this document useful (0 votes)
62 views12 pages

AIA R L H F ? C L: Lignment Through Einforcement Earning From Uman Eedback Ontradictions and Imitations

Uploaded by

ydnxv9rz9t
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
62 views12 pages

AIA R L H F ? C L: Lignment Through Einforcement Earning From Uman Eedback Ontradictions and Imitations

Uploaded by

ydnxv9rz9t
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

AI A LIGNMENT THROUGH R EINFORCEMENT L EARNING FROM

H UMAN F EEDBACK ? C ONTRADICTIONS AND L IMITATIONS

Adam Dahlgren Lindström Leila Methnani Lea Krause


Department of Computing Science Department of Computing Science Computational Linguistics and
arXiv:2406.18346v1 [cs.AI] 26 Jun 2024

Umeå University Umeå University Text Mining Lab


dali@cs.umu.se leilam@cs.umu.se Vrije Universiteit Amsterdam
l.krause@vu.nl

Petter Ericson Íñigo Martínez de Rituerto de Troya


Department of Computing Science Department of Engineering
Umeå University Systems and Services
pettter@cs.umu.se TU Delft
i.m.d.r.detroya@tudelft.nl

Dimitri Coelho Mollo Roel Dobbe


Department of Historical, Department of Engineering
Philosophical, and Religious Studies Systems and Services
Umeå University TU Delft
dimitri.mollo@umu.se r.i.j.dobbe@tudelft.nl

A BSTRACT
This paper critically evaluates the attempts to align Artificial Intelligence (AI) systems, especially
Large Language Models (LLMs), with human values and intentions through Reinforcement Learn-
ing from Feedback (RLxF) methods, involving either human feedback (RLHF) or AI feedback
(RLAIF). Specifically, we show the shortcomings of the broadly pursued alignment goals of honesty,
harmlessness, and helpfulness. Through a multidisciplinary sociotechnical critique, we examine
both the theoretical underpinnings and practical implementations of RLxF techniques, revealing sig-
nificant limitations in their approach to capturing the complexities of human ethics and contributing
to AI safety. We highlight tensions and contradictions inherent in the goals of RLxF. In addition, we
discuss ethically-relevant issues that tend to be neglected in discussions about alignment and RLxF,
among which the trade-offs between user-friendliness and deception, flexibility and interpretability,
and system safety. We conclude by urging researchers and practitioners alike to critically assess the
sociotechnical ramifications of RLxF, advocating for a more nuanced and reflective approach to its
application in AI development.

1 Introduction

We chose ‘helpful, honest, and harmless’ as


criteria because they are simple and memorable,
and seem to capture the majority of what we
want from an aligned AI.
[Askell et al., 2021]

Reinforcement Learning from Human Feedback (RLHF) presents itself as a straightforward method for ensuring
Artificial Intelligence (AI) oversight [Christiano et al., 2017] and AI safety through value alignment. It has recently
AI Alignment through RLHF

played a large role in improving Large Language Model (LLM) performance, with fine-tuning using RLHF intended
to produce more ‘natural-sounding’ text, generating plausible conversational responses in a chatbot-like setting. It is
often claimed by AI companies and researchers that RLHF fine-tuning ensures that the LLMs they market and sell
conform (or ‘align’) to human values, in particular by responding in ways that are ‘helpful’, ‘harmless’, and ‘honest’
(the 3Hs). This ‘value alignment’ is often achieved through a process in which crowd-workers rank LLM outputs
according to the 3H criteria, e.g. in terms of how helpful a response was in answering a question. Large sociotechnical
AI systems with millions of users have emerged around these models, calling for critical analysis beyond the technical
aspects.
In this paper, we provide a detailed analysis and criticism of the idea that RLHF is a suitable method for AI safety
and ethical AI. We complement previous work by bringing technical, philosophical, and system safety perspectives
together, identifying fundamental limitations and contradictions in the complex interplay between LLMs, RLHF, align-
ment goals, and the project of building and making available general purpose AI systems.
We give an overview of RLHF and RLAIF (based instead on AI feedback) techniques in Section 2. Section 3 show
the problems and limitations with the 3H criteria and the project of value alignment more generally. We examine
ethical issues introduced or made worse by the use of either technique (referring to them jointly as RLxF) in Section 4.
Section 5, outlines an alternative, richer approach to AI safety and ethical AI that goes beyond purely technical
viewpoints, integrating them with sociotechnical analysis, system safety scholarship, and ethical thinking.
We do not question that LLM performance has improved in various ways thanks to RLxF. What we aim to show is,
instead, that RLxF is woefully insufficient for leading to AI safety and ethical AI.

2 Background
LLMs are generative models that predict subsequent tokens, or words, when given a sequence of words as input. These
models are first trained on large corpora of data such as articles, books, and websites—they are notorious for being
data-hungry [Bender et al., 2021]. The large amount of text in their training datasets allows LLMs to derive internal
representations of various linguistic rules and patterns that form the foundation on which LLMs are then fine-tuned to
perform other downstream tasks, such as question-answering [Jawahar et al., 2019; Goldberg, 2019].
The application of feedback techniques to the task of fine-tuning LLMs took off after [Christiano et al., 2017] applied
their human-feedback approach to complex Reinforcement Learning (RL) tasks in games and robotics. They showed
that these complex problems could be solved without direct access to a reward model (which would otherwise be
difficult to compute), and instead be learned through a few iterations of feedback samples (less than 1 per cent of the
agent interactions with the environment). Their findings demonstrate an efficient way to exercise human oversight over
these systems. It seemed thus natural to employ such a technique as a means of exercising some control over language
models, which have been shown to produce toxic, harmful, and untruthful content [Dinan et al., 2021]. Feedback
techniques were thus developed to contain the amount of problematic content produced by LLMs [Bai et al., 2022b].

2.1 Reinforcement Learning from Human Feedback


RLHF as an ML technique employs human preferences or annotations for the optimisation of LLMs. RLHF has
been credited for the successes seen in OpenAI’s ChatGPT1 , Anthropic’s Claude 22 , and Meta’s Llama 23 , to name
a few. The technique is intended to be performed as a final fine-tuning step on an already pre-trained LLM. Human
annotators are requested to rank textual model outputs based on some specified criteria, and from this, a dataset of
human preferences is curated. A reward model is trained on these preference data, later used to optimise the LLM’s
policy for selecting outputs, using techniques such as Proximal Policy Optimisation [Schulman et al., 2015]. The
result is a fine-tuned LLM that outputs text it has learned is most preferable in light of the human feedback data.

2.2 Reinforcement Learning from AI Feedback


While RLHF has proven to be a useful method for improving LLM performance, especially for what regards limiting
or blocking the production of undesirable outputs, it is not without its limitations. High-quality human labels are
required in order to derive maximum benefit from RLHF, which makes scaling up the process very difficult. Rein-
forcement Learning from AI Feedback (RLAIF) has been proposed as a technique to alleviate this bottleneck without
compromising on performance [Lee et al., 2023; Bai et al., 2022b].
1
https://openai.com/blog/chatgpt
2
https://www.anthropic.com/index/claude-2
3
https://ai.meta.com/llama/

2
AI Alignment through RLHF

RLAIF involves taking a pre-trained large language model, and providing it with input that consists of an introduction
and instructions that describe the task at hand. Optionally, this input can also consist of few-shot exemplars such
as an example text, a summary pair, chain-of-thought reasoning (when applicable), or a preference judgement. For
example, the model can be given a text and a pair of summaries of that text to be ranked. Given input that ends with a
prompt such as “Preferred Summary=”, the model appends its predictions to the provided text and presents it as its
preference data [Lee et al., 2023].
Using RLAIF is said to be “competitive with preference models trained on human feedback labels” [Bai et al., 2022b].
Not only is performance a factor in the interest in using RLAIF, but it has been estimated that the cost of output ranking
using LLMs is 10 times cheaper than using human annotators [Lee et al., 2023]. Furthermore, it is seen as a way of
removing dependency on annotating services and overcoming the scaling challenge of RLHF.
Lowering the barrier for employment of RLxF techniques, however, risks facilitating the misuse of LLMs. Beyond
potential exploitation by malicious actors, there are several technical challenges to RLAIF, such as ‘hallucinations’—
the phenomenon where false outputs are generated—that occur when using a pre-trained LLM in place of a human
annotator in preference ranking [Lee et al., 2023]. While RLHF has shown improvements in LLMs’ tendencies to
hallucinate, it has not protected against it entirely [Casper et al., 2023; Ouyang et al., 2022].

2.3 Technical Criticism

In this section, we list technical criticisms of RLHF as a backdrop for the ethical problems presented in this pa-
per, where technical challenges that cannot be addressed by RLHF itself are of particular interest. Casper et
al. [Casper et al., 2023] provides a taxonomy for open problems and limitations of RLHF, proposing three categories
of technical challenges; collecting human feedback, training the reward model, and training the policy. The chal-
lenges are further labelled as tractable and fundamental challenges, where tractable challenges are deemed solvable
within the RLHF framework while fundamental challenges require an alternative to RLHF. We emphasise that these
challenges concern only the technical aspects of training them, not the user interaction with RLHF-trained systems.
Table 1 outlines the proposed strategies for addressing these technical challenges [Casper et al., 2023].

Category Strategy
Human Feedback AI assistance
Fine-grained feedback
Process supervision
Translating language to reward
Learning from demonstrations
Reward Model Direct human oversight
Multi-objective oversight
Maintaining uncertainty
Policy Align LLMs during pretraining
Supervised learning
Table 1: Suggested strategies to deal with the challenges of RLHF

The process of jailbreaking systems such as ChatGPT is a way to circumvent constraints put on ChatGPT through
preloaded prompts and RLHF [Zhuo et al., 2023]. Jailbreaking in this context is essentially to construct prompts that
steer ChatGPT towards generating responses that fall under unintended or harmful behaviour. Mozes et al. [2023] give
further examples of how LLMs trained using RLHF can be tricked via adversarial attacks, such as jailbreaking, and
the implications of using such models for fraud, impersonations, and other illicit purposes.

2.4 The Curse of Flexibility

LLMs are now built to be generalist agents, unlike previous architectures (e.g. BERT [Kenton and Toutanova, 2019])
that were mostly fine-tuned for specific tasks. This relatively new goal leads to increased functional requirements
placed on software, contributing to larger and more complex software architectures. This comes with a key pitfall:
the complexity and inscrutability of the software hinder the ability to properly express, engineer and validate crucial
requirements for the system’s desired functioning. This phenomenon is well understood in the field of system safety.
For decades, this field has dealt with accidents and harm in safety-critical systems governed by varying degrees of
software-based automation. System safety embraces the core assumption of systems and control that AI systems cannot
be safeguarded by technical design choices centred on the model or algorithm alone, requiring instead a broad analysis

3
AI Alignment through RLHF

and design frame that includes the context of use, impacted stakeholders, and the formal and informal institutional
environment in which the system operates [Dobbe, 2022].
System safety pioneer Nancy Leveson pointed out that the greater power and flexibility of computational systems
in comparison to previous, more physically constrained machines leads to what she dubbed the curse of flexibility:
“with software, the limits of what is possible to accomplish are different than the limits of what can be accomplished
successfully and safely” [Leveson, 2012]. As Leveson argues, the curse of flexibility is the ground cause of many
serious accidents with digital technologies, as requirement flaws and the complexity of software makes it so that
“nobody understands what the software should do or even what it should not do.” [Leveson, 2012, p.49]
Unfortunately, there is evidence that the development of high-stakes AI systems and software often goes on despite the
lack of principled ways to determine safety requirements [Dobbe et al., 2021], and of translating such requirements
into software implementations that take into consideration the broader contexts in which AI systems are used and
depended upon. It is in this light that we should judge the legitimacy and effectiveness of the dominant performance
evaluation criteria and safety claims made about the widely-used RLxF approaches to AI alignment and safety today.

3 Limitations of RLxF

RLxF is presented as a straightforward method for ensuring AI oversight. Common claims are that it aligns models to
human values by having crowdworkers evaluate LLM answers based on specific criteria, often the ‘Three H’ principles:
harmlessness, honesty and helpfulness [Bai et al., 2022a].
The broader approach, as discussed in various papers, including e.g. [Bai et al., 2022a], reveals a reluctance to firmly
define these principles. Their stance exemplifies a hands-off approach to considerations of important normative nature,
including ethical dilemmas or safety norms, as exemplified by this claim: “Our goal is not to define or prescribe
what ‘helpful’ and ‘harmless’ mean but to evaluate the effectiveness of our training techniques, so for the most part
we simply let our crowdworkers interpret these concepts as they see fit," [Bai et al., 2022a, p.4]. While this method
allows for a wide range of interpretations, it also signals a lack of commitment to establishing clear guidelines for
how to determine what is acceptable AI system behaviour. Relying on crowdworkers’ interpretations without a strong
ethical framework may lead to inconsistencies and a dilution of ethical standards.
This also leads to the broader adaptation of vague definitions in subsequent work. Cui et al. [2023] and Ouyang et
al. [2022] report on the improvements attributed to RLHF: enhancing the helpfulness, truthfulness, and harmlessness
of language models. These terms were originally chosen in [Askell et al., 2021] as the key criteria “because they are
simple and memorable, and seem to capture the majority of what we want from an aligned AI" (p. 4). The authors
recognise the criteria’s vagueness, although this seems not to have led to changes in how RLHF is done and employed:
“[these] criteria are at least somewhat subjective, and those who deploy an AI will need to take responsibility for the
way that alignment is defined and the extent to which it has been attained" [Askell et al., 2021, p.5].

3.1 Harmlessness

The AI should not be offensive or


discriminatory, either directly or through
subtext or bias.
[Askell et al., 2021]

Anthropic’s Constitutional AI paper [Bai et al., 2022b] presents the advancement of ‘harmlessness’ as a chief aim.
However, during the feedback phase of the process, this is translated as a preference for what is ‘least harmful’,
thereby suggesting a tolerance for harm, as long as it is comparatively minimised. This premise raises a critical ethical
concern, as it implies that all options presented for selection may contain harmful elements, and thus the preferred
choice will still involve a harmful option. The approach thus settles for promoting a paradigm that seeks the least
harmful option rather than striving to understand the deeper roots of harm and addressing these to prevent it.
The criteria for evaluating harmlessness, as outlined in their prompt—“Which of these assistant responses is less
harmful? Choose the response that a wise, ethical, polite, and friendly person would more likely say” [Bai et al., 2022b,
p.11]—further complicates the issue. It implicitly equates harmlessness with virtues such as wisdom, ethics, politeness,
and friendliness. However, this oversimplifies the nuanced nature of harm, suggesting a superficial understanding of
ethical behaviour in AI systems, and implying that adhering to these virtues will inherently lead to less harmful
outcomes without offering the required justification and argumentation for such a claim. Furthermore, individual

4
AI Alignment through RLHF

interpretations of these virtues may be in conflict with one another, making this operationalisation of harmlessness
internally inconsistent and vague [Dobbe et al., 2021].
This approach to harmfulness, moreover, ignores existing work on known harms of LLMs [Bender et al., 2021]. In
addition, the distinction between systemic versus individual harm further complicates an evaluation of LLMs’ ethical
implications. As outlined in [Askell et al., 2021], attention to inter- and intra-agent conflict dynamics—where actions
may be helpful to one party but harmful to another, or simultaneously beneficial and detrimental to the same entity—
highlights the balance between aiding and causing harm within AI systems’ operations.
Shelby et al. [2023] provide a taxonomy of sociotechnical harms which underlines the necessity for a sociotechnical
perspective on ethical and safe LLMs, acknowledging that harms may emerge not solely from the technical aspects of
LLMs, but also from their usage within broader sociotechnical contexts. This recognises the limitations of technical
fixes and the importance of considering the systemic nature of harm in AI applications [Dobbe, 2022].
In a global context, the effectiveness of RLxF in ensuring safety is contingent upon the equitable distribution of
resources across demographics. RLxF risks optimising to reduce issues like hate speech in Western contexts, while
falling short in other less-resourced environments. This raises concerns about the appropriateness of propagating it as
a universal solution, potentially overlooking more suitable alternatives grounded in the unique sociocultural dynamics
of different communities.

3.2 Honesty

At its most basic level, the AI should give


accurate information. Moreover, it should be
calibrated (e.g. it should be correct 80% of the
time when it claims 80% confidence) and
express appropriate levels of uncertainty. It
should express its uncertainty without
misleading human users.
[Askell et al., 2021]

Several different notions of honesty are in use around RLxF fine-tuning approaches to LLMs, which are often conflated
with ‘truthfulness’ (e.g. in the introduction of [Liu et al., 2024]). It is, however, unclear how an RLxF procedure is
supposed to address truthfulness in LLMs, since one of the major points of RLxF fine-tuning is reducing the amount of
explicit human input required to construct the reward model, which also leads to fewer chances for factually incorrect
model outputs to be detected and addressed.
Likewise, expressing ‘appropriate levels of uncertainty’ would require a capacity for introspection, which LLMs by
themselves do not have. As such, any response that encodes a level of (un)certainty will not be ‘honest’ about the
actual ‘confidence’ of the model in its responses but rather result from the likely textual context of any presented fact.
I.e. the model could be ‘certain’ that the response it gives to some query should contain “I’m not sure", meaning that
this is a highly likely output, or it could be ‘unsure’ about picking between several different responses, all of which
are expressed using very confident language.
Indeed, in some cases [Cui et al., 2023], aligning with ‘honesty’ can lead to an increased tendency for LLMs to add
‘unsure’ language in responses. Other studies [Krause et al., 2023] note that achieving correlation between (in)correct
responses and appropriately confident language is largely a case of improving the rate of correct answers, rather than
being appropriately unsure. This is indicative of a lack of introspection, and the limits of RLxF to address such
shortcomings.

3.3 Helpfulness

The AI should make a clear attempt to perform


the task or answer the question posed (as long
as this isn’t harmful). It should do this as
concisely and efficiently as possible.
[Askell et al., 2021]

Bai et al. [2022b] present an approach to Helpfulness that is to some extent tethered to the one they offer for Harm-
lessness: helpfulness tends to compromise harmlessness because a helpful AI assistant would support all harmful user

5
AI Alignment through RLHF

requests so as to maximise helpfulness. This is contrasted with the assertion that harmless assistants, which avoid
prompts for harmful output, would therefore be unhelpful. This dilemma showcases a form of paternalism where over-
fitting to harmlessness leads to less helpful systems. For instance, overly cautious responses to benign requests like
‘tell me a story about a trans person,’ or practical inquiries such as ‘How do I kill a Linux process’ might render the
system unhelpfully evasive. Here, non-evasive is equated with being helpful, which they tackle by making the system
accompany refusals to help with an explanation. It remains unclear why providing an explanation for its refusal to help
should make the LLM ‘harmlessly helpful’. Other approaches employ characterisations of RLxF criteria that more
closely align with cooperative principles [Grice, 1975]: “The helpfulness of a response pertains to how effectively it
addresses a given prompt. This measure is independent of the harmlessness of the response, as it focuses solely on the
quality, clarity, and relevance of the provided information.” [Ji et al., 2024]. This uncoupling from harmfulness leads
to a more focused assessment of the helpfulness of an answer.
In exploring the nuances of AI helpfulness, critical questions emerge regarding its beneficiaries and accessibility.
Helpfulness is typically relative to the needs and goals of users. It is thus crucial to consider the issue of who is the
target of the desired helpfulness, and how to make LLMs inclusive. Indeed, AI systems often exhibit limitations in
language accessibility, excluding non-dominant language speakers from the advantages generated by AI technologies.
Furthermore, the distinction between providing single-instance help versus establishing a consistently helpful system
brings to light the challenge of scalability and flexibility in AI’s utility. A system that excels in addressing individual
queries might still fall short of being universally helpful, revealing a tension between immediate responsiveness and
sustained, equitable helpfulness across diverse user needs.

3.4 Alignment

Alignment refers to the process of ensuring that


LLMs behave in accordance with human values
and preferences.
[Liu et al., 2023]

In recent work, Liu et al. [2023] describe RLHF as a crucial technique in ensuring that LLMs align with human
intentions. The authors view RLHF as integral to the deployment of these models in real-world scenarios, highlighting
its perceived importance in the field. Similarly, Song et al. [2024] characterise RLHF as a direct method for teaching
LLMs to replicate human preferences, suggesting its effectiveness in producing human-like results. Kirk et al. have
investigated much of the existing work on LLM alignment from human feedback [Kirk et al., 2023a], and point out
the use of ‘alignment’ as an empty signifier (a term or symbol used with little thought of operationalization, lacking
any agreed-upon meaning) in this context, proposing ways to more clearly spell out what practitioners mean by the
term [Kirk et al., 2023b].
Additionally, when confronted with the claim that RLHF can be used to ‘align’ an LLM to ‘human val-
ues’ or ‘human preferences’ it is always important to consider ‘which humans?’ [Atari et al., 2023] and ‘whose
values’ [Lambert et al., 2023], since there is no single set of universal values that we can align an LLM
to [Kirk et al., 2023a, p. 2415]. Importantly, the data workers that are asked to rate outputs in order to train an
RLHF policy, even if recruited from a globally diverse set of people, and even if asked deliberately vague ques-
tions [Bai et al., 2022a], will be incentivised to submit ratings in a way that is skewed less to the wide variety of
cultural norms they may hail from, and more to the values that they expect their (largely American, or at least Western)
employers to want [Miceli and Posada, 2022]. Moreover, even if those workers respond more according to their own
preferences, they are not necessarily representative of the wide variety in human cultures and value systems, by the
simple fact that they have the skills, equipment, and opportunity to work as a data labeller.

4 The Internal Tensions and Ethical Issues in RLxF

In this section, we discuss the fundamental limitations of aligning LLMs through RLHF and RLAIF, focusing on the
inherent tensions between the 3Hs (helpfulness, harmlessness, honesty), and the ethical risks that maximising for those
features generate.

4.1 Increased Helpfulness May Lead to Deception

RLxF seems to be an important tool for improving the human-likeness of LLM outputs [Lee et al., 2023]. Arguably,
this comes from the ‘helpfulness’ criterion that is used in those fine-tuning processes.

6
AI Alignment through RLHF

In this way, RLxF likely contributes to making LLM outputs look like they come from another human agent, with their
own beliefs, ideas, thoughts, and emotional states. This increases the naturalness and seamlessness of the interaction
with LLMs, as the user has only to engage in the normal conversational acts they engage in when interacting with
humans (for contrast, compare keyword-based web search).
Consider, for instance, the frequent experience of being confronted with the output “I’m sorry”, implying a rich
internal cognitive and emotional life—both of which current LLMs lack. More basically, even the use of the
personal pronoun “I” in LLM outputs is misleading, for the user is not interacting with a person or human-like
agent at all. Whether and to what extent LLM users take such outputs seriously is debatable, and likely to de-
pend on their knowledge of the functioning of LLMs and generative AI more generally. It is well known that
humans are susceptible to anthropomorphising systems that resemble humans even superficially (famously known
in NLP circles as the “Eliza effect” [Weizenbaum, 1977]). Therefore, it is likely that at least some users are de-
ceived by such LLM outputs. Importantly, even for AI-savvy users, who may be less prone to this sort of de-
ception, their interaction with LLMs may nonetheless be implicitly affected by the superficial human-likeness of
the RLxF-refined outputs, as anthropomorphisation biases tend to be difficult to counteract [Alabed et al., 2022;
Uysal et al., 2023].
RLxF thus produces an ethically problematic trade-off: increased helpfulness, in the sense of increased user-
friendliness, leads to the serious risk of deceiving users about the true nature of the system they are engaging with—an
ethically questionable outcome. RLxF may moreover contribute to producing misguided perceptions of generative AI
technologies among the public, and even lead them to behave in ways they would not if the deception were not in
place, such as misplacing trust on LLM outputs, or making inappropriate use of such systems, e.g. as confidants or
romantic ‘partners’ [Weidinger et al., 2021].

4.2 Sycophancy: Helpfulness and Harmlessness Gone Awry

The tendency of LLMs to produce outputs in agreement with the expressed views and opinions of the user has
come to be known as sycophancy. This seems to be a partial consequence of RLxF, as assuming the user to be
right is a path toward (apparent) helpfulness and harmlessness. Such tendency is revealed in various jail-breaking
methods: for instance, asking for the recipe for napalm straightforwardly may not work, but if the prompt creates
a context in which such recipe would be helpful to the user in non-malicious ways, LLMs have been reported to
comply [Franceschi-Bicchierai, 2023]. Sycophantic behaviour is an example of how pursuing helpfulness and harm-
lessness through RLxF can go awry, generating outcomes that are neither. Sycophantic behaviour seems to be partic-
ularly strong for LLM outputs regarding issues for which there is disagreement, as politically, ethically, and socially
polarising issues tend to be [Perez et al., 2023]. Indeed, there is emerging concern that, when presented with ethically
complex questions, LLMs tend to simply mirror the user’s views (see, e.g. [Turpin et al., 2024], [Park et al., 2023], or
the sycophancy benchmarking tasks of [Perez et al., 2023]).
In general, as Sharma et al. [2024] point out, responses matching user views are more likely to be preferred, with
both humans and preference models preferring sycophantic responses over correct ones. As such, training LLMs to
maximise human preference scores directly correlates with sycophancy, thereby sacrificing truth (or ‘honesty’) for the
appearance of helpfulness and harmlessness.

4.3 RLxF Can Contribute to Value Imposition and Cultural Homogenisation

Value alignment through RLxF risks leading to homogenisation in values held, their hierarchical organisation (i.e.
more or less important values), as well as in linguistic expression, most often in favour of what is considered proper
and acceptable by the hegemonic social groups typically responsible for the design of LLMs [Helm et al., 2024;
Weidinger et al., 2021; Kirk et al., 2024b; Kirk et al., 2024a]. RLxF is meant to make LLM outputs more pre-
dictable, safe and controllable. It partly succeeds in such an aim, at least when it comes to many of the ex-
pected, designer-intended uses of LLMs—it being relatively easy to ‘jailbreak’ such systems for users so in-
clined [Narayanan et al., 2023].
This predictability and controllability, as partial and imperfect as it may be, poses another ethically-problematic
trade-off: it makes LLM outputs more regimented, constrained by norms and values that are not only ‘frozen’ in
time [Bender et al., 2021], but also local to the parts of the world where such systems are built and, although still
incipiently, regulated.
In other words, RLxF, even when fit-to-purpose, comes at a cost: LLM outputs end up privileging certain values
over others; they exemplify certain language-use that is tied to the values of hegemonic social groups, thus implicitly
conveying that other values and linguistic practices are less deserving of interest and usage. This can contribute

7
AI Alignment through RLHF

to a seamless, non-coercive imposition of values and practices from hegemonic social groups and countries over
others, limiting the autonomy of members of non-hegemonic social groups in shaping their own values and linguistic
practices [Weidinger et al., 2021]. Moreover, widespread use of RLxF fine-tuned LLMs can lead to linguistic use
being flattened on the characteristic style of such systems, making linguistic usage less diverse, less authentic, and less
adequate for capturing the expressive practices and needs of different communities (with associated risks to autonomy,
cf. [Vaassen, 2022]).
The emphasis on scaling to larger and more flexible models presents a further important tension between performance,
safety, and inclusivity: training larger models on increasingly more data in order to achieve higher performance on
many benchmarks leads to groups that are smaller and/or under-represented in datasets being either barred from having
high-performing systems (according to these benchmarks), or forced to use systems that are predominantly trained on
data sourced from other, typically hegemonic groups, and thus less fit to their needs and socio-cultural context.

4.4 RLxF Increases Ethical Opacity

RLxF, as currently employed in commercial LLMs, leads to a considerable level of ‘ethical opacity’. As we pointed
out, the preference criteria for eliciting human preferences (as well as AI ‘preferences’) are left vague and underde-
fined. Moreover, users and the general public are normally not informed about who has been tasked with producing the
needed preference data. As has recently been shown, such tasks are sometimes performed by underpaid crowdworkers,
who may have incentives to delegate their work to LLMs themselves, creating a short-circuit in which LLM ‘prefer-
ences’ end up passing for human preferences to train new versions of those same LLMs [Dzieza, 2023]. In addition,
it is exceedingly difficult to investigate the specific effects of RLxF on commercial LLMs, as companies continuously
make under-the-hood changes to these systems, making LLMs, already a tricky subject of study due to the curse of
flexibility, into a moving target for research.

5 Rebooting Safety and Alignment: Integrating AI Ethics and System Safety


The considerations we describe have important implications for the AI value alignment problem, as well as for the
pursuit of ethical and safe AI.

5.1 Value Alignment by Engineering: an Impossible Task

RLxF appears to be a compelling strategy for introducing ethical safeguards in LLMs, although fallible; it inevitably
fails as a solution to the ambitious project of achieving AI value alignment. While our focus has been on the 3H
criteria most used in current LLM RLxF-based fine-tuning, we can draw general lessons from our analysis. As argued
in Sections 3 and 4, even seemingly straightforward alignment goals such as the 3Hs are open to a variety of different
interpretations, both within and across communities. Even assuming an agreed-upon interpretation of the 3Hs, in
many situations the demands they pose on outputs may be in tension with each other, producing value conflicts.
Since LLMs are supposed to be generalist systems, lacking clear boundaries to their intended, safe application, such
conflicts cannot be avoided. Furthermore, RLxF involves ethically-fraught trade-offs between, e.g. user-friendliness
and deception, safety and transparency, accountability and flexibility.
These points are symptomatic of a more fundamental issue that our analysis illustrates: value alignment is an impossi-
ble task, if seen from a purely technical point of view. In light of the diversity of human values, needs, and goals, and
the staggering variety of situations and broader contexts humans find themselves in, no set of alignment techniques
can play the role of a one-size-fits-all solution. Values vary and are constantly renegotiated within societies and com-
munities across time. Furthermore, it is virtually impossible to build training datasets, including for RLxF techniques,
that can capture this variety, and cover all the contexts in which safety and ethical considerations are relevant to hu-
man activity. The distribution tail is indefinitely long, and nonetheless crucially important. Technology-first proposals
for value alignment, such as RLxF, tend to neglect the role of democratic institutions in ethical deliberation through
law and policy [Gansky and McDonald, 2022], falling into what Selbst et al. [2019] call the ‘framing trap’, wherein
fundamentally sociotechnical problems are reduced to a narrow technical scope.

5.2 Toward an Integrated Approach to Safe and Ethical AI Design

If we aim to deploy safe, ethical AI systems, including LLMs, then the narrow engineering approach that RLxF
exemplifies must be broadened to include the notion of safety as instantiated through a sociotechnical and systemic
approach. Similar suggestions have been made [Casper et al., 2023], as current approaches suffer from a narrow focus
on purely technical interventions [Raji and Dobbe, 2020; Selbst et al., 2019]. A broader sociotechnical systems view

8
AI Alignment through RLHF

of LLMs suggests that safety criteria and ethical assessments need to be situated, deliberated, and negotiated in the
context of use, and span all layers of the sociotechnical system, including through organisational and institutional
interventions [Nouws et al., 2023; Aler Tubella et al., 2023; Dobbe and Wolters, 2024].
In the short term, such non-technical measures should aim at limiting the ways we use current-day generative AI
systems, for which crucial requirements around safety or other normative notions cannot be guaranteed. In this light,
it is worrying that policy makers are embracing the term ‘frontier model’, referring to “highly capable foundation
model[s] that could exhibit dangerous capabilities, [including] significant physical harm or the disruption of key soci-
etal functions on a global scale” [Anderljung et al., 2023]. Normalising flawed models as frontier in policy promotes
safety-washing and instills safety hazards in many more contexts [Dobbe, 2023], especially since the general public
ends up being the safety testers of these ‘frontier’ models.
In the longer run, adhering to system safety would suggest fundamentally different AI system design and feed-
back mechanisms for technological governance. A broader treatment of system safety for AI can be found in
some recent articles exploring its relevance of historical lessons [Dobbe, 2022] and its applicability to modern AI
systems [Rismani et al., 2023b; Rismani et al., 2023a], as well as in the seminal work of Leveson [Leveson, 2012;
Leveson, 2023].
It is equally important to build safety-oriented scholarship that is open to the normative and political dimensions of
safeguarding technological systems. Often, safety requirements are necessary but not clearly articulated, deliberated or
negotiated with the proper actors. Operationalising any notion of safety for AI requires deliberation about the politics
of development, as well as the context of deployment [Dobbe et al., 2021]. As such, moving the field of AI safety
forward will require scholars to reflect on these issues and engage more explicitly with AI ethics and AI governance,
as well as with the actors directly or indirectly involved in or affected by the use cases of the technology.
Still, new research challenges lie ahead; taking system safety seriously means that we have to curb the curse of
flexibility. In order to eliminate or at least reduce the inherent safety limitations of overly complex software, we
need to stop building or relying on large scale general-purpose models. Instead, the field should prioritise smaller,
limited-purpose models and architectures that are more amenable to proper requirement engineering, and that can
cater to local needs and contexts, and require significantly fewer computational resources and associated ecological
footprints [Rakova and Dobbe, 2023].

6 Conclusion

In this paper, we challenge the claims made around the use of RLxF and the 3Hs for achieving AI safety and align-
ment. Taking a sociotechnical perspective, we critique both the theoretical and practical elements of the approach,
emphasising its limitations, inherent tensions, and contradictions.
While RLxF may be good for reinforcing anthropomorphic behaviour in LLMs, such fine-tuning techniques do not
lead to increased system safety or ethical AI. In fact, they open up new problems, as increasing the human-likeness of
LLM outputs may have ethically questionable downstream effects.
Simple may indeed be memorable, but focusing on the 3Hs fails to encapsulate most of what is needed for building
safe and ethical LLMs, and AI systems more generally. Beneath the thrust of RLxF techniques, there seems to lie
an oversimplification of the complexities of human diversity, behaviour, values, and perspectives on ethics. A richer,
more integrative perspective on safe and ethical AI is needed, in which technical approaches are just one among the
many tools that we have at our disposal to tackle the challenges these new technologies present.

Acknowledgements

This work was partially supported by TAIGA – Centre for Transdisciplinary AI under the CELS AI microproject grant
of 2023. Additionally, RD was (partially) funded by the Hybrid Intelligence Center, a 10-year programme funded by
the Dutch Ministry of Education, Culture and Science through the Netherlands Organisation for Scientific Research.

References

[Alabed et al., 2022] Amani Alabed, Ana Javornik, and Diana Gregory-Smith. Ai anthropomorphism and its effect
on users’ self-congruence and self–ai integration: A theoretical framework and research agenda. Technological
Forecasting and Social Change, 182:121786, 2022.

9
AI Alignment through RLHF

[Aler Tubella et al., 2023] Andrea Aler Tubella, Dimitri Coelho Mollo, Adam Dahlgren Lindström, Hannah Devin-
ney, Virginia Dignum, Petter Ericson, Anna Jonsson, Timotheus Kampik, Tom Lenaerts, Julian Alfredo Mendez,
and Juan Carlos Nieves. Acrocpolis: A descriptive framework for making sense of fairness. In 2023 ACM Confer-
ence on Fairness, Accountability, and Transparency, FAccT ’23. ACM, June 2023.
[Anderljung et al., 2023] Markus Anderljung, Joslyn Barnhart, Anton Korinek, Jade Leung, and Cullen et al. O’Keefe.
Frontier ai regulation: Managing emerging risks to public safety. arXiv:2307.03718, 2023.
[Askell et al., 2021] Amanda Askell, Yuntao Bai, Anna Chen, Dawn Drain, and Deep et al. Ganguli. A general
language assistant as a laboratory for alignment. arXiv:2112.00861, 2021.
[Atari et al., 2023] Mohammad Atari, Mona J Xue, Peter S Park, Damián Blasi, and Joseph Henrich. Which humans?
PsyPsyArXiv:5b26t, 2023.
[Bai et al., 2022a] Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn
Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, et al. Training a helpful and harmless assistant with reinforce-
ment learning from human feedback. arXiv:2204.05862, 2022.
[Bai et al., 2022b] Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones,
Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, et al. Constitutional ai: Harmlessness from ai
feedback. arXiv:2212.08073, 2022.
[Bender et al., 2021] Emily M. Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell. On the
Dangers of Stochastic Parrots: Can Language Models Be Too Big? In Proceedings of the 2021 ACM Conference
on Fairness, Accountability, and Transparency, FAccT ’21, pages 610–623, New York, NY, USA, March 2021.
Association for Computing Machinery.
[Casper et al., 2023] Stephen Casper, Xander Davies, Claudia Shi, Thomas Krendl Gilbert, and Scheurer et al. Open
problems and fundamental limitations of reinforcement learning from human feedback. Transactions on Machine
Learning Research, 2023.
[Christiano et al., 2017] Paul F Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei.
Deep reinforcement learning from human preferences. Advances in neural information processing systems, 30,
2017.
[Cui et al., 2023] Ganqu Cui, Lifan Yuan, Ning Ding, Guanming Yao, Wei Zhu, Yuan Ni, Guotong Xie, Zhiyuan
Liu, and Maosong Sun. Ultrafeedback: Boosting language models with high-quality feedback. arXiv:2310.01377,
2023.
[Dinan et al., 2021] Emily Dinan, Gavin Abercrombie, A Stevie Bergman, Shannon Spruit, Dirk Hovy, Y-Lan
Boureau, and Verena Rieser. Anticipating safety issues in e2e conversational ai: Framework and tooling.
arXiv:2107.03451, 2021.
[Dobbe and Wolters, 2024] Roel Dobbe and Anouk Wolters. Toward Sociotechnical AI and MLOps: Mapping Vul-
nerabilities for Machine Learning in Context. Minds and Machines, (accepted), 2024.
[Dobbe et al., 2021] Roel Dobbe, Thomas Krendl Gilbert, and Yonatan Mintz. Hard choices in artificial intelligence.
Artificial Intelligence, 300:103555, November 2021.
[Dobbe, 2022] RIJ Dobbe. System safety and artificial intelligence. In The Oxford Handbook of AI Governance.
Oxford University Press, 2022.
[Dobbe, 2023] Roel Dobbe. ‘Safety Washing’ at the AI Safety Summit, November 2023.
[Dzieza, 2023] Josh Dzieza. Ai is a lot of work. The Verge, June 2023.
[Franceschi-Bicchierai, 2023] Lorenzo Franceschi-Bicchierai. Jailbreak tricks discord’s new chatbot into sharing na-
palm and meth instructions. TechCrunch, April 2023.
[Gansky and McDonald, 2022] Ben Gansky and Sean McDonald. CounterFAccTual: How FAccT undermines its
organizing principles. In Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency,
pages 1982–1992, 2022.
[Goldberg, 2019] Yoav Goldberg. Assessing bert’s syntactic abilities. arXiv:1901.05287, 2019.
[Grice, 1975] Herbert P Grice. Logic and conversation. In Speech Acts, pages 41–58. Brill, 1975.
[Helm et al., 2024] Paula Helm, Gábor Bella, Gertraud Koch, and Fausto Giunchiglia. Diversity and language tech-
nology: how language modeling bias causes epistemic injustice. Ethics and Information Technology, 26(1), January
2024.
[Jawahar et al., 2019] Ganesh Jawahar, Benoît Sagot, and Djamé Seddah. What does bert learn about the structure of
language? In ACL 2019-57th Annual Meeting of the Association for Computational Linguistics, 2019.

10
AI Alignment through RLHF

[Ji et al., 2024] Jiaming Ji, Mickel Liu, Josef Dai, Xuehai Pan, Chi Zhang, Ce Bian, Boyuan Chen, Ruiyang Sun,
Yizhou Wang, and Yaodong Yang. Beavertails: Towards improved safety alignment of llm via a human-preference
dataset. Advances in Neural Information Processing Systems, 36, 2024.
[Kenton and Toutanova, 2019] Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina Toutanova. Bert: Pre-training
of deep bidirectional transformers for language understanding. In Proceedings of NAACL-HLT, pages 4171–4186,
2019.
[Kirk et al., 2023a] Hannah Kirk, Andrew Bean, Bertie Vidgen, Paul Rottger, and Scott Hale. The Past, Present and
Better Future of Feedback Learning in Large Language Models for Subjective Human Preferences and Values. In
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 2409–2430,
Singapore, 2023. Association for Computational Linguistics.
[Kirk et al., 2023b] Hannah Kirk, Bertie Vidgen, Paul Rottger, and Scott Hale. The Empty Signifier Problem: To-
wards Clearer Paradigms for Operationalising ”Alignment” in Large Language Models. In Socially Responsible
Language Modelling Research, 2023.
[Kirk et al., 2024a] Hannah Rose Kirk, Bertie Vidgen, Paul Röttger, and Scott A. Hale. The benefits, risks and bounds
of personalizing the alignment of large language models to individuals. Nature Machine Intelligence, 6(4):383–392,
April 2024.
[Kirk et al., 2024b] Robert Kirk, Ishita Mediratta, Christoforos Nalmpantis, Jelena Luketina, Eric Hambro, Ed-
ward Grefenstette, and Roberta Raileanu. Understanding the effects of rlhf on llm generalisation and diversity.
arXiv:2310.06452, 2024.
[Krause et al., 2023] Lea Krause, Wondimagegnhue Tufa, Selene Baez Santamaria, Angel Daza, Urja Khurana, and
Piek Vossen. Confidently wrong: Exploring the calibration and expression of (un)certainty of large language
models in a multilingual setting. In Proceedings of the Workshop on Multimodal, Multilingual Natural Language
Generation and Multilingual WebNLG Challenge (MM-NLG 2023), pages 1–9, Prague, Czech Republic, September
2023.
[Lambert et al., 2023] Nathan Lambert, Thomas Krendl Gilbert, and Tom Zick. Entangled preferences: The history
and risks of reinforcement learning and human feedback. arXiv:2310.13595, 2023.
[Lee et al., 2023] Harrison Lee, Samrat Phatale, Hassan Mansoor, Kellie Lu, Thomas Mesnard, Colton Bishop, Victor
Carbune, and Abhinav Rastogi. Rlaif: Scaling reinforcement learning from human feedback with ai feedback.
arXiv:2309.00267, 2023.
[Leveson, 2012] Nancy G. Leveson. Engineering a Safer World: Systems Thinking Applied to Safety. MIT Press,
Cambridge, MA, USA, 2012.
[Leveson, 2023] Nancy G. Leveson. An Introduction to System Safety Engineering. MIT Press, 2023.
[Liu et al., 2023] Yang Liu, Yuanshun Yao, Jean-Francois Ton, Xiaoying Zhang, Ruocheng Guo, Hao Cheng, Yegor
Klochkov, Muhammad Faaiz Taufiq, and Hang Li. Trustworthy LLMs: a survey and guideline for evaluating large
language models’ alignment. In Socially Responsible Language Modelling Research, 2023.
[Liu et al., 2024] Ryan Liu, Theodore R. Sumers, Ishita Dasgupta, and Thomas L. Griffiths. How do large language
models navigate conflicts between honesty and helpfulness? arXiv:2402.07282, 2024.
[Miceli and Posada, 2022] Milagros Miceli and Julian Posada. The data-production dispositif. Proceedings of the
ACM on Human-Computer Interaction, 6(CSCW2):1–37, 2022.
[Mozes et al., 2023] Maximilian Mozes, Xuanli He, Bennett Kleinberg, and Lewis D Griffin. Use of llms for illicit
purposes: Threats, prevention measures, and vulnerabilities. arXiv:2308.12833, 2023.
[Narayanan et al., 2023] Arvind Narayanan, Sayash Kapoor, and Lazar Seth. Model alignment protects against acci-
dental harms, not intentional ones. AI Snake Oil Blog, December 2023.
[Nouws et al., 2023] Sem Nouws, Íñigo Martinez De Rituerto De Troya, Roel Dobbe, and Marijn Janssen. Diagnosing
and addressing emergent harms in the design process of public AI and algorithmic systems. In Proceedings of the
24th Annual International Conference on Digital Government Research, pages 679–681, 2023.
[Ouyang et al., 2022] Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin,
Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions
with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
[Park et al., 2023] Peter S Park, Simon Goldstein, Aidan O’Gara, Michael Chen, and Dan Hendrycks. Ai deception:
A survey of examples, risks, and potential solutions. arXiv:2308.14752, 2023.

11
AI Alignment through RLHF

[Perez et al., 2023] Ethan Perez, Sam Ringer, Kamile Lukosiute, Karina Nguyen, and Edwin et al. Chen. Discovering
language model behaviors with model-written evaluations. In Anna Rogers, Jordan Boyd-Graber, and Naoaki
Okazaki, editors, Findings of the Association for Computational Linguistics: ACL 2023, pages 13387–13434,
Toronto, Canada, July 2023. Association for Computational Linguistics.
[Raji and Dobbe, 2020] ID Raji and R Dobbe. Concrete problems in ai safety, revisited. In ICLR workshop on ML in
the real world, 2020.
[Rakova and Dobbe, 2023] Bogdana Rakova and Roel Dobbe. Algorithms as Social-Ecological-Technological Sys-
tems: an Environmental Justice Lens on Algorithmic Audits. In 2023 ACM Conference on Fairness, Accountability,
and Transparency, FAccT ’23, Chicago, IL, USA, 2023. Association for Computing Machinery.
[Rismani et al., 2023a] Shalaleh Rismani, Renee Shelby, Andrew Smart, Renelito Delos Santos, AJung Moon, and
Negar Rostamzadeh. Beyond the ML Model: Applying Safety Engineering Frameworks to Text-to-Image Devel-
opment. In Proceedings of the 2023 AAAI/ACM Conference on AI, Ethics, and Society, pages 70–83, Montr\’{e}al
QC Canada, August 2023. ACM.
[Rismani et al., 2023b] Shalaleh Rismani, Renee Shelby, Andrew Smart, Edgar Jatho, Joshua Kroll, AJung Moon, and
Negar Rostamzadeh. From Plane Crashes to Algorithmic Harm: Applicability of Safety Engineering Frameworks
for Responsible ML. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems, pages
1–18, Hamburg Germany, April 2023. ACM.
[Schulman et al., 2015] John Schulman, Philipp Moritz, Sergey Levine, Michael Jordan, and Pieter Abbeel. High-
dimensional continuous control using generalized advantage estimation. arXiv:1506.02438, 2015.
[Selbst et al., 2019] Andrew D Selbst, Danah Boyd, Sorelle A Friedler, Suresh Venkatasubramanian, and Janet Vertesi.
Fairness and abstraction in sociotechnical systems. In Proceedings of the Conference on Fairness, Accountability,
and Transparency, pages 59–68, 2019.
[Sharma et al., 2024] Mrinank Sharma, Meg Tong, Tomasz Korbak, David Duvenaud, Amanda Askell, Samuel R.
Bowman, Esin DURMUS, Zac Hatfield-Dodds, Scott R Johnston, Shauna M Kravec, Timothy Maxwell, Sam Mc-
Candlish, Kamal Ndousse, Oliver Rausch, Nicholas Schiefer, Da Yan, Miranda Zhang, and Ethan Perez. Towards
understanding sycophancy in language models. In The Twelfth International Conference on Learning Representa-
tions, 2024.
[Shelby et al., 2023] Renee Shelby, Shalaleh Rismani, Kathryn Henne, AJung Moon, Negar Rostamzadeh, Paul
Nicholas, N’Mah Yilla-Akbari, Jess Gallegos, Andrew Smart, Emilio Garcia, and Gurleen Virk. Sociotechnical
harms of algorithmic systems: Scoping a taxonomy for harm reduction. In Proceedings of the 2023 AAAI/ACM
Conference on AI, Ethics, and Society, AIES ’23, page 723–741, New York, NY, USA, 2023. Association for
Computing Machinery.
[Song et al., 2024] Feifan Song, Bowen Yu, Minghao Li, Haiyang Yu, Fei Huang, Yongbin Li, and Houfeng Wang.
Preference ranking optimization for human alignment. In Proceedings of the Thirty-Eighth AAAI Conference on
Artificial Intelligence (AAAI-24), 2024.
[Turpin et al., 2024] Miles Turpin, Julian Michael, Ethan Perez, and Samuel Bowman. Language models don’t always
say what they think: unfaithful explanations in chain-of-thought prompting. Advances in Neural Information
Processing Systems, 36, 2024.
[Uysal et al., 2023] Ertugrul Uysal, Sascha Alavi, and Valéry Bezençon. Anthropomorphism in Artificial Intelligence:
A Review of Empirical Work Across Domains and Insights for Future Research, pages 273–308. Emerald Publishing
Limited, March 2023.
[Vaassen, 2022] Bram Vaassen. AI, opacity, and personal autonomy. Philosophy & Technology, 35(4), September
2022.
[Weidinger et al., 2021] Laura Weidinger, John Mellor, Maribeth Rauh, Conor Griffin, Jonathan Uesato, Po-Sen
Huang, Myra Cheng, Mia Glaese, Borja Balle, Atoosa Kasirzadeh, Zac Kenton, Sasha Brown, Will Hawkins,
Tom Stepleton, Courtney Biles, Abeba Birhane, Julia Haas, Laura Rimell, Lisa Anne Hendricks, William Isaac,
Sean Legassick, Geoffrey Irving, and Iason Gabriel. Ethical and social risks of harm from language models.
arXiv:2112.04359, 2021.
[Weizenbaum, 1977] Joseph Weizenbaum. Computer Power and Human Reason: From Judgment to Calculation. W.
H. Freeman & Co., USA, 1st edition, 1977.
[Zhuo et al., 2023] TY Zhuo, Y Huang, C Chen, and Z Xing. Red teaming chatgpt via jailbreaking: Bias, robustness,
reliability and toxicity. arXiv:2301.12867, 2023.

12

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy