0% found this document useful (0 votes)

62 views12 pages

AIA R L H F ? C L: Lignment Through Einforcement Earning From Uman Eedback Ontradictions and Imitations

Uploaded by

ydnxv9rz9t

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

62 views12 pages

AIA R L H F ? C L: Lignment Through Einforcement Earning From Uman Eedback Ontradictions and Imitations

Uploaded by

ydnxv9rz9t

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 12

AI A LIGNMENT THROUGH R EINFORCEMENT L EARNING FROM

H UMAN F EEDBACK ? C ONTRADICTIONS AND L IMITATIONS

Adam Dahlgren Lindström Leila Methnani Lea Krause

Department of Computing Science Department of Computing Science Computational Linguistics and
arXiv:2406.18346v1 [cs.AI] 26 Jun 2024

Umeå University Umeå University Text Mining Lab

dali@cs.umu.se leilam@cs.umu.se Vrije Universiteit Amsterdam
l.krause@vu.nl

Petter Ericson Íñigo Martínez de Rituerto de Troya

Department of Computing Science Department of Engineering
Umeå University Systems and Services
pettter@cs.umu.se TU Delft
i.m.d.r.detroya@tudelft.nl

Dimitri Coelho Mollo Roel Dobbe

Department of Historical, Department of Engineering
Philosophical, and Religious Studies Systems and Services
Umeå University TU Delft
dimitri.mollo@umu.se r.i.j.dobbe@tudelft.nl

A BSTRACT
This paper critically evaluates the attempts to align Artificial Intelligence (AI) systems, especially
Large Language Models (LLMs), with human values and intentions through Reinforcement Learn-
ing from Feedback (RLxF) methods, involving either human feedback (RLHF) or AI feedback
(RLAIF). Specifically, we show the shortcomings of the broadly pursued alignment goals of honesty,
harmlessness, and helpfulness. Through a multidisciplinary sociotechnical critique, we examine
both the theoretical underpinnings and practical implementations of RLxF techniques, revealing sig-
nificant limitations in their approach to capturing the complexities of human ethics and contributing
to AI safety. We highlight tensions and contradictions inherent in the goals of RLxF. In addition, we
discuss ethically-relevant issues that tend to be neglected in discussions about alignment and RLxF,
among which the trade-offs between user-friendliness and deception, flexibility and interpretability,
and system safety. We conclude by urging researchers and practitioners alike to critically assess the
sociotechnical ramifications of RLxF, advocating for a more nuanced and reflective approach to its
application in AI development.

1 Introduction

We chose ‘helpful, honest, and harmless’ as

criteria because they are simple and memorable,
and seem to capture the majority of what we
want from an aligned AI.
[Askell et al., 2021]

Reinforcement Learning from Human Feedback (RLHF) presents itself as a straightforward method for ensuring
Artificial Intelligence (AI) oversight [Christiano et al., 2017] and AI safety through value alignment. It has recently
AI Alignment through RLHF

played a large role in improving Large Language Model (LLM) performance, with fine-tuning using RLHF intended
to produce more ‘natural-sounding’ text, generating plausible conversational responses in a chatbot-like setting. It is
often claimed by AI companies and researchers that RLHF fine-tuning ensures that the LLMs they market and sell
conform (or ‘align’) to human values, in particular by responding in ways that are ‘helpful’, ‘harmless’, and ‘honest’
(the 3Hs). This ‘value alignment’ is often achieved through a process in which crowd-workers rank LLM outputs
according to the 3H criteria, e.g. in terms of how helpful a response was in answering a question. Large sociotechnical
AI systems with millions of users have emerged around these models, calling for critical analysis beyond the technical
aspects.
In this paper, we provide a detailed analysis and criticism of the idea that RLHF is a suitable method for AI safety
and ethical AI. We complement previous work by bringing technical, philosophical, and system safety perspectives
together, identifying fundamental limitations and contradictions in the complex interplay between LLMs, RLHF, align-
ment goals, and the project of building and making available general purpose AI systems.
We give an overview of RLHF and RLAIF (based instead on AI feedback) techniques in Section 2. Section 3 show
the problems and limitations with the 3H criteria and the project of value alignment more generally. We examine
ethical issues introduced or made worse by the use of either technique (referring to them jointly as RLxF) in Section 4.
Section 5, outlines an alternative, richer approach to AI safety and ethical AI that goes beyond purely technical
viewpoints, integrating them with sociotechnical analysis, system safety scholarship, and ethical thinking.
We do not question that LLM performance has improved in various ways thanks to RLxF. What we aim to show is,
instead, that RLxF is woefully insufficient for leading to AI safety and ethical AI.

2 Background
LLMs are generative models that predict subsequent tokens, or words, when given a sequence of words as input. These
models are first trained on large corpora of data such as articles, books, and websites—they are notorious for being
data-hungry [Bender et al., 2021]. The large amount of text in their training datasets allows LLMs to derive internal
representations of various linguistic rules and patterns that form the foundation on which LLMs are then fine-tuned to
perform other downstream tasks, such as question-answering [Jawahar et al., 2019; Goldberg, 2019].
The application of feedback techniques to the task of fine-tuning LLMs took off after [Christiano et al., 2017] applied
their human-feedback approach to complex Reinforcement Learning (RL) tasks in games and robotics. They showed
that these complex problems could be solved without direct access to a reward model (which would otherwise be
difficult to compute), and instead be learned through a few iterations of feedback samples (less than 1 per cent of the
agent interactions with the environment). Their findings demonstrate an efficient way to exercise human oversight over
these systems. It seemed thus natural to employ such a technique as a means of exercising some control over language
models, which have been shown to produce toxic, harmful, and untruthful content [Dinan et al., 2021]. Feedback
techniques were thus developed to contain the amount of problematic content produced by LLMs [Bai et al., 2022b].

2.1 Reinforcement Learning from Human Feedback

RLHF as an ML technique employs human preferences or annotations for the optimisation of LLMs. RLHF has
been credited for the successes seen in OpenAI’s ChatGPT1 , Anthropic’s Claude 22 , and Meta’s Llama 23 , to name
a few. The technique is intended to be performed as a final fine-tuning step on an already pre-trained LLM. Human
annotators are requested to rank textual model outputs based on some specified criteria, and from this, a dataset of
human preferences is curated. A reward model is trained on these preference data, later used to optimise the LLM’s
policy for selecting outputs, using techniques such as Proximal Policy Optimisation [Schulman et al., 2015]. The
result is a fine-tuned LLM that outputs text it has learned is most preferable in light of the human feedback data.

2.2 Reinforcement Learning from AI Feedback

While RLHF has proven to be a useful method for improving LLM performance, especially for what regards limiting
or blocking the production of undesirable outputs, it is not without its limitations. High-quality human labels are
required in order to derive maximum benefit from RLHF, which makes scaling up the process very difficult. Rein-
forcement Learning from AI Feedback (RLAIF) has been proposed as a technique to alleviate this bottleneck without
compromising on performance [Lee et al., 2023; Bai et al., 2022b].
1
https://openai.com/blog/chatgpt
2
https://www.anthropic.com/index/claude-2
3
https://ai.meta.com/llama/

2
AI Alignment through RLHF

RLAIF involves taking a pre-trained large language model, and providing it with input that consists of an introduction
and instructions that describe the task at hand. Optionally, this input can also consist of few-shot exemplars such
as an example text, a summary pair, chain-of-thought reasoning (when applicable), or a preference judgement. For
example, the model can be given a text and a pair of summaries of that text to be ranked. Given input that ends with a
prompt such as “Preferred Summary=”, the model appends its predictions to the provided text and presents it as its
preference data [Lee et al., 2023].
Using RLAIF is said to be “competitive with preference models trained on human feedback labels” [Bai et al., 2022b].
Not only is performance a factor in the interest in using RLAIF, but it has been estimated that the cost of output ranking
using LLMs is 10 times cheaper than using human annotators [Lee et al., 2023]. Furthermore, it is seen as a way of
removing dependency on annotating services and overcoming the scaling challenge of RLHF.
Lowering the barrier for employment of RLxF techniques, however, risks facilitating the misuse of LLMs. Beyond
potential exploitation by malicious actors, there are several technical challenges to RLAIF, such as ‘hallucinations’—
the phenomenon where false outputs are generated—that occur when using a pre-trained LLM in place of a human
annotator in preference ranking [Lee et al., 2023]. While RLHF has shown improvements in LLMs’ tendencies to
hallucinate, it has not protected against it entirely [Casper et al., 2023; Ouyang et al., 2022].

2.3 Technical Criticism

In this section, we list technical criticisms of RLHF as a backdrop for the ethical problems presented in this pa-
per, where technical challenges that cannot be addressed by RLHF itself are of particular interest. Casper et
al. [Casper et al., 2023] provides a taxonomy for open problems and limitations of RLHF, proposing three categories
of technical challenges; collecting human feedback, training the reward model, and training the policy. The chal-
lenges are further labelled as tractable and fundamental challenges, where tractable challenges are deemed solvable
within the RLHF framework while fundamental challenges require an alternative to RLHF. We emphasise that these
challenges concern only the technical aspects of training them, not the user interaction with RLHF-trained systems.
Table 1 outlines the proposed strategies for addressing these technical challenges [Casper et al., 2023].

Category Strategy
Human Feedback AI assistance
Fine-grained feedback
Process supervision
Translating language to reward
Learning from demonstrations
Reward Model Direct human oversight
Multi-objective oversight
Maintaining uncertainty
Policy Align LLMs during pretraining
Supervised learning
Table 1: Suggested strategies to deal with the challenges of RLHF

The process of jailbreaking systems such as ChatGPT is a way to circumvent constraints put on ChatGPT through
preloaded prompts and RLHF [Zhuo et al., 2023]. Jailbreaking in this context is essentially to construct prompts that
steer ChatGPT towards generating responses that fall under unintended or harmful behaviour. Mozes et al. [2023] give
further examples of how LLMs trained using RLHF can be tricked via adversarial attacks, such as jailbreaking, and
the implications of using such models for fraud, impersonations, and other illicit purposes.

2.4 The Curse of Flexibility

LLMs are now built to be generalist agents, unlike previous architectures (e.g. BERT [Kenton and Toutanova, 2019])
that were mostly fine-tuned for specific tasks. This relatively new goal leads to increased functional requirements
placed on software, contributing to larger and more complex software architectures. This comes with a key pitfall:
the complexity and inscrutability of the software hinder the ability to properly express, engineer and validate crucial
requirements for the system’s desired functioning. This phenomenon is well understood in the field of system safety.
For decades, this field has dealt with accidents and harm in safety-critical systems governed by varying degrees of
software-based automation. System safety embraces the core assumption of systems and control that AI systems cannot
be safeguarded by technical design choices centred on the model or algorithm alone, requiring instead a broad analysis

3
AI Alignment through RLHF

and design frame that includes the context of use, impacted stakeholders, and the formal and informal institutional
environment in which the system operates [Dobbe, 2022].
System safety pioneer Nancy Leveson pointed out that the greater power and flexibility of computational systems
in comparison to previous, more physically constrained machines leads to what she dubbed the curse of flexibility:
“with software, the limits of what is possible to accomplish are different than the limits of what can be accomplished
successfully and safely” [Leveson, 2012]. As Leveson argues, the curse of flexibility is the ground cause of many
serious accidents with digital technologies, as requirement flaws and the complexity of software makes it so that
“nobody understands what the software should do or even what it should not do.” [Leveson, 2012, p.49]
Unfortunately, there is evidence that the development of high-stakes AI systems and software often goes on despite the
lack of principled ways to determine safety requirements [Dobbe et al., 2021], and of translating such requirements
into software implementations that take into consideration the broader contexts in which AI systems are used and
depended upon. It is in this light that we should judge the legitimacy and effectiveness of the dominant performance
evaluation criteria and safety claims made about the widely-used RLxF approaches to AI alignment and safety today.

3 Limitations of RLxF

RLxF is presented as a straightforward method for ensuring AI oversight. Common claims are that it aligns models to
human values by having crowdworkers evaluate LLM answers based on specific criteria, often the ‘Three H’ principles:
harmlessness, honesty and helpfulness [Bai et al., 2022a].
The broader approach, as discussed in various papers, including e.g. [Bai et al., 2022a], reveals a reluctance to firmly
define these principles. Their stance exemplifies a hands-off approach to considerations of important normative nature,
including ethical dilemmas or safety norms, as exemplified by this claim: “Our goal is not to define or prescribe
what ‘helpful’ and ‘harmless’ mean but to evaluate the effectiveness of our training techniques, so for the most part
we simply let our crowdworkers interpret these concepts as they see fit," [Bai et al., 2022a, p.4]. While this method
allows for a wide range of interpretations, it also signals a lack of commitment to establishing clear guidelines for
how to determine what is acceptable AI system behaviour. Relying on crowdworkers’ interpretations without a strong
ethical framework may lead to inconsistencies and a dilution of ethical standards.
This also leads to the broader adaptation of vague definitions in subsequent work. Cui et al. [2023] and Ouyang et
al. [2022] report on the improvements attributed to RLHF: enhancing the helpfulness, truthfulness, and harmlessness
of language models. These terms were originally chosen in [Askell et al., 2021] as the key criteria “because they are
simple and memorable, and seem to capture the majority of what we want from an aligned AI" (p. 4). The authors
recognise the criteria’s vagueness, although this seems not to have led to changes in how RLHF is done and employed:
“[these] criteria are at least somewhat subjective, and those who deploy an AI will need to take responsibility for the
way that alignment is defined and the extent to which it has been attained" [Askell et al., 2021, p.5].

3.1 Harmlessness

The AI should not be offensive or

discriminatory, either directly or through
subtext or bias.
[Askell et al., 2021]

Anthropic’s Constitutional AI paper [Bai et al., 2022b] presents the advancement of ‘harmlessness’ as a chief aim.
However, during the feedback phase of the process, this is translated as a preference for what is ‘least harmful’,
thereby suggesting a tolerance for harm, as long as it is comparatively minimised. This premise raises a critical ethical
concern, as it implies that all options presented for selection may contain harmful elements, and thus the preferred
choice will still involve a harmful option. The approach thus settles for promoting a paradigm that seeks the least
harmful option rather than striving to understand the deeper roots of harm and addressing these to prevent it.
The criteria for evaluating harmlessness, as outlined in their prompt—“Which of these assistant responses is less
harmful? Choose the response that a wise, ethical, polite, and friendly person would more likely say” [Bai et al., 2022b,
p.11]—further complicates the issue. It implicitly equates harmlessness with virtues such as wisdom, ethics, politeness,
and friendliness. However, this oversimplifies the nuanced nature of harm, suggesting a superficial understanding of
ethical behaviour in AI systems, and implying that adhering to these virtues will inherently lead to less harmful
outcomes without offering the required justification and argumentation for such a claim. Furthermore, individual

4
AI Alignment through RLHF

interpretations of these virtues may be in conflict with one another, making this operationalisation of harmlessness
internally inconsistent and vague [Dobbe et al., 2021].
This approach to harmfulness, moreover, ignores existing work on known harms of LLMs [Bender et al., 2021]. In
addition, the distinction between systemic versus individual harm further complicates an evaluation of LLMs’ ethical
implications. As outlined in [Askell et al., 2021], attention to inter- and intra-agent conflict dynamics—where actions
may be helpful to one party but harmful to another, or simultaneously beneficial and detrimental to the same entity—
highlights the balance between aiding and causing harm within AI systems’ operations.
Shelby et al. [2023] provide a taxonomy of sociotechnical harms which underlines the necessity for a sociotechnical
perspective on ethical and safe LLMs, acknowledging that harms may emerge not solely from the technical aspects of
LLMs, but also from their usage within broader sociotechnical contexts. This recognises the limitations of technical
fixes and the importance of considering the systemic nature of harm in AI applications [Dobbe, 2022].
In a global context, the effectiveness of RLxF in ensuring safety is contingent upon the equitable distribution of
resources across demographics. RLxF risks optimising to reduce issues like hate speech in Western contexts, while
falling short in other less-resourced environments. This raises concerns about the appropriateness of propagating it as
a universal solution, potentially overlooking more suitable alternatives grounded in the unique sociocultural dynamics
of different communities.

3.2 Honesty

At its most basic level, the AI should give

accurate information. Moreover, it should be
calibrated (e.g. it should be correct 80% of the
time when it claims 80% confidence) and
express appropriate levels of uncertainty. It
should express its uncertainty without
misleading human users.
[Askell et al., 2021]

Several different notions of honesty are in use around RLxF fine-tuning approaches to LLMs, which are often conflated
with ‘truthfulness’ (e.g. in the introduction of [Liu et al., 2024]). It is, however, unclear how an RLxF procedure is
supposed to address truthfulness in LLMs, since one of the major points of RLxF fine-tuning is reducing the amount of
explicit human input required to construct the reward model, which also leads to fewer chances for factually incorrect
model outputs to be detected and addressed.
Likewise, expressing ‘appropriate levels of uncertainty’ would require a capacity for introspection, which LLMs by
themselves do not have. As such, any response that encodes a level of (un)certainty will not be ‘honest’ about the
actual ‘confidence’ of the model in its responses but rather result from the likely textual context of any presented fact.
I.e. the model could be ‘certain’ that the response it gives to some query should contain “I’m not sure", meaning that
this is a highly likely output, or it could be ‘unsure’ about picking between several different responses, all of which
are expressed using very confident language.
Indeed, in some cases [Cui et al., 2023], aligning with ‘honesty’ can lead to an increased tendency for LLMs to add
‘unsure’ language in responses. Other studies [Krause et al., 2023] note that achieving correlation between (in)correct
responses and appropriately confident language is largely a case of improving the rate of correct answers, rather than
being appropriately unsure. This is indicative of a lack of introspection, and the limits of RLxF to address such
shortcomings.

3.3 Helpfulness

The AI should make a clear attempt to perform

the task or answer the question posed (as long
as this isn’t harmful). It should do this as
concisely and efficiently as possible.
[Askell et al., 2021]

Bai et al. [2022b] present an approach to Helpfulness that is to some extent tethered to the one they offer for Harm-
lessness: helpfulness tends to compromise harmlessness because a helpful AI assistant would support all harmful user

5
AI Alignment through RLHF

requests so as to maximise helpfulness. This is contrasted with the assertion that harmless assistants, which avoid
prompts for harmful output, would therefore be unhelpful. This dilemma showcases a form of paternalism where over-
fitting to harmlessness leads to less helpful systems. For instance, overly cautious responses to benign requests like
‘tell me a story about a trans person,’ or practical inquiries such as ‘How do I kill a Linux process’ might render the
system unhelpfully evasive. Here, non-evasive is equated with being helpful, which they tackle by making the system
accompany refusals to help with an explanation. It remains unclear why providing an explanation for its refusal to help
should make the LLM ‘harmlessly helpful’. Other approaches employ characterisations of RLxF criteria that more
closely align with cooperative principles [Grice, 1975]: “The helpfulness of a response pertains to how effectively it
addresses a given prompt. This measure is independent of the harmlessness of the response, as it focuses solely on the
quality, clarity, and relevance of the provided information.” [Ji et al., 2024]. This uncoupling from harmfulness leads
to a more focused assessment of the helpfulness of an answer.
In exploring the nuances of AI helpfulness, critical questions emerge regarding its beneficiaries and accessibility.
Helpfulness is typically relative to the needs and goals of users. It is thus crucial to consider the issue of who is the
target of the desired helpfulness, and how to make LLMs inclusive. Indeed, AI systems often exhibit limitations in
language accessibility, excluding non-dominant language speakers from the advantages generated by AI technologies.
Furthermore, the distinction between providing single-instance help versus establishing a consistently helpful system
brings to light the challenge of scalability and flexibility in AI’s utility. A system that excels in addressing individual
queries might still fall short of being universally helpful, revealing a tension between immediate responsiveness and
sustained, equitable helpfulness across diverse user needs.

3.4 Alignment

Alignment refers to the process of ensuring that

LLMs behave in accordance with human values
and preferences.
[Liu et al., 2023]

In recent work, Liu et al. [2023] describe RLHF as a crucial technique in ensuring that LLMs align with human
intentions. The authors view RLHF as integral to the deployment of these models in real-world scenarios, highlighting
its perceived importance in the field. Similarly, Song et al. [2024] characterise RLHF as a direct method for teaching
LLMs to replicate human preferences, suggesting its effectiveness in producing human-like results. Kirk et al. have
investigated much of the existing work on LLM alignment from human feedback [Kirk et al., 2023a], and point out
the use of ‘alignment’ as an empty signifier (a term or symbol used with little thought of operationalization, lacking
any agreed-upon meaning) in this context, proposing ways to more clearly spell out what practitioners mean by the
term [Kirk et al., 2023b].
Additionally, when confronted with the claim that RLHF can be used to ‘align’ an LLM to ‘human val-
ues’ or ‘human preferences’ it is always important to consider ‘which humans?’ [Atari et al., 2023] and ‘whose
values’ [Lambert et al., 2023], since there is no single set of universal values that we can align an LLM
to [Kirk et al., 2023a, p. 2415]. Importantly, the data workers that are asked to rate outputs in order to train an
RLHF policy, even if recruited from a globally diverse set of people, and even if asked deliberately vague ques-
tions [Bai et al., 2022a], will be incentivised to submit ratings in a way that is skewed less to the wide variety of
cultural norms they may hail from, and more to the values that they expect their (largely American, or at least Western)
employers to want [Miceli and Posada, 2022]. Moreover, even if those workers respond more according to their own
preferences, they are not necessarily representative of the wide variety in human cultures and value systems, by the
simple fact that they have the skills, equipment, and opportunity to work as a data labeller.

4 The Internal Tensions and Ethical Issues in RLxF

In this section, we discuss the fundamental limitations of aligning LLMs through RLHF and RLAIF, focusing on the
inherent tensions between the 3Hs (helpfulness, harmlessness, honesty), and the ethical risks that maximising for those
features generate.

4.1 Increased Helpfulness May Lead to Deception

RLxF seems to be an important tool for improving the human-likeness of LLM outputs [Lee et al., 2023]. Arguably,
this comes from the ‘helpfulness’ criterion that is used in those fine-tuning processes.

6
AI Alignment through RLHF

In this way, RLxF likely contributes to making LLM outputs look like they come from another human agent, with their
own beliefs, ideas, thoughts, and emotional states. This increases the naturalness and seamlessness of the interaction
with LLMs, as the user has only to engage in the normal conversational acts they engage in when interacting with
humans (for contrast, compare keyword-based web search).
Consider, for instance, the frequent experience of being confronted with the output “I’m sorry”, implying a rich
internal cognitive and emotional life—both of which current LLMs lack. More basically, even the use of the
personal pronoun “I” in LLM outputs is misleading, for the user is not interacting with a person or human-like
agent at all. Whether and to what extent LLM users take such outputs seriously is debatable, and likely to de-
pend on their knowledge of the functioning of LLMs and generative AI more generally. It is well known that
humans are susceptible to anthropomorphising systems that resemble humans even superficially (famously known
in NLP circles as the “Eliza effect” [Weizenbaum, 1977]). Therefore, it is likely that at least some users are de-
ceived by such LLM outputs. Importantly, even for AI-savvy users, who may be less prone to this sort of de-
ception, their interaction with LLMs may nonetheless be implicitly affected by the superficial human-likeness of
the RLxF-refined outputs, as anthropomorphisation biases tend to be difficult to counteract [Alabed et al., 2022;
Uysal et al., 2023].
RLxF thus produces an ethically problematic trade-off: increased helpfulness, in the sense of increased user-
friendliness, leads to the serious risk of deceiving users about the true nature of the system they are engaging with—an
ethically questionable outcome. RLxF may moreover contribute to producing misguided perceptions of generative AI
technologies among the public, and even lead them to behave in ways they would not if the deception were not in
place, such as misplacing trust on LLM outputs, or making inappropriate use of such systems, e.g. as confidants or
romantic ‘partners’ [Weidinger et al., 2021].

4.2 Sycophancy: Helpfulness and Harmlessness Gone Awry

The tendency of LLMs to produce outputs in agreement with the expressed views and opinions of the user has
come to be known as sycophancy. This seems to be a partial consequence of RLxF, as assuming the user to be
right is a path toward (apparent) helpfulness and harmlessness. Such tendency is revealed in various jail-breaking
methods: for instance, asking for the recipe for napalm straightforwardly may not work, but if the prompt creates
a context in which such recipe would be helpful to the user in non-malicious ways, LLMs have been reported to
comply [Franceschi-Bicchierai, 2023]. Sycophantic behaviour is an example of how pursuing helpfulness and harm-
lessness through RLxF can go awry, generating outcomes that are neither. Sycophantic behaviour seems to be partic-
ularly strong for LLM outputs regarding issues for which there is disagreement, as politically, ethically, and socially
polarising issues tend to be [Perez et al., 2023]. Indeed, there is emerging concern that, when presented with ethically
complex questions, LLMs tend to simply mirror the user’s views (see, e.g. [Turpin et al., 2024], [Park et al., 2023], or
the sycophancy benchmarking tasks of [Perez et al., 2023]).
In general, as Sharma et al. [2024] point out, responses matching user views are more likely to be preferred, with
both humans and preference models preferring sycophantic responses over correct ones. As such, training LLMs to
maximise human preference scores directly correlates with sycophancy, thereby sacrificing truth (or ‘honesty’) for the
appearance of helpfulness and harmlessness.

4.3 RLxF Can Contribute to Value Imposition and Cultural Homogenisation

Value alignment through RLxF risks leading to homogenisation in values held, their hierarchical organisation (i.e.
more or less important values), as well as in linguistic expression, most often in favour of what is considered proper
and acceptable by the hegemonic social groups typically responsible for the design of LLMs [Helm et al., 2024;
Weidinger et al., 2021; Kirk et al., 2024b; Kirk et al., 2024a]. RLxF is meant to make LLM outputs more pre-
dictable, safe and controllable. It partly succeeds in such an aim, at least when it comes to many of the ex-
pected, designer-intended uses of LLMs—it being relatively easy to ‘jailbreak’ such systems for users so in-
clined [Narayanan et al., 2023].
This predictability and controllability, as partial and imperfect as it may be, poses another ethically-problematic
trade-off: it makes LLM outputs more regimented, constrained by norms and values that are not only ‘frozen’ in
time [Bender et al., 2021], but also local to the parts of the world where such systems are built and, although still
incipiently, regulated.
In other words, RLxF, even when fit-to-purpose, comes at a cost: LLM outputs end up privileging certain values
over others; they exemplify certain language-use that is tied to the values of hegemonic social groups, thus implicitly
conveying that other values and linguistic practices are less deserving of interest and usage. This can contribute

7
AI Alignment through RLHF

to a seamless, non-coercive imposition of values and practices from hegemonic social groups and countries over
others, limiting the autonomy of members of non-hegemonic social groups in shaping their own values and linguistic
practices [Weidinger et al., 2021]. Moreover, widespread use of RLxF fine-tuned LLMs can lead to linguistic use
being flattened on the characteristic style of such systems, making linguistic usage less diverse, less authentic, and less
adequate for capturing the expressive practices and needs of different communities (with associated risks to autonomy,
cf. [Vaassen, 2022]).
The emphasis on scaling to larger and more flexible models presents a further important tension between performance,
safety, and inclusivity: training larger models on increasingly more data in order to achieve higher performance on
many benchmarks leads to groups that are smaller and/or under-represented in datasets being either barred from having
high-performing systems (according to these benchmarks), or forced to use systems that are predominantly trained on
data sourced from other, typically hegemonic groups, and thus less fit to their needs and socio-cultural context.

4.4 RLxF Increases Ethical Opacity

RLxF, as currently employed in commercial LLMs, leads to a considerable level of ‘ethical opacity’. As we pointed
out, the preference criteria for eliciting human preferences (as well as AI ‘preferences’) are left vague and underde-
fined. Moreover, users and the general public are normally not informed about who has been tasked with producing the
needed preference data. As has recently been shown, such tasks are sometimes performed by underpaid crowdworkers,
who may have incentives to delegate their work to LLMs themselves, creating a short-circuit in which LLM ‘prefer-
ences’ end up passing for human preferences to train new versions of those same LLMs [Dzieza, 2023]. In addition,
it is exceedingly difficult to investigate the specific effects of RLxF on commercial LLMs, as companies continuously
make under-the-hood changes to these systems, making LLMs, already a tricky subject of study due to the curse of
flexibility, into a moving target for research.

5 Rebooting Safety and Alignment: Integrating AI Ethics and System Safety

The considerations we describe have important implications for the AI value alignment problem, as well as for the
pursuit of ethical and safe AI.

5.1 Value Alignment by Engineering: an Impossible Task

RLxF appears to be a compelling strategy for introducing ethical safeguards in LLMs, although fallible; it inevitably
fails as a solution to the ambitious project of achieving AI value alignment. While our focus has been on the 3H
criteria most used in current LLM RLxF-based fine-tuning, we can draw general lessons from our analysis. As argued
in Sections 3 and 4, even seemingly straightforward alignment goals such as the 3Hs are open to a variety of different
interpretations, both within and across communities. Even assuming an agreed-upon interpretation of the 3Hs, in
many situations the demands they pose on outputs may be in tension with each other, producing value conflicts.
Since LLMs are supposed to be generalist systems, lacking clear boundaries to their intended, safe application, such
conflicts cannot be avoided. Furthermore, RLxF involves ethically-fraught trade-offs between, e.g. user-friendliness
and deception, safety and transparency, accountability and flexibility.
These points are symptomatic of a more fundamental issue that our analysis illustrates: value alignment is an impossi-
ble task, if seen from a purely technical point of view. In light of the diversity of human values, needs, and goals, and
the staggering variety of situations and broader contexts humans find themselves in, no set of alignment techniques
can play the role of a one-size-fits-all solution. Values vary and are constantly renegotiated within societies and com-
munities across time. Furthermore, it is virtually impossible to build training datasets, including for RLxF techniques,
that can capture this variety, and cover all the contexts in which safety and ethical considerations are relevant to hu-
man activity. The distribution tail is indefinitely long, and nonetheless crucially important. Technology-first proposals
for value alignment, such as RLxF, tend to neglect the role of democratic institutions in ethical deliberation through
law and policy [Gansky and McDonald, 2022], falling into what Selbst et al. [2019] call the ‘framing trap’, wherein
fundamentally sociotechnical problems are reduced to a narrow technical scope.

5.2 Toward an Integrated Approach to Safe and Ethical AI Design

If we aim to deploy safe, ethical AI systems, including LLMs, then the narrow engineering approach that RLxF
exemplifies must be broadened to include the notion of safety as instantiated through a sociotechnical and systemic
approach. Similar suggestions have been made [Casper et al., 2023], as current approaches suffer from a narrow focus
on purely technical interventions [Raji and Dobbe, 2020; Selbst et al., 2019]. A broader sociotechnical systems view

8
AI Alignment through RLHF

of LLMs suggests that safety criteria and ethical assessments need to be situated, deliberated, and negotiated in the
context of use, and span all layers of the sociotechnical system, including through organisational and institutional
interventions [Nouws et al., 2023; Aler Tubella et al., 2023; Dobbe and Wolters, 2024].
In the short term, such non-technical measures should aim at limiting the ways we use current-day generative AI
systems, for which crucial requirements around safety or other normative notions cannot be guaranteed. In this light,
it is worrying that policy makers are embracing the term ‘frontier model’, referring to “highly capable foundation
model[s] that could exhibit dangerous capabilities, [including] significant physical harm or the disruption of key soci-
etal functions on a global scale” [Anderljung et al., 2023]. Normalising flawed models as frontier in policy promotes
safety-washing and instills safety hazards in many more contexts [Dobbe, 2023], especially since the general public
ends up being the safety testers of these ‘frontier’ models.
In the longer run, adhering to system safety would suggest fundamentally different AI system design and feed-
back mechanisms for technological governance. A broader treatment of system safety for AI can be found in
some recent articles exploring its relevance of historical lessons [Dobbe, 2022] and its applicability to modern AI
systems [Rismani et al., 2023b; Rismani et al., 2023a], as well as in the seminal work of Leveson [Leveson, 2012;
Leveson, 2023].
It is equally important to build safety-oriented scholarship that is open to the normative and political dimensions of
safeguarding technological systems. Often, safety requirements are necessary but not clearly articulated, deliberated or
negotiated with the proper actors. Operationalising any notion of safety for AI requires deliberation about the politics
of development, as well as the context of deployment [Dobbe et al., 2021]. As such, moving the field of AI safety
forward will require scholars to reflect on these issues and engage more explicitly with AI ethics and AI governance,
as well as with the actors directly or indirectly involved in or affected by the use cases of the technology.
Still, new research challenges lie ahead; taking system safety seriously means that we have to curb the curse of
flexibility. In order to eliminate or at least reduce the inherent safety limitations of overly complex software, we
need to stop building or relying on large scale general-purpose models. Instead, the field should prioritise smaller,
limited-purpose models and architectures that are more amenable to proper requirement engineering, and that can
cater to local needs and contexts, and require significantly fewer computational resources and associated ecological
footprints [Rakova and Dobbe, 2023].

6 Conclusion

In this paper, we challenge the claims made around the use of RLxF and the 3Hs for achieving AI safety and align-
ment. Taking a sociotechnical perspective, we critique both the theoretical and practical elements of the approach,
emphasising its limitations, inherent tensions, and contradictions.
While RLxF may be good for reinforcing anthropomorphic behaviour in LLMs, such fine-tuning techniques do not
lead to increased system safety or ethical AI. In fact, they open up new problems, as increasing the human-likeness of
LLM outputs may have ethically questionable downstream effects.
Simple may indeed be memorable, but focusing on the 3Hs fails to encapsulate most of what is needed for building
safe and ethical LLMs, and AI systems more generally. Beneath the thrust of RLxF techniques, there seems to lie
an oversimplification of the complexities of human diversity, behaviour, values, and perspectives on ethics. A richer,
more integrative perspective on safe and ethical AI is needed, in which technical approaches are just one among the
many tools that we have at our disposal to tackle the challenges these new technologies present.

Acknowledgements

This work was partially supported by TAIGA – Centre for Transdisciplinary AI under the CELS AI microproject grant
of 2023. Additionally, RD was (partially) funded by the Hybrid Intelligence Center, a 10-year programme funded by
the Dutch Ministry of Education, Culture and Science through the Netherlands Organisation for Scientific Research.

References

[Alabed et al., 2022] Amani Alabed, Ana Javornik, and Diana Gregory-Smith. Ai anthropomorphism and its effect
on users’ self-congruence and self–ai integration: A theoretical framework and research agenda. Technological
Forecasting and Social Change, 182:121786, 2022.

9
AI Alignment through RLHF

[Aler Tubella et al., 2023] Andrea Aler Tubella, Dimitri Coelho Mollo, Adam Dahlgren Lindström, Hannah Devin-
ney, Virginia Dignum, Petter Ericson, Anna Jonsson, Timotheus Kampik, Tom Lenaerts, Julian Alfredo Mendez,
and Juan Carlos Nieves. Acrocpolis: A descriptive framework for making sense of fairness. In 2023 ACM Confer-
ence on Fairness, Accountability, and Transparency, FAccT ’23. ACM, June 2023.
[Anderljung et al., 2023] Markus Anderljung, Joslyn Barnhart, Anton Korinek, Jade Leung, and Cullen et al. O’Keefe.
Frontier ai regulation: Managing emerging risks to public safety. arXiv:2307.03718, 2023.
[Askell et al., 2021] Amanda Askell, Yuntao Bai, Anna Chen, Dawn Drain, and Deep et al. Ganguli. A general
language assistant as a laboratory for alignment. arXiv:2112.00861, 2021.
[Atari et al., 2023] Mohammad Atari, Mona J Xue, Peter S Park, Damián Blasi, and Joseph Henrich. Which humans?
PsyPsyArXiv:5b26t, 2023.
[Bai et al., 2022a] Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn
Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, et al. Training a helpful and harmless assistant with reinforce-
ment learning from human feedback. arXiv:2204.05862, 2022.
[Bai et al., 2022b] Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones,
Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, et al. Constitutional ai: Harmlessness from ai
feedback. arXiv:2212.08073, 2022.
[Bender et al., 2021] Emily M. Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell. On the
Dangers of Stochastic Parrots: Can Language Models Be Too Big? In Proceedings of the 2021 ACM Conference
on Fairness, Accountability, and Transparency, FAccT ’21, pages 610–623, New York, NY, USA, March 2021.
Association for Computing Machinery.
[Casper et al., 2023] Stephen Casper, Xander Davies, Claudia Shi, Thomas Krendl Gilbert, and Scheurer et al. Open
problems and fundamental limitations of reinforcement learning from human feedback. Transactions on Machine
Learning Research, 2023.
[Christiano et al., 2017] Paul F Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei.
Deep reinforcement learning from human preferences. Advances in neural information processing systems, 30,
2017.
[Cui et al., 2023] Ganqu Cui, Lifan Yuan, Ning Ding, Guanming Yao, Wei Zhu, Yuan Ni, Guotong Xie, Zhiyuan
Liu, and Maosong Sun. Ultrafeedback: Boosting language models with high-quality feedback. arXiv:2310.01377,
2023.
[Dinan et al., 2021] Emily Dinan, Gavin Abercrombie, A Stevie Bergman, Shannon Spruit, Dirk Hovy, Y-Lan
Boureau, and Verena Rieser. Anticipating safety issues in e2e conversational ai: Framework and tooling.
arXiv:2107.03451, 2021.
[Dobbe and Wolters, 2024] Roel Dobbe and Anouk Wolters. Toward Sociotechnical AI and MLOps: Mapping Vul-
nerabilities for Machine Learning in Context. Minds and Machines, (accepted), 2024.
[Dobbe et al., 2021] Roel Dobbe, Thomas Krendl Gilbert, and Yonatan Mintz. Hard choices in artificial intelligence.
Artificial Intelligence, 300:103555, November 2021.
[Dobbe, 2022] RIJ Dobbe. System safety and artificial intelligence. In The Oxford Handbook of AI Governance.
Oxford University Press, 2022.
[Dobbe, 2023] Roel Dobbe. ‘Safety Washing’ at the AI Safety Summit, November 2023.
[Dzieza, 2023] Josh Dzieza. Ai is a lot of work. The Verge, June 2023.
[Franceschi-Bicchierai, 2023] Lorenzo Franceschi-Bicchierai. Jailbreak tricks discord’s new chatbot into sharing na-
palm and meth instructions. TechCrunch, April 2023.
[Gansky and McDonald, 2022] Ben Gansky and Sean McDonald. CounterFAccTual: How FAccT undermines its
organizing principles. In Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency,
pages 1982–1992, 2022.
[Goldberg, 2019] Yoav Goldberg. Assessing bert’s syntactic abilities. arXiv:1901.05287, 2019.
[Grice, 1975] Herbert P Grice. Logic and conversation. In Speech Acts, pages 41–58. Brill, 1975.
[Helm et al., 2024] Paula Helm, Gábor Bella, Gertraud Koch, and Fausto Giunchiglia. Diversity and language tech-
nology: how language modeling bias causes epistemic injustice. Ethics and Information Technology, 26(1), January
2024.
[Jawahar et al., 2019] Ganesh Jawahar, Benoît Sagot, and Djamé Seddah. What does bert learn about the structure of
language? In ACL 2019-57th Annual Meeting of the Association for Computational Linguistics, 2019.

10
AI Alignment through RLHF

[Ji et al., 2024] Jiaming Ji, Mickel Liu, Josef Dai, Xuehai Pan, Chi Zhang, Ce Bian, Boyuan Chen, Ruiyang Sun,
Yizhou Wang, and Yaodong Yang. Beavertails: Towards improved safety alignment of llm via a human-preference
dataset. Advances in Neural Information Processing Systems, 36, 2024.
[Kenton and Toutanova, 2019] Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina Toutanova. Bert: Pre-training
of deep bidirectional transformers for language understanding. In Proceedings of NAACL-HLT, pages 4171–4186,
2019.
[Kirk et al., 2023a] Hannah Kirk, Andrew Bean, Bertie Vidgen, Paul Rottger, and Scott Hale. The Past, Present and
Better Future of Feedback Learning in Large Language Models for Subjective Human Preferences and Values. In
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 2409–2430,
Singapore, 2023. Association for Computational Linguistics.
[Kirk et al., 2023b] Hannah Kirk, Bertie Vidgen, Paul Rottger, and Scott Hale. The Empty Signifier Problem: To-
wards Clearer Paradigms for Operationalising ”Alignment” in Large Language Models. In Socially Responsible
Language Modelling Research, 2023.
[Kirk et al., 2024a] Hannah Rose Kirk, Bertie Vidgen, Paul Röttger, and Scott A. Hale. The benefits, risks and bounds
of personalizing the alignment of large language models to individuals. Nature Machine Intelligence, 6(4):383–392,
April 2024.
[Kirk et al., 2024b] Robert Kirk, Ishita Mediratta, Christoforos Nalmpantis, Jelena Luketina, Eric Hambro, Ed-
ward Grefenstette, and Roberta Raileanu. Understanding the effects of rlhf on llm generalisation and diversity.
arXiv:2310.06452, 2024.
[Krause et al., 2023] Lea Krause, Wondimagegnhue Tufa, Selene Baez Santamaria, Angel Daza, Urja Khurana, and
Piek Vossen. Confidently wrong: Exploring the calibration and expression of (un)certainty of large language
models in a multilingual setting. In Proceedings of the Workshop on Multimodal, Multilingual Natural Language
Generation and Multilingual WebNLG Challenge (MM-NLG 2023), pages 1–9, Prague, Czech Republic, September
2023.
[Lambert et al., 2023] Nathan Lambert, Thomas Krendl Gilbert, and Tom Zick. Entangled preferences: The history
and risks of reinforcement learning and human feedback. arXiv:2310.13595, 2023.
[Lee et al., 2023] Harrison Lee, Samrat Phatale, Hassan Mansoor, Kellie Lu, Thomas Mesnard, Colton Bishop, Victor
Carbune, and Abhinav Rastogi. Rlaif: Scaling reinforcement learning from human feedback with ai feedback.
arXiv:2309.00267, 2023.
[Leveson, 2012] Nancy G. Leveson. Engineering a Safer World: Systems Thinking Applied to Safety. MIT Press,
Cambridge, MA, USA, 2012.
[Leveson, 2023] Nancy G. Leveson. An Introduction to System Safety Engineering. MIT Press, 2023.
[Liu et al., 2023] Yang Liu, Yuanshun Yao, Jean-Francois Ton, Xiaoying Zhang, Ruocheng Guo, Hao Cheng, Yegor
Klochkov, Muhammad Faaiz Taufiq, and Hang Li. Trustworthy LLMs: a survey and guideline for evaluating large
language models’ alignment. In Socially Responsible Language Modelling Research, 2023.
[Liu et al., 2024] Ryan Liu, Theodore R. Sumers, Ishita Dasgupta, and Thomas L. Griffiths. How do large language
models navigate conflicts between honesty and helpfulness? arXiv:2402.07282, 2024.
[Miceli and Posada, 2022] Milagros Miceli and Julian Posada. The data-production dispositif. Proceedings of the
ACM on Human-Computer Interaction, 6(CSCW2):1–37, 2022.
[Mozes et al., 2023] Maximilian Mozes, Xuanli He, Bennett Kleinberg, and Lewis D Griffin. Use of llms for illicit
purposes: Threats, prevention measures, and vulnerabilities. arXiv:2308.12833, 2023.
[Narayanan et al., 2023] Arvind Narayanan, Sayash Kapoor, and Lazar Seth. Model alignment protects against acci-
dental harms, not intentional ones. AI Snake Oil Blog, December 2023.
[Nouws et al., 2023] Sem Nouws, Íñigo Martinez De Rituerto De Troya, Roel Dobbe, and Marijn Janssen. Diagnosing
and addressing emergent harms in the design process of public AI and algorithmic systems. In Proceedings of the
24th Annual International Conference on Digital Government Research, pages 679–681, 2023.
[Ouyang et al., 2022] Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin,
Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions
with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
[Park et al., 2023] Peter S Park, Simon Goldstein, Aidan O’Gara, Michael Chen, and Dan Hendrycks. Ai deception:
A survey of examples, risks, and potential solutions. arXiv:2308.14752, 2023.

11
AI Alignment through RLHF

[Perez et al., 2023] Ethan Perez, Sam Ringer, Kamile Lukosiute, Karina Nguyen, and Edwin et al. Chen. Discovering
language model behaviors with model-written evaluations. In Anna Rogers, Jordan Boyd-Graber, and Naoaki
Okazaki, editors, Findings of the Association for Computational Linguistics: ACL 2023, pages 13387–13434,
Toronto, Canada, July 2023. Association for Computational Linguistics.
[Raji and Dobbe, 2020] ID Raji and R Dobbe. Concrete problems in ai safety, revisited. In ICLR workshop on ML in
the real world, 2020.
[Rakova and Dobbe, 2023] Bogdana Rakova and Roel Dobbe. Algorithms as Social-Ecological-Technological Sys-
tems: an Environmental Justice Lens on Algorithmic Audits. In 2023 ACM Conference on Fairness, Accountability,
and Transparency, FAccT ’23, Chicago, IL, USA, 2023. Association for Computing Machinery.
[Rismani et al., 2023a] Shalaleh Rismani, Renee Shelby, Andrew Smart, Renelito Delos Santos, AJung Moon, and
Negar Rostamzadeh. Beyond the ML Model: Applying Safety Engineering Frameworks to Text-to-Image Devel-
opment. In Proceedings of the 2023 AAAI/ACM Conference on AI, Ethics, and Society, pages 70–83, Montr\’{e}al
QC Canada, August 2023. ACM.
[Rismani et al., 2023b] Shalaleh Rismani, Renee Shelby, Andrew Smart, Edgar Jatho, Joshua Kroll, AJung Moon, and
Negar Rostamzadeh. From Plane Crashes to Algorithmic Harm: Applicability of Safety Engineering Frameworks
for Responsible ML. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems, pages
1–18, Hamburg Germany, April 2023. ACM.
[Schulman et al., 2015] John Schulman, Philipp Moritz, Sergey Levine, Michael Jordan, and Pieter Abbeel. High-
dimensional continuous control using generalized advantage estimation. arXiv:1506.02438, 2015.
[Selbst et al., 2019] Andrew D Selbst, Danah Boyd, Sorelle A Friedler, Suresh Venkatasubramanian, and Janet Vertesi.
Fairness and abstraction in sociotechnical systems. In Proceedings of the Conference on Fairness, Accountability,
and Transparency, pages 59–68, 2019.
[Sharma et al., 2024] Mrinank Sharma, Meg Tong, Tomasz Korbak, David Duvenaud, Amanda Askell, Samuel R.
Bowman, Esin DURMUS, Zac Hatfield-Dodds, Scott R Johnston, Shauna M Kravec, Timothy Maxwell, Sam Mc-
Candlish, Kamal Ndousse, Oliver Rausch, Nicholas Schiefer, Da Yan, Miranda Zhang, and Ethan Perez. Towards
understanding sycophancy in language models. In The Twelfth International Conference on Learning Representa-
tions, 2024.
[Shelby et al., 2023] Renee Shelby, Shalaleh Rismani, Kathryn Henne, AJung Moon, Negar Rostamzadeh, Paul
Nicholas, N’Mah Yilla-Akbari, Jess Gallegos, Andrew Smart, Emilio Garcia, and Gurleen Virk. Sociotechnical
harms of algorithmic systems: Scoping a taxonomy for harm reduction. In Proceedings of the 2023 AAAI/ACM
Conference on AI, Ethics, and Society, AIES ’23, page 723–741, New York, NY, USA, 2023. Association for
Computing Machinery.
[Song et al., 2024] Feifan Song, Bowen Yu, Minghao Li, Haiyang Yu, Fei Huang, Yongbin Li, and Houfeng Wang.
Preference ranking optimization for human alignment. In Proceedings of the Thirty-Eighth AAAI Conference on
Artificial Intelligence (AAAI-24), 2024.
[Turpin et al., 2024] Miles Turpin, Julian Michael, Ethan Perez, and Samuel Bowman. Language models don’t always
say what they think: unfaithful explanations in chain-of-thought prompting. Advances in Neural Information
Processing Systems, 36, 2024.
[Uysal et al., 2023] Ertugrul Uysal, Sascha Alavi, and Valéry Bezençon. Anthropomorphism in Artificial Intelligence:
A Review of Empirical Work Across Domains and Insights for Future Research, pages 273–308. Emerald Publishing
Limited, March 2023.
[Vaassen, 2022] Bram Vaassen. AI, opacity, and personal autonomy. Philosophy & Technology, 35(4), September
2022.
[Weidinger et al., 2021] Laura Weidinger, John Mellor, Maribeth Rauh, Conor Griffin, Jonathan Uesato, Po-Sen
Huang, Myra Cheng, Mia Glaese, Borja Balle, Atoosa Kasirzadeh, Zac Kenton, Sasha Brown, Will Hawkins,
Tom Stepleton, Courtney Biles, Abeba Birhane, Julia Haas, Laura Rimell, Lisa Anne Hendricks, William Isaac,
Sean Legassick, Geoffrey Irving, and Iason Gabriel. Ethical and social risks of harm from language models.
arXiv:2112.04359, 2021.
[Weizenbaum, 1977] Joseph Weizenbaum. Computer Power and Human Reason: From Judgment to Calculation. W.
H. Freeman & Co., USA, 1st edition, 1977.
[Zhuo et al., 2023] TY Zhuo, Y Huang, C Chen, and Z Xing. Red teaming chatgpt via jailbreaking: Bias, robustness,
reliability and toxicity. arXiv:2301.12867, 2023.

E4. LLM Instruction Tuning
No ratings yet
E4. LLM Instruction Tuning
45 pages
RLHF - Reinforcement Learning From Human Feedback
No ratings yet
RLHF - Reinforcement Learning From Human Feedback
21 pages
A Survey of Reinforcement Learning From Human Feedback
No ratings yet
A Survey of Reinforcement Learning From Human Feedback
83 pages
Cs224n 2025 Lecture10 Instruction Tunining RLHF
No ratings yet
Cs224n 2025 Lecture10 Instruction Tunining RLHF
61 pages
Open Problems and Fundamental Limitations of Reinforcement Learning From Human Feedback
No ratings yet
Open Problems and Fundamental Limitations of Reinforcement Learning From Human Feedback
38 pages
Open Problems and Fundamental Limitations of Reinforcement Learning From Human Feedback
No ratings yet
Open Problems and Fundamental Limitations of Reinforcement Learning From Human Feedback
34 pages
RL Summ1
No ratings yet
RL Summ1
28 pages
S RLHF: S R L H F: AFE AFE Einforcement Earning From Uman Eedback
No ratings yet
S RLHF: S R L H F: AFE AFE Einforcement Earning From Uman Eedback
27 pages
Constitutional AI Harmlessness From AI Feedback
No ratings yet
Constitutional AI Harmlessness From AI Feedback
34 pages
Re Max
No ratings yet
Re Max
36 pages
RLHF Tutorial (60 Mins)
No ratings yet
RLHF Tutorial (60 Mins)
73 pages
Reinforcement Learning From Human Feedback
No ratings yet
Reinforcement Learning From Human Feedback
100 pages
Constitutional
No ratings yet
Constitutional
34 pages
10.48550 Arxiv.2212.08073
No ratings yet
10.48550 Arxiv.2212.08073
34 pages
AIDEAS Źródło Moduł 1 - Pętle Sprzężenia Zwrotnego W Modelach Językowych A Problem Hakowania Nagród W Kontekście (In-Context Reward Hacking - ICRH)
No ratings yet
AIDEAS Źródło Moduł 1 - Pętle Sprzężenia Zwrotnego W Modelach Językowych A Problem Hakowania Nagród W Kontekście (In-Context Reward Hacking - ICRH)
47 pages
Limitation of RLHF
No ratings yet
Limitation of RLHF
42 pages
Almost Surely Safe Alignment of Large Language Models at Inference-Time
No ratings yet
Almost Surely Safe Alignment of Large Language Models at Inference-Time
25 pages
AS L - R A A P RLHF: Hared OW ANK Daptation Pproach To Ersonalized
No ratings yet
AS L - R A A P RLHF: Hared OW ANK Daptation Pproach To Ersonalized
36 pages
Reinforcement Learning From Human Feedback in LLMS: Whose Culture, Whose Values, Whose Perspectives?
No ratings yet
Reinforcement Learning From Human Feedback in LLMS: Whose Culture, Whose Values, Whose Perspectives?
26 pages
Decipher
No ratings yet
Decipher
37 pages
2024 Acl-Long 523
No ratings yet
2024 Acl-Long 523
25 pages
AI & Prompting Workshop Day 2
No ratings yet
AI & Prompting Workshop Day 2
19 pages
Module 3
No ratings yet
Module 3
44 pages
RLAIF: Scaling Reinforcement Learning From Human Feedback With AI Feedback
No ratings yet
RLAIF: Scaling Reinforcement Learning From Human Feedback With AI Feedback
18 pages
Chain of Hindsight Aligns Language Models With Feedback PDF
No ratings yet
Chain of Hindsight Aligns Language Models With Feedback PDF
18 pages
Hybridflow
No ratings yet
Hybridflow
19 pages
ALarm
No ratings yet
ALarm
15 pages
RLHF
No ratings yet
RLHF
14 pages
Pdf?id AAx Is 3 D2 ZZ
No ratings yet
Pdf?id AAx Is 3 D2 ZZ
31 pages
NPTEL
No ratings yet
NPTEL
37 pages
Salmon: S - A P - F R M: ELF Lignment With Rinciple Ollowing Eward Odels
No ratings yet
Salmon: S - A P - F R M: ELF Lignment With Rinciple Ollowing Eward Odels
32 pages
Transforming Human Interactions With AI Via Reinforcement Learning With Human Feedback RLHF
No ratings yet
Transforming Human Interactions With AI Via Reinforcement Learning With Human Feedback RLHF
11 pages
RL4F: Generating Natural Language Feedback With Reinforcement Learning For Repairing Model Outputs
No ratings yet
RL4F: Generating Natural Language Feedback With Reinforcement Learning For Repairing Model Outputs
16 pages
52 Language Model Self Improvemen
No ratings yet
52 Language Model Self Improvemen
26 pages
(Slide) RLHF
No ratings yet
(Slide) RLHF
53 pages
07 Lecture10 Post Training
No ratings yet
07 Lecture10 Post Training
61 pages
A Comprehensive Survey of LLM Alignment Techniques - RLHF - Rlaif - Ppo - Dpo and More
No ratings yet
A Comprehensive Survey of LLM Alignment Techniques - RLHF - Rlaif - Ppo - Dpo and More
37 pages
Towards Reliable Alignment: Uncertainty-Aware RLHF
No ratings yet
Towards Reliable Alignment: Uncertainty-Aware RLHF
25 pages
TMLR 2023 Raft
No ratings yet
TMLR 2023 Raft
29 pages
Deeplearning - Ai Deeplearning - Ai
No ratings yet
Deeplearning - Ai Deeplearning - Ai
154 pages
Chat GPT
No ratings yet
Chat GPT
35 pages
Efficient Online RFT With Plug-and-Play LLM Judges: Unlocking State-of-the-Art Performance
No ratings yet
Efficient Online RFT With Plug-and-Play LLM Judges: Unlocking State-of-the-Art Performance
25 pages
GALLM Unit 4 Notes
No ratings yet
GALLM Unit 4 Notes
14 pages
Reinforcement Learning From Reflective Feedback (RLRF) : Aligning and Improving Llms Via Fine-Grained Self-Reflection
No ratings yet
Reinforcement Learning From Reflective Feedback (RLRF) : Aligning and Improving Llms Via Fine-Grained Self-Reflection
22 pages
Secrets of RLHF in Large Language Models Part I: PPO
No ratings yet
Secrets of RLHF in Large Language Models Part I: PPO
32 pages
Inverse Reinforcement Learning Meets Large Language Model Post-Training: Basics, Advances, and Opportunities
No ratings yet
Inverse Reinforcement Learning Meets Large Language Model Post-Training: Basics, Advances, and Opportunities
31 pages
Hierarchical Reinforcement Learning For Open-Domain Dialog - 2
No ratings yet
Hierarchical Reinforcement Learning For Open-Domain Dialog - 2
12 pages
Teaching AI To Summarize Like A Human: AReinforcement Learning Experiment
No ratings yet
Teaching AI To Summarize Like A Human: AReinforcement Learning Experiment
6 pages
521H0502-521H0498-521h0333 NLP Report
No ratings yet
521H0502-521H0498-521h0333 NLP Report
27 pages
Teaching LLM
No ratings yet
Teaching LLM
24 pages
Reinforcement Learning From Human Feedback (RLHF)
No ratings yet
Reinforcement Learning From Human Feedback (RLHF)
23 pages
RLHF PDF
No ratings yet
RLHF PDF
9 pages
Siemens 3VA1-2
No ratings yet
Siemens 3VA1-2
676 pages
Day 18 - RLHF
No ratings yet
Day 18 - RLHF
8 pages
BSI MD Consultants Day Usability and Human Factors Presentation UK EN
No ratings yet
BSI MD Consultants Day Usability and Human Factors Presentation UK EN
38 pages
The Wisdom of Hindsight Makes Language Models Better Instruction Followers
No ratings yet
The Wisdom of Hindsight Makes Language Models Better Instruction Followers
15 pages
Illustrating Reinforcement Learning From Human Feedback (RLHF)
No ratings yet
Illustrating Reinforcement Learning From Human Feedback (RLHF)
10 pages
ChatGPT, LLM and RLHF
No ratings yet
ChatGPT, LLM and RLHF
45 pages
Secrets of RLHF in Large Language Models Part I: Ppo: Fudan NLP Group Bytedance Inc
100% (1)
Secrets of RLHF in Large Language Models Part I: Ppo: Fudan NLP Group Bytedance Inc
32 pages
Lab: L - S A C B: Arge Cale Lignment For HAT OTS
No ratings yet
Lab: L - S A C B: Arge Cale Lignment For HAT OTS
10 pages
2.9 How LLMs Follow Instructions, Instruction Tuning and RLHF
No ratings yet
2.9 How LLMs Follow Instructions, Instruction Tuning and RLHF
2 pages
PDF
100% (2)
PDF
39 pages
Defining Operations in Oracle Process Manufacturing
No ratings yet
Defining Operations in Oracle Process Manufacturing
39 pages
STULZ E2 Controller Operation Manual OZU0037M
No ratings yet
STULZ E2 Controller Operation Manual OZU0037M
82 pages
Service Center Repairs We Buy Used Equipment: Instra
No ratings yet
Service Center Repairs We Buy Used Equipment: Instra
75 pages
Chorus Trio Expander User Manual Rev 1.8 en 05.2022
No ratings yet
Chorus Trio Expander User Manual Rev 1.8 en 05.2022
104 pages
AC51526140 Nimh Battery Pack
No ratings yet
AC51526140 Nimh Battery Pack
1 page
Resume 1 Linux Administrator
No ratings yet
Resume 1 Linux Administrator
2 pages
The Future of Cybersecurity - Emerging Trends and Challenges
No ratings yet
The Future of Cybersecurity - Emerging Trends and Challenges
5 pages
Elektor-1982-07 (Super LN Phono, Class A+B Amplifier)
No ratings yet
Elektor-1982-07 (Super LN Phono, Class A+B Amplifier)
97 pages
HONO
No ratings yet
HONO
29 pages
Nasscom Mlops Playbook 2022
No ratings yet
Nasscom Mlops Playbook 2022
55 pages
MSC Pool Conceptdfadslfkdslfkdsal
No ratings yet
MSC Pool Conceptdfadslfkdslfkdsal
4 pages
Ce Lab 17213 CF
No ratings yet
Ce Lab 17213 CF
37 pages
TWITTER
No ratings yet
TWITTER
2 pages
Jci Application Form
No ratings yet
Jci Application Form
3 pages
Semantic Research in Computational Linguistics
No ratings yet
Semantic Research in Computational Linguistics
50 pages
Analysis and Simulation of Brain Signal Data by EEG Signal Processing Technique Using MATLAB
No ratings yet
Analysis and Simulation of Brain Signal Data by EEG Signal Processing Technique Using MATLAB
7 pages
Some Common Taylor Series: The Sine and Cosine Functions
No ratings yet
Some Common Taylor Series: The Sine and Cosine Functions
4 pages
VNX - VNX 5100 Procedures-Replacing A SFP in A SP
No ratings yet
VNX - VNX 5100 Procedures-Replacing A SFP in A SP
15 pages
Project 2
No ratings yet
Project 2
8 pages
Sap Powerdesigner: Object-Oriented Model Report
No ratings yet
Sap Powerdesigner: Object-Oriented Model Report
13 pages
JAVA For Beginners: Using The Vehicle Class
No ratings yet
JAVA For Beginners: Using The Vehicle Class
12 pages
Esaimen Ooop
No ratings yet
Esaimen Ooop
9 pages
Math Quad
No ratings yet
Math Quad
4 pages
Design and Simulation of Digital Down Converter Based On System Generator
No ratings yet
Design and Simulation of Digital Down Converter Based On System Generator
3 pages
Epfo Mis 312
No ratings yet
Epfo Mis 312
1 page
2023 R Programming Apr May (AICTE)
No ratings yet
2023 R Programming Apr May (AICTE)
3 pages
The metaverse and its impact on human rights, the rule of law and democracy
From Everand
The metaverse and its impact on human rights, the rule of law and democracy
Council of Europe
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

AIA R L H F ? C L: Lignment Through Einforcement Earning From Uman Eedback Ontradictions and Imitations

Uploaded by

AIA R L H F ? C L: Lignment Through Einforcement Earning From Uman Eedback Ontradictions and Imitations

Uploaded by

AI A LIGNMENT THROUGH R EINFORCEMENT L EARNING FROM

H UMAN F EEDBACK ? C ONTRADICTIONS AND L IMITATIONS

Adam Dahlgren Lindström Leila Methnani Lea Krause

Umeå University Umeå University Text Mining Lab

Petter Ericson Íñigo Martínez de Rituerto de Troya

Dimitri Coelho Mollo Roel Dobbe

We chose ‘helpful, honest, and harmless’ as

2.1 Reinforcement Learning from Human Feedback

2.2 Reinforcement Learning from AI Feedback

2.3 Technical Criticism

2.4 The Curse of Flexibility

The AI should not be offensive or

At its most basic level, the AI should give

The AI should make a clear attempt to perform

Alignment refers to the process of ensuring that

4 The Internal Tensions and Ethical Issues in RLxF

4.1 Increased Helpfulness May Lead to Deception

4.2 Sycophancy: Helpfulness and Harmlessness Gone Awry

4.3 RLxF Can Contribute to Value Imposition and Cultural Homogenisation

4.4 RLxF Increases Ethical Opacity

5 Rebooting Safety and Alignment: Integrating AI Ethics and System Safety

5.1 Value Alignment by Engineering: an Impossible Task

5.2 Toward an Integrated Approach to Safe and Ethical AI Design

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.