Learning from negative feedback, or positive feedback or both

Abdolmaleki, Abbas; Piot, Bilal; Shahriari, Bobak; Springenberg, Jost Tobias; Hertweck, Tim; Joshi, Rishabh; Oh, Junhyuk; Bloesch, Michael; Lampe, Thomas; Heess, Nicolas; Buchli, Jonas; Riedmiller, Martin

Computer Science > Machine Learning

arXiv:2410.04166 (cs)

[Submitted on 5 Oct 2024 (v1), last revised 7 Mar 2025 (this version, v3)]

Title:Learning from negative feedback, or positive feedback or both

Authors:Abbas Abdolmaleki, Bilal Piot, Bobak Shahriari, Jost Tobias Springenberg, Tim Hertweck, Rishabh Joshi, Junhyuk Oh, Michael Bloesch, Thomas Lampe, Nicolas Heess, Jonas Buchli, Martin Riedmiller

View PDF HTML (experimental)

Abstract:Existing preference optimization methods often assume scenarios where paired preference feedback (preferred/positive vs. dis-preferred/negative examples) is available. This requirement limits their applicability in scenarios where only unpaired feedback--for example, either positive or negative--is available. To address this, we introduce a novel approach that decouples learning from positive and negative feedback. This decoupling enables control over the influence of each feedback type and, importantly, allows learning even when only one feedback type is present. A key contribution is demonstrating stable learning from negative feedback alone, a capability not well-addressed by current methods. Our approach builds upon the probabilistic framework introduced in (Dayan and Hinton, 1997), which uses expectation-maximization (EM) to directly optimize the probability of positive outcomes (as opposed to classic expected reward maximization). We address a key limitation in current EM-based methods: they solely maximize the likelihood of positive examples, while neglecting negative ones. We show how to extend EM algorithms to explicitly incorporate negative examples, leading to a theoretically grounded algorithm that offers an intuitive and versatile way to learn from both positive and negative feedback. We evaluate our approach for training language models based on human feedback as well as training policies for sequential decision-making problems, where learned value functions are available.

Subjects:	Machine Learning (cs.LG); Machine Learning (stat.ML)
Cite as:	arXiv:2410.04166 [cs.LG]
	(or arXiv:2410.04166v3 [cs.LG] for this version)
	https://doi.org/10.48550/arXiv.2410.04166

Submission history

From: Abbas Abdolmaleki [view email]
[v1] Sat, 5 Oct 2024 14:04:03 UTC (5,747 KB)
[v2] Thu, 6 Mar 2025 15:11:57 UTC (6,981 KB)
[v3] Fri, 7 Mar 2025 10:51:04 UTC (6,966 KB)

Computer Science > Machine Learning

Title:Learning from negative feedback, or positive feedback or both

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Computer Science > Machine Learning

Title:Learning from negative feedback, or positive feedback or both

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.