Lukas Fluri

Zurich AI Safety is hiring a Director

Summary Zurich AI Safety (ZAIS) is hiring a Director, with the goal of professionalizing AI Safety capacity building in Zurich and Switzerland. The Director will lead individualized career advising for promising talent, expand outreach to mid- and senior-level professionals, manage ZAIS's growing volunteer team, and explore AI Safety opportunities in...

Mar 321

Is the evidence in "Language Models Learn to Mislead Humans via RLHF" valid?

Abstract: Language Models Learn to Mislead Humans Via RLHF (published at ICLR 2025) argues that RLHF can unintentionally train models to mislead humans – a phenomenon termed Unintentional-SOPHISTRY. However, our review of the paper's code and experiments suggests that a significant portion of their empirical findings may be due largely...

Dec 1, 202535

Zurich AI Safety is looking for (Co-)Directors - EOI

TL;DR: Zurich AI Safety (ZAIS) is looking for a Director or Co-Directors. As of now, ZAIS is purely volunteer-run and is applying for funding for this position in collaboration with EA Switzerland. Therefore, this is an expression of interest (EOI) rather than a secured hiring, but promising candidates will be...

Sep 3, 202512

The Perils of Optimizing Learned Reward Functions

This post contains a summary of our paper which will be presented at ICML 2025. Feel free to visit me (Lukas) at our poster stand to chat about our work. More info about the time and location can be found here. TL;DR Reward learning techniques like RLHF are still the...

Jul 11, 202519

Evaluating Superhuman Models with Consistency Checks

Consider the following two questions: Is this move good or bad? Is this forecast accurate? In both cases, the ground truth is not known to us humans. Furthermore, in both cases there either already exist superhuman AI systems (as in the case of chess), or researchers are actively working to...

Aug 1, 202321

Open Problems in Negative Side Effect Minimization

Acknowledgments We want to thank Stuart Armstrong, Remmelt Ellen, David Lindner, Michal Pokorny, Achyuta Rajaram, Adam Shimi, and Alex Turner for helpful discussions and valuable feedback on earlier drafts of this post. Fabian Schimpf and Lukas Fluri are part of this year’s edition of the AI Safety Camp. Our gratitude...

May 6, 202212

LESSWRONG
LW

LESSWRONG
LW

Lukas Fluri

Lukas Fluri

Is the evidence in "Language Models Learn to Mislead Humans via RLHF" valid?

Evaluating Superhuman Models with Consistency Checks

Zurich AI Safety is hiring a Director

The Perils of Optimizing Learned Reward Functions

Lukas Fluri

Is the evidence in "Language Models Learn to Mislead Humans via RLHF" valid?

Evaluating Superhuman Models with Consistency Checks

Zurich AI Safety is hiring a Director

The Perils of Optimizing Learned Reward Functions

Zurich AI Safety is hiring a Director

Is the evidence in "Language Models Learn to Mislead Humans via RLHF" valid?

Zurich AI Safety is looking for (Co-)Directors - EOI

The Perils of Optimizing Learned Reward Functions

Evaluating Superhuman Models with Consistency Checks

Open Problems in Negative Side Effect Minimization