LESSWRONG
is fundraising!
LW

I understand that a reward maximiser would wire-head (take control over the reward provision mechanism), but I don’t see why training an RL agent would necessarily end up in a reward-maximising agent? Turntrout’s Reward is Not the Optimisation Target shed some clarity on this, but I definitely have remaining questions.

Leo Gao's Toward Deconfusing Wireheading and Reward Maximization sheds some light on this.

[-]Kyle O’Brien3y20

I agree with this suggestion. EleutherAI's alignment channels have been invaluable for my understanding of the alignment problem. I typically get insightful responses and explanations on the same day as posting. I've also been able to answer other folks' questions to deepen my inside view.

There is a alignment-beginners channel and a alignment-general channel. Your questions seem similar to what I see in alignment-general . For example, I received helpful answers when I asked this question about inverse reinforcement learning there yesterday.

Question: When I read Human Compatible a while back, I had the takeaway that Stuart Russel was very bullish on Inverse Reinforcement Learning being an important alignment research direction. However, I don’t see much mention of IRL on EleutherAI and the alignment forum. I see much more content about RLHF. Is IRL and RLHF the same thing? If not, what are folks’ thoughts on IRL?

Moderation Log

More from JanB

Curated and popular this week

2Comments

Hey, this is me. I’d like to understand AI X-risk better. Is anyone interested in being my “alignment tutor”, for maybe 1 h per week, or 1 h every two weeks? I’m happy to pay.

Fields I want to understand better:

Anything related to prosaic AI alignment/existential ML safety
Failure stories/threat models

Fields I’m not interested in (right now):

agent foundation
decision theory
other very mathsy stuff that’s not related to ML

My level of understanding:

I have a decent knowledge of ML/deep learning (I’m in the last year of my PhD)
I haven’t done the AGI Safety Fundamentals course, but I just skimmed it, and I think I had independently read essentially all the core readings (which means I probably have also read many things not on the curriculum). I’d say I have a relatively deep understanding of a majority (but not all) of the content in this curriculum.
Similarly for the AGI Safety Fundamentals 201, excluding the tracks

Example questions I wrestled with recently, and I might have brought up during the tutoring:

It seems to me that our currently level of outer alignment tools (RLHF + easy augmentation) is enough to solve the outer alignment problem sufficiently well so that humans don’t end up dead or disempowered (conditional on slow takeoff); and then we can solve further outer alignment problem as the come up, with iteration and regulation. So I basically think that the core of the alignment problem, at the moment, is inner alignment + deceptive alignment. What am I missing? (I read Christiano’s “Another Outer Alignment Failure Story”, but I still have this question.)
I understand that a reward maximiser would wire-head (take control over the reward provision mechanism), but I don’t see why training an RL agent would necessarily end up in a reward-maximising agent? Turntrout’s Reward is Not the Optimisation Target shed some clarity on this, but I definitely have remaining questions.
Is the failure mode describe in Ajeya’s Without Specific Countermeasures an inner alignment failure, or an outer alignment failure (I think it’s both).

You don’t need to have very crisps answers to these to be my tutor, but you should probably have at least some good thoughts.