Raphaël S

AI safety field building/teaching in France.

Previously, CTO of omniscience.

ML engineer, interested in phenomenology.

Wiki Contributions


Yes, this is manual labeling and it is prohibited.

Feel free to ask me any other questions.

Thank you! Yes, for most of these issues, it's possible to create GIFs or at least pictograms. I can see the value this could bring to decision-makers.

However, even though I am quite honored, it's not because I wrote this post that I am the best person to do this kind of work. So, if anyone is inspired to work on this, feel free to send me a private message.

Thanks, I overlooked this and it makes sense to me. However, I'm not as certain about your last sentence: 

"and would also have corrigibility problems that were never present in ChatGPT because ChatGPT was never trying to navigate the real world.

I agree with the idea of "steering the trajectory," and this is a possibility we must consider. However, I still expect that if we train the robot to use the "Shut Down" token when it hears "Hi RobotGPT, please shut down," I don't see why it wouldn't work. 

It seems to me that we're comparing a second-order effect with a first-order effect.

I find it useful to work on a spreadsheet to think about the severity of these problems.  Here is my template.

RLHF gives rewards which can withstand more optimization before producing unintended

Do you have a link for that please?

I don't see how RLHF could be framed as some kind of advance on the problem of outer alignment. 


RLHF solves the “backflip” alignment problem : If you try to write a function for creating an agent doing a backflip, it will crash. But with RLHF, you obtain a nice backflip.

Yes. This image is only a classifier. No mesa optimizer here. So we have only a capability robustness problem

Task restriction. The observation that diverse environments seem to increase the probability of mesa-optimization

Where does this observation come from?

Load More