Charbel-Raphaël

Charbel-Raphael Segerie

AI safety field building/teaching in France with Effisciences

Previously, CTO of omniscience.

ML engineer, interested in phenomenology.

https://crsegerie.github.io/ 

Wiki Contributions

Comments

None of these directly address what I'm calling The alignment stability problem, to give a name to what you're addressing here.

Maybe the alignment stability problem is the same thing as the sharp left turn?

I don't have the same intuition for the timeline, but I really like the tl;dr suggestion!

Is there an arXiv version of this blog post somewhere, that I could cite?

Thanks for writing this!

Situational awareness is a spectrum --> One important implication that I hadn't considered before is the challenge of choosing a threshold or shelling point beyond which a model becomes significantly dangerous. This may have a lot of consequences in OpenAI's Plan: Setting the threshold above which we should stop deployment seems very tricky, and this is not discussed in their plan.

The potential decomposition of situational awareness is also an intriguing idea. I would love to see a more detailed exploration of this. This would be the kind of things that would be very helpful to develop. Is anyone working on this?

cool post. It could be a little more detailed, but the pointers are there.

Yes, this is manual labeling and it is prohibited.

Feel free to ask me any other questions.

Thank you! Yes, for most of these issues, it's possible to create GIFs or at least pictograms. I can see the value this could bring to decision-makers.

However, even though I am quite honored, it's not because I wrote this post that I am the best person to do this kind of work. So, if anyone is inspired to work on this, feel free to send me a private message.

Thanks, I overlooked this and it makes sense to me. However, I'm not as certain about your last sentence: 

"and would also have corrigibility problems that were never present in ChatGPT because ChatGPT was never trying to navigate the real world.

I agree with the idea of "steering the trajectory," and this is a possibility we must consider. However, I still expect that if we train the robot to use the "Shut Down" token when it hears "Hi RobotGPT, please shut down," I don't see why it wouldn't work. 

It seems to me that we're comparing a second-order effect with a first-order effect.

I find it useful to work on a spreadsheet to think about the severity of these problems.  Here is my template.

Load More