Let’s pretend I have a semi rigors model that lays out why RLHF is doomed to fail and also that it negatively affects model performance (including why it does so)

Let’s go further into lala land and pretend that I have an architectural plan that does much better, very transparent, steerable and corrigible, can be deployed and used without changing or retraining the base LLM.

There are some downsides like requires more compute at inference time, not provable bulletproof, likely breaks in SI regime and definitely breaks under self improvement (so very definitely NOT an alignment proposal).

Short term this looks beneficial, also looks like shortening timelines, and extremely unlikely to advance the AI safety field (in the direction of what we ultimately want and need).

What should I do, if I ever happened to be in such a situation?

  • Prototype it, limited access with the expressed purpose of breaking stuff (black box, absolutely no architectural information provided).

  • Write it up and publish.

  • Forget about it, smarter people must have already thought of it, and since it’s not a thing, I am clearly wrong.

  • Forget about it, only helps capabilities.

New Answer
New Comment

1 Answers sorted by

I endorse the "overly galaxy brained strategy." If you actually understand why it's not useful even as a step towards some other alignment scheme that works for superintelligence, you should just drop it and think about other things.

However, usually things aren't so cut and dried. In the course of arriving at the epistemic state hypothesized above, it's probably a good idea to talk to some other safety researchers.

Generally if you think of something that's super useful for present-day systems, it's related to ideas that are useful for future systems. In that case, I endorse attempting to study your idea for its safety properties for a while and then eventually publishing (preferably just in time to scoop people in industry who are thinking about similar things :P ).

My hypothetical self thanks you for your input and has punted the issues to the real me.

I feel like I need to dig a little bit into this

If you actually understand why it's not useful

Honestly I don't know for sure I do, how can I when everything is so ill-defined and we have so few scraps of solid fact to base things on.

That said, there is a couple of issue, and the major one is grounding, or rather the lack of grounding.

Grounding is IMO a core problem, although people rarely talk about, I think that mainly comes about because we (humans) seemingly have sol... (read more)