Jared Kaplan
Jared Kaplan has not written any posts yet.

Jared Kaplan has not written any posts yet.

Thanks, yeah I meant that I was interested in a solution that would scale to arbitrarily superhuman AI capabilities with a "mere" capabilities hit/cost (perhaps a very large cost that grows with AI capability, but does not impose a bound on the ultimate capability of the aligned system). So this was a useful clarification for me in terms of understanding your perspective; I may be wrong but I could imagine it might be useful to lead with this a bit more, ie "we don't know of and would be very interested in solutions that might be extremely costly but that avoid all counter-examples". Possibly you already say this and I just missed it.
Apologies for a possibly naive comment/question, perhaps this has been discussed elsewhere and you can just direct me there. But anyway...
I would find it helpful to see a strategy that ARC believes does in fact solve ELK, but fails only because it requires taking an unacceptably large capabilities hit. I would find this helpful for several reasons, namely
(1) it would help me to understand what kinds of strategies you believe really do escape counter-examples,
(2) it would give me a better sense for how optimistic to be about the approach, since it's often easier to start from an inefficient solution and make it more efficient, than it is to find an inefficient solution in the first place, and/or
(3) if you have trouble identifying such a solution, then it would suggest to me that finding one might be a useful research direction.
I think this is an interesting project, and one that (from a very different angle) I’ve spent a bit of time on, so here are a few notes on that, followed by a few suggestions. Stella, in another comment, made several great points that I agree with and that are similar in spirit to my suggestions.
Anyway, based on a fairly similar motivation of wanting to be able to “ask a LM what it’s actually thinking/expecting”, combined with the general tendency to want to do the simplest and cheapest thing possible first… and then try to make it even simpler still before starting… we’ve experimented with including metadata in language pretraining data. Most... (read 449 more words →)
There's a direction (which I imagine you and others have considered) where you replace some activations within your AI with natural language, so that eg heuristically certain layers can only communicate with the next layer in NL.
Then you heavily regularize in various ways. You'd require the language to be fully understandable and transparent, perhaps requiring that counter-factual changes to inputs lead to sensible changes to outputs within subsystems, etc. You'd have humans verify the language was relevant, meaningful, & concise, train AIs to do this verification at larger scale, do some adversarial training, etc. You could also train sub-human level AIs to paraphrase the language that's used and restate it between layers,... (read more)