Jared Kaplan — LessWrong

There's a direction (which I imagine you and others have considered) where you replace some activations within your AI with natural language, so that eg heuristically certain layers can only communicate with the next layer in NL.

Then you heavily regularize in various ways. You'd require the language to be fully understandable and transparent, perhaps requiring that counter-factual changes to inputs lead to sensible changes to outputs within subsystems, etc. You'd have humans verify the language was relevant, meaningful, & concise, train AIs... (read more)

Prizes for ELK proposals

Jared Kaplan4yΩ330

Thanks, yeah I meant that I was interested in a solution that would scale to arbitrarily superhuman AI capabilities with a "mere" capabilities hit/cost (perhaps a very large cost that grows with AI capability, but does not impose a bound on the ultimate capability of the aligned system). So this was a useful clarification for me in terms of understanding your perspective; I may be wrong but I could imagine it might be useful to lead with this a bit more, ie "we don't know of and would be very interested in solutions that might be extremely costly but that avoid all counter-examples". Possibly you already say this and I just missed it.

Prizes for ELK proposals

Jared Kaplan4yΩ470

Apologies for a possibly naive comment/question, perhaps this has been discussed elsewhere and you can just direct me there. But anyway...

I would find it helpful to see a strategy that ARC believes does in fact solve ELK, but fails only because it requires taking an unacceptably large capabilities hit. I would find this helpful for several reasons, namely

(1) it would help me to understand what kinds of strategies you believe really do escape counter-examples,
(2) it would give me a better sense for how optimistic to be about the appr... (read more)

Visible Thoughts Project and Bounty Announcement

Jared Kaplan5yΩ11230

I think this is an interesting project, and one that (from a very different angle) I’ve spent a bit of time on, so here are a few notes on that, followed by a few suggestions. Stella, in another comment, made several great points that I agree with and that are similar in spirit to my suggestions.

Anyway, based on a fairly similar motivation of wanting to be able to “ask a LM what it’s actually thinking/expecting”, combined with the general tendency to want to do the simplest and cheapest thing possible first… and then try to make it even simpler still befor... (read more)