I work at Redwood Research.

Wiki Contributions


Maybe "full loss-of-control to AIs"? Idk.

Yeah, I guess corrigible might not require any human ethics. Might just be that the AI doesn't care about seizing power (or care about anything really) or similar.

I would say "catastrophic outcome (>50% chance the AI kills >1 billion people)" or something and then footnote. Not sure though. The standard approach is to say "existential risk".


we'd have a rogue ASI on our hands

FWIW it doesn't seem obvious to me that it wouldn't be sufficiently corrigible by default.

I'd be at about 25% that if you end up with an ASI by accident, you'll notice before it ends up going rogue. This aren't great odds of course.

I absolutely agree that if we could create AIs which are very capable overall, but very unlikely to scheme that would be very useful.

That said, the core method by which we'd rule out scheming, insufficient ability to do opaque agency (aka opaque goal-directed reasoning), also rules out most other serious misalignment failure modes because danger due to serious misalignment probably requires substantial hidden reasoning.

So while I agree with you about the value of AIs which can't scheme, I'm less sure that I buy that the argument I make here is evidence for this being valueable.

Also, I'm not sure that trying to differentially advance architectures is a good use of resources on current margins for various reasons. (For instance, better scaffolding also drives additional investment.)

I think literal extinction from AI is a somewhat odd outcome to study as it heavily depends on difficult to reason about properties of the world (e.g. the probability that Aliens would trade substantial sums of resources for emulated human minds and the way acausal trade works in practice).

For more discussion see here, here, and here.

[I haven't read the sequence, I'm just responding the focus on extinction. In practice, my claims in this comment mean that I disagree with the arguments under "Human extinction as a convergently instrumental subgoal".]

Non-deceptive failures are easy to notice, but they're not necessarily easy to eliminate

I agree, I was trying to note this in my second paragraph, but I guess this was insufficiently clear.

I added the sentence "Being easy-to-study doesn't imply easy-to-solve".

I think I take them more seriously than you.

Seems too hard to tell based on this limited context. I think non-scheming failures are about 50% of the risk and probably should be about 50% of the effort of the AI safety-from-misalignment community. (I can see some arguments for scheming/deceptive alignment being more important toi work on in advance, but it also might be that non-scheming is more tractible and a higher fraction of risk in short timelines, so IDK overall.)

I don't think the evaluations we're describing here are about measuring capabilites. More like measuring whether our oversight (and other aspects) suffice for avoiding misalignment failures.

Measuring capabilities should be easy.

I will also point to OpenAI's weak-to-strong results, where increasingly strong students keep improving generalization given labels from a fixed-size teacher. We just don't live in a world where this issue is a lethality.

For a fixed weak teacher and increasing stronger students from a fixed model stack[1], I think you can probably avoid performance ever going down on most/typical tasks if you properly use early stopping, only use process based feedback, and the model isn't intentionally trying to perform poorly.

You might have instead expected performance to go up and then eventually go down with scale, but I think you can likely avoid this with early stopping (if you carefully find the right stopping point with scaling laws and analogous validation domains where we can ensure we get good labels or other methods of getting validation).

If I recall, I think we also see something similar in the scaling laws for reward model overoptimization work by Leo Gao (also done at OpenAI). (I think this is probably a more analogous case in most ways than the weak-to-strong results from OpenAI as far as understanding the dynamics of fitting to human errors.)

(Let's put aside the case where the model intentionally tries to perform poorly. (I'm not even sure this case actually looks that different, but it certainly complicates the analysis. I'm doing some work on this case looking at model organisms of intentionally poor performance, and I expect that for these exact model organisms, we'll probably see performance going up and then back down again with scale in at least some cases.))

(To be clear, I don't think this "performance never goes down with correct early stopping" claim is totally obvious. It will depend on the exact rate at which AIs learn to predict errors vs learn what the task is and how to do it. And how these rates evolve with scale. If the sigmoid on error learning rate vs scale has a different mid point and different slope than the sigmoid for learning the task, you can absolutely have actual performance go down.)

While I think you can avoid having the performance ever degrade via process based feedback using scaling laws on most/typical tasks, I think the performance will also plateau due to how early you'll end up needing to stop.

More concerningly, there might be alteratives to purely process-based human labels which don't plateau in performance, seem to increase performance, but which rarely results in egregiously bad behavior. Most notably various types of outcomes based feedback might have this property.

As an example of this, I expect that you can create "hackable/exploitable" game environments to exhibit this. More specifically:

  • We'll pretrain models on a collection of hackable game envs. We'll train a model stack of variable training compute.
  • We'll finetune these models in a new different (but still hackable) atari env where we expect to see transfer from the prior atari envs.
  • It's likely to me that as models get smarter if exploiting is ultimately a better strategy, final finetuned performance goes down even with early stopping.

You might be able to see this on some atari games with added semi-realistic exploits? I'm unsure.

  1. As in, you just vary the training compute and set all other values optimally based on this. ↩︎

I like @abramdemski's comment in the sibling, but see also this comment by Paul on "how would an LLM become goal-directed".

(That said, on @abramdemski's comment, I think it does seem important and notable that there isn't a clear and strong positive argument.)

Load More