Rohin Shah — LessWrong

LESSWRONG
LW

it'd be fine if you held alignment constant but dialed up capabilities.

I don't know what this means so I can't give you a prediction about it.

I don't really see why it's relevant how aligned Claude is if we're not thinking about that as part of it

I just named three reasons:

Current models do not provide much evidence one way or another for existential risk from misalignment (in contrast to frequent claims that "the doomers were right")
Given tremendous uncertainty, our best guess should be that future models are like current models, and so future models will not try to take over, and so existential risk from misalignment is low
Some particular threat model predicted that even at current capabilities we should see significant misalignment, but we don't see this, which is evidence against that particular threat model.

Is it relevant to the object-level question of "how hard is aligning a superintelligence"? No, not really. But people are often talking about many things other than that question.

For example, is it relevant to "how much should I defer to doomers"? Yes absolutely (see e.g. #1).

leogao's Shortform

Rohin Shah2d*Ω576

I bucket this under "given this ratio of right/wrong responses, you think a smart alignment researcher who's paying attention can keep it in a corrigibility basin even as capability levels rise?". Does that feel inaccurate, or, just, not how you'd exactly put it?

I'm pretty sure it is not that. When people say this it is usually just asking the question: "Will current models try to take over or otherwise subvert our control (including incompetently)?" and noticing that the answer is basically "no".^[1] What they use this to argue for can then vary:

Current models do not provide much evidence one way or another for existential risk from misalignment (in contrast to frequent claims that "the doomers were right")
Given tremendous uncertainty, our best guess should be that future models are like current models, and so future models will not try to take over, and so existential risk from misalignment is low
Some particular threat model predicted that even at current capabilities we should see significant misalignment, but we don't see this, which is evidence against that particular threat model.^[2]

I agree with (1), disagree with (2) when (2) is applied to superintelligence, and for (3) it depends on details.

In Leo's case in particular I don't think he's using the observation for much, it's mostly just a throwaway claim that's part of the flow of the comment, but inasmuch as it is being used it is to say something like "current AIs aren't trying to subvert our control, so it's not completely implausible on the face of it that the first automated alignment researcher to which we delegate won't try to subvert our control", which is just a pretty weak claim and seems fine, and doesn't imply any kind of extrapolation to superintelligence. I'd be surprised if this was an important disagreement with the "alignment is hard" crowd.

^{^}
There are demos of models doing stuff like this (e.g. blackmail) but only under conditions selected highly adversarially. These look fragile enough that overall I'd still say current models are more aligned than e.g. rationalists (who under adversarially selected conditions have been known to intentionally murder people).
^{^}
E.g. One naive threat model says "Orthogonality says that an AI system's goals are completely independent of its capabilities, so we should expect that current AI systems have random goals, which by fragility of value will then be misaligned". Setting aside whether anyone ever believed in such a naive threat model, I think we can agree that current models are evidence against such a threat model.

Research Agenda: Synthesizing Standalone World-Models

Rohin Shah4d*Ω472

Which, importantly, includes every fruit of our science and technology.

I don't think this is the right comparison, since modern science / technology is a collective effort and so can only cumulate thinking through mostly-interpretable steps. (This may also be true for AI, but if so then you get interpretability by default, at least interpretability-to-the-AIs, at which point you are very likely better off trying to build AIs that can explain that to humans.)

In contrast, I'd expect individual steps of scientific progress that happen within a single mind often are very uninterpretable (see e.g. "research taste").

If we understood an external superhuman world-model as well as a human understands their own world-model, I think that'd obviously get us access to tons of novel knowledge.

Sure, I agree with that, but "getting access to tons of novel knowledge" is nowhere close to "can compete with the current paradigm of building AI", which seems like the appropriate bar given you are trying to "produce a different tool powerful enough to get us out of the current mess".

Perhaps concretely I'd wildly guess with huge uncertainty that this would involve an alignment tax of ~4 GPTs, in the sense that if you had an interpretable world model from GPT-10 similar in quality to a human's understanding of their own world model, that would be similarly useful as GPT-6.

Research Agenda: Synthesizing Standalone World-Models

Rohin Shah9dΩ7173

a human's world-model is symbolically interpretable by the human mind containing it.

Say what now? This seems very false:

See almost anything physical (riding a bike, picking things up, touch typing a keyboard, etc). If you have a dominant hand / leg, try doing some standard tasks with the non-dominant hand / leg. Seems like if the human mind could symbolically interpret its own world model this should be much easier to do.
Basically anything to do with vision / senses. Presumably if vision was symbolically interpretable to the mind then there wouldn't be much of a skill ladder to climb for things like painting.
Symbolic grammar usually has to be explicitly taught to people, even though ~everyone has a world model that clearly includes grammar (in the sense that they can generate grammatical sentences and identify errors in grammar)

Tbc I can believe it's true in some cases, e.g. I could believe that some humans' far-mode abstract world models are approximately symbolically interpretable to their mind (though I don't think mine is). But it seems false in the vast majority of domains (if you are measuring relative to competent, experienced people in those domains, as seems necessary if you are aiming for your system to outperform what humans can do).

The Most Common Bad Argument In These Parts

Rohin Shah16d4416

What exactly do you propose that a Bayesian should do, upon receiving the observation that a bounded search for examples within a space did not find any such example?

(I agree that it is better if you can instead construct a tight logical argument, but usually that is not an option.)

I also don't find the examples very compelling:

Security mindset -- Afaict the examples here are fictional
Superforecasters -- In my experience, superforecasters have all kinds of diverse reasons for low p(doom), some good, many bad. The one you describe doesn't seem particularly common.
Rethink -- Idk the details here, will pass
Fatima Sun Miracle: I'll just quote Scott Alexander's own words in the post you link:

I will admit my bias: I hope the visions of Fatima were untrue, and therefore I must also hope the Miracle of the Sun was a fake. But I’ll also admit this: at times when doing this research, I was genuinely scared and confused. If at this point you’re also scared and confused, then I’ve done my job as a writer and successfully presented the key insight of Rationalism: “It ain’t a true crisis of faith unless it could go either way”.
[...]
I don’t think we have devastated the miracle believers. We have, at best, mildly irritated them. If we are lucky, we have posited a very tenuous, skeletal draft of a materialist explanation of Fatima that does not immediately collapse upon the slightest exposure to the data. It will be for the next century’s worth of scholars to flesh it out more fully.

Overall, I'm pleasantly surprised by how bad these examples are. I would have expected much stronger examples, since on priors I expected that many people would in fact follow EFAs off a cliff, rather than treating them as evidence of moderate but not overwhelming strength. To put it another way, I expected that your FA on examples of bad EFAs would find more and/or stronger hits than it actually did, and in my attempt to better approximate Bayesianism I am noticing this observation and updating on it.

Nice-ish, smooth takeoff (with imperfect safeguards) probably kills most "classic humans" in a few decades.

Rohin Shah25d*73

It's instead arguing with the people who are imagining something like "business continues sort of as usual in a decentralized fashion, just faster, things are complicated and messy, but we muddle through somehow, and the result is okay."

The argument for this position is more like: "we never have a 'solution' that gives us justified confidence that the AI will be aligned, but when we build the AIs, the AIs turn out to be aligned anyway".

You seem to instead be assuming "we don't get a 'solution', and so we build ASI and all instances of ASI are mostly misaligned but a bit nice, and so most people die". I probably disagree with that position too, but imo it's not an especially interesting position to debate, as I do agree that building ASI that is mostly misaligned but a bit nice is a bad outcome that we should try hard to prevent.

The title is reasonable

Rohin Shah1mo30

Yeah, that's fair for agendas that want to directly study the circumstances that lead to scheming. Though when thinking about those agendas, I do find myself more optimistic because they likely do not have to deal with long time horizons, whereas capabilities work likely will have to engage with that.

Note many alignment agendas don't need to actually study potential schemers. Amplified oversight can make substantial progress without studying actual schemers (but probably will face the long horizon problem). Interpretability can make lots of foundational progress without schemers, that I would expect to mostly generalize to schemers. Control can make progress with models prompted or trained to be malicious.

This is a review of the reviews

Rohin Shah1mo90

On reflection I think you're right that this post isn't doing the thing I thought it was doing, and have edited my comment.

(For reference: I don't actually have strong takes on whether you should have chosen a different title given your beliefs. I agree that your strategy seems like a reasonable one given those beliefs, while also thinking that building a Coalition of the Concerned would have been a reasonable strategy given those beliefs. I mostly dislike the social pressure currently being applied in the direction of "those who disagree should stick to their agreements" (example) without even an acknowledgement of the asymmetricity of the request, let alone a justification for it. But I agree this post isn't quite doing that.)

The title is reasonable

Rohin Shah1mo20

Do you agree the feedback loops for capabilities are better right now?

Yes, primarily due to the asymmetry where capabilities can work with existing systems while alignment is mostly stuck waiting for future systems, but that should be much less true by the time of DAI.

Edit: though risk could be lower in the worlds where capabilities R&D goes off the rails due to having more time for safety, depending on whether this also applies to the next actor etc.

I was thinking both of this, and also that it seems quite correlated due to lack of asymmetry. Like, 20% on exactly one going off the rails rather than zero or both seems very high to me; I feel like to get to that I would want to know about some important structural differences between the problems. (Though I definitely phrased my comment poorly for communicating that.)

The title is reasonable

Rohin Shah1mo30

I'm not really following where the disanalogy is coming from (like, why are the feedback loops better?)

Sure, AI societies could go off the rails that hurts alignment R&D but not AI R&D; they could also go off the rails in a way that hurts AI R&D and not alignment R&D. Not sure why I should expect one rather than the other.

Although on further reflection, even though the current DAI isn't scheming, alignment work still has to be doing some worst-case type thinking about how future AIs might be scheming, whereas this is not needed for AI R&D. I don't think this makes a big difference -- usually I find worst-case conceptual thinking to be substantially easier than average-case conceptual thinking -- but I could imagine that causing issues.

LESSWRONG
LW

LESSWRONG
LW

Sequences

Posts

Wikitag Contributions

Comments