LESSWRONG
LW

2466
Rohin Shah
16498Ω565120722941
Message
Dialogue
Subscribe

Research Scientist at Google DeepMind. Creator of the Alignment Newsletter. http://rohinshah.com/

Sequences

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
Value Learning
Alignment Newsletter
13rohinmshah's Shortform
Ω
6y
Ω
142
Buck's Shortform
Rohin Shah12h197

Just switch to log_10 space and add? Eg 10k times 300k = 1e4 * 3e5 = 3e9 = 3B. A bit slower but doesn't require any drills.

Reply
Comparative advantage & AI
Rohin Shah14d158

Seb is explicitly talking about AGI and not ASI. It's right there in the tweet.

Most people in policy and governance are not talking about what happens after an intelligence explosion. There are many voices in AI policy and governance and lots of them say dumb stuff, e.g. I expect someone has said the next generation of AIs will cause huge unemployment. Comparative advantage is indeed one reasonable thing to discuss in response to that conversation.

Stop assuming that everything anyone says about AI must clearly be a response to Yudkowsky.

Reply3111
leogao's Shortform
Rohin Shah22dΩ220

it'd be fine if you held alignment constant but dialed up capabilities.

I don't know what this means so I can't give you a prediction about it.

I don't really see why it's relevant how aligned Claude is if we're not thinking about that as part of it

I just named three reasons:

  1. Current models do not provide much evidence one way or another for existential risk from misalignment (in contrast to frequent claims that "the doomers were right")
  2. Given tremendous uncertainty, our best guess should be that future models are like current models, and so future models will not try to take over, and so existential risk from misalignment is low
  3. Some particular threat model predicted that even at current capabilities we should see significant misalignment, but we don't see this, which is evidence against that particular threat model.

Is it relevant to the object-level question of "how hard is aligning a superintelligence"? No, not really. But people are often talking about many things other than that question.

For example, is it relevant to "how much should I defer to doomers"? Yes absolutely (see e.g. #1).

Reply
leogao's Shortform
Rohin Shah23d*Ω686

I bucket this under "given this ratio of right/wrong responses, you think a smart alignment researcher who's paying attention can keep it in a corrigibility basin even as capability levels rise?". Does that feel inaccurate, or, just, not how you'd exactly put it?

I'm pretty sure it is not that. When people say this it is usually just asking the question: "Will current models try to take over or otherwise subvert our control (including incompetently)?" and noticing that the answer is basically "no".[1] What they use this to argue for can then vary:

  1. Current models do not provide much evidence one way or another for existential risk from misalignment (in contrast to frequent claims that "the doomers were right")
  2. Given tremendous uncertainty, our best guess should be that future models are like current models, and so future models will not try to take over, and so existential risk from misalignment is low
  3. Some particular threat model predicted that even at current capabilities we should see significant misalignment, but we don't see this, which is evidence against that particular threat model.[2]

I agree with (1), disagree with (2) when (2) is applied to superintelligence, and for (3) it depends on details.

In Leo's case in particular I don't think he's using the observation for much, it's mostly just a throwaway claim that's part of the flow of the comment, but inasmuch as it is being used it is to say something like "current AIs aren't trying to subvert our control, so it's not completely implausible on the face of it that the first automated alignment researcher to which we delegate won't try to subvert our control", which is just a pretty weak claim and seems fine, and doesn't imply any kind of extrapolation to superintelligence. I'd be surprised if this was an important disagreement with the "alignment is hard" crowd.

  1. ^

    There are demos of models doing stuff like this (e.g. blackmail) but only under conditions selected highly adversarially. These look fragile enough that overall I'd still say current models are more aligned than e.g. rationalists (who under adversarially selected conditions have been known to intentionally murder people).

  2. ^

    E.g. One naive threat model says "Orthogonality says that an AI system's goals are completely independent of its capabilities, so we should expect that current AI systems have random goals, which by fragility of value will then be misaligned". Setting aside whether anyone ever believed in such a naive threat model, I think we can agree that current models are evidence against such a threat model.

Reply
Research Agenda: Synthesizing Standalone World-Models
Rohin Shah25d*Ω592

Which, importantly, includes every fruit of our science and technology.

I don't think this is the right comparison, since modern science / technology is a collective effort and so can only cumulate thinking through mostly-interpretable steps. (This may also be true for AI, but if so then you get interpretability by default, at least interpretability-to-the-AIs, at which point you are very likely better off trying to build AIs that can explain that to humans.)

In contrast, I'd expect individual steps of scientific progress that happen within a single mind often are very uninterpretable (see e.g. "research taste").

If we understood an external superhuman world-model as well as a human understands their own world-model, I think that'd obviously get us access to tons of novel knowledge.

Sure, I agree with that, but "getting access to tons of novel knowledge" is nowhere close to "can compete with the current paradigm of building AI", which seems like the appropriate bar given you are trying to "produce a different tool powerful enough to get us out of the current mess".

Perhaps concretely I'd wildly guess with huge uncertainty that this would involve an alignment tax of ~4 GPTs, in the sense that if you had an interpretable world model from GPT-10 similar in quality to a human's understanding of their own world model, that would be similarly useful as GPT-6.

Reply
Research Agenda: Synthesizing Standalone World-Models
Rohin Shah1moΩ8184

a human's world-model is symbolically interpretable by the human mind containing it.

Say what now? This seems very false:

  • See almost anything physical (riding a bike, picking things up, touch typing a keyboard, etc). If you have a dominant hand / leg, try doing some standard tasks with the non-dominant hand / leg. Seems like if the human mind could symbolically interpret its own world model this should be much easier to do.
  • Basically anything to do with vision / senses. Presumably if vision was symbolically interpretable to the mind then there wouldn't be much of a skill ladder to climb for things like painting.
  • Symbolic grammar usually has to be explicitly taught to people, even though ~everyone has a world model that clearly includes grammar (in the sense that they can generate grammatical sentences and identify errors in grammar)

Tbc I can believe it's true in some cases, e.g. I could believe that some humans' far-mode abstract world models are approximately symbolically interpretable to their mind (though I don't think mine is). But it seems false in the vast majority of domains (if you are measuring relative to competent, experienced people in those domains, as seems necessary if you are aiming for your system to outperform what humans can do).

Reply
The Most Common Bad Argument In These Parts
Rohin Shah1mo4516

What exactly do you propose that a Bayesian should do, upon receiving the observation that a bounded search for examples within a space did not find any such example?

(I agree that it is better if you can instead construct a tight logical argument, but usually that is not an option.)

I also don't find the examples very compelling:

  1. Security mindset -- Afaict the examples here are fictional
  2. Superforecasters -- In my experience, superforecasters have all kinds of diverse reasons for low p(doom), some good, many bad. The one you describe doesn't seem particularly common.
  3. Rethink -- Idk the details here, will pass
  4. Fatima Sun Miracle: I'll just quote Scott Alexander's own words in the post you link:

I will admit my bias: I hope the visions of Fatima were untrue, and therefore I must also hope the Miracle of the Sun was a fake. But I’ll also admit this: at times when doing this research, I was genuinely scared and confused. If at this point you’re also scared and confused, then I’ve done my job as a writer and successfully presented the key insight of Rationalism: “It ain’t a true crisis of faith unless it could go either way”.

[...]

I don’t think we have devastated the miracle believers. We have, at best, mildly irritated them. If we are lucky, we have posited a very tenuous, skeletal draft of a materialist explanation of Fatima that does not immediately collapse upon the slightest exposure to the data. It will be for the next century’s worth of scholars to flesh it out more fully.

Overall, I'm pleasantly surprised by how bad these examples are. I would have expected much stronger examples, since on priors I expected that many people would in fact follow EFAs off a cliff, rather than treating them as evidence of moderate but not overwhelming strength. To put it another way, I expected that your FA on examples of bad EFAs would find more and/or stronger hits than it actually did, and in my attempt to better approximate Bayesianism I am noticing this observation and updating on it.

Reply
Nice-ish, smooth takeoff (with imperfect safeguards) probably kills most "classic humans" in a few decades.
Rohin Shah2mo*73

It's instead arguing with the people who are imagining something like "business continues sort of as usual in a decentralized fashion, just faster, things are complicated and messy, but we muddle through somehow, and the result is okay."

The argument for this position is more like: "we never have a 'solution' that gives us justified confidence that the AI will be aligned, but when we build the AIs, the AIs turn out to be aligned anyway".

You seem to instead be assuming "we don't get a 'solution', and so we build ASI and all instances of ASI are mostly misaligned but a bit nice, and so most people die". I probably disagree with that position too, but imo it's not an especially interesting position to debate, as I do agree that building ASI that is mostly misaligned but a bit nice is a bad outcome that we should try hard to prevent.

Reply
The title is reasonable
Rohin Shah2mo30

Yeah, that's fair for agendas that want to directly study the circumstances that lead to scheming. Though when thinking about those agendas, I do find myself more optimistic because they likely do not have to deal with long time horizons, whereas capabilities work likely will have to engage with that.

Note many alignment agendas don't need to actually study potential schemers. Amplified oversight can make substantial progress without studying actual schemers (but probably will face the long horizon problem). Interpretability can make lots of foundational progress without schemers, that I would expect to mostly generalize to schemers. Control can make progress with models prompted or trained to be malicious.

Reply1
This is a review of the reviews
Rohin Shah2mo90

On reflection I think you're right that this post isn't doing the thing I thought it was doing, and have edited my comment.

(For reference: I don't actually have strong takes on whether you should have chosen a different title given your beliefs. I agree that your strategy seems like a reasonable one given those beliefs, while also thinking that building a Coalition of the Concerned would have been a reasonable strategy given those beliefs. I mostly dislike the social pressure currently being applied in the direction of "those who disagree should stick to their agreements" (example) without even an acknowledgement of the asymmetricity of the request, let alone a justification for it. But I agree this post isn't quite doing that.)

Reply
Load More
Newsletters
5 years ago
(+17/-12)
51GDM: Consistency Training Helps Limit Sycophancy and Jailbreaks in Gemini 2.5 Flash
Ω
14d
Ω
2
166Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety
Ω
4mo
Ω
32
58Evaluating and monitoring for AI scheming
Ω
4mo
Ω
9
73Google DeepMind: An Approach to Technical AGI Safety and Security
Ω
7mo
Ω
12
113Negative Results for SAEs On Downstream Tasks and Deprioritising SAE Research (GDM Mech Interp Team Progress Update #2)
Ω
7mo
Ω
15
103AGI Safety & Alignment @ Google DeepMind is hiring
Ω
9mo
Ω
19
105A short course on AGI safety from the GDM Alignment team
Ω
9mo
Ω
2
81MONA: Managed Myopia with Approval Feedback
Ω
10mo
Ω
30
222AGI Safety and Alignment at Google DeepMind: A Summary of Recent Work
Ω
1y
Ω
33
49On scalable oversight with weak LLMs judging strong LLMs
Ω
1y
Ω
18
Load More