LESSWRONG
LW

1926
Rohin Shah
16392Ω563720722881
Message
Dialogue
Subscribe

Research Scientist at Google DeepMind. Creator of the Alignment Newsletter. http://rohinshah.com/

Sequences

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
13rohinmshah's Shortform
Ω
6y
Ω
142
Value Learning
Alignment Newsletter
The Most Common Bad Argument In These Parts
Rohin Shah6d4416

What exactly do you propose that a Bayesian should do, upon receiving the observation that a bounded search for examples within a space did not find any such example?

(I agree that it is better if you can instead construct a tight logical argument, but usually that is not an option.)

I also don't find the examples very compelling:

  1. Security mindset -- Afaict the examples here are fictional
  2. Superforecasters -- In my experience, superforecasters have all kinds of diverse reasons for low p(doom), some good, many bad. The one you describe doesn't seem particularly common.
  3. Rethink -- Idk the details here, will pass
  4. Fatima Sun Miracle: I'll just quote Scott Alexander's own words in the post you link:

I will admit my bias: I hope the visions of Fatima were untrue, and therefore I must also hope the Miracle of the Sun was a fake. But I’ll also admit this: at times when doing this research, I was genuinely scared and confused. If at this point you’re also scared and confused, then I’ve done my job as a writer and successfully presented the key insight of Rationalism: “It ain’t a true crisis of faith unless it could go either way”.

[...]

I don’t think we have devastated the miracle believers. We have, at best, mildly irritated them. If we are lucky, we have posited a very tenuous, skeletal draft of a materialist explanation of Fatima that does not immediately collapse upon the slightest exposure to the data. It will be for the next century’s worth of scholars to flesh it out more fully.

Overall, I'm pleasantly surprised by how bad these examples are. I would have expected much stronger examples, since on priors I expected that many people would in fact follow EFAs off a cliff, rather than treating them as evidence of moderate but not overwhelming strength. To put it another way, I expected that your FA on examples of bad EFAs would find more and/or stronger hits than it actually did, and in my attempt to better approximate Bayesianism I am noticing this observation and updating on it.

Reply
Nice-ish, smooth takeoff (with imperfect safeguards) probably kills most "classic humans" in a few decades.
Rohin Shah15d*73

It's instead arguing with the people who are imagining something like "business continues sort of as usual in a decentralized fashion, just faster, things are complicated and messy, but we muddle through somehow, and the result is okay."

The argument for this position is more like: "we never have a 'solution' that gives us justified confidence that the AI will be aligned, but when we build the AIs, the AIs turn out to be aligned anyway".

You seem to instead be assuming "we don't get a 'solution', and so we build ASI and all instances of ASI are mostly misaligned but a bit nice, and so most people die". I probably disagree with that position too, but imo it's not an especially interesting position to debate, as I do agree that building ASI that is mostly misaligned but a bit nice is a bad outcome that we should try hard to prevent.

Reply
The title is reasonable
Rohin Shah25d30

Yeah, that's fair for agendas that want to directly study the circumstances that lead to scheming. Though when thinking about those agendas, I do find myself more optimistic because they likely do not have to deal with long time horizons, whereas capabilities work likely will have to engage with that.

Note many alignment agendas don't need to actually study potential schemers. Amplified oversight can make substantial progress without studying actual schemers (but probably will face the long horizon problem). Interpretability can make lots of foundational progress without schemers, that I would expect to mostly generalize to schemers. Control can make progress with models prompted or trained to be malicious.

Reply1
This is a review of the reviews
Rohin Shah26d90

On reflection I think you're right that this post isn't doing the thing I thought it was doing, and have edited my comment.

(For reference: I don't actually have strong takes on whether you should have chosen a different title given your beliefs. I agree that your strategy seems like a reasonable one given those beliefs, while also thinking that building a Coalition of the Concerned would have been a reasonable strategy given those beliefs. I mostly dislike the social pressure currently being applied in the direction of "those who disagree should stick to their agreements" (example) without even an acknowledgement of the asymmetricity of the request, let alone a justification for it. But I agree this post isn't quite doing that.)

Reply
The title is reasonable
Rohin Shah26d20

Do you agree the feedback loops for capabilities are better right now?

Yes, primarily due to the asymmetry where capabilities can work with existing systems while alignment is mostly stuck waiting for future systems, but that should be much less true by the time of DAI.

Edit: though risk could be lower in the worlds where capabilities R&D goes off the rails due to having more time for safety, depending on whether this also applies to the next actor etc.

I was thinking both of this, and also that it seems quite correlated due to lack of asymmetry. Like, 20% on exactly one going off the rails rather than zero or both seems very high to me; I feel like to get to that I would want to know about some important structural differences between the problems. (Though I definitely phrased my comment poorly for communicating that.)

Reply
The title is reasonable
Rohin Shah26d30

I'm not really following where the disanalogy is coming from (like, why are the feedback loops better?)

Sure, AI societies could go off the rails that hurts alignment R&D but not AI R&D; they could also go off the rails in a way that hurts AI R&D and not alignment R&D. Not sure why I should expect one rather than the other.

Although on further reflection, even though the current DAI isn't scheming, alignment work still has to be doing some worst-case type thinking about how future AIs might be scheming, whereas this is not needed for AI R&D. I don't think this makes a big difference -- usually I find worst-case conceptual thinking to be substantially easier than average-case conceptual thinking -- but I could imagine that causing issues.

Reply
The title is reasonable
Rohin Shah26d40

Ok, but it isn't sufficient to just ensure DAI isn't scheming, we have to also ensure it is aligned enough to hand off work and has good epistemics and is well elicited on hard to check tasks. This seems pretty hard given the huge rush, but it isn't obviously fucked IMO, especially given the extra years from acceleration. I have some draft writing on this which should hopefully be out somewhat soon. Maybe my view is 20% chance of failure given good allocation and roughly 60% chance of failure given the default allocation (which includes stuff like the safety team not actually handing off or not seriously working on this etc)?

This seems way too pessimistic to me. At the point of DAI, capabilities work will also require good epistemics and good elicitation on hard to check tasks. The key disanalogy between capabilities and alignment work at the point of DAI is that the DAI might be scheming, but you're in a subjunctive case where we've assumed the DAI is not scheming. Whence the pessimism?

(This complaint is related to Eli's complaint)

Reply
The title is reasonable
Rohin Shah26d53

Note I am thinking of a pretty specific subset of comments where Buck is engaging with people who he views as "extremely unreasonable MIRI partisans". I'm not primarily recommending that Buck move those comments to private channels, usually my recommendation is to not bother commenting on that at all. If there does happen to be some useful kernel to discuss, then I'd recommend he do that elsewhere and then write something public with the actually useful stuff.

Reply1
The title is reasonable
Rohin Shah26d*481

Oh huh, kinda surprised my phrasing was stronger than what you'd say. 

Idk the "two monkey chieftains" is just very... strong, as a frame. Like of course #NotAllResearchers, and in reality even for a typical case there's going to be some mix of object-level-epistemically-valid reasoning along with social-monkey reasoning, and so on.

Also, you both get many more observations than I do (by virtue of being in the Bay Area) and are paying more attention to extracting evidence / updates out of those observations around the social reality of AI safety research. I could believe that you're correct, I don't have anything to contradict it, I just haven't looked enough detail to come to that conclusion myself.

Tribal thinking is just really ingrained

This might be true but feels less like the heart of the problem. Imo the bigger deal is more like trapped priors:

The basic idea of a trapped prior is purely epistemic. It can happen (in theory) even in someone who doesn't feel emotions at all. If you gather sufficient evidence that there are no polar bears near you, and your algorithm for combining prior with new experience is just a little off, then you can end up rejecting all apparent evidence of polar bears as fake, and trapping your anti-polar-bear prior.

A person on either "side" certainly feels like they have sufficient evidence / arguments for their position (and can often list them out in detail, so it's not pure self-deception). So premise #1 is usually satisfied.

There are tons of ways that the algorithm for combining prior with new experience can be "just a little off" to satisfy premise #2:

  • When you read a new post, if it's by "your" side everything feels consistent with your worldview so you don't notice all the ways it is locally invalid, whereas if it's by the "other" side you intuitively notice a wrong conclusion (because it conflicts with your worldview) which then causes you to find the places where it is locally invalid.[1] If you aren't correcting for this, your prior will be trapped.
    • (More broadly I think LWers greatly underestimate the extent to which almost all reasoning is locally logically invalid, and how much you have to evaluate arguments based on their context.[2])
  • Even when you do notice a local invalidity in "your" side, it is easy enough for you to repair so it doesn't change your view. But if you notice a local invalidity in "their" side, you don't know how to repair it and so it seems like a gaping hole. If you aren't correcting for this, your prior will be trapped.[3]
  • When someone points out a counterargument, you note that there's a clear slight change in position that averts the counterargument, without checking whether this should change confidence overall.[4]
  • The sides have different epistemic norms, so it is just actually true that the "other" side has more epistemic issues as-evaluated-by-your-norms than "your" side. If you aren't correcting for this, your prior will be trapped.
    • I don't quite know enough to pin down what the differences are, but maybe something like: "pessimists" care a lot more about precision of words and logical local validity, whereas "optimists" care a lot more about the thrust of an argument and accepted general best practices even if you can't explain exactly how they're compatible with Bayesianism. Idk I feel like this is not correct.

I think this is the (much bigger) challenge that you'd want to try to solve. For example, I think LW curation decisions are systematically biased for these reasons, and that likely contributes substantially to the problem with LW group epistemics.

Given that, what kinds of solutions would I be thinking about?

  • Partial solution from academia: there are norms restricting people's (influential) opinions to their domain of expertise. This creates a filter where the opinions you care about are much more likely to be the result of deep engagement with details on a given topic, and so are more likely to be correct. (Relatedly, my biggest critique of individual LW epistemics is a lack of respect for how much details matter.)
  • Partial solution from academia: procedural norms around what evidence you have to show for something to become "accepted knowledge" (typically enforced via peer review).[5]
  • For curation in particular: get some "optimists" to feed into curation decisions. (Buck, Ryan, and Lukas all seem like potential candidates, seeing as they aren't as pessimistic as me and at least Buck + Ryan already put some effort into LW group epistemics.)

Tbc I also believe that there's lots of straightforwardly tribal thinking going on.[6] People also mindkill themselves in ways that make them less capable of reasoning clearly.[7] But it doesn't feel as necessary to solve. If you had a not-that-large set of good thinking going on, that feels like it could be enough (e.g. Alignment Forum at time of launch). Just let the tribes keep on tribe-ing and mostly ignore them.

I guess all of this is somewhat in conflict with my original position that sensationalism bias is a big deal for LW group epistemics. Whoops, sorry. I do still think sensationalism and tribalism biases are a big deal but on reflection I think trapped priors are a bigger deal and more of my reason for overall pessimism.

Though for sensationalism / tribalism I'd personally consider solutions as drastic as "get rid of the karma system, accept lower motivation for users to produce content, figure out something else for identifying which posts should be surfaced to readers (maybe an LLM-based system can do a decent job)" and "much stronger moderation of tribal comments, including e.g. deleting highly-upvoted EY comments that are too combative / dismissive".

  1. ^

    For example, I think this post against counting arguments reads as though the authors noticed a local invalidity in a counting argument, then commenters on the early draft pointed out that of course there was a dependence on simplicity that most people could infer from context, and then the authors threw some FUD on simplicity. (To be clear, I endorse some of the arguments in that post, and not others, do not take this as me disendorsing that post entirely.)

  2. ^

    Habryka's commentary here seems like an example, where the literal wording of Zach's tweet is clearly locally invalid, but I naturally read Zach's tweet as "they're wrong about doom being inevitable [if anyone builds it]". (I agree it would have been better for Zach to be clearer there, but Habryka's critique seems way too strong.)

  3. ^

    For example, when reading the Asterisk review of IABIED (not the LW comments, the original review on Asterisk), I noticed that the review was locally incorrect because the IABIED authors don't consider an intelligence explosion to be necessary for doom, but also I could immediately repair it to "it's not clear why these arguments should make you confident in doom if you don't have a very fast takeoff" (that being my position). (Tbc I haven't read IABIED, I just know the authors' arguments well enough to predict what the book would say.) But I expect people on the "MIRI side" would mostly note "incorrect" and fail to predict the repair. (The in-depth review, which presumably involved many hours of thought, does get as far as noting that probably Clara thinks that FOOM is needed to justify "you only get one shot", but doesn't really go into any depth or figure out what the repair would actually be.)

  4. ^

    As a possible example, MacAskill quotes PC's summary of EY as "you can’t learn anything about alignment from experimentation and failures before the critical try" but I think EY's position is closer to "you can't learn enough about alignment from experimentation and failures before the critical try". Similarly see this tweet. I certainly believe that EY's position is that you can't learn enough, but did the author actually reflect on the various hopes for learning about alignment from experimentation and failures and updated their own beliefs, or did they note that there's a clear rebuttal and then stopped thinking? (I legitimately don't know tbc; though I'm happy to claim that often it's more like the latter even if I don't know in any individual case.)

  5. ^

    During my PhD I was consistently irritated by how often peer reviewers would just completely fail to be moved by a conceptual argument. But arguably this is a feature, not a bug, along the lines of epistemic learned helplessness; if you stick to high standards of evidence that have worked well enough in the past, you'll miss out on some real knowledge but you will be massively more resistant to incorrect-but-convincing arguments.

  6. ^

    I was especially unimpressed about "enforcing norms on" (ie threatening) people if they don't take the tribal action.

  7. ^

    For example, "various readers may be less cautious/paranoid/afraid than me, and think that it’s worth some risk of killing every child on Earth (and everyone else) to get progress faster or to avoid the costs of getting everyone to go slow". If you are arguing for > 90% doom "if anyone builds it", you don't need rhetorical jujitsu like this! (And in fact my sense is that many of the MIRI team who aren't EY/NS equivocate a lot between "what's needed for < 90% doom" and "what's needed for < 1% doom", though I'm not going to defend this claim. Seems like the sort of thing that could happen if you mindkill yourself this way.)

Reply111
This is a review of the reviews
Rohin Shah26d*17-13

Building a coalition doesn't look like suppressing disagreements, but it does look like building around the areas of agreement. 

Indeed. This is why one might choose a different book title than "If Anyone Builds It, Everyone Dies".

EDIT: On reflection, I retract my (implicit) claim that this is a symmetric situation; there is a difference between what you say unprompted, vs what you say when commenting on what someone else has said. It is of course still true that one might choose a different book title if the goal was to build around areas of agreement.

Reply
Load More
Newsletters
5 years ago
(+17/-12)
166Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety
Ω
3mo
Ω
32
57Evaluating and monitoring for AI scheming
Ω
3mo
Ω
9
73Google DeepMind: An Approach to Technical AGI Safety and Security
Ω
6mo
Ω
12
113Negative Results for SAEs On Downstream Tasks and Deprioritising SAE Research (GDM Mech Interp Team Progress Update #2)
Ω
6mo
Ω
15
103AGI Safety & Alignment @ Google DeepMind is hiring
Ω
8mo
Ω
19
104A short course on AGI safety from the GDM Alignment team
Ω
8mo
Ω
2
81MONA: Managed Myopia with Approval Feedback
Ω
9mo
Ω
30
222AGI Safety and Alignment at Google DeepMind: A Summary of Recent Work
Ω
1y
Ω
33
49On scalable oversight with weak LLMs judging strong LLMs
Ω
1y
Ω
18
63Improving Dictionary Learning with Gated Sparse Autoencoders
Ω
1y
Ω
38
Load More