Research Scientist at Google DeepMind. Creator of the Alignment Newsletter. http://rohinshah.com/
It's instead arguing with the people who are imagining something like "business continues sort of as usual in a decentralized fashion, just faster, things are complicated and messy, but we muddle through somehow, and the result is okay."
The argument for this position is more like: "we never have a 'solution' that gives us justified confidence that the AI will be aligned, but when we build the AIs, the AIs turn out to be aligned anyway".
You seem to instead be assuming "we don't get a 'solution', and so we build ASI and all instances of ASI are mostly misaligned but a bit nice, and so most people die". I probably disagree with that position too, but imo it's not an especially interesting position to debate, as I do agree that building ASI that is mostly misaligned but a bit nice is a bad outcome that we should try hard to prevent.
Yeah, that's fair for agendas that want to directly study the circumstances that lead to scheming. Though when thinking about those agendas, I do find myself more optimistic because they likely do not have to deal with long time horizons, whereas capabilities work likely will have to engage with that.
Note many alignment agendas don't need to actually study potential schemers. Amplified oversight can make substantial progress without studying actual schemers (but probably will face the long horizon problem). Interpretability can make lots of foundational progress without schemers, that I would expect to mostly generalize to schemers. Control can make progress with models prompted or trained to be malicious.
On reflection I think you're right that this post isn't doing the thing I thought it was doing, and have edited my comment.
(For reference: I don't actually have strong takes on whether you should have chosen a different title given your beliefs. I agree that your strategy seems like a reasonable one given those beliefs, while also thinking that building a Coalition of the Concerned would have been a reasonable strategy given those beliefs. I mostly dislike the social pressure currently being applied in the direction of "those who disagree should stick to their agreements" (example) without even an acknowledgement of the asymmetricity of the request, let alone a justification for it. But I agree this post isn't quite doing that.)
Do you agree the feedback loops for capabilities are better right now?
Yes, primarily due to the asymmetry where capabilities can work with existing systems while alignment is mostly stuck waiting for future systems, but that should be much less true by the time of DAI.
Edit: though risk could be lower in the worlds where capabilities R&D goes off the rails due to having more time for safety, depending on whether this also applies to the next actor etc.
I was thinking both of this, and also that it seems quite correlated due to lack of asymmetry. Like, 20% on exactly one going off the rails rather than zero or both seems very high to me; I feel like to get to that I would want to know about some important structural differences between the problems. (Though I definitely phrased my comment poorly for communicating that.)
I'm not really following where the disanalogy is coming from (like, why are the feedback loops better?)
Sure, AI societies could go off the rails that hurts alignment R&D but not AI R&D; they could also go off the rails in a way that hurts AI R&D and not alignment R&D. Not sure why I should expect one rather than the other.
Although on further reflection, even though the current DAI isn't scheming, alignment work still has to be doing some worst-case type thinking about how future AIs might be scheming, whereas this is not needed for AI R&D. I don't think this makes a big difference -- usually I find worst-case conceptual thinking to be substantially easier than average-case conceptual thinking -- but I could imagine that causing issues.
Ok, but it isn't sufficient to just ensure DAI isn't scheming, we have to also ensure it is aligned enough to hand off work and has good epistemics and is well elicited on hard to check tasks. This seems pretty hard given the huge rush, but it isn't obviously fucked IMO, especially given the extra years from acceleration. I have some draft writing on this which should hopefully be out somewhat soon. Maybe my view is 20% chance of failure given good allocation and roughly 60% chance of failure given the default allocation (which includes stuff like the safety team not actually handing off or not seriously working on this etc)?
This seems way too pessimistic to me. At the point of DAI, capabilities work will also require good epistemics and good elicitation on hard to check tasks. The key disanalogy between capabilities and alignment work at the point of DAI is that the DAI might be scheming, but you're in a subjunctive case where we've assumed the DAI is not scheming. Whence the pessimism?
(This complaint is related to Eli's complaint)
Note I am thinking of a pretty specific subset of comments where Buck is engaging with people who he views as "extremely unreasonable MIRI partisans". I'm not primarily recommending that Buck move those comments to private channels, usually my recommendation is to not bother commenting on that at all. If there does happen to be some useful kernel to discuss, then I'd recommend he do that elsewhere and then write something public with the actually useful stuff.
Oh huh, kinda surprised my phrasing was stronger than what you'd say.
Idk the "two monkey chieftains" is just very... strong, as a frame. Like of course #NotAllResearchers, and in reality even for a typical case there's going to be some mix of object-level-epistemically-valid reasoning along with social-monkey reasoning, and so on.
Also, you both get many more observations than I do (by virtue of being in the Bay Area) and are paying more attention to extracting evidence / updates out of those observations around the social reality of AI safety research. I could believe that you're correct, I don't have anything to contradict it, I just haven't looked enough detail to come to that conclusion myself.
Tribal thinking is just really ingrained
This might be true but feels less like the heart of the problem. Imo the bigger deal is more like trapped priors:
The basic idea of a trapped prior is purely epistemic. It can happen (in theory) even in someone who doesn't feel emotions at all. If you gather sufficient evidence that there are no polar bears near you, and your algorithm for combining prior with new experience is just a little off, then you can end up rejecting all apparent evidence of polar bears as fake, and trapping your anti-polar-bear prior.
A person on either "side" certainly feels like they have sufficient evidence / arguments for their position (and can often list them out in detail, so it's not pure self-deception). So premise #1 is usually satisfied.
There are tons of ways that the algorithm for combining prior with new experience can be "just a little off" to satisfy premise #2:
I think this is the (much bigger) challenge that you'd want to try to solve. For example, I think LW curation decisions are systematically biased for these reasons, and that likely contributes substantially to the problem with LW group epistemics.
Given that, what kinds of solutions would I be thinking about?
Tbc I also believe that there's lots of straightforwardly tribal thinking going on.[6] People also mindkill themselves in ways that make them less capable of reasoning clearly.[7] But it doesn't feel as necessary to solve. If you had a not-that-large set of good thinking going on, that feels like it could be enough (e.g. Alignment Forum at time of launch). Just let the tribes keep on tribe-ing and mostly ignore them.
I guess all of this is somewhat in conflict with my original position that sensationalism bias is a big deal for LW group epistemics. Whoops, sorry. I do still think sensationalism and tribalism biases are a big deal but on reflection I think trapped priors are a bigger deal and more of my reason for overall pessimism.
Though for sensationalism / tribalism I'd personally consider solutions as drastic as "get rid of the karma system, accept lower motivation for users to produce content, figure out something else for identifying which posts should be surfaced to readers (maybe an LLM-based system can do a decent job)" and "much stronger moderation of tribal comments, including e.g. deleting highly-upvoted EY comments that are too combative / dismissive".
For example, I think this post against counting arguments reads as though the authors noticed a local invalidity in a counting argument, then commenters on the early draft pointed out that of course there was a dependence on simplicity that most people could infer from context, and then the authors threw some FUD on simplicity. (To be clear, I endorse some of the arguments in that post, and not others, do not take this as me disendorsing that post entirely.)
Habryka's commentary here seems like an example, where the literal wording of Zach's tweet is clearly locally invalid, but I naturally read Zach's tweet as "they're wrong about doom being inevitable [if anyone builds it]". (I agree it would have been better for Zach to be clearer there, but Habryka's critique seems way too strong.)
For example, when reading the Asterisk review of IABIED (not the LW comments, the original review on Asterisk), I noticed that the review was locally incorrect because the IABIED authors don't consider an intelligence explosion to be necessary for doom, but also I could immediately repair it to "it's not clear why these arguments should make you confident in doom if you don't have a very fast takeoff" (that being my position). (Tbc I haven't read IABIED, I just know the authors' arguments well enough to predict what the book would say.) But I expect people on the "MIRI side" would mostly note "incorrect" and fail to predict the repair. (The in-depth review, which presumably involved many hours of thought, does get as far as noting that probably Clara thinks that FOOM is needed to justify "you only get one shot", but doesn't really go into any depth or figure out what the repair would actually be.)
As a possible example, MacAskill quotes PC's summary of EY as "you can’t learn anything about alignment from experimentation and failures before the critical try" but I think EY's position is closer to "you can't learn enough about alignment from experimentation and failures before the critical try". Similarly see this tweet. I certainly believe that EY's position is that you can't learn enough, but did the author actually reflect on the various hopes for learning about alignment from experimentation and failures and updated their own beliefs, or did they note that there's a clear rebuttal and then stopped thinking? (I legitimately don't know tbc; though I'm happy to claim that often it's more like the latter even if I don't know in any individual case.)
During my PhD I was consistently irritated by how often peer reviewers would just completely fail to be moved by a conceptual argument. But arguably this is a feature, not a bug, along the lines of epistemic learned helplessness; if you stick to high standards of evidence that have worked well enough in the past, you'll miss out on some real knowledge but you will be massively more resistant to incorrect-but-convincing arguments.
I was especially unimpressed about "enforcing norms on" (ie threatening) people if they don't take the tribal action.
For example, "various readers may be less cautious/paranoid/afraid than me, and think that it’s worth some risk of killing every child on Earth (and everyone else) to get progress faster or to avoid the costs of getting everyone to go slow". If you are arguing for > 90% doom "if anyone builds it", you don't need rhetorical jujitsu like this! (And in fact my sense is that many of the MIRI team who aren't EY/NS equivocate a lot between "what's needed for < 90% doom" and "what's needed for < 1% doom", though I'm not going to defend this claim. Seems like the sort of thing that could happen if you mindkill yourself this way.)
Building a coalition doesn't look like suppressing disagreements, but it does look like building around the areas of agreement.
Indeed. This is why one might choose a different book title than "If Anyone Builds It, Everyone Dies".
EDIT: On reflection, I retract my (implicit) claim that this is a symmetric situation; there is a difference between what you say unprompted, vs what you say when commenting on what someone else has said. It is of course still true that one might choose a different book title if the goal was to build around areas of agreement.
What exactly do you propose that a Bayesian should do, upon receiving the observation that a bounded search for examples within a space did not find any such example?
(I agree that it is better if you can instead construct a tight logical argument, but usually that is not an option.)
I also don't find the examples very compelling:
Overall, I'm pleasantly surprised by how bad these examples are. I would have expected much stronger examples, since on priors I expected that many people would in fact follow EFAs off a cliff, rather than treating them as evidence of moderate but not overwhelming strength. To put it another way, I expected that your FA on examples of bad EFAs would find more and/or stronger hits than it actually did, and in my attempt to better approximate Bayesianism I am noticing this observation and updating on it.