LessWrong team member / moderator. I've been a LessWrong organizer since 2011, with roughly equal focus on the cultural, practical and intellectual aspects of the community. My first project was creating the Secular Solstice and helping groups across the world run their own version of it. More recently I've been interested in improving my own epistemic standards and helping others to do so as well.
This paragraph feels righter-to-me (oh, huh, you even ended up with the same word "ostentatious" as pointer that I did in my comment-1-minute-ago)
I'll say (as a guy who just wrote a very pro-book post) that this vibe feels off to me. (I'm not sure if any particular sentence seems definitely wrong, but, it feels like it's coming from a generator that I think is wrong)
I think Eliezer/Nate were deliberately not attempting to make the book some kind of broad thing the whole community could rally behind. They might have done so, but, they didn't. So, complaining about "why our kind can't cooperate" doesn't actually feel right to me in this instance.
(I think there's some kind of subtle "why we can't cooperate" thing that is still relevant, but, it's less like "YOU SHOULD ALL BE COOPERATING" and more like "some people should notice that something is weird about the way they're sort of... ostentatiously not cooperating?". Where I'm not so much frustrated at them "not cooperating," more frustrated at the weirdness of the dynamics around the ostentatiousness. (This sentence still isn't quite right, but, I'mma leave it there for no)
FYI I got value from the last round of arguments between Buck/Ryan and Eliezer (in The Problem), where I definitely agree Eliezer was being obtuse/annoying. I learned more useful things about Buck's worldview from that one than Eliezer's (nonzero from Eliezer's tho), and I think that was good for the commons more broadly.
I don't know if it was a better use of time than whatever else Buck would have done that day, but, I appreciated it.
(I'm not sure what to do about the fact that Being Triggered is such a powerful catalyst for arguing, it does distort what conversations we find ourselves having, but, I think it increases the total amount of public argumentation that exists, fairly significantly)
Oh, if that's what you meant by Defense in Depth, as Joe said, the book's argument is "we don't know how."
At weak capabilities, our current ability to steer AI is sufficient, because mistakes aren't that bad. Anthropic is trying pretty hard with Claude to build something that's robustly aligned, and it's just quite hard. When o3 or claude cheat on programming tasks, they get caught, and the consequences aren't that dire. But when there are millions of iterations of AI-instances making choices, and when it is smarter than humanity, the amount of robustness you need is much much higher.
My understanding is that Y&S think this is doomed because ~"at the limit of <poorly defined, handwavy stuff> the model will end up killing us [probably as a side-effect] anyway"
[...]
And then you can draw conclusions about what a perfect agent with those preferences would do. But there's no reason to believe your map always applies.
I'm not quite sure how to parse this, but, it sounds like you're saying something like "I don't understand why we should expect in the limit something to be a perfect game theoretic agent." The answer is "because if it wasn't, that wouldn't be the limit, and the AI would notice it was behaving suboptimally, and figure out a way to change that."
Not every AI will do that, automatically. But, if you're deliberately pushing the AI to be a good problem solver, and if it ends up in a position where it is capable of improving it's cognition, once it notices 'improve my cognition' as a viable option, there's not a reason for it to stop.
...
It sounds like a lot of your objection is maybe to the general argument "things that can happen, eventually will." (in particular, when billions of dollars worth of investment are trying to push towards things-nearby-that-attractor happening).
(Or, maybe more completely: "sure, things that can happen eventually will, but meanwhile a lot of other stuff might happen that changes how path-dependent-things will play out?")
I'm curious how loadbearing that feels for the rest of the arguments?
I do think this is a pretty good point about how human value formation tends to happen.
I think something sort-of-similar might happen to happen a little, nearterm, with LLM-descended AI. But, AI just doesn't have any of the same social machinery actually embedded in it the same way, so if it's doing something similar, it'd be happening because LLMs vaguely ape human tendencies. (And I expect this to stop being a major factor as the AI gets smarter. I don't expect it to install the sort of social drives itself that humans have, and "imitate humans" has pretty severe limits of how smart you can get, so if we get to AI much smarter than that, it'll probably be doing a different thing)
I think the more important here is "notice that you're (probably) wrong about about how you actually do your value-updating, and this may be warping your expectations about how AI would do it."
But, that doesn't leave me with any particular other idea than the current typical bottom-up story.
(obviously if we did something more like uploads, or upload-adjacent, it'd be a whole different story)
EDIT: Missed Raemon's reply, I agree with at least the vibe of his comment (it's a bit stronger than what I'd have said).
Oh huh, kinda surprised my phrasing was stronger than what you'd say.
Getting into a bit from a problem-solving angle, in a "first think about the problem for 5 minutes before proposing solutions" kinda way...
The reasons the problem is hard include:
Probably there's more.
Meanwhile, the knobs to handle this are:
The sort of thing that seems like an improvement is changing something about how strong upvotes work, at least in some cases. (i.e. maybe if we detect a thread has fairly obvious factions tug-of-war-ing, we turn off strong-upvoting, or add some kind of cooldown period)
We've periodically talked about having strong-upvotes require giving a reason, or otherwise constrained in some way, although I think there was disagreement about whether that'd probably be good or bad.
Why is this any different than training a next generation of word-predictors and finding out it can now play chess, or do chain-of-thought reasoning, or cheat on tests? I agree it's unlocking new abilities, I just disagree that this implies anything massively different from what's already going on, and is the thing you'd expect to happen by default.
I think Rohin is (correctly IMO) noticing that, while often some thoughtful pieces succeed at talking about the doomer/optimist stuff in a way thats not-too-tribal and helps people think, it's just very common for it to also affect the way people talk and reason.
Like, it's good IMO that that Paul piece got pretty upvoted, but, the way that many people related to Eliezer and Paul as sort of two monkey chieftains with narratives to rally around, more than just "here are some abstract ideas about what makes alignment hard or easy", is telling. (The evidence for this is subtle enough I'm not going to try to argue it right now, but I think it's a very real thing. My post here today is definitely part of this pattern. I don't know exactly how I could have written it without doing so, but there's something tragic about it)
Yeah, I agree with "trapped priors" being a major problem.
The solution this angle brings to mind is more like "subsidize comments/posts that do a good job of presenting counterarguments in a way that is less triggering / feeding into the toxoplasma".