Most of my posts and comments are about AI and alignment. Posts I'm most proud of, which also provide a good introduction to my worldview:
I also created Forum Karma, and wrote a longer self-introduction here.
PMs and private feedback are always welcome.
NOTE: I am not Max Harms, author of Crystal Society. I'd prefer for now that my LW postings not be attached to my full name when people Google me for other reasons, but you can PM me here or on Discord (m4xed) if you want to know who I am.
You seem to be reading Dario to say "tendencies like instrumental power-seeking won't emerge at all".
I am more saying that the when Dario and others dismiss what they call "doomer" arguments as vague / clean theories, ungrounded philosophy, etc. and couch their own position as moderate + epistemically humble, what's actually happening is Dario himself failing to generalize about how the world works.
We can imagine that some early powerful AIs will also miss those lessons / generalizations, either by chance or because of deliberate choices that the creators make, but if you count on that, or even just say that we can't really know exactly how it will play out until we build and experiment, you're relying on your own ignorance and lack of understanding to tell an overly-conjunctive story, even if parts of your story are supported by experiment. That chain of reasoning is invalid, regardless of what is true in principle or practice about the AI systems people actually build.
On Dario's part I suspect this is at least partly motivated cognition, but for others, one way past this failure mode could be to study and reflect on examples in domains that are (on the surface) unrelated to AI. Unfortunately, having someone else spell out the connections and deep lessons from this kind of study has had mixed results in the past - millions of words have been spilled on LW and other places over the years and it usually devolves into accusations of argument by analogy, reference class tennis, navel-gazing, etc.
This “misaligned power-seeking” is the intellectual basis of predictions that AI will inevitably destroy humanity.
The problem with this pessimistic position is that it mistakes a vague conceptual argument about high-level incentives—one that masks many hidden assumptions—for definitive proof. I think people who don’t build AI systems every day are wildly miscalibrated on how easy it is for clean-sounding stories to end up being wrong, and how difficult it is to predict AI behavior from first principles, especially when it involves reasoning about generalization over millions of environments (which has over and over again proved mysterious and unpredictable). Dealing with the messiness of AI systems for over a decade has made me somewhat skeptical of this overly theoretical mode of thinking.
One of the most important hidden assumptions, and a place where what we see in practice has diverged from the simple theoretical model, is the implicit assumption that AI models are necessarily monomaniacally focused on a single, coherent, narrow goal, and that they pursue that goal in a clean, consequentialist manner. In fact, our researchers have found that AI models are vastly more psychologically complex, as our work on introspection or personas shows.
False / non-sequitur? Instrumental convergence and optimality of power-seeking are facts that describe important facets of reality. They unpack to precise + empirical + useful models of many dynamics in economics, games, markets, biology, computer security, and many adversarial interactions among humans generally.
The fact that these dynamics don't (according to Dario / Anthropic) make useful predictions about the behavior of current / near-future AI systems, and the fact that current AI systems are not actually all that powerful or dangerous, is not a coincidence. But that isn't at all a refutation of power-seeking and optimization as convergent behavior of actually-powerful agents! I think people who build AI systems every day are "wildly miscalibrated" on how empirically well-supported and widely applicable these dynamics and methods of thinking are outside their own field.
Dario's "more moderate and more robust version" of how power-seeking could be a real risk seems like an overly-detailed just-so story about some ways instrumental convergence and power-seeking could emerge in current AI systems, conveniently in ways that Anthropic is mostly set up to catch / address. But the actually-correct argument is more like: if instrumental convergence and power-seeking don't emerge in some form, then the AI system you end up with won't actually be sufficiently powerful for what you want to do, regardless of how aligned it is. And even if you do manage to build something powerful enough for whatever you want to do that is aligned and doesn't converge towards power-seeking, that implies someone else can build a strictly more powerful system which does converge, likely with relative ease compared to the effort you put in to build the non-convergent system. None of this depends on whether the latest version of Claude is psychologically complex or has a nice personality or whatever.
It's the digital equivalent of printing something in vivid color on a glossy brochure rather than in black and white on A4 paper.
Hmm, maybe, but if that's what it takes to get people to start consuming text content (instead of video), I'll take it? By "medium" I was primarily referring to text over video, not that dashboards and polished graphics - I think if Lightcone were to spend time creating e.g. polished political ads or explainer / informational videos of any kind that would be pretty bad, unless they strongly endorsed the content.
I wasn't super explicit about it in the grandparent, but two load-bearing and perhaps unusual things I believe:
A large fraction of people, including elites, get basically all of their information and form their opinions and beliefs based on non-text sources: TV news, shortform video, instructional and educational videos, talking and listening to other people, etc.
(I suspect a lot of LW and adjacent people who tend to consume a lot of text content don't realize how widespread this is due to typical-minding.)
I also feel pretty meh about the content of Deciding to Win and (somewhat) AI 2027, but as a medium, I think popularizing widely accessible, written long-form prose as a mode of elite discourse is unambiguously positive and potentially very impactful.
It's kind of taken for granted around here, but "reliably and deeply engage with thoughtful written word" is absolutely not a bar that most discourse of any stripe meets. If Lightcone changes that even a little, for example by causing policymakers, elected officials, and other kinds of elites to shift a bit of their time from trawling social media and TV news, reacting to vibes and shallow popular sentiment, to engaging more with bloggers and public intellectuals, that seems like it could be worth a lot.
As a medium, a slick one page website seems like a nice middle ground in the design space between books (less accessible / widely shareable, often too long, don't allow for good back-and-forth / commentary) and blog posts with comment sections (ideal for truly deep engagement, but not as legible / accessible for different reasons). And all of these are much better than TV news, "community engagement", social media, video, etc.
Also worth remembering that (actual) RSI was never a necessary condition for ruin. It seems at least plausible that at some point, human AI researchers on their own will find methods of engineering an AGI to sufficiently superhuman levels, to the point where the AI is smart enough to start developing nanotech and / or socially engineering humans for bootstrapping needs.
So even if labs were carefully monitoring for RSI and trying to avoid it (rather than deliberately engineering for it + frog boiling in the meantime), an AI inclined to take over might find that it doesn't even need to bother with potentially dicey self-modifications until after it has already secured victory.
Meta: any discussion or reaction you observe to abrasiveness and communication style (including the discussion here) is selected for people who are particularly sensitive to and / or feel strongly enough about these things one way or the other to speak up. I think if you don't account for this, you'll end up substantially overestimating the impact and EV in either direction.
To be clear, I think this selection effect is not simply, "lots of people like to talk about Eliezer", which you tried to head off as best you could. If you made a completely generic post about discourse norms, strategic communication, the effects and (un)desirability of abrasiveness and snark, when and how much it is appropriate, etc. it might get less overall attention. But my guess is that it would still attract the usual suspects and object-level viewpoints, in a way that warps the discussion due to selection.
As a concrete example of the kind of effect this selection might have: I find the norms of discourse on the EA Forum somewhat off-putting, and in general I find that thinking strategically about communication (as opposed to simply communicating) feels somewhat icky and not particularly appealing as a conversational subject. From this, you can probably infer how I feel about some of Eliezer's comments and the responses. But I also don't usually feel strongly enough about it to remark on these things either way. I suspect I am not atypical, but that my views are underrepresented in discussions like this.
Another factor is that the absolute size of the boom (or bubble) is somewhat smaller than it appears if you just look at dollar-denominated increases in the value of stocks.
Stocks are denominated in dollars, and the value of a dollar has fallen substantially in real terms in the last couple of years, in large part due to tariffs (or uncertainty created by tariffs) and inflation. These are mostly independent effects from each other and the AI boom, and neither is particularly good for stimulating actual healthy economic growth, but they could soften the effect of any future bubble popping, because there is less actual growth / bubble to pop than what it looks like if you just look at returns in nominal terms.
because any solver specialized to the evolutionary subset is guaranteed to fail once the promise is removed; outside that tightly curated domain, the mapping reverts to an unbounded and intractable instance class.
There are plenty of expansions you could make to the "evolutionary subset" (some of them trivial, some of them probably interesting) for which no theorem from complexity theory guarantees that the problem of predicting how any particular instance in the superset folds is intractable.
In general, hardness results from complexity theory say very little about the practical limits on problem-solving ability for AI (or humans, or evolution) in the real world, precisely because the "standard abstraction schemes" do not fully capture interesting aspects of the real-world problem domain, and because the results are mainly about classes and limiting behavior rather than any particular instance we care about.
In many hardness and impossibility results, "adversarial / worst-case" are doing nearly all of the work in the proof, but if you're just trying to build some nanobots you don't care about that. Or more prosaically, if you want to steal some cryptocurrency, in real life you use a side-channel or 0-day in the implementation (or a wrench attack); you don't bother trying to factor large numbers.
IMO it is correct to mostly ignore these kinds of things when building your intuition about what a superintelligence is likely or not likely to be able to do, once you understand what the theorems actually say. NP-hardness says, precisely, that "if a problem is NP-hard (and ), that implies that there is no deterministic algorithm anyone (even a superintelligence) can run, which accepts arbitrary instances of the problem and finds a solution in time steps polynomial in the size of the problem instance.". This statement is precise and formal, but unfortunately it doesn't mention protein folding, and even the implications that it has for an idealized formal model of protein folding are of limited use when trying to predict what specific proteins AlphaFold-N will / won't be able to predict correctly.
if you asked me to pick between the CEV of Claude 3 Opus and that of a median human, I think it'd be a pretty close call (I'd probably pick Claude, but it depends on the details of the setup).
This example seems like it is kind of missing the point of CEV in the first place? If you're at the point where you can actually pick the CEV of some person or AI, you've already solved most or all of your hard problems.
Setting aside that picking a particular entity is already getting away from the original formulation of CEV somewhat, the main reason I see to pick a human over Opus is that a median human very likely has morally-relevant-to-other-humans qualia, in ways that current AIs may not.
I realize this is maybe somewhat tangential to the rest of the post, but I think this sort of disagreement is central to a lot of (IMO misplaced) optimism based on observations of current AIs, and implies an unjustifiably high level of confidence in a theory of mind of AIs, by putting that theory on par with a level of confidence that you can justifiably have in a theory of mind for humans. Elaborating / speculating a bit:
My guess is that you lean towards Opus based on a combination of (a) chatting with it for a while and seeing that it says nice things about humans, animals, AIs, etc. in a way that respects those things' preferences and shows a generalized caring about sentience and (b) running some experiments on its internals to see that these preferences are deep or robust in some way, under various kinds of perturbations.
But I think what models (or a median / randomly chosen human) say about these things is actually one of the less important considerations. I am not as pessimistic as, say, Wei Dai about how bad humans currently are at philosophy, but neither the median human nor any AI model that I have seen so far can talk sensibly about the philosophy of consciousness, morality, alignment, etc. nor even really come close. So on my view, outputs (both words and actions) of both current AIs and average humans on these topics are less relevant (for CEV purposes) than the underlying generators of those thoughts and actions.
In humans, we have a combination of (a) knowing a lot about evolution and neuroscience and (b) being humans ourselves. Taken together, these two things bridge the gap of a lot of missing or contentious philosophical knowledge - we don't have to know exactly what qualia are to be pretty confident that other humans have them via introspection + knowing that the generators are (mechanically) very similar. Also, we know that the generators of goodness and sentience in humans generalize well enough, at least from median to >.1%ile humans - for the same reasons (a) and (b) above, we can be pretty confident that the smartest and most good among us feel love, pain, sorrow, etc. in roughly similar ways to everyone else, and being multiple standard deviations (upwards) among humans for smartness and / or goodness (usually) doesn't cause a person to do crazy / harmful things. I don't think we have similarly strong evidence about how AIs generalize even up to that point (let alone beyond).
Not sure where / if you disagree with any of this, but either way, the point is that I think that "I would pick Opus over a human" for anything CEV-adjacent implies a lot more confidence in a philosophy of both human and AI minds than is warranted.
In the spirit of making empirical / falsifiable predictions, a thing that would change my view on this is if AI researchers (or AIs themselves) started producing better philosophical insights about consciousness, metaethics, etc. than the best humans did in 2008, where these insights are grounded by their applicability to and experimental predictions about humans and human consciousness (rather than being self-referential / potentially circular insights about AIs themselves). I don't think Eliezer got everything right about philosophy, morality, consciousness, etc. 15y ago, but I haven't seen much in the way of public writing or discourse that has improved on things since then, and in many ways the quality of discourse has gotten worse. I think it would be a positive sign (but don't expect to see it) if AIs were to change that.
I think this is closer to a restatement of your / Dario's position, rather than a crux. My claim is that it doesn't matter whether specific future AIs are "naturally" consequentialists or something else, or how many degrees of freedom there are to be or not be a consequential and still get stuff done. Without bringing AI into it at all, we can already know (I claim, but am not really expanding on here), that consequentialism itself is extremely powerful, natural, optimal, etc. and there are some very general and deep lessons that we can learn from this. "There might be a way to build an AI without all that" or even "In practice that won't happen by default given current training methods, at least for a while" could be true, but it wouldn't change my position.
OK, sure.
Right, this is closer to where I disagree. I think there is a strong argument about this that doesn't have anything to do with "shaped cognition" or even AI in particular.
I would flag this as exactly the wrong kind of lesson / example to learn something interesting about consequentialism - failure and mediocrity are overdetermined; it's just not that interesting that there are particular contrived examples where some humans fail at applying consequentialism. Some of the best places to look for the deeper lessons and intuitions about consequentialism are environments where there is a lot of cut-throat competition, possibility for outlier success and failure, not artificially constrained or bounded in time or resources, etc.