Most of my posts and comments are about AI and alignment. Posts I'm most proud of, which also provide a good introduction to my worldview:
I also created Forum Karma, and wrote a longer self-introduction here.
PMs and private feedback are always welcome.
NOTE: I am not Max Harms, author of Crystal Society. I'd prefer for now that my LW postings not be attached to my full name when people Google me for other reasons, but you can PM me here or on Discord (m4xed) if you want to know who I am.
I suspect that there might be a crux that's something like: are future AIs more naturally oriented toward something like consequentialist reasoning or shaped cognition:
I think this is closer to a restatement of your / Dario's position, rather than a crux. My claim is that it doesn't matter whether specific future AIs are "naturally" consequentialists or something else, or how many degrees of freedom there are to be or not be a consequential and still get stuff done. Without bringing AI into it at all, we can already know (I claim, but am not really expanding on here), that consequentialism itself is extremely powerful, natural, optimal, etc. and there are some very general and deep lessons that we can learn from this. "There might be a way to build an AI without all that" or even "In practice that won't happen by default given current training methods, at least for a while" could be true, but it wouldn't change my position.
But if something like this is where Dario is coming from, then I wouldn't say that the problem is that he has missed a bit about how the world works. It's that he has noticed that current AI looks like it'd be based on shaped cognition if extrapolated further,
OK, sure.
and that there hasn't been a strong argument for why it couldn't be kept that way relatively straightforwardly.
Right, this is closer to where I disagree. I think there is a strong argument about this that doesn't have anything to do with "shaped cognition" or even AI in particular.
On the other hand, expertise research finds that trying to do consequentialist reasoning in most established domains is generally error-prone and a mark of novices, and experts have had their cognition shaped to just immediately see the right thing and execute it. And people are generally not very consequentialist about navigating their lives and just do whatever everyone else does, and often this is actually a better idea than trying to figure out everything in your life from first principles.
I would flag this as exactly the wrong kind of lesson / example to learn something interesting about consequentialism - failure and mediocrity are overdetermined; it's just not that interesting that there are particular contrived examples where some humans fail at applying consequentialism. Some of the best places to look for the deeper lessons and intuitions about consequentialism are environments where there is a lot of cut-throat competition, possibility for outlier success and failure, not artificially constrained or bounded in time or resources, etc.
You seem to be reading Dario to say "tendencies like instrumental power-seeking won't emerge at all".
I am more saying that the when Dario and others dismiss what they call "doomer" arguments as vague / clean theories, ungrounded philosophy, etc. and couch their own position as moderate + epistemically humble, what's actually happening is Dario himself failing to generalize about how the world works.
We can imagine that some early powerful AIs will also miss those lessons / generalizations, either by chance or because of deliberate choices that the creators make, but if you count on that, or even just say that we can't really know exactly how it will play out until we build and experiment, you're relying on your own ignorance and lack of understanding to tell an overly-conjunctive story, even if parts of your story are supported by experiment. That chain of reasoning is invalid, regardless of what is true in principle or practice about the AI systems people actually build.
On Dario's part I suspect this is at least partly motivated cognition, but for others, one way past this failure mode could be to study and reflect on examples in domains that are (on the surface) unrelated to AI. Unfortunately, having someone else spell out the connections and deep lessons from this kind of study has had mixed results in the past - millions of words have been spilled on LW and other places over the years and it usually devolves into accusations of argument by analogy, reference class tennis, navel-gazing, etc.
This “misaligned power-seeking” is the intellectual basis of predictions that AI will inevitably destroy humanity.
The problem with this pessimistic position is that it mistakes a vague conceptual argument about high-level incentives—one that masks many hidden assumptions—for definitive proof. I think people who don’t build AI systems every day are wildly miscalibrated on how easy it is for clean-sounding stories to end up being wrong, and how difficult it is to predict AI behavior from first principles, especially when it involves reasoning about generalization over millions of environments (which has over and over again proved mysterious and unpredictable). Dealing with the messiness of AI systems for over a decade has made me somewhat skeptical of this overly theoretical mode of thinking.
One of the most important hidden assumptions, and a place where what we see in practice has diverged from the simple theoretical model, is the implicit assumption that AI models are necessarily monomaniacally focused on a single, coherent, narrow goal, and that they pursue that goal in a clean, consequentialist manner. In fact, our researchers have found that AI models are vastly more psychologically complex, as our work on introspection or personas shows.
False / non-sequitur? Instrumental convergence and optimality of power-seeking are facts that describe important facets of reality. They unpack to precise + empirical + useful models of many dynamics in economics, games, markets, biology, computer security, and many adversarial interactions among humans generally.
The fact that these dynamics don't (according to Dario / Anthropic) make useful predictions about the behavior of current / near-future AI systems, and the fact that current AI systems are not actually all that powerful or dangerous, is not a coincidence. But that isn't at all a refutation of power-seeking and optimization as convergent behavior of actually-powerful agents! I think people who build AI systems every day are "wildly miscalibrated" on how empirically well-supported and widely applicable these dynamics and methods of thinking are outside their own field.
Dario's "more moderate and more robust version" of how power-seeking could be a real risk seems like an overly-detailed just-so story about some ways instrumental convergence and power-seeking could emerge in current AI systems, conveniently in ways that Anthropic is mostly set up to catch / address. But the actually-correct argument is more like: if instrumental convergence and power-seeking don't emerge in some form, then the AI system you end up with won't actually be sufficiently powerful for what you want to do, regardless of how aligned it is. And even if you do manage to build something powerful enough for whatever you want to do that is aligned and doesn't converge towards power-seeking, that implies someone else can build a strictly more powerful system which does converge, likely with relative ease compared to the effort you put in to build the non-convergent system. None of this depends on whether the latest version of Claude is psychologically complex or has a nice personality or whatever.
It's the digital equivalent of printing something in vivid color on a glossy brochure rather than in black and white on A4 paper.
Hmm, maybe, but if that's what it takes to get people to start consuming text content (instead of video), I'll take it? By "medium" I was primarily referring to text over video, not that dashboards and polished graphics - I think if Lightcone were to spend time creating e.g. polished political ads or explainer / informational videos of any kind that would be pretty bad, unless they strongly endorsed the content.
I wasn't super explicit about it in the grandparent, but two load-bearing and perhaps unusual things I believe:
A large fraction of people, including elites, get basically all of their information and form their opinions and beliefs based on non-text sources: TV news, shortform video, instructional and educational videos, talking and listening to other people, etc.
(I suspect a lot of LW and adjacent people who tend to consume a lot of text content don't realize how widespread this is due to typical-minding.)
I also feel pretty meh about the content of Deciding to Win and (somewhat) AI 2027, but as a medium, I think popularizing widely accessible, written long-form prose as a mode of elite discourse is unambiguously positive and potentially very impactful.
It's kind of taken for granted around here, but "reliably and deeply engage with thoughtful written word" is absolutely not a bar that most discourse of any stripe meets. If Lightcone changes that even a little, for example by causing policymakers, elected officials, and other kinds of elites to shift a bit of their time from trawling social media and TV news, reacting to vibes and shallow popular sentiment, to engaging more with bloggers and public intellectuals, that seems like it could be worth a lot.
As a medium, a slick one page website seems like a nice middle ground in the design space between books (less accessible / widely shareable, often too long, don't allow for good back-and-forth / commentary) and blog posts with comment sections (ideal for truly deep engagement, but not as legible / accessible for different reasons). And all of these are much better than TV news, "community engagement", social media, video, etc.
Also worth remembering that (actual) RSI was never a necessary condition for ruin. It seems at least plausible that at some point, human AI researchers on their own will find methods of engineering an AGI to sufficiently superhuman levels, to the point where the AI is smart enough to start developing nanotech and / or socially engineering humans for bootstrapping needs.
So even if labs were carefully monitoring for RSI and trying to avoid it (rather than deliberately engineering for it + frog boiling in the meantime), an AI inclined to take over might find that it doesn't even need to bother with potentially dicey self-modifications until after it has already secured victory.
Meta: any discussion or reaction you observe to abrasiveness and communication style (including the discussion here) is selected for people who are particularly sensitive to and / or feel strongly enough about these things one way or the other to speak up. I think if you don't account for this, you'll end up substantially overestimating the impact and EV in either direction.
To be clear, I think this selection effect is not simply, "lots of people like to talk about Eliezer", which you tried to head off as best you could. If you made a completely generic post about discourse norms, strategic communication, the effects and (un)desirability of abrasiveness and snark, when and how much it is appropriate, etc. it might get less overall attention. But my guess is that it would still attract the usual suspects and object-level viewpoints, in a way that warps the discussion due to selection.
As a concrete example of the kind of effect this selection might have: I find the norms of discourse on the EA Forum somewhat off-putting, and in general I find that thinking strategically about communication (as opposed to simply communicating) feels somewhat icky and not particularly appealing as a conversational subject. From this, you can probably infer how I feel about some of Eliezer's comments and the responses. But I also don't usually feel strongly enough about it to remark on these things either way. I suspect I am not atypical, but that my views are underrepresented in discussions like this.
Another factor is that the absolute size of the boom (or bubble) is somewhat smaller than it appears if you just look at dollar-denominated increases in the value of stocks.
Stocks are denominated in dollars, and the value of a dollar has fallen substantially in real terms in the last couple of years, in large part due to tariffs (or uncertainty created by tariffs) and inflation. These are mostly independent effects from each other and the AI boom, and neither is particularly good for stimulating actual healthy economic growth, but they could soften the effect of any future bubble popping, because there is less actual growth / bubble to pop than what it looks like if you just look at returns in nominal terms.
because any solver specialized to the evolutionary subset is guaranteed to fail once the promise is removed; outside that tightly curated domain, the mapping reverts to an unbounded and intractable instance class.
There are plenty of expansions you could make to the "evolutionary subset" (some of them trivial, some of them probably interesting) for which no theorem from complexity theory guarantees that the problem of predicting how any particular instance in the superset folds is intractable.
In general, hardness results from complexity theory say very little about the practical limits on problem-solving ability for AI (or humans, or evolution) in the real world, precisely because the "standard abstraction schemes" do not fully capture interesting aspects of the real-world problem domain, and because the results are mainly about classes and limiting behavior rather than any particular instance we care about.
In many hardness and impossibility results, "adversarial / worst-case" are doing nearly all of the work in the proof, but if you're just trying to build some nanobots you don't care about that. Or more prosaically, if you want to steal some cryptocurrency, in real life you use a side-channel or 0-day in the implementation (or a wrench attack); you don't bother trying to factor large numbers.
IMO it is correct to mostly ignore these kinds of things when building your intuition about what a superintelligence is likely or not likely to be able to do, once you understand what the theorems actually say. NP-hardness says, precisely, that "if a problem is NP-hard (and ), that implies that there is no deterministic algorithm anyone (even a superintelligence) can run, which accepts arbitrary instances of the problem and finds a solution in time steps polynomial in the size of the problem instance.". This statement is precise and formal, but unfortunately it doesn't mention protein folding, and even the implications that it has for an idealized formal model of protein folding are of limited use when trying to predict what specific proteins AlphaFold-N will / won't be able to predict correctly.
I agree that the claims the Anthropic researchers are making here are kind of wacky, but there is a related / not-exactly-steelman argument that has been floating around LW for a while, namely that there is an assumption by many old-school AI alignment people that transformer models will necessarily get more coherent as they get smarter (and larger), when (according to the arguers) that assumption hasn't been fully justified or empirically been the case so far.
I recall @nostalgebraist's comment here as an example of this line of discussion that was highly upvoted at the time.
So a generous / benign interpretation of the "Hot mess" work is that is an attempt to empirically investigate this argument and the questions that nostalgabraist and others have posed.
Personally, I continue to think that most of these discussions are kind of missing the point of the original arguments and assumptions that they're questioning. The actual argument that coherence and agency are deeply and closely related to the ability to usefully and generally plan, execute, and adapt in a sample-efficient way, doesn't depend on what's happening in any particular existing AI system or assume anything about how they will work. It might or might not be the case that these properties and abilities will emerge directly in transformer models as they get larger - or they'll emerge as a result of putting the model in the right kind of harness / embodiment, or as part of some advancement in a post-training process deliberately designed to shape them for coherence, or they'll emerge in some totally different architecture / paradigm - but exactly how and when that happens isn't a crux for any of my own beliefs or worldview.
Put another way, "a country of geniuses in a datacenter" had better be pretty good at working together and pursuing complex, long time-horizon goals coherently if they want to actually get anything useful done! Whether and how the citizens of that country contain large transformer models as a key component is maybe an interesting question from a timelines / forecasting perspective or if you want to try building that country right away, but it doesn't seem particularly relevant to what happens shortly afterwards if you actually succeed.