It might be some elements of human intelligence (at least at the civilizational level) are culturally/memetically transmitted. All fine and good in theory. Except the social hypercompetition between people and intense selection pressure of ideas online might be eroding our world's intelligence. Eliezer wonders if he's only who he is because he grew up reading old science fiction from before the current era's memes.

10Raemon
This a first pass review that's just sort of organizing my thinking about this post. This post makes a few different types of claims: * Hyperselected memes may be worse (generally) than weakly selected ones * Hyperselected memes may specifically be damaging our intelligence/social memetic software * People today are worse at negotiating complex conflicts from different filter bubbles * There's a particular set of memes (well represented in 1950s sci-fi) that was particularly important, and which are not as common nowadays. It has a question which is listed although not focused on too explicitly on its own terms: * What do you do if you want to have good ideas? (i.e. "drop out of college? read 1950s sci-fi in your formative years?") It prompts me to separately consider the questions: * What actually is the internet doing to us? It's surely doing something. * What sorts of cultures are valuable? What sorts of cultures can be stably maintained? What sorts of cultures cause good intellectual development? ... Re: the specific claim of "hypercompetition is destroying things", I think the situation is complicated by the "precambrian explosion" of stuff going on right now. Pop music is defeating classical music in relative terms, but, like, in absolute terms there's still a lot more classical music now than in 1400 [citation needed?]. I'd guess this is also true of for tribal FB comments vs letter-to-the-editor-type writings.  * [claim by me] Absolute amounts of thoughtful discourse is probably still increasing My guess is that "listens carefully to arguments" has just always been rare, and that people have generally been dismissive of the outgroup, and now that's just more prominent. I'd also guess that there's more 1950s style sci-fi today than in 1950. But it might not be, say, driving national projects that required a critical mass of it. (And it might or might not be appearing on bestseller lists?) If so, the question is less "are things being destro
Customize
The current cover of If Anyone Builds it, Everyone Dies is kind of ugly and I hope it is just a placeholder. At least one of my friends agrees. Book covers matter a lot! I'm not a book cover designer, but here are some thoughts: AI is popular right now, so you'd probably want to indicate that from a distance. The current cover has "AI" half-faded in the tagline. Generally the cover is not very nice to look at.  Why are you de-emphasizing "Kill Us All" by hiding it behind that red glow? I do like the font choice, though. No-nonsense and straightforward.  @Eliezer Yudkowsky @So8res 
Thomas Kwa*Ω401198
17
Cross-domain time horizon:  We know AI time horizons (human time-to-complete at which a model has a 50% success rate) on software tasks are currently ~1.5hr and doubling every 4-7 months, but what about other domains? Here's a preliminary result comparing METR's task suite (orange line) to benchmarks in other domains, all of which have some kind of grounding in human data: Observations * Time horizons agentic computer use (OSWorld) is ~100x shorter than other domains. Domains like Tesla self-driving (tesla_fsd), scientific knowledge (gpqa), and math contests (aime), video understanding (video_mme), and software (hcast_r_s) all have roughly similar horizons. * My guess is this means models are good at taking in information from a long context but bad at acting coherently. Most work requires agency like OSWorld, which may be why AIs can't do the average real-world 1-hour task yet. * There are likely other domains that fall outside this cluster; these are just the five I examined * Note the original version had a unit conversion error that gave 60x too high horizons for video_mme; this has been fixed (thanks @ryan_greenblatt ) * Rate of improvement varies significantly; math contests have improved ~50x in the last year but Tesla self-driving only 6x in 3 years. * HCAST is middle of the pack in both. Note this is preliminary and uses a new methodology so there might be data issues. I'm currently writing up a full post! Is this graph believable? What do you want to see analyzed? edit: fixed Video-MME numbers
The U.S. 30-year Treasury rate has reached 5.13%, a level last seen in October 2023. The last time this rate was at this level was in 2007, when the U.S. federal debt was about $9 trillion. Today, that debt is nearing $37 trillion. I believe bond market participants are signaling a lack of confidence that the fiscal situation in the United States will improve during President Trump’s second administration. Like many financial professionals, I had high hopes that President Trump’s election would bring the fiscal situation in order. Unfortunately, the "Department of Government Efficiency" has not been as efficient as many had hoped, and U.S. Congress seems completely uninterested in reducing federal spending in a meaningful way. The tax cut bill currently moving through Congress, fully backed by the White House, will exacerbate the fiscal situation. If this trend of rising long-term Treasury rates continues, the United States will soon face very tough decisions that neither Wall Street nor Main Street is ready to face.
Many believe that one hope for our future is that the AI labs will makes some mistake that will kill many people, but not all of us, resulting in the survivors finally realizing how dangerous AI is. I wish people would refer to that as a "near miss", not a "warning shot". A warning shot is when the danger (originally a warship) actually cares about you but cares about its mission more, with the result that it complicates its plans and policies to try to keep you alive.
Jack Morris has posted this thread https://x.com/jxmnop/status/1925224612872233081 about his paper "Harnessing the Universal Geometry of Embeddings"    Have others thought through what this means for the notion of fundamentally alien internal ontologies? Would love any ideas! Sorry if missed a post on it.

Popular Comments

> The trends reflect the increasingly intense tastes of the highest spending, most engaged consumers. https://logicmag.io/play/my-stepdad's-huge-data-set/ > > While a lot of people (most likely you and everyone you know) are consumers of internet porn (i.e., they watch it but don’t pay for it), a tiny fraction of those people are customers. Customers pay for porn, typically by clicking an ad on a tube site, going to a specific content site (often owned by MindGeek), and entering their credit card information. > > > > This “consumer” vs. “customer” division is key to understanding the use of data to perpetuate categories that seem peculiar to many people both inside and outside the industry. “We started partitioning this idea of consumers and customers a few years ago,” Adam Grayson, CFO of the legacy studio Evil Angel, told AVN. “It used to be a perfect one-to-one in our business, right? If somebody consumed your stuff, they paid for it. But now it’s probably 10,000 to one, or something.” > > > > There’s an analogy to be made with US politics: political analysts refer to “what the people want,” when in fact a fraction of “the people” are registered voters, and of those, only a percentage show up and vote. Candidates often try to cater to that subset of “likely voters”— regardless of what the majority of the people want. In porn, it’s similar. You have the people (the consumers), the registered voters (the customers), and the actual people who vote (the customers who result in a conversion—a specific payment for a website subscription, a movie, or a scene). Porn companies, when trying to figure out what people want, focus on the customers who convert. It’s their tastes that set the tone for professionally produced content and the industry as a whole. > > > > By 2018, we are now over a decade into the tube era. That means that most LA-area studios are getting their marching orders from out-of-town business people armed with up-to-the-minute customer data. Porn performers tend to roll their eyes at some of these orders, but they don’t have much choice. I have been on sets where performers crack up at some of the messages that are coming “from above,” particularly concerning a repetitive obsession with scenes of “family roleplay” (incest-themed material that uses words like “stepmother,” “stepfather,” and “stepdaughter”) or what the industry calls “IR” (which stands for “interracial” and invariably means a larger, dark-skinned black man and a smaller light-skinned white woman, playing up supposed taboos via dialogue and scenarios). > > > > These particular “taboo” genres have existed since the early days of commercial American porn. For instance, see the stellar performance by black actor Johnnie Keyes as Marilyn Chambers’ orgy partner in 1972’s cinematic Behind the Green Door, or the VHS-era incest-focused sensation Taboo from 1980. But backed by online data of paid customers seemingly obsessed with these topics, the twenty-first-century porn industry—which this year, to much fanfare, was for the first time legally allowed to film performers born in this millennium—has seen a spike in titles devoted to these (frankly old-fashioned) fantasies. > > > > Most performers take any jobs their agents send them out for. The competition is fierce—the ever-replenishing supply of wannabe performers far outweighs the demand for roles—and they don’t want to be seen as “difficult” (particularly the women). Most of the time, the actors don’t see the scripts or know any specific details until they get to set. To the actors rolling their eyes at yet another prompt to declaim, “But you’re my stepdad!” or, “Show me your big black dick,” the directors shrug, point at the emailed instructions and say, “That’s what they want…” So my interpretation here is that it's not that there's suddenly a huge spike in people discovering they love incest in 2017 where they were clueless in 2016 or that they were all brainwashed to no longer enjoy vanilla that year, it's that that is when the hidden oligopoly turned on various analytics and started deliberately targeting those fetishes as a fleet-wide business decision. And this was because they had so thoroughly commodified regular porn to a price point of $0, that the only paying customers that are left are the ones with extreme fetishes who cannot be supplied by regular amateur or pro supply. They may or may not have increased in absolute number compared to pre-2017, but it doesn't matter, because everyone else vanished, and their relative importance skyrocketed: "If somebody consumed your stuff, they paid for it. But now it’s probably 10,000 to one, or something.” (For younger readers who may be confused by how a ratio like 10000:1 is even hypothetically possible because 'where did that 10k come from when no one pays for porn?', it's worth recalling that renting porn videos used to be big business that would be done by a lot of men, and it kept many non-Blockbuster video rental stores afloat and it was an ordinary thing for your local store to have a 'back room' that the kiddies were strictly forbidden from, and while it would certainly stock a lot of fetish stuff like interracial porn, it also rented out tons of normal stuff. If you have no idea what this was like, you may enjoy reading "True Porn Clerk Stories", Ali Davis 2002.) I think there is a similar effect with foot fetishes & furries: they are oddly well-heeled and pay a ton of money for new stuff, because they are under-supplied and demand new ones. There is not much 'organic' supply of women photographing their feet in various lascivious ways; it's not that it's hard, they just don't do it, but can be incentivized to do so. (I recall reading an article on Wikifoot where IIRC they interviewed a contributor who said he got some photos by simply politely emailing or DMing the woman to ask for her to take some foot photos, and she would oblige. "send foots kthnxbai" apparently works. And probably it's fairly easy to pay for or commission feet images/videos: almost everyone has two feet already, and you can work in feet into regular porn easily by simply choosing different angles or postures, and a closeup of a foot won't turn off regular porn consumers either, so you can have your cake & eat it too. Similarly for incest: saying "But you're my stepdad!" is cheap and easy and anyone can do it if the Powers That Be tell them to in case a few 'customers' will pay actual $$$ for it, while those 'consumers' not into that plot roll their eyes and ignore it as so much silly 'porn movie plot' framing as they get on with business.)
Text diffusion models are still LLMs, just not autoregressive.
The key question is whether you can find improvements which work at large scale using mostly small experiments, not whether the improvements work just as well at small scale. The 3 largest algorithmic advances discussed here (Transformer, MoE, and MQA) were all originally found at tiny scale (~1 hr on an H100 or ~1e19 FLOP[1] which is ~7 orders of magnitude smaller than current frontier training runs).[2] This paper looks at how improvements vary with scale, and finds the best improvements have returns which increase with scale. But, we care about predictability given careful analysis and scaling laws which aren't really examined. > We found that, historically, the largest algorithmic advances couldn't just be scaled up from smaller versions. They needed to have large amounts of compute to develop and validate This is false: the largest 3 advances they identify were all first developed at tiny scale. To be clear, the exact versions of these advances used in modern AIs are likely based on higher compute experiments. But, the returns from these more modern adaptations are unclear (and plausibly these adaptations could be found with small experiments using careful scaling analysis). ---------------------------------------- Separately, as far as I can tell, the experimental results in the paper shed no light on whether gains are compute-dependent (let alone predictable from small scale). Of the advances they experimentally test, only one (MQA) is identified as compute dependent. They find that MQA doesn't improve loss (at small scale). But, this isn't how MQA is supposed to help, it is supposed to improve inference efficiency which they don't test! So, these results only confirm that a bunch of innovations (RoPE, FA, LN) are in fact compute independent. Ok, so does MQA improve inference at small scale? The paper says: > At the time of its introduction in 2019, MQA was tested primarily on small models where memory constraints were not a major concern. As a result, its benefits were not immediately apparent. However, as model sizes grew, memory efficiency became increasingly important, making MQA a crucial optimization in modern LLMs Memory constraints not being a major concern at small scale doesn't mean it didn't help then (at the time, I think people didn't care as much about inference efficiency, especially decoder inference efficiency). Separately, the inference performance improvements at large scale are easily predictable with first principles analysis! The post misses all of this by saying: > MQA, then, by providing minimal benefit at small scale, but much larger benefit at larger scales —is a great example of the more-general class of a compute-dependent innovation. I think it's actually unclear if there was minimal benefit at small scale—maybe people just didn't care much about (decoder) inference efficiency at the time—and further, the inference efficiency gain at large scale is easily predictable as I noted! The post says: > compute-dependent improvements showed minimal benefit or actually hurt performance. But, as I've noted, they only empirically tested MQA and those results are unclear! The transformer is well known to be a huge improvement even at very small scale. (I'm not sure about MoE.) ---------------------------------------- FAQ: Q: Ok, but surely the fact that returns often vary with scale makes small scale experiments less useful? A: Yes, returns varying with scale would reduce predictability (all else equal), but by how much? If returns improve in a predictable way that would be totally fine. Careful science could (in principle) predict big gains at large scale despite minimal or negative gains at small scale. Q: Ok, sure, but if you actually look at modern algorithmic secrets, they are probably much less predictable from small to large scale. (Of course, we don't know that much with public knowledge.) A: Seems quite plausible! In this case, we're left with a quantitative question of how predictable things are, whether we can identify if something will be predictable, and if there are enough areas of progress which are predictable. ---------------------------------------- Everyone agrees compute is a key input, the question is just how far massively accelerated, much more capable, and vastly more prolific labor can push things. ---------------------------------------- This was also posted as a (poorly edited) tweet thread here. ---------------------------------------- 1. While 1e19 FLOP is around the scale of the final runs they included in each of these papers, these advances are pretty likely to have been initially found at (slightly) smaller scale. Like maybe 5-100x lower FLOP. The larger runs were presumably helpful for verifying the improvement, though I don't think they were clearly essential, probably you could have instead done a bunch of careful scaling analysis. ↩︎ 2. Also, it's worth noting that Transformer, MoE, and MQA are selected for being large single advances, making them unrepresentative. Large individual advances are probably typically easier to identify, making them more likely to be found earlier (and at smaller scale). We'd also expect large single improvements to be more likely to exhibit returns over a large range of different scales. But I didn't pick these examples, they were just the main examples used in the paper! ↩︎
Load More

Recent Discussion

In my previous post in this series, I explained why we urgently need to change AI developers’ incentives: if we allow the status quo to continue, then an AI developer will recklessly deploy misaligned superintelligence, which is likely to permanently disempower humanity and cause billions of deaths. AI governance research can potentially be helpful in changing this status quo, but only if it’s paired with plenty of political advertising – research by itself doesn’t automatically convince any of the people who have the power to rein in AI developers.

Executive Summary

Here, in the third post, I want to make it clear that we are not doing nearly enough political advertising to successfully change the status quo. By my estimate, we have at least 3 governance researchers for every...

It was a cold and cloudy San Francisco Sunday. My wife and I were having lunch with friends at a Korean cafe.

My phone buzzed with a text. It said my mom was in the hospital.

I called to find out more. She had a fever, some pain, and had fainted. The situation was serious, but stable.

Monday was a normal day. No news was good news, right?

Tuesday she had seizures.

Wednesday she was in the ICU. I caught the first flight to Tampa.

Thursday she rested comfortably.

Friday she was diagnosed with bacterial meningitis, a rare condition that affects about 3,000 people in the US annually. The doctors had known it was a possibility, so she was already receiving treatment.

We stayed by her side through the weekend. My dad spent every night...

That seems nice. I have not acquired steadfastness (yet (growth mindset?)) but perhaps "find things from which I could justifiably draw steadfastness as a resulting apparent trait" would be a useful tactic to try to apply. I have mostly optimized for flexibility, such as to be able to react to whatever happens, and then be able to nudge everything closer back towards The Form Of The Good... but the practical upshot doesn't look like steadfastness from the outside, I don't think.

Mom would have approved of less "apparent chaos from a distance without the abi... (read more)

Claude Sonnet 4 and Claude Opus 4 are out. Anthropic says they're both state-of-the-art for coding. Blogpost, system card.

Anthropic says Opus 4 may have dangerous bio capabilities, so it's implementing its ASL-3 standard for misuse-prevention and security for that model. (It says it has ruled out dangerous capabilities for Sonnet 4.) Blogpost, safety case report. (RSP.)

Tweets: Anthropic, Sam Bowman, Jan Leike.

Claude 3.7 Sonnet has been retconned to Claude Sonnet 3.7 (and similarly for other models).

Summary: We have previously argued that scalable oversight methods can be used to provide guarantees on low-stakes safety – settings where individual failures are non-catastrophic. However, if your reward function (e.g. honesty) is compatible with many possible solutions then you also need to avoid having free parameters exploited over time. We call this the exploitable search problem. We propose a  zero-sum game where, at equilibrium, free parameters are not exploited – that is, our AI systems are carrying out an unexploitable search.

 

Conceptual illustration using fabricated data. Left: Entropy penalties diversify trajectories but still yield high unobserved, malicious reward S. Right: exploitable search games are designed to bound probability of high S. 

The exploitable search problem

We would like to be able to use AI on under-specified tasks (e.g. coding and research advice) where there are...

I see two senses in which where research sabotage is different from the usual sandbagging issue:

  1. ELK i.e. when explanations are hard to find: If a malicious solution is high probability for a model to find, but presenting a human-recognizable argument is hard. Note that you might be concerned that our classifiers have to do this as well, but that's not the case. The classifiers could work indirectly by forcing a spread of probability mass to distinct solutions.

  2. High-stakes, multi-shot settings: Let's say the world is such that it's feasible to identify

... (read more)
2ryan_greenblatt
I don't agree it shares these problems any more than the proposal you disucss shares these problems: ---------------------------------------- I don't love KL to a base model because we might be worried the base model is scheming or if not likely to be scheming it might be much less capable than humans.
1Jacob Pfau
Thanks for the push, I previously didn't click through to your post, and after doing so I realized you're suggesting something different from what I'd assumed. From a skim the immediate concerns with your Dagger-like RL setup is that you are bottlenecked past human capability level and you introduce a new need for online sampling from humans--as you mention in the post. For the AI R&D setting (AGI-level capabilities) I have in mind, these are not affordances I want to assume we have. If, counterfactually, we went ahead with assuming cheap access to sufficiently capable humans, then I could imagine being convinced the linked method is preferrable. Two points that seem relevant for your method: (1) Sample efficiency of your method w.r.t. the human demonstrations. (2) Time complexity of training away malign initialization (e.g. the first solution found imports in the first chunk an insecure package).
2ryan_greenblatt
Sure, but this is assuming the type of max-ent optimality we wanted to get away from by using a restricted set of classifiers if I understand correctly?

Podcast version (read by the author) here, or search for "Joe Carlsmith Audio" on your podcast app.

1. Introduction

Currently, most people treat AIs like tools. We act like AIs don’t matter in themselves. We use them however we please.

For certain sorts of beings, though, we shouldn’t act like this. Call such beings “moral patients.” Humans are the paradigm example. But many of us accept that some non-human animals are probably moral patients as well. You shouldn’t kick a stray dog just for fun.[1]

Can AIs be moral patients? If so, what sorts of AIs? Will some near-term AIs be moral patients? Are some AIs moral patients now?

If so, it matters a lot. We’re on track to build and run huge numbers of AIs. Indeed: if hardware and deployment scale fast...

Valdes10

Small nitpick: "the if and only if" is false. It is perfectly possible to have an AI that doesn't want any moral rights and is misaligned in some other way.

1xpym
It seems clear enough to me that pretty much everybody is hopelessly confused about these issues, and sees no promising avenues for quick progress. On that note, I'm interested in your answer to Conor Leahy's question in a comment to the linked post: "What kind of questions can you make progress on without constant grounding and dialogue with reality? This is the default of how we humans build knowledge and solve hard new questions, the places where we do best and get the least drawn astray is exactly those areas where we can have as much feedback from reality in as tight loops as possible, and so if we are trying to tackle ever more lofty problems, it becomes ever more important to get exactly that feedback wherever we can get it!" I agree with his perspective, and am curios where and why you disagree.
2cubefox
There is no "citation" that anyone but myself feels pain. It's the "problem of other minds". After all, anyone could be a p-zombie, not just babies, animals, AIs...
1xpym
This attitude presupposes that circumstances in which human cultures find themselves can't undergo quick and radical change. Which would've been reasonable for most of history — change had been slow and gradual enough for cultural evolution to keep up with. But the bald fact that no culture ever had to deal with anything like AGI points to the fatal limitation of this approach — even if people happen to have consistent moral intuitions about it a priori (which seems very unlikely to me), there's no good reason to expect those intuitions to survive actual contact with reality.
To get the best posts emailed to you, create an account! (2-3 posts per week, selected by the LessWrong moderation team.)
Log In Reset Password
...or continue with

This is the video and transcript of a talk I gave on AI welfare at Anthropic in May 2025. The slides are also available here. The talk gives an overview of my current take on the topic. I'm also in the midst of writing a series of essays of about it, the first of which -- "On the stakes of AI moral status" -- is available here (podcast version, read by the author, here). My takes may evolve as I do more thinking about the issue.

Hi everybody. Thanks for coming. So: this talk is going to be about AI welfare. About whether AIs have welfare, moral status, consciousness, that kind of thing. How to think about that, and what to do in light of reasonable credences about...

Last week I stumbled over Dimensional Analysis which is not only useful for applied fields (physics, biology, economics), but also for math (Why did no one tell me that you can almost always think of df/dx as having "the type of f"/"the type of x"? The fact that exponents always have to be unit-less etc.? It had never occurred to me to make this distinction. In my mind, f(x)=x went from the reals to the reals, just like did.

One example of a question that before I would have had to think slowly about: What is the type of the standard deviation of a distribution? What is the type of the z-score of a sample?

Answer

The standard deviation has the unit of the random variable X, while the z-score

...

think of df/dx as having "the type of f"/"the type of x"

I expect you learned calculus the wrong way, in a math class instead of in physics. That's the point the notation, and the key reason it's an improvement over something like  or !

Epistemic status: shower thought quickly sketched, but I do have a PhD in this.

As we approach AGI and need to figure out what goals to give it we will need to find tractable ways to resolve moral disagreement. One of the most intractable moral disagreements is between the moral realists and the moral antirealists.

There's an oversimplified view of this disagreement that goes:

  • If you're a moral realist, you want to align AGI to the best moral-epistemic deliberative processes you can find to figure out what is right
  • If you're a moral antirealist and you're a unilateralist, you want to stage a coup and tile the world with your values
  • If you're a moral antirealist and you're cooperative, you want to align AGI to a democratic process that locks in whatever
...

Another question to ask, even assuming faultless convergence, related to uniqueness, is whether the process of updates has a endpoint at all.

That is, I could imagine that there exists series of arguments that would convince someone who believes X to believe Y, and a set that would convince someone who believes Y to believe X. If both of these sets of arguments are persuasive even after someone has changed their mind before, we have a cycle which is compatible with faultless convergence, but has no endpoint.

4Charlie Steiner
I feel sad that your hypotheses are almost entirely empirical, but seem like they include just enough metaethically-laden ideas that you have to go back to describing what you think people with different commitments might accept or reject. My checklist: Moral reasoning is real (or at least, the observables you gesture towards could indeed be observed, setting aside the interpretation of what humans are doing) Faultless convergence is maybe possible (I'm not totally sure what observables you're imagining - is an "argument" allowed to be a system that interacts with its audience? If it's a book, do all people have to read the same sequence of words, or can the book be a choose your own adventure that tells differently-inclined readers to turn to different pages? Do arguments have to be short, or can they take years to finish, interspersed with real-life experiences?), but also I disagree with the connotation that this is good, that convergence via argument is the gold standard, that the connection between being changed by arguments and sharing values is solid rather than fluid. No Uniqueness No Semi-uniqueness Therefore Unificiation is N/A