It might be some elements of human intelligence (at least at the civilizational level) are culturally/memetically transmitted. All fine and good in theory. Except the social hypercompetition between people and intense selection pressure of ideas online might be eroding our world's intelligence. Eliezer wonders if he's only who he is because he grew up reading old science fiction from before the current era's memes.
In my previous post in this series, I explained why we urgently need to change AI developers’ incentives: if we allow the status quo to continue, then an AI developer will recklessly deploy misaligned superintelligence, which is likely to permanently disempower humanity and cause billions of deaths. AI governance research can potentially be helpful in changing this status quo, but only if it’s paired with plenty of political advertising – research by itself doesn’t automatically convince any of the people who have the power to rein in AI developers.
Here, in the third post, I want to make it clear that we are not doing nearly enough political advertising to successfully change the status quo. By my estimate, we have at least 3 governance researchers for every...
It was a cold and cloudy San Francisco Sunday. My wife and I were having lunch with friends at a Korean cafe.
My phone buzzed with a text. It said my mom was in the hospital.
I called to find out more. She had a fever, some pain, and had fainted. The situation was serious, but stable.
Monday was a normal day. No news was good news, right?
Tuesday she had seizures.
Wednesday she was in the ICU. I caught the first flight to Tampa.
Thursday she rested comfortably.
Friday she was diagnosed with bacterial meningitis, a rare condition that affects about 3,000 people in the US annually. The doctors had known it was a possibility, so she was already receiving treatment.
We stayed by her side through the weekend. My dad spent every night...
That seems nice. I have not acquired steadfastness (yet (growth mindset?)) but perhaps "find things from which I could justifiably draw steadfastness as a resulting apparent trait" would be a useful tactic to try to apply. I have mostly optimized for flexibility, such as to be able to react to whatever happens, and then be able to nudge everything closer back towards The Form Of The Good... but the practical upshot doesn't look like steadfastness from the outside, I don't think.
Mom would have approved of less "apparent chaos from a distance without the abi...
Claude Sonnet 4 and Claude Opus 4 are out. Anthropic says they're both state-of-the-art for coding. Blogpost, system card.
Anthropic says Opus 4 may have dangerous bio capabilities, so it's implementing its ASL-3 standard for misuse-prevention and security for that model. (It says it has ruled out dangerous capabilities for Sonnet 4.) Blogpost, safety case report. (RSP.)
Tweets: Anthropic, Sam Bowman, Jan Leike.
Claude 3.7 Sonnet has been retconned to Claude Sonnet 3.7 (and similarly for other models).
Summary: We have previously argued that scalable oversight methods can be used to provide guarantees on low-stakes safety – settings where individual failures are non-catastrophic. However, if your reward function (e.g. honesty) is compatible with many possible solutions then you also need to avoid having free parameters exploited over time. We call this the exploitable search problem. We propose a zero-sum game where, at equilibrium, free parameters are not exploited – that is, our AI systems are carrying out an unexploitable search.
We would like to be able to use AI on under-specified tasks (e.g. coding and research advice) where there are...
I see two senses in which where research sabotage is different from the usual sandbagging issue:
ELK i.e. when explanations are hard to find: If a malicious solution is high probability for a model to find, but presenting a human-recognizable argument is hard. Note that you might be concerned that our classifiers have to do this as well, but that's not the case. The classifiers could work indirectly by forcing a spread of probability mass to distinct solutions.
High-stakes, multi-shot settings: Let's say the world is such that it's feasible to identify
Podcast version (read by the author) here, or search for "Joe Carlsmith Audio" on your podcast app.
Currently, most people treat AIs like tools. We act like AIs don’t matter in themselves. We use them however we please.
For certain sorts of beings, though, we shouldn’t act like this. Call such beings “moral patients.” Humans are the paradigm example. But many of us accept that some non-human animals are probably moral patients as well. You shouldn’t kick a stray dog just for fun.[1]
Can AIs be moral patients? If so, what sorts of AIs? Will some near-term AIs be moral patients? Are some AIs moral patients now?
If so, it matters a lot. We’re on track to build and run huge numbers of AIs. Indeed: if hardware and deployment scale fast...
Small nitpick: "the if and only if" is false. It is perfectly possible to have an AI that doesn't want any moral rights and is misaligned in some other way.
This is the video and transcript of a talk I gave on AI welfare at Anthropic in May 2025. The slides are also available here. The talk gives an overview of my current take on the topic. I'm also in the midst of writing a series of essays of about it, the first of which -- "On the stakes of AI moral status" -- is available here (podcast version, read by the author, here). My takes may evolve as I do more thinking about the issue.
Hi everybody. Thanks for coming. So: this talk is going to be about AI welfare. About whether AIs have welfare, moral status, consciousness, that kind of thing. How to think about that, and what to do in light of reasonable credences about...
Last week I stumbled over Dimensional Analysis which is not only useful for applied fields (physics, biology, economics), but also for math (Why did no one tell me that you can almost always think of df/dx as having "the type of f"/"the type of x"? The fact that exponents always have to be unit-less etc.? It had never occurred to me to make this distinction. In my mind, f(x)=x went from the reals to the reals, just like did.
One example of a question that before I would have had to think slowly about: What is the type of the standard deviation of a distribution? What is the type of the z-score of a sample?
Answer
The standard deviation has the unit of the random variable X, while the z-score
think of df/dx as having "the type of f"/"the type of x"
I expect you learned calculus the wrong way, in a math class instead of in physics. That's the point the notation, and the key reason it's an improvement over something like or !
Epistemic status: shower thought quickly sketched, but I do have a PhD in this.
As we approach AGI and need to figure out what goals to give it we will need to find tractable ways to resolve moral disagreement. One of the most intractable moral disagreements is between the moral realists and the moral antirealists.
There's an oversimplified view of this disagreement that goes:
Another question to ask, even assuming faultless convergence, related to uniqueness, is whether the process of updates has a endpoint at all.
That is, I could imagine that there exists series of arguments that would convince someone who believes X to believe Y, and a set that would convince someone who believes Y to believe X. If both of these sets of arguments are persuasive even after someone has changed their mind before, we have a cycle which is compatible with faultless convergence, but has no endpoint.