Alex Turner argues that the concepts of "inner alignment" and "outer alignment" in AI safety are unhelpful and potentially misleading. The author contends that these concepts decompose one hard problem (AI alignment) into two extremely hard problems, and that they go against natural patterns of cognition formation. Alex argues that "robust grading" scheme based approaches are unlikely to work to develop AI alignment.
I like stories where characters wear suits.
Since I like suits so much, I realized that I should just wear one.
The result has been overwhelmingly positive. Everyone loves it: friends, strangers, dance partners, bartenders. It makes them feel like they're in a Kingsmen film. Even teenage delinquents and homeless beggars love it. The only group that gives me hateful looks is the radical socialists.
If you wear a suit in a casual culture, people will ask "Why are you wearing a suit?" This might seem to imply that you shouldn't wear a suit. Does...
No, you are missing the point.
I'm banning you from commenting on my posts on the grounds that your comments are, on tone alone, argumentative rather than constructive. This has nothing to do with whether you are correct.
Outlive: The Science & Art of Longevity by Peter Attia (with Bill Gifford[1]) gives Attia's prescription on how to live longer and stay healthy into old age. In this post, I critically review some of the book's scientific claims that stood out to me.
This is not a comprehensive review. I didn't review assertions that I was pretty sure were true (ex: VO2 max improves longevity), or that were hard for me to evaluate (ex: the mechanics of how LDL cholesterol functions in the body), or that I didn't care about (ex: sleep deprivation impairs one's ability to identify facial expressions).
First, some general notes:
Thanks for the kind words!
I didn't discuss this in my review because I didn't really have anything to say about it, but Outlive talks about some "technologically advanced" longevity interventions (IIRC rapamycin got the most attention), and it concluded that none of them were that well-supported, and the best longevity interventions are still the obvious things (exercise; avoiding harmful activities like smoking; healthy diet; maybe sleep*).
But I will say that I'd guess that a lifetime of exercise does buy you >1 year of life expectancy, see footnote 59...
Glenn Beck is the only popular mainstream news host who takes AI safety seriously. I am being entirely serious. For those of you who don't know, Glenn Beck is one of the most trusted and well-known news sources by American conservatives.
Over the past month, he has produced two hour-long segments, one of which was an interview with AI ethicist Tristan Harris. At no point in any of this does he express incredulity at the ideas of AGI, ASI, takeover, extinction risk, or transhumanism. He says things that are far out of the normie Overton Window, with no attempt to equivocate or hedge his bets. "We're going to cure cancer, and we're to do it right before we kill all humans on planet Earth". He just says things...
Nope! I think it's great now. In fact I did it myself already. And in fact I was probably wrong two years ago.
Summary: We found that LLMs exhibit significant race and gender bias in realistic hiring scenarios, but their chain-of-thought reasoning shows zero evidence of this bias. This serves as a nice example of a 100% unfaithful CoT "in the wild" where the LLM strongly suppresses the unfaithful behavior. We also find that interpretability-based interventions succeeded while prompting failed, suggesting this may be an example of interpretability being the best practical tool for a real world problem.
For context on our paper, the tweet thread is here and the paper is here.
Chain of Thought (CoT) monitoring has emerged as a popular research area in AI safety. The idea is simple - have the AIs reason in English text when solving a problem, and monitor the reasoning for misaligned...
Educated people on the internet tend to be left-leaning, so when you train the model to write like an educated person, it also ends up inheriting left-leaning views
I think it's not just this, probably the other traits promoted in post-training (e.g. harmlessness training) are also correlated with left-leaning content on the internet.
This is more speculative and confusing than my typical posts and I also think the content of this post could be substantially improved with more effort. But it's been sitting around in my drafts for a long time and I sometimes want to reference the arguments in it, so I thought I would go ahead and post it.
I often speculate about how much progress you get in the first year after AIs fully automate AI R&D within an AI company (if people try to go as fast as possible). Natural ways of estimating this often involve computing algorithmic research speed-up relative to prior years where research was done by humans. This somewhat naturally gets you progress in units of effective compute — that is, as defined by...
I wonder if you can convert the METR time horizon results into SD / year numbers. My sense is that this will probably not be that meaningful because AIs are much worse than mediocre professionals while having a different skill profile, so they are effectively out of the human range.
If you did a best effort version of this by looking at software engineers who struggle to complete longer tasks like the ones in the METR benchmark(s), I'd wildly guess that a doubling in time horizon is roughly 0.7 SD such that this predicts ~1.2 SD / year.
Not saying we should pause AI, but consider the following argument:
What has humanity done with surplus people at every single opportunity that has presented itself? There's your argument.
Multiple people have asked me whether I could post this LW in some form, hence this linkpost.
~17,000 words. Originally written on June 7, 2025.
(Note: although I expect this post will be interesting to people on LW, keep in mind that it was written with a broader audience in mind than my posts and comments here. This had various implications about my choices of presentation and tone, about which things I explained from scratch rather than assuming as background, my level of comfort casually reciting factual details from memory rather than explicitly checking them against the original source, etc.
Although, come of think of it, this was also true of most of my early posts on LW [which were crossposts from my blog], so maybe it's not a big deal...)
I suspect that many of the things you've said here are also true for humans.
That is, humans often conceptualize ourselves in terms of underspecified identities. Who am I? I'm Richard. What's my opinion on this post? Well, being "Richard" doesn't specify how I should respond to this post. But let me check the cached facts I believe about myself ("I'm truth-seeking"; "I'm polite") and construct an answer which fits well with those facts. A child might start off not really knowing what "polite" means, but still wanting to be polite, and gradually flesh out wh...
In this post I want to highlight a small puzzle for causal theories of mechanistic interpretability. It purports to show that causal abstractions do not generally correctly capture the mechanistic nature of models.
Consider the following causal model :
Assume for the sake of argument that we only consider two possible inputs: and , that is, and are always equal.[1]
In this model, it is intuitively clear that is what causes the output , and is irrelevant. I will argue that this obvious asymmetry between and is not borne out by the causal theory of mechanistic interpretability.
Consider the following causal model :
Is a valid causal abstraction of the computation that goes on in ? That seems to depend on whether corresponds to or to . If corresponds to , then it seems that is a faithful representation of . If corresponds to , then is not intuitively a faithful representation of . Indeed, if corresponds...
Hmm, the math isn’t rendering. Here is a rendered version: