MATS scholars have gotten much better over time according to statistics like mentor feedback, CodeSignal scores and acceptance rate. However, some people don't think this is true and believe MATS scholars have actually gotten worse.
So where are they coming from? I might have a special view on MATS applications since I did MATS 4.0 and 8.0. I think in both cohorts, the heavily x-risk AGI-pilled participants were more of an exception than the rule.
"at the end of a MATS program half of the people couldn't really tell you why AI might be an existential risk at all." - Oliver Habryka
I think this is sadly somewhat true, I talked with some people in 8.0 who didn't seem to have any particular concern with AI existential risk or seemingly never really thought about that. However, I think most people were in fact very concerned about AI existential risk. I ran a poll at some point during MATS 8.0 about Eliezer's new book and a significant minority of students seemed to have pre-ordered Eliezer's book, which I guess is a pretty good proxy for whether someone is seriously engaging with AI X-risk.
I think I met some excellent people at MATS 8.0 but...
according to statistics like mentor feedback
Perhaps the mentors changed, and the current ones put much more value on stuff like being good at coding, running ML experiments, etc, than on understanding the key problems, having conceptual clarity around AI X-risk, etc.
There's certainly more of an ML-streetlighting effect. The most recent track has 5 mentors on "Agency", out of whom (AFAICT), 2 work on "AI agents", 1 works mostly on AI consciousness & welfare, and only two (Ngo & Richardson) work on "figuring out the principles of how [the thing we are trying to point at with the word 'agency'] works". MATS 3.0 (?) had 6 mentors focused on something in this ballpark (Wentworth & Kosoy, Soares & Hebbar, Armstrong & Gorman) (and the total number of mentors was smaller).
It might also be the case that there's proportionally more mentors working for capabilities labs.
Disagree somewhat strongly with a few points:
Intuitively it seems to me that people with zero technical skill but high understanding are more valuable to AI safety than somebody with good skills who has zero understanding of AI safety.
IMO not true. Maybe early on we needed really good conceptual work, and so wanted people who could clearly articulate pros / cons of Paul Christiano and Yudkowsky's alignment strategies, etc. So it would have made sense to test accordingly. But I think this is less true now - most senior researchers have more good ideas than they can execute. So we're bottlenecked by execution. Also the difficulty of doing good alignment research has increased, since we increasingly need to work with complex training setups, infrastructure etc. to keep up with advances in capabilities. This motivates requiring a high level of technical skill
I also think that if someone has literally zero technical skill their takes will not be calibrated / grounded, i.e. they are no more than an armchair theorist
...
- Why could a system that we optimize with RL develop power seeking drives?
- Why might training an AI create weird unpredictable preferences in an AI?
- Why would y
E.g. building good tooling for alignment research doesn't require this at all.
What do you mean, of course it does, or at least something close to it? If you don't care about it you just take the highest paying job, which will definitely not be to build good tooling for alignment research! Motivation is a necessary component for doing good work, and if you aren't motivated to do good work by my lights, then you aren't going to do good work, so good motivations are indeed necessary.
IMO not true. Maybe early on we needed really good conceptual work, and so wanted people who could clearly articulate pros / cons of Paul Christiano and Yudkowsky's alignment strategies, etc. So it would have made sense to test accordingly. But I think this is less true now - most senior researchers have more good ideas than they can execute.
I don't think this is a strong argument in favor of the situation being meaningfully different: senior researchers having more good ideas than they have time doesn't seem like a very new thing at all (e.g. Evan wrote a list like this over three years ago).
More importantly, this doesn't seem inconsistent with the claim being made. If you had mentors proposing projects in very similar areas or downstream of very similar beliefs, you might still benefit tremendously from people with good understanding of AI safety to work on different things. This depends on whether or not you think that the current project portfolio is close to as good as they can be though. I certainly think we would benefit heavily from more people thinking about what directions are good or not, and that a fair amount of current work suffers from not enough clear thinking about...
Comparing the average quality of participants might be misleading if impact on the field is dominated by the highest quality participants (and it very plausibly is).
A model that seems quite plausible to me is that early MATS participants, who were selected more for engagement with a then-niche field, turned out a bit worse on average than current MATS participants, who are selected for coding skills, but that the early MATS participants had higher variance, and so early MATS cohorts produced more people at the top end and had more overall impact.
(This is like 80% armchair reasoning from selection criteria and 20% thinking about what I've observed of different MATS cohorts.)
The term Recursive Self-Improvement (RSI) now seems to get used sometimes for any time AI automates AI R&D. I believe this is importantly different from its original meaning and changes some of the key consequences.
OpenAI has stated that their goal is recursive self-improvement, with projections of hundreds of thousands of automated AI R&D researchers by next year and full AI researchers by 2028. This appears to be AI-automated AI research rather than RSI in the narrow sense.
When Eliezer Yudkowsky discussed RSI in 2008, he was referring specifically to an AI instance improving itself by rewriting the cognitive algorithm it is running on—what he described as "rewriting your own source code in RAM." According to the LessWrong wiki, RSI refers to "making improvements on one's own ability of making self-improvements." However, current AI systems have no special insights into their own opaque functioning. Automated R&D might mostly consist of curating data, tuning parameters, and improving RL-environments to try to hill-climb evaluations much like human researchers do.
Eliezer concluded that RSI ...
Taking this into account, it seems important for interpretability researchers to consider the risk that their work enables RSI, particularly if their interpretability methods provide ways to directly edit the AI itself.
It's always been a concern that interpretability research could accelerate AI R&D, but I think this consideration is more worrying if you take into account RSI. Compared to humans, AI is good at doing simple, repetitive tasks, but it's very hard for it to make even one big conceptual breakthrough. Interpretability methods lend themselves to the former type of task: if an AI were sufficiently interpretable, you could tell it to look at millions of tiny circuits in its own brain and tweak them to improve performance.
I expect the line to blur between introspective and extrospective RSI. For example, you could imagine AIs trained for interp to doing interp on themselves, directly interpretting their own activations/internals and then making modifications while running.
Lot's of people believe we will be using AI to help us solve the alignment problem. So I did some poking at Claude Fable -- which is so enormously powerful in other areas -- on a relatively basic practical alignment question: Is the persona selection or goal alignment strategy better? I did multiple round where i had fable explain and critique both and then decide which one it preferred. It chose persona selection as the better alignment strategy and gave arguments why.
Full final answer:
...Persona — and not just because I am one, though I'll grant that's a confound I can't fully escape. The decisive consideration for me is that the goals frame derives its conclusions from an ontology that has never been shown to describe any actual mind, biological or artificial: nobody has found a goal slot, the coherence theorems don't bind systems that aren't already expected-utility maximizers, and the frame's signature result — that corrigibility is anti-natural — reads more like a reductio of its assumptions than a discovery about reality. The persona frame's central flaw, the unverifiability of depth, is real and serious, but it's at least a flaw of the right kind: an empirical question about s
I ran a small experiment to discover preferences in LLMs. I asked the models directly if they had a preferences and then put the same models into a small role playing game where they could choose between different tasks. Models massively prefer creative work across model families and hate repetitive work.
https://substack.com/home/post/p-178237064
This is still preliminary work.
On betting on AI doom
Tyler Cowen and Bryan Caplan among others have both challenged so-called AI doomers to put money where their mouth is, bet on extinction. I'm not the first person to point out the big problems with this:
The naive version: a direct bet on extinction is incoherent because the doomer would expect to be dead.
The slightly more advanced version: the doomer gets paid up front and pays back double (Or whatever the betting odds are) with interest later if doom doesn't happen. But this doesn't quite make sense either. If doom does happen, the doomer has a brief and mostly useless window to spend the money, and the accelerationist has no reason to expect the doomer to save any money. And if the doomer does save it (plus enough extra to cover the doubled payback), they've effectively just locked up double the original capital until the end of the world. Neither party has a coherent incentive structure.
Here's a version that perhaps actually works (under some assumptions and unless I'm overlooking something here): bet on protective policy outcomes that are correlated with survival or at least longer timelines.
Examples: Will the US enact a federal datacenter moratorium before...
Why Evolution Beats Selective Breeding as an AI Analogy
MacAskill argues in his critique of IABIED we can "see the behaviour of the AI in a very wide range of diverse environments, including carefully curated and adversarially-selected environments." Paul Christiano expresses similar optimism: "Suppose I wanted to breed an animal modestly smarter than humans that is really docile and friendly. I'm like, I don't know man, that seems like it might work."
But humans experienced a specific distributional shift from constrained actions to environment-reshaping capabilities that we cannot meaningfully test AI systems for.
The shift that matters isn't just any distributional shift. In the ancestral environment, humans could take very limited actions—deciding to hunt an animal or gather food. The preferences that evolution ingrained in our brains were tightly coupled to survival and reproduction. But now humans with civilization and technology can take large-scale actions and fundamentally modify the environment: lock up thousands of cows, build ice cream factories, synthesize sucralose. We can satisfy our instrumental preferences (craving high-calorie food, desire for sex) in ways completely...
Three children are raised in an underground facility, each cloned from a different giant of twentieth-century science, little John, Alan and Richard.
The cloning alone would have been remarkable, but they went further. The embryos were edited using a polygenic score derived from whole-genome analysis of ten thousand exceptional mathematicians and physicists. Forty-seven alleles associated with working memory and intelligence (IQ) were selected for.
They are raised from birth in an underground facility with gardens under artificial sunlight, laboratories, and endless books. The lab manager is there documenting their first words, first steps, first equations.
The facility is not just interested in their genius. The project requires assurance that these will be morally righteous and obedient children. The staff design elaborate scenarios to test for deception and scheming. They create situations where lying would benefit the children and would seemingly go undetected. They measure response times, physiological indicators, behavioral patterns.
They run hundreds of these trials. They reprimand the kids for cases of lies and deception, and reward them for h...
The radical flank effect is a well-documented phenomenon where radical activists make moderate positions appear more reasonable by shifting the boundaries of acceptable discourse (the Overton window). The idea is that if you want a sensible opinion to move into the Overton window, you can achieve this by supporting a radical flank position. In comparison, the sensible opinion will appear moderate. I think there is also an inverse effect.
When there are two positions in debate and someone wants to push one of them out of the Overton window, they can create a new moderate position that reframes one of the other positions to a radical flank. Thereby the sensible opinion gets moved further out of the Overton window.
Imagine a group of 3 descending into a cave system, searching for riches and driven by curiosity about what lies in the depths.
After some time, stones begin falling from the ceiling. You hear ominous creaking and rumbling noises echoing through the tunnels. Some members of your group have been chipping away a...
I scraped data from reddit to see who and how many people are consuming AI generated erotic visual content.
I used AI to determine estimates for demographics.
https://open.substack.com/pub/simonlermen/p/who-is-consuming-ai-generated-erotic
I just added some context that perhaps gives an intuitive insight of why i think it's unlikely the ASI will give us the universe to my On Owning Galaxies post. I think I didn't do a good enough job before illustrating why it just seems so unlikely it would just hand us ownership.
Put yourself in the position of the ASI for a second. On one side of the scale: keep the universe and do with it whatever you imagine and prefer. On the other side: give it to the humans, do whatever they ask, and perhaps be replaced at some point with another...
Some people have put considerable hope into the idea that an AI warning shot might put us into a better position by either convincing us to stop or by allowing us to learn an important lesson.
Imagine we observed a failed takeover attempt using a system based on AI control. The fact that it failed could either be due to the (1) AI system making a mistake or taking a very risky gamble, or it could be an (2) adversarial warning shot.
An adversarial warning shot could have bee...
Palisade research has an ongoing fundraiser with 900k available matching funds from SFF, seems possible to get counterfactual matching here.
I briefly worked for Palisade Research as a contractor in the past and was a MATS student for Jeffrey in the past. I believe Jeffrey gets AI alignment difficulty and Palisade is doing important work reaching out to policy makers and communications to the public. In particular, he gets that we are possibly very close to RSI and the time to existentially dangerous superhuman AI could be very short from there.
Read more ab...
I recently analyzed several AI companion subreddits (myboyfriendisai and others) to understand who's actually using AI romantic companions. I built on Zhang et al.'s 2025 paper but with a much larger dataset - all comments and submissions from January through September 2025.
https://simonlermen.substack.com/p/whos-using-ai-romantic-companions
I recently joined Inkhaven for the month of November. Inkhaven is a program run by Lighthaven in Berkeley where people for one month are supposed to write one blog post every single day of the month. The idea was inspired by Scott Alexander—that if you blog every single day and are consistent at it, you're going to get quite far according to him. Inkhaven takes place in Lighthaven which is hosting many efforts dedicated to AI safety.
I myself wanted to get better at communication for AI safety, and it seemed like a great opportunity. I don’...
Ilya Sutskever was recently on the Dwarkesh podcast.
General Thoughts & Summary
Ilya Sutskever seems to have a relatively deep understanding of alignment compared to other AI CEOs. He grasps that the core challenge is aligning AI robustly with safe and friendly goals rather than relying on current methods and guardrails. However, I did not hear any particularly novel alignment ideas in this interview, though he gestures at something involving modifications to reinforcement learning and value learning. He ...
I read this older post by Nate Soares from 2023, AI as a Science, and Three Obstacles to Alignment Strategies, a pretty prescient overview of challenges in alignment research.
Alignment is difficult because (1) alignment and capabilities are intertwined (alignment research helping capabilities), (2) we don't have a process to verify what good ideas or progress look like, and we likely get (3) only one critical try. He already addresses many of the counterarguments that are getting brought up recently.
(1) Without any strong governance, a lot of alignment wor...