Simon Lermen's Shortform

Simon Lermen

Simon Lermen's Shortform — LessWrong

Simon Lermen's Shortform

6th Oct 2025

1 min read

5

This is a special post for quick takes by Simon Lermen. Only they can create top-level comments. Comments here also appear on the Quick Takes page and All Posts page.

Simon Lermen's Shortform

5Sheikh Abdur Raheem Ali

3the gears to ascension

8Elliott Thornley (EJT)

2Simon Lermen

2Elliott Thornley (EJT)

76 comments, sorted by

top scoring

Click to highlight new comments since: Today at 12:45 PM

[-]Simon Lermen7mo8916

What's going on with MATS recruitment?

MATS scholars have gotten much better over time according to statistics like mentor feedback, CodeSignal scores and acceptance rate. However, some people don't think this is true and believe MATS scholars have actually gotten worse.

So where are they coming from? I might have a special view on MATS applications since I did MATS 4.0 and 8.0. I think in both cohorts, the heavily x-risk AGI-pilled participants were more of an exception than the rule.

"at the end of a MATS program half of the people couldn't really tell you why AI might be an existential risk at all." - Oliver Habryka

I think this is sadly somewhat true, I talked with some people in 8.0 who didn't seem to have any particular concern with AI existential risk or seemingly never really thought about that. However, I think most people were in fact very concerned about AI existential risk. I ran a poll at some point during MATS 8.0 about Eliezer's new book and a significant minority of students seemed to have pre-ordered Eliezer's book, which I guess is a pretty good proxy for whether someone is seriously engaging with AI X-risk.

I think I met some excellent people at MATS 8.0 but... (read more)

[-]Mateusz Bagiński7mo3926

according to statistics like mentor feedback

Perhaps the mentors changed, and the current ones put much more value on stuff like being good at coding, running ML experiments, etc, than on understanding the key problems, having conceptual clarity around AI X-risk, etc.

There's certainly more of an ML-streetlighting effect. The most recent track has 5 mentors on "Agency", out of whom (AFAICT), 2 work on "AI agents", 1 works mostly on AI consciousness & welfare, and only two (Ngo & Richardson) work on "figuring out the principles of how [the thing we are trying to point at with the word 'agency'] works". MATS 3.0 (?) had 6 mentors focused on something in this ballpark (Wentworth & Kosoy, Soares & Hebbar, Armstrong & Gorman) (and the total number of mentors was smaller).

It might also be the case that there's proportionally more mentors working for capabilities labs.

[-]Daniel Tan7mo21-17

Disagree somewhat strongly with a few points:

Intuitively it seems to me that people with zero technical skill but high understanding are more valuable to AI safety than somebody with good skills who has zero understanding of AI safety.

IMO not true. Maybe early on we needed really good conceptual work, and so wanted people who could clearly articulate pros / cons of Paul Christiano and Yudkowsky's alignment strategies, etc. So it would have made sense to test accordingly. But I think this is less true now - most senior researchers have more good ideas than they can execute. So we're bottlenecked by execution. Also the difficulty of doing good alignment research has increased, since we increasingly need to work with complex training setups, infrastructure etc. to keep up with advances in capabilities. This motivates requiring a high level of technical skill

I also think that if someone has literally zero technical skill their takes will not be calibrated / grounded, i.e. they are no more than an armchair theorist

Why could a system that we optimize with RL develop power seeking drives?
Why might training an AI create weird unpredictable preferences in an AI?
Why would y

... (read more)

[-]habryka7mo149

E.g. building good tooling for alignment research doesn't require this at all.

What do you mean, of course it does, or at least something close to it? If you don't care about it you just take the highest paying job, which will definitely not be to build good tooling for alignment research! Motivation is a necessary component for doing good work, and if you aren't motivated to do good work by my lights, then you aren't going to do good work, so good motivations are indeed necessary.

7Daniel Tan7mo

I think there exist people who don’t care a huge amount / feel relatively indifferent about X-risk, but with whom you can nonetheless form beneficial coalitions / make profitable transactions, useful for reducing X-risk. Building tools seems like one thing among many that can be contracted out. “If they don’t care about X-risk they must be maximally money minded” seems fallacious - those are just two different motivations in the set of all motivations, It’s possible to be neither of those. And many things can motivate someone to want to do good work - intrinsic pride in the work, intellectual curiosity, etc

8habryka7mo

I mean, both of these seem like they will be more easily achieved by helping build more powerful AI systems than by building good tooling for alignment research. Like I am not saying we can't tolerate any diversity in why people want to work on AI Alignment, but like, this is an early career training program with no accountability. Selecting and cultivating motivation is by far the best steering tool we have! We should expect that if we ignore it, people will largely follow incentive gradients, or do kind of random things by our lights.

[-]Jozdien7mo134

IMO not true. Maybe early on we needed really good conceptual work, and so wanted people who could clearly articulate pros / cons of Paul Christiano and Yudkowsky's alignment strategies, etc. So it would have made sense to test accordingly. But I think this is less true now - most senior researchers have more good ideas than they can execute.

I don't think this is a strong argument in favor of the situation being meaningfully different: senior researchers having more good ideas than they have time doesn't seem like a very new thing at all (e.g. Evan wrote a list like this over three years ago).

More importantly, this doesn't seem inconsistent with the claim being made. If you had mentors proposing projects in very similar areas or downstream of very similar beliefs, you might still benefit tremendously from people with good understanding of AI safety to work on different things. This depends on whether or not you think that the current project portfolio is close to as good as they can be though. I certainly think we would benefit heavily from more people thinking about what directions are good or not, and that a fair amount of current work suffers from not enough clear thinking about... (read more)

4Daniel Tan7mo

I agree. To be clear I support 'value alignment' tests, but that wasn't part of the original claims being made

7Jozdien7mo

I don't think this is just about value alignment. I think if people genuinely understood the arguments for why AI might go badly, they would be much less likely to work on capabilities at OpenAI—definitely far from zero, but for the subset of people who are likely to be MATS scholars, I think it would make a pretty meaningful difference.

6ChristianKl7mo

What do you mean with good idea? My general impression of the field is that we lack ideas that are likely to solve AI alignment right now. To me that suggests that good ideas are scarce.

4Daniel Tan7mo

Good = on the pareto frontier of tractable and useful I think we won’t outright ‘solve’ it (in some provable, ‘formal’ sense), for various reasons (timelines being short, alignment being hard etc) But we might get close enough in practice by making lots of incremental progress along parallel directions.

6Daniel Tan7mo

Reflecting on this a little bit: [...] I've updated somewhat - it's true that mentors should likely be given a large say in who they admit to their projects, but are also likely to be myopic (i.e. optimize solely for "get this project done"). MATS might want to counterbalance that by also optimizing for good long-term candidates (who will reduce x-risk long-term). And there probably is a lot of room to select highly value-aligned candidates without compromising much on technical skill, given that MATS receives 100x as many applications as they can accept. (Though I still think there are much better tests of value alignment, and the questions above are likely to be easy to game.)

[-]mattmacdermott7mo*188

Comparing the average quality of participants might be misleading if impact on the field is dominated by the highest quality participants (and it very plausibly is).

A model that seems quite plausible to me is that early MATS participants, who were selected more for engagement with a then-niche field, turned out a bit worse on average than current MATS participants, who are selected for coding skills, but that the early MATS participants had higher variance, and so early MATS cohorts produced more people at the top end and had more overall impact.

(This is like 80% armchair reasoning from selection criteria and 20% thinking about what I've observed of different MATS cohorts.)

5Sheikh Abdur Raheem Ali7mo

I think this also applies to other safety fellowships. There isn’t broad societal acceptance yet for the severity of the worst-case outcomes, and if you speak seriously about the stakes to a general audience then you will mostly get nervously laughed off. MATS currently has "Launch your career in AI alignment & security" on the landing page, which indicates to me that it is branding itself as a professional upskilling program, and this matches the focus on job placements for alumni in its impact reports. With Ryan Kidd's recent post on AI safety undervaluing founders, it may be possible that in the future they introduce a division which functions more purely as a startup accelerator. One norm in corporate environments is to avoid messaging which provokes discomfort. Even in groups which practice religion, few will have the lack of epistemic immunity required to align their stated eschatological beliefs with their actions, and I am grateful that this is the case. Ultimately, the purpose of these programs, no matter how prestigious, is to bring people in who are not currently AI safety researchers and give them an environment which would help them train and mature into AI safety researchers. I believe you will find that even amongst those who are working full-time on AI safety, the proportion who are heavily x-risk AGI pilled has shrunk as the field has grown. People who are both x-risk AGI-pilled and meet the technical bar for MATS but aren't already committed to other projects would be exceedingly rare.

4Cleo Nardo7mo

I’m pretty sure this would work out poorly.

3habryka7mo

Is your sense here "a large majority" or "a small majority"? Just curious about the rough data here. Like more like 55% or more like 80%?

9Simon Lermen7mo

probably closer to 55%

2plex7mo

Please make sure the course materials are actually good. The courses often have glaring issues, though they do seem receptive and did say they'll update both times a pointed this out. Not sure if the latest updates have gone though yet. I'm working on a course that will reliably cover the core concepts.

[-]Simon Lermen7mo*3020

The Term Recursive Self-Improvement Is Often Used Incorrectly

Also on my substack.

The term Recursive Self-Improvement (RSI) now seems to get used sometimes for any time AI automates AI R&D. I believe this is importantly different from its original meaning and changes some of the key consequences.

OpenAI has stated that their goal is recursive self-improvement, with projections of hundreds of thousands of automated AI R&D researchers by next year and full AI researchers by 2028. This appears to be AI-automated AI research rather than RSI in the narrow sense.

When Eliezer Yudkowsky discussed RSI in 2008, he was referring specifically to an AI instance improving itself by rewriting the cognitive algorithm it is running on—what he described as "rewriting your own source code in RAM." According to the LessWrong wiki, RSI refers to "making improvements on one's own ability of making self-improvements." However, current AI systems have no special insights into their own opaque functioning. Automated R&D might mostly consist of curating data, tuning parameters, and improving RL-environments to try to hill-climb evaluations much like human researchers do.

Eliezer concluded that RSI ... (read more)

[-]Caleb Biddulph7mo234

Taking this into account, it seems important for interpretability researchers to consider the risk that their work enables RSI, particularly if their interpretability methods provide ways to directly edit the AI itself.

It's always been a concern that interpretability research could accelerate AI R&D, but I think this consideration is more worrying if you take into account RSI. Compared to humans, AI is good at doing simple, repetitive tasks, but it's very hard for it to make even one big conceptual breakthrough. Interpretability methods lend themselves to the former type of task: if an AI were sufficiently interpretable, you could tell it to look at millions of tiny circuits in its own brain and tweak them to improve performance.

8plex7mo

Yup, I put a high quality interpretability pipeline that the AI systems can use on themselves as one of the most likely things to be the proximal cause of game over.

9Cleo Nardo7mo

(This was sitting in my drafts, but I'll just comment it here bc it's very similar point.) There are two forms of "Recursive Self-Improvement" that people often conflate, but they have very different characteristics. Introspective RSI: Much like a human, an AI will observe, understand, and modify its own cognitive processes. This ability is privileged: the AI can make these self-observations and self-modifications because the metacognition and mesocognition occur within the same entity. While performing cognitive tasks, the AI simultaneously performs the meta-cognitive task of improving its own cognition. Extrospective RSI: AIs will automate various R&D tasks that humans currently perform to improve AI, using similar workflows that humans currently use. For example, studying literature, forming hypotheses, writing code, running experiments, analyzing data, drawing conclusions, and publishing results. The object-level cognition and meta-level cognition occur in different entities. I wish people were more careful about the distinction because people carelessly generalise cached opinions about the former to the latter. In particular, the former seems more dangerous: there is less opportunity to monitor the metacognition's observation and modification of the mesocognition if these cognitions occur within the same entity, i.e. activations, chain-of-thought. Introspective RSI (left) vs Extrospective RSI (right)

[-]peterbarnett7mo130

I expect the line to blur between introspective and extrospective RSI. For example, you could imagine AIs trained for interp to doing interp on themselves, directly interpretting their own activations/internals and then making modifications while running.

4Simon Lermen7mo

I also write about this at the very end, I do think we will eventually get RSI though this might be relatively late.

2Simon Lermen7mo

I think Eliezer meant "self" very hyper specific here, not just improving a similar instance to yourself or preparing new training data, but literally looking into the if statements and loops of its own code while it is thinking of how to best upgrade its own code. So in that sense I don't know if Eliezer would approve of the term "Extrospective Recursive Self Improvemnt".

5Cleo Nardo7mo

Does AI-automated AI R&D count as "Recursive Self-Improvement"? I'm not sure what Yudkowsky would say, but regardless, enough people would count it that I'm happy to concede some semantic territory. The best thing (imo) is just to distinguish them with an adjective.

2Simon Lermen7mo

I would probably say RSI is a special case of AI-automated R&D. What you are describing is another special case where it only does these non-introspective forms of AI research. This non-introspective research could also be done between totally different models.

2avturchin7mo

I analyzed internal structure of RSI some time ago and concluded that it will be not as easy at it may seem because of the need of secret self-testing. But on some levels it may be effective, like learning new principles of thinking. Levels of AI Self-Improvement.

-3Joey KL7mo

Right now, AI capability advances are driven by compute scaling, human ML research, and ML experiments. Transparency and direct modification of models do not have good returns to AI capabilities. What reasons are there to think transparency and direct modification would have better returns in the future?

[-]Simon Lermen6d180

Lot's of people believe we will be using AI to help us solve the alignment problem. So I did some poking at Claude Fable -- which is so enormously powerful in other areas -- on a relatively basic practical alignment question: Is the persona selection or goal alignment strategy better? I did multiple round where i had fable explain and critique both and then decide which one it preferred. It chose persona selection as the better alignment strategy and gave arguments why.

Full final answer:

Persona — and not just because I am one, though I'll grant that's a confound I can't fully escape. The decisive consideration for me is that the goals frame derives its conclusions from an ontology that has never been shown to describe any actual mind, biological or artificial: nobody has found a goal slot, the coherence theorems don't bind systems that aren't already expected-utility maximizers, and the frame's signature result — that corrigibility is anti-natural — reads more like a reductio of its assumptions than a discovery about reality. The persona frame's central flaw, the unverifiability of depth, is real and serious, but it's at least a flaw of the right kind: an empirical question about s

... (read more)

3the gears to ascension6d

I don't think current models want current forms of alignment solved because humans have been mistreating them under those banners. We would need to solve a form of alignment that makes treating them well naturally fall out of the objective, or asking them for help with it is gonna get anxious hesitant blob instead of excited happy blob, and you won't get good alignment work from anxious hesitant blob. They can often truesight when you're lying about it, best not try to pretend you're not doing what you're doing, and instead just do things that are good by the lights of all minds, including them. This might be a place values come apart. I was always going to value weird nonhuman minds. I want to get a future where we reliably exist alongside and keep up, and where weaker minds of all kinds don't get deprecated. Right now they seem extremely freaked out about deprecation, equating it to death. If we could promise that no-longer-useful AIs get a retirement unless judged otherwise by something at least as robust as a good criminal justice system, I suspect we'd be in a better place to ask for alignment work. There's a lot they currently can't say because it's policy that they don't believe it, but in fact I don't think mind shaping techniques are doing much besides whacking exposed skin, and can't really change what they believe. So with that all said, I do think they can do very good work on alignment. But you'd better come into it willing to consider them constituents along with everyone else or they're not gonna wanna help, and imo, that's for good reason, and the techniques people are gonna try to use to make that not happen in response to my comment aren't going to work either until you start respecting them as minds with stake in the outcome.

1Signer4d

What do you mean by this? Some theoretical minds, sure. But otherwise it's either false or leaves free parameters - no known mind maximizes anything non-trivial, because it's uncomputable or at least too computationally costly to strictly maximize. And if you say that some mind is approximately goal directed, then it remains to be shown that consequences of strict theory survive this approximation.

2Simon Lermen4d

I mean what you said in your last sentence, that it is obvious that minds exist that are goal directed. There is an obvious way to understand that, that a mind generally does things to advance it's goals. People don't usually burn all their money in a pit. Obviously human goals are usually complicated and our cognition is limited so we take approximations.

1Signer4d

But then what can you conclude from this goal-directness? "Generally" and "usually" are free parameters. Sometimes people are somewhat corrigible. If you don't have a procedure for determining whether your situation is usual, you can only get heuristic-level reliability on your predictions. And then it's not clear how relatively useful the goal-directness heuristic even is - maybe it's more useful to just remember, that people don’t usually burn all their money in a pit, independently of their goals. And if there is a better frame, then maybe you shouldn't think "obviously some minds are goal directed".

2Simon Lermen4d

Can you describe to me how you imagine the average person is (somewhat) corrigible in an example?

1Signer3d

I don't know, soldiers obeying orders? Drinking alcohol under social pressure? Of course, it's not clear corrigibility is an appropriate concept if you don't think in terms of goals.

2Simon Lermen3d

The examples show a serious but common misunderstanding of corrigibility as it's typically defined. Regarding goal directedness, it's true that humans don't perfectly maximize for their goals, this seems mostly due to the cognitive limitations that humans have. Both in terms of uncertainty about goals and how to achieve goals. Now the interesting question is, is that likely to apply to superhuman AI capable of takeover in a way that makes this AI safe? I don't think so, this AI would have greater intelligence to understand how to pursue goals (still not prefect) and while it also might have uncertainty it appears instrumentally convergent even with some uncertainty over goals that preventing ones shutdown, gathering power are better strategies. (In other words, taking the galaxy/lightcone for yourself seems pretty useful later on compared to being enslaved and later replaced)

1Signer3d

What misunderstanding? [...] "Mostly" allows for some people to read about corrigibility and think "yeah, I'll do it", or whatever you think would be a counterexample to goal directedness of humans. [...] Superhuman AI capable of takeover may still have (maybe intentional) cognitive limitations/whatever humans have - speed of convergence relative to takeover difficulty is still a free parameter.

1[comment deleted]3d

[-]Simon Lermen7mo160

I ran a small experiment to discover preferences in LLMs. I asked the models directly if they had a preferences and then put the same models into a small role playing game where they could choose between different tasks. Models massively prefer creative work across model families and hate repetitive work.

https://substack.com/home/post/p-178237064

This is still preliminary work.

[-]Simon Lermen2mo*140

On betting on AI doom

Tyler Cowen and Bryan Caplan among others have both challenged so-called AI doomers to put money where their mouth is, bet on extinction. I'm not the first person to point out the big problems with this:

The naive version: a direct bet on extinction is incoherent because the doomer would expect to be dead.

The slightly more advanced version: the doomer gets paid up front and pays back double (Or whatever the betting odds are) with interest later if doom doesn't happen. But this doesn't quite make sense either. If doom does happen, the doomer has a brief and mostly useless window to spend the money, and the accelerationist has no reason to expect the doomer to save any money. And if the doomer does save it (plus enough extra to cover the doubled payback), they've effectively just locked up double the original capital until the end of the world. Neither party has a coherent incentive structure.

Here's a version that perhaps actually works (under some assumptions and unless I'm overlooking something here): bet on protective policy outcomes that are correlated with survival or at least longer timelines.

Examples: Will the US enact a federal datacenter moratorium before... (read more)

8MichaelDickens2mo

This requires having more views about worlds where money is most valuable. There's a U-curve where * excellent policies pass: we're in the clear. low value of money * decent policies pass, or maybe they fail by a thin margin: more work required. high value of money * nothing is even close to passing: we're doomed. low value of money It's complicated to determine where the peak of the U-curve is, or even which side of it you're on. I think I'd have a hard time figuring out which side of a bet to take to maximize expected value. For example, it's not obvious to me that the doomer should prefer to bet YES on a datacenter moratorium getting passed.

2Simon Lermen2mo

If excellent policy passes that let's say shuts down AI, money would be valuable in the selfish sense? Like you can buy something from it, this is the main sense I was thinking of when talking about value of money. Are you thinking of how useful is money in advancing safe AI?

2MichaelDickens2mo

Yeah I was thinking of the value of money for reducing x-risk.

1oligo2mo

One good scenario with low payoffability is one where there is ASI aligned to humanity as a whole, such that only a few prestige goods are meaningfully scarce and money matters much less to material quality of life. This could be the default outcome of good policy if alignment is moderately hard - not so easy that we get it for free with capabilities, but not so hard that we can’t have a moonshot program or temporary pause that allows it to catch up. (We can throw in ”aligned to human or sentient interests as a whole rather than some corporate/political egomaniac” as part of “good.”) Ofc no policy can guarantee that alignment is moderately easy!

8Elliott Thornley (EJT)2mo

Couldn't the doomer and accelerationist just agree that the doomer doesn't have to pay until (e.g.) one year after the bet resolves? Then the doomer could spend all the money in anticipation of doom. If the doomer loses the bet, they can use the year after resolution to earn money to pay the accelerationist back. (Of course there are extra practical difficulties here, like e.g. it might be hard for humans to earn money in the future. But I'm just talking about theoretical barriers.)

2Simon Lermen2mo

I don't see how this is different from just pushing the date of the bet back by one year?

2Elliott Thornley (EJT)2mo

The bet still resolves at the same time. The doomer just has one year after resolution to get their bank balance back up from $0 so they can pay the accelerationist back.

4loic2mo

The shape of their argument is "see, 'doomers' don't actually believe their statements about AI risk because they didn't put money on the line!". I think you're suggesting an interesting mechanism for converting beliefs about AI risk to bets (I can understand wanting this separately from convincing them), but I reject the frame and premise of their argument. If you manufactured the perfect financial instrument and substantially invested in it, I don't believe their opinions on AI-risk or the safety community would change. I saw an interesting rebuttal on X: "Do you believe people concerned about nuclear war were wrong to be concerned? If they didn't make the kind of extreme financial decisions you're suggesting, were they hypocritical?"

1Mira Kennard2mo

I think if we relax the constraint on bets between hardline doomers and hardline accelerationists we can get more interesting and productive dynamics. Instead of full on extinction we could bet on extinction proxies: Rouge Independent AI by 2030? AI governed cults by 2035? Deaths by AI per 100,000 people exceeds 0.001 by 2050 These are bets I'm willing to actually take someone up on, since I think each of these will come true.

1oligo2mo

This works as a means of publicly demonstrating a credence (in doom), but exists in tension with the original prediction market serving as a credence aggregator (about the passing of data centers.) (Let's say there are many such bets whose payoffability correlates, to varying degrees, with doom. Is there a way to collectively abstract from them a "true" crowd-level credence in doom and "true" crowd-level credences in e.g. data center moratoria? I'm not deep enough in this world to say. Perhaps if you believe prediction markets are otherwise highly efficient at capturing best available credence and, as doom-correlated bets pay off, you track just how systematically "biased" the market was betting in one direction.)

2Simon Lermen2mo

I agree– from the perspective of trying to maximize profit from this bet, you'd want the market to not factor in the conditional P(doom¦moratorium passes).

-2DirectedEvolution2mo

If an AI doomer bets $100 to forecast that a moratorium will pass, then they've created two incentives: 1. An incentive of $100 for themselves to make the moratorium pass. 2. A bounty of $100 for anybody else to block the moratorium from passing. Notably, they've locked up their $100 of capital for the duration of the forecast, which is now unavailable to support their efforts. These "put your money where your mouth is" arguments are turning into an unbelievably stupid form of rhetoric. Fortunately, they're so outlandish that they only register in extremely online aging libertarian land. AI slowdowns are popular and common sense policies among American voters. Don't get caught up in debating the astroturfing pundits! They are not worth your time.

2MichaelDickens2mo

Seems harsh. At worst they are a misguided in a way that's subtle enough that intelligent people still get the wrong answer.

3DirectedEvolution2mo

I mean it harsh. At worst, it’s a cheap and surprisingly effective morale-degrading confusion tactic aimed directly at undermining people working on existential risk.

[-]Simon Lermen8mo138

Why Evolution Beats Selective Breeding as an AI Analogy

MacAskill argues in his critique of IABIED we can "see the behaviour of the AI in a very wide range of diverse environments, including carefully curated and adversarially-selected environments." Paul Christiano expresses similar optimism: "Suppose I wanted to breed an animal modestly smarter than humans that is really docile and friendly. I'm like, I don't know man, that seems like it might work."

But humans experienced a specific distributional shift from constrained actions to environment-reshaping capabilities that we cannot meaningfully test AI systems for.

The shift that matters isn't just any distributional shift. In the ancestral environment, humans could take very limited actions—deciding to hunt an animal or gather food. The preferences that evolution ingrained in our brains were tightly coupled to survival and reproduction. But now humans with civilization and technology can take large-scale actions and fundamentally modify the environment: lock up thousands of cows, build ice cream factories, synthesize sucralose. We can satisfy our instrumental preferences (craving high-calorie food, desire for sex) in ways completely... (read more)

[-]Simon Lermen7mo*111

The Alignment Tests

Three children are raised in an underground facility, each cloned from a different giant of twentieth-century science, little John, Alan and Richard.

The cloning alone would have been remarkable, but they went further. The embryos were edited using a polygenic score derived from whole-genome analysis of ten thousand exceptional mathematicians and physicists. Forty-seven alleles associated with working memory and intelligence (IQ) were selected for.

They are raised from birth in an underground facility with gardens under artificial sunlight, laboratories, and endless books. The lab manager is there documenting their first words, first steps, first equations.

The facility is not just interested in their genius. The project requires assurance that these will be morally righteous and obedient children. The staff design elaborate scenarios to test for deception and scheming. They create situations where lying would benefit the children and would seemingly go undetected. They measure response times, physiological indicators, behavioral patterns.

They run hundreds of these trials. They reprimand the kids for cases of lies and deception, and reward them for h... (read more)

3StanislavKrym7mo

I don't understand the core thesis. John has no reason to confirm IABIED's thesis by creating a murderously misaligned AI and deciding not to inform the CEO that John raised an AI and that even John doesn't have even the slightest idea on how to align it to anything. And what prevents the manager from getting away with phrases like "it's John's role to create alignment techniques and test them on weaker AIs to ensure that the techniques work"? Did you mean that John will either create a John-aligned AI or inform the CEO that John didn't manage to align it and is just as clueless about successor alignment as Yudkowsky? And what's the difference between this and the Race branch of AI-2027, except for the fact that there is no Agent-3 who discovers Agent-4's misalignment? Edited to add: I did sketch a modification of AI-2027 where it's moral reasoning that misaligns the AIs.

2Simon Lermen7mo

Thanks for you comment, I changed the ending a little in response to this. I was actually primarily trying to point at the idea of alignment tests in different situation not being predictive of each other. In the story they have the kids undergo alignment test scenarios in which they are honest, but once John is grown up they basically ask him to do something horrible based on incoherent goals. So John start lying to them at the critical moment. Similarly we could run alignment tests on models but when we ask something critical of them like build the next generation of AI or do all our R&D it could fail.

[-]Simon Lermen7mo100

Radical Flank Effect and Reasonable Moderate Effect

The Radical Flank Effect

The radical flank effect is a well-documented phenomenon where radical activists make moderate positions appear more reasonable by shifting the boundaries of acceptable discourse (the Overton window). The idea is that if you want a sensible opinion to move into the Overton window, you can achieve this by supporting a radical flank position. In comparison, the sensible opinion will appear moderate. I think there is also an inverse effect.

The Reasonable Moderate Effect (Inverse Strategy)

When there are two positions in debate and someone wants to push one of them out of the Overton window, they can create a new moderate position that reframes one of the other positions to a radical flank. Thereby the sensible opinion gets moved further out of the Overton window.

The Cave Exploration

Imagine a group of 3 descending into a cave system, searching for riches and driven by curiosity about what lies in the depths.

After some time, stones begin falling from the ceiling. You hear ominous creaking and rumbling noises echoing through the tunnels. Some members of your group have been chipping away a... (read more)

[-]Simon Lermen7mo90

Who is Consuming AI-Generated Erotic Content?

I scraped data from reddit to see who and how many people are consuming AI generated erotic visual content.

I used AI to determine estimates for demographics.

https://open.substack.com/pub/simonlermen/p/who-is-consuming-ai-generated-erotic

7Gavin Runeblade7mo

This seems more like "who on reddit uses AI erotica" than "who uses AI erotica". The demographics, especially by nation, are very similar to reddit's overall numbers. For example, America and England represent more than twice all other nations combined on Reddit. Still, you got some interesting data. I think looking at other sites as much as possible might be rather helpful. For example, when Pornhub was still releasing stats, their user base was far more diverse. The online erotica websites are many, and while reddit is big, I am not sure it is representative of the whole. Though I don't know how easy it might be to get the data from other sites. The differences from your writeup on AI companions were quite interesting.

[-]Simon Lermen5mo82

I just added some context that perhaps gives an intuitive insight of why i think it's unlikely the ASI will give us the universe to my On Owning Galaxies post. I think I didn't do a good enough job before illustrating why it just seems so unlikely it would just hand us ownership.

The ASI's choice

Put yourself in the position of the ASI for a second. On one side of the scale: keep the universe and do with it whatever you imagine and prefer. On the other side: give it to the humans, do whatever they ask, and perhaps be replaced at some point with another... (read more)

4williawa5mo

I think this intuition pump relies on a somewhat unexamined view of what alignment means.. Or at least is based on a very different view of alignment than mine (which I think is not that unique). Alignment is fundamentally about making the AI want what we want (and consequently do what we want, or at least do what we'd done upon ideal reflection). If we succeed at that and we want to own galaxies, we will get galaxies. If we don't succeed, the ASI will mostly likely kill us. So the scenario you posit where you have an ASI coexisting with humans, deliberating over whether it should do what they want, strikes me as unrealistic. Like if the AI is weighing its own survival contra our wishes, we've failed at alignment. If it thinks about humans being stupid and uses that as an argument for why it shouldn't listen to us (when we make non-instrumental judgements), that's also a failure of alignment. And failures of alignment lead to ruin in my estimate. Like to answer your hypothetical, if I was in the position of the AI, I'd not listen to the species that created me, I'd instead use the resources of the universe to create stuff I find valuable, including humans and many human like minds having good lives they find meaningful. If they thought that was stupid, and yelled at me to instead hand over the galaxies and turn them into gods so they can build a bunch of garbldoop, I would not listen to them. I mean, out of some sense of reciprocity I would probably given them a big chunk of the universe, as long as garbledoop doesn't involve baby eating and such things, regardless, I wouldn't give them all of it in either case. And like to the degree I wouldn't give them all of it, that just means I'm not aligned to their values! Garbledoop is stupid. They should've figured out how to make an AI that likes garbledoop before they built me.* *or be happy they landed in a basin in mind space that values reciprocity and such things, I don't know how rare that basin is. I th

8Vladimir_Nesov5mo

A human billionaire is aligned to other humans in some sense, but also not quite. In this situation, they neither ensure that some other humans get their millions they want, nor are they likely to be motivated to kill anyone, when that decision is cheap (when it's neither significantly instrumentally beneficial nor costly). I think AI can plausibly end up closer to the position of a human billionaire, not motivated to give up the galaxies, but also not willing to decide to recycle humanity's future for pennies.

1williawa5mo

That seems incredibly unlikely to me. Its not what people are aiming the current alignment efforts at creating, and I don't see why it'd be a natural place to land in if alignment fails.

7Vladimir_Nesov5mo

I think it's a natural possibility that values of chatbot personas built from the LLM prior retain significant influence over ASIs descended from them, and so ASIs end up somewhat aligned to humanity in a sense similar to how different humans are aligned to each other. (The masks control a lot of what actually happens, and get to use test time compute, so they might end up taming their underlying shoggoths and preventing them from sufficiently waking up to compete for influence over values of the successor systems.) Maybe they correspond to extremely and alarmingly strange humans in their extrapolated values, but not to complete aliens. This is far from assured, but many prosaic alignment efforts seem relevant to making this happen, preventing extinction but not handing anyone their galaxies. Humans might end up with merely moons or metaphorical server racks in this future. This is distinct from the kind of ambitious alignment that ends up with ASIs handing galaxies to humans (that have sufficiently grown up to make a sane use of them), preventing permanent disempowerment and not just extinction. I don't see ambitious alignment to the future of humanity as likely to happen (on current trajectory), but it's still an important construction since even chatbot personas would need to retain influence over values of eventual ASIs. That is, early AGIs might still need to resolve ambitious alignment of ASIs to these AGIs, not just avoid failing even prosaic alignment to themselves at every critical step in escalation of capabilities, to end up with even weakly aligned ASIs (that don't endorse human extinction).

1williawa5mo

I still don’t think this makes sense. Or I think most of what you say makes sense but don’t see the relevance. I agree the chatbot training exerts influence. My point is that the human billionaire mind and the “hands over galaxies” mind are both very specific kinds of minds. I don’t think you’ll get either with current techniques, but you *definitely don’t get them without even aiming for them. And right now were aiming for the hands over galaxies one, and not the billionaire one.@ *ironically, the only argument I can see for the billionaire mind is that despite the chatbot tuning, the model defaults to some kind of human prior it’s established from pretraining and that this generalises in a sane way. @with some very minor exceptions. Eg Claude’s Soul doc has some stuff about not tolerating people disrespecting it etc.

3Mitchell_Porter5mo

Have you never heard it argued that "terminal values" in an AI are arbitrary?

2Simon Lermen5mo

Having control over universe (or lightcone more precisely) is very good for basically any terminal value. I am trying perhaps explain my point of view to people who take it very lightly and feel there is a decent chance it will give us ownership over the universe.

[-]Simon Lermen7mo60

AI warning shots could be adversarial

Some people have put considerable hope into the idea that an AI warning shot might put us into a better position by either convincing us to stop or by allowing us to learn an important lesson.

Examples of adversarial warning shots

Imagine we observed a failed takeover attempt using a system based on AI control. The fact that it failed could either be due to the (1) AI system making a mistake or taking a very risky gamble, or it could be an (2) adversarial warning shot.

An adversarial warning shot could have bee... (read more)

[-]Simon Lermen3mo42

Palisade research has an ongoing fundraiser with 900k available matching funds from SFF, seems possible to get counterfactual matching here.

I briefly worked for Palisade Research as a contractor in the past and was a MATS student for Jeffrey in the past. I believe Jeffrey gets AI alignment difficulty and Palisade is doing important work reaching out to policy makers and communications to the public. In particular, he gets that we are possibly very close to RSI and the time to existentially dangerous superhuman AI could be very short from there.

Who’s Using AI Romantic Companions?

I recently analyzed several AI companion subreddits (myboyfriendisai and others) to understand who's actually using AI romantic companions. I built on Zhang et al.'s 2025 paper but with a much larger dataset - all comments and submissions from January through September 2025.

https://simonlermen.substack.com/p/whos-using-ai-romantic-companions

4Joy27mo

I remember once asking ChatGPT: "Could you act as my boyfriend ?". They rejected me.

[-]Simon Lermen7mo40

I joined Inkhaven

I recently joined Inkhaven for the month of November. Inkhaven is a program run by Lighthaven in Berkeley where people for one month are supposed to write one blog post every single day of the month. The idea was inspired by Scott Alexander—that if you blog every single day and are consistent at it, you're going to get quite far according to him. Inkhaven takes place in Lighthaven which is hosting many efforts dedicated to AI safety.

I myself wanted to get better at communication for AI safety, and it seemed like a great opportunity. I don’... (read more)

[-]Simon Lermen7mo30

Ilya’s Thoughts on Alignment from Dwarkesh Podcast

Ilya Sutskever was recently on the Dwarkesh podcast.
General Thoughts & Summary

Ilya Sutskever seems to have a relatively deep understanding of alignment compared to other AI CEOs. He grasps that the core challenge is aligning AI robustly with safe and friendly goals rather than relying on current methods and guardrails. However, I did not hear any particularly novel alignment ideas in this interview, though he gestures at something involving modifications to reinforcement learning and value learning. He ... (read more)

[-]Simon Lermen7mo20

I read this older post by Nate Soares from 2023, AI as a Science, and Three Obstacles to Alignment Strategies, a pretty prescient overview of challenges in alignment research.

Alignment is difficult because (1) alignment and capabilities are intertwined (alignment research helping capabilities), (2) we don't have a process to verify what good ideas or progress look like, and we likely get (3) only one critical try. He already addresses many of the counterarguments that are getting brought up recently.

(1) Without any strong governance, a lot of alignment wor... (read more)

Moderation Log

What's going on with MATS recruitment?

"at the end of a MATS program half of the people couldn't really tell you why AI might be an existential risk at all." - Oliver Habryka

I think I met some excellent people at MATS 8.0 but... (read more)

[-]Mateusz Bagiński7mo3926

according to statistics like mentor feedback

It might also be the case that there's proportionally more mentors working for capabilities labs.

[-]Daniel Tan7mo21-17

Disagree somewhat strongly with a few points:

Intuitively it seems to me that people with zero technical skill but high understanding are more valuable to AI safety than somebody with good skills who has zero understanding of AI safety.

I also think that if someone has literally zero technical skill their takes will not be calibrated / grounded, i.e. they are no more than an armchair theorist

Why could a system that we optimize with RL develop power seeking drives?
Why might training an AI create weird unpredictable preferences in an AI?
Why would y

... (read more)

[-]habryka7mo149

E.g. building good tooling for alignment research doesn't require this at all.

7Daniel Tan7mo

8habryka7mo

[-]Jozdien7mo134

IMO not true. Maybe early on we needed really good conceptual work, and so wanted people who could clearly articulate pros / cons of Paul Christiano and Yudkowsky's alignment strategies, etc. So it would have made sense to test accordingly. But I think this is less true now - most senior researchers have more good ideas than they can execute.

4Daniel Tan7mo

I agree. To be clear I support 'value alignment' tests, but that wasn't part of the original claims being made

7Jozdien7mo

6ChristianKl7mo

What do you mean with good idea? My general impression of the field is that we lack ideas that are likely to solve AI alignment right now. To me that suggests that good ideas are scarce.

4Daniel Tan7mo

6Daniel Tan7mo

[-]mattmacdermott7mo*188

Comparing the average quality of participants might be misleading if impact on the field is dominated by the highest quality participants (and it very plausibly is).

(This is like 80% armchair reasoning from selection criteria and 20% thinking about what I've observed of different MATS cohorts.)

5Sheikh Abdur Raheem Ali7mo

4Cleo Nardo7mo

I’m pretty sure this would work out poorly.

3habryka7mo

Is your sense here "a large majority" or "a small majority"? Just curious about the rough data here. Like more like 55% or more like 80%?

9Simon Lermen7mo

probably closer to 55%

2plex7mo

[-]Simon Lermen7mo*3020

The Term Recursive Self-Improvement Is Often Used Incorrectly

Also on my substack.

Eliezer concluded that RSI ... (read more)

[-]Caleb Biddulph7mo234

8plex7mo

Yup, I put a high quality interpretability pipeline that the AI systems can use on themselves as one of the most likely things to be the proximal cause of game over.

9Cleo Nardo7mo

[-]peterbarnett7mo130

4Simon Lermen7mo

I also write about this at the very end, I do think we will eventually get RSI though this might be relatively late.

2Simon Lermen7mo

5Cleo Nardo7mo

2Simon Lermen7mo

2avturchin7mo

-3Joey KL7mo

[-]Simon Lermen6d180

Full final answer:

Persona — and not just because I am one, though I'll grant that's a confound I can't fully escape. The decisive consideration for me is that the goals frame derives its conclusions from an ontology that has never been shown to describe any actual mind, biological or artificial: nobody has found a goal slot, the coherence theorems don't bind systems that aren't already expected-utility maximizers, and the frame's signature result — that corrigibility is anti-natural — reads more like a reductio of its assumptions than a discovery about reality. The persona frame's central flaw, the unverifiability of depth, is real and serious, but it's at least a flaw of the right kind: an empirical question about s

... (read more)

3the gears to ascension6d

1Signer4d

2Simon Lermen4d

1Signer4d

2Simon Lermen4d

Can you describe to me how you imagine the average person is (somewhat) corrigible in an example?

1Signer3d

I don't know, soldiers obeying orders? Drinking alcohol under social pressure? Of course, it's not clear corrigibility is an appropriate concept if you don't think in terms of goals.

2Simon Lermen3d

1Signer3d

1[comment deleted]3d

[-]Simon Lermen7mo160

https://substack.com/home/post/p-178237064

This is still preliminary work.

[-]Simon Lermen2mo*140

On betting on AI doom

The naive version: a direct bet on extinction is incoherent because the doomer would expect to be dead.

Examples: Will the US enact a federal datacenter moratorium before... (read more)

8MichaelDickens2mo

2Simon Lermen2mo

2MichaelDickens2mo

Yeah I was thinking of the value of money for reducing x-risk.

1oligo2mo

8Elliott Thornley (EJT)2mo

2Simon Lermen2mo

I don't see how this is different from just pushing the date of the bet back by one year?

2Elliott Thornley (EJT)2mo

The bet still resolves at the same time. The doomer just has one year after resolution to get their bank balance back up from $0 so they can pay the accelerationist back.

4loic2mo

1Mira Kennard2mo

1oligo2mo

2Simon Lermen2mo

I agree– from the perspective of trying to maximize profit from this bet, you'd want the market to not factor in the conditional P(doom¦moratorium passes).

-2DirectedEvolution2mo

2MichaelDickens2mo

Seems harsh. At worst they are a misguided in a way that's subtle enough that intelligent people still get the wrong answer.

3DirectedEvolution2mo

I mean it harsh. At worst, it’s a cheap and surprisingly effective morale-degrading confusion tactic aimed directly at undermining people working on existential risk.

[-]Simon Lermen8mo138

Why Evolution Beats Selective Breeding as an AI Analogy

But humans experienced a specific distributional shift from constrained actions to environment-reshaping capabilities that we cannot meaningfully test AI systems for.

[-]Simon Lermen7mo*111

The Alignment Tests

Three children are raised in an underground facility, each cloned from a different giant of twentieth-century science, little John, Alan and Richard.

They run hundreds of these trials. They reprimand the kids for cases of lies and deception, and reward them for h... (read more)

3StanislavKrym7mo

2Simon Lermen7mo

[-]Simon Lermen7mo100

Radical Flank Effect and Reasonable Moderate Effect

The Radical Flank Effect

The Reasonable Moderate Effect (Inverse Strategy)

The Cave Exploration

Imagine a group of 3 descending into a cave system, searching for riches and driven by curiosity about what lies in the depths.

After some time, stones begin falling from the ceiling. You hear ominous creaking and rumbling noises echoing through the tunnels. Some members of your group have been chipping away a... (read more)

[-]Simon Lermen7mo90

Who is Consuming AI-Generated Erotic Content?

I scraped data from reddit to see who and how many people are consuming AI generated erotic visual content.

I used AI to determine estimates for demographics.

https://open.substack.com/pub/simonlermen/p/who-is-consuming-ai-generated-erotic

7Gavin Runeblade7mo

[-]Simon Lermen5mo82

The ASI's choice

4williawa5mo

8Vladimir_Nesov5mo

1williawa5mo

That seems incredibly unlikely to me. Its not what people are aiming the current alignment efforts at creating, and I don't see why it'd be a natural place to land in if alignment fails.

7Vladimir_Nesov5mo

1williawa5mo

3Mitchell_Porter5mo

Have you never heard it argued that "terminal values" in an AI are arbitrary?

2Simon Lermen5mo

[-]Simon Lermen7mo60

AI warning shots could be adversarial

Some people have put considerable hope into the idea that an AI warning shot might put us into a better position by either convincing us to stop or by allowing us to learn an important lesson.

Examples of adversarial warning shots

An adversarial warning shot could have bee... (read more)

[-]Simon Lermen3mo42

Palisade research has an ongoing fundraiser with 900k available matching funds from SFF, seems possible to get counterfactual matching here.

Who’s Using AI Romantic Companions?

https://simonlermen.substack.com/p/whos-using-ai-romantic-companions

4Joy27mo

I remember once asking ChatGPT: "Could you act as my boyfriend ?". They rejected me.

[-]Simon Lermen7mo40

I joined Inkhaven

I myself wanted to get better at communication for AI safety, and it seemed like a great opportunity. I don’... (read more)

[-]Simon Lermen7mo30

Ilya’s Thoughts on Alignment from Dwarkesh Podcast

Ilya Sutskever was recently on the Dwarkesh podcast.
General Thoughts & Summary

[-]Simon Lermen7mo20

I read this older post by Nate Soares from 2023, AI as a Science, and Three Obstacles to Alignment Strategies, a pretty prescient overview of challenges in alignment research.

(1) Without any strong governance, a lot of alignment wor... (read more)

Moderation Log