Simon Skade
I did (mostly non-prosaic) alignment research between Feb 2022 and Aug 2025. (Won $10k in the ELK contest, participated in MLAB and SERI MATS 3.0 & 3.1, then independent research. I mostly worked on an ambitious attempt to better understand minds to figure out how to create more understandable and pointable AIs. I started with agent foundations but then developed a more sciency agenda where I also studied concrete observations from language/linguistics, pychology, (neuroscience - though didn't study much here yet), and from tracking my thoughts on problems I solved (aka a good kind of introspection).)
I'm now exploring advocacy for making it more likely that we get sth like the MIRI treaty (ideally with a good exit plan like human intelligence augmentation, or possibly an alignment project with actually competent leadership).
Currently based in Germany.
There is another LW wikitag here, which includes:
Secondly, the possibility that human values may not converge. Yudkowsky considered CEV obsolete almost immediately after its publication in 2004. He states that there's a "principled distinction between discussing CEV as an initial dynamic of Friendliness, and discussing CEV as a Nice Place to Live" and his essay was essentially conflating the two definitions.
But I totally agree that CEV is a useful concept to have. Also Yudkowsky's later writing (like the Arbital post presumably around 2016) should trump his earlier take in 2004. Or maybe the meaning of CEV shifted a bit over the years from sth more specific to a very indirect pointer. Idk, I don't remember the original CEV paper well.
I'd be curious about how your timelines updated. Last year you wrote:
Over the past year, my timelines have become even more bimodal than they already were. The key question is whether o1/o3-style models achieve criticality (i.e. are able to autonomously self-improve in non-narrow ways), including possibly under the next generation of base model. My median guess is that they won’t and that the excitement about them is very overblown. But I’m not very confident in that guess.
If the excitement is overblown, then we’re most likely still about 1 transformers-level paradigm shift away from AGI capable of criticality, and timelines of ~10 years seem reasonable. Conditional on that world, I also think we’re likely to see another AI winter in the next year or so.
If the excitement is not overblown, then we’re probably looking at more like 2-3 years to criticality. In that case, any happy path probably requires outsourcing a lot of alignment research to AI, and then the main bottleneck is probably our own understanding of how to align much-smarter-than-human AGI.
To me it seems plausible that we're in some intermediate world where progress continues but we still have like 5 years to criticality.
Thanks for your yearly update!
On the plan:
- What is The Plan for AI alignment? Briefly: Sort out our fundamental confusions about agency and abstraction enough to do interpretability that works and generalizes robustly. Then, look through our AI’s internal concepts for a good alignment target, and Retarget the Search.
I think this won't work because many human-value-laden concepts aren't very natural for an AI. More specifically, in the 2023 version of the plan you wrote:
The standard alignment by default story goes:
- Suppose the natural abstraction hypothesis[2] is basically correct, i.e. a wide variety of minds trained/evolved in the same environment converge to use basically-the-same internal concepts.
- … Then it’s pretty likely that neural nets end up with basically-human-like internal concepts corresponding to whatever stuff humans want.
- … So in principle, it shouldn’t take that many bits-of-optimization to get nets to optimize for whatever stuff humans want.
- … Therefore if we just kinda throw reward at nets in the obvious ways (e.g. finetuning/RLHF), and iterate on problems for a while, maybe that just works?
In the linked post, I gave that roughly a 10% chance of working. I expect the natural abstraction part to basically work, the problem is [...]
I think the natural abstraction part here does not work - not because natural abstractions aren't a thing - but because there's an exception for abstractions that are dependent on the particular mind architecture an agent has.
Concepts like "love", "humor", and probably "consciousness" may be natural for humans but probably less natural for AIs.
But also we cannot just wire up those concepts into the values of an AI and expect the AI's values to generalize correctly. The way our values generalize - how we will decide what to value as we grow smarter and do philosophical reflection - seems quite contingent on our mind architecture. Unless we have an AI that shares our mind architecture (like in Steven Byrnes' agenda), we'd need to point the AI to an indirect specification of what we value, aka CEV. And CEV doesn't seem like a simple natural abstraction that an AI would learn without us teaching it about CEV. And even if it knows CEV because we taught it, I find it hard to imagine how we would point the search process to it (even assuming we have a retargetable general purpose search).
Also see here and here. But mainly I think you need to think a lot more concretely about what goal we actually want to point the AI at.
Although I agree with this:
Generally, we aim to work on things which are robust bottlenecks to a broad space of plans. In particular, our research mostly focuses on natural abstraction, because that seems like the most robust bottleneck on which (not-otherwise-doomed) plans get stuck.
However, it does not look to me like you are making much progress relative to your stated beliefs of how close you are. Aka relative to (from your 2024 update where this statement sounded like it's based on 10ish year timelines):
Earlier this year, David and I estimated that we’d need roughly a 3-4x productivity multiplier to feel like we were basically on track.
So here are some thoughts on how your progress looks to me, although I've not been following your research in detail anymore since summer 2024 (after your early natural latents posts):
Basically, it seems to me like you're making the mistake of Aristotelians that Francis Bacon points out in the Baconian Method (or Novum Organum generally):
the intellect mustn't be allowed •to jump—to fly—from particulars a long way up to axioms that are of almost the highest generality... Our only hope for good results in the sciences is for us to proceed thus: using a valid ladder, we move up gradually—not in leaps and bounds—from particulars to lower axioms, then to middle axioms, then up and up...
Aka, you look at a few examples, and directly try to find a general theory of abstraction. I think this makes your theory overly simplistic and probably basically useless.
Like, when I read Natural Latents: The Concepts, I already had a feeling of the post trying to explain too much at once - lumping together things as natural latents that seem very importantly different, and also in some cases natural latents seemed like a dubious fit. I started to form an intuitive distinction in my mind between objects (like a particular rigid body) and concepts (like clusters in thingspace like "tree" (as opposed to a particular tree)), although I couldn't explain it well at the time. Later I studied a bit formal language semantics and the distinction there is just total 101 basics.
I studied language a bit and tried to carve up in a bit more detail what types of abstractions there are, which I wrote up here. But really I think that's still too abstract and still too top-down and one probably needs to study particular words in a lot of detail, then similar words, etc.
Not that this kind of study of language is necessarily the best way to proceed with alignment - I didn't continue it after my 5 month language-and-orcas-exploration. But I do think concrete study of observations and abstracting slowly is important.
ADDED: Basically, from having tried a little to understand natural/human ontologies myself it does not look to me like natural latents is much progress. But again I didn't follow your work in detail and if you have concrete plans or evidence of how it's going to be useful for pointing AIs then lmk.
I meant SFF. No idea what was up with my typing circuits.
Does SSD have fixed or flexible budget? It could be that the bottleneck to Jaan Tallinn's spending is rather how many good options there will be to donate to, rather than his budget.
I also recently listened to the planecrash chapter "the meeting of their minds" and while it's not a lecture it does contain a lot of interesting insights. May seem like weird anthropics brainfuck to some people though. And it definitely contains spoilers.
PS: also check out this lecture. (EDIT: This is mostly "how to relate to beliefs" + "what the truth can destroy", and then a short section that's not linked in the post here.)
PPS: Also check out these insights from dath ilan.
Do you think sociopaths are sociopaths because their approval reward is very weak? And if so, why do they often still seek dominance/prestige?
I'm confused about how you're thinking about the space of agents, such that "maybe we don't need to make big changes"?
I just mean that I don't plan for corrigibility to scale that far anyway (see my other comment), and maybe we don't need a paradigm shift to get to the level we want, so it's mostly small updates from gradient descent. (Tbc, I still think there are many problems, and I worry the basin isn't all that real so multiple small updates might lead us out of the basin. It just didn't seem to me that this particular argument would be a huge dealbreaker if the rest works out.)
What actions help you land in the basin?
Clarifying the problem first: Let's say we have actor-critic model-based RL. Then our goal is that the critic is a function on the world model that measures sth like how empowered the principal is in the short term, aka assigning high valence to predicted outcomes that correspond to an empowered principal.
One thing we want to do is to make it less likely that a different function that also fits the reward signal well would be learned. E.g.:
We also want very competent overseers that understand corrigibility well and give rewards accurately, rather than e.g. rewarding nice extra things the AI did but which you didn't ask for.
And then you also want to use some thought monitoring. If the agent doesn't reason in CoT, we might still be able to train some translators on the neuralese. We can:
Tbc, this is just how you may get into the basin. It may become harder to stay in it, because (1) the AI learns a better model of the world and there are simple functions from the world model that perform better (e.g. get reward), and (2) the corrigibility learned may be brittle and imperfect and might still cause subtle power seeking because it happens to still be instrumentally convergent or so.
The AI reasons more competently in corrigible ways as it becomes smarter, falling deeper into the basin.
The AI doesn't fall deeper into the basin by itself, it only happens because of humans fixing problems.
If the AI helps humans to stay informed and asks about their preferences in potential edge cases, does that count as the humans fixing flaws?
Also some more thoughts on that point:
But this is a different view of mindspace; there is no guarantee that small changes to a mind will result in small changes in how corrigible it is, nor that a small change in how corrigible something is can be achieved through a small change to the mind!
As a proof of concept, suppose that all neural networks were incapable of perfect corrigibility, but capable of being close to perfect corrigibility, in the sense of being hard to seriously knock off the rails. From the perspective of one view of mindspace we're "in the attractor basin" and have some hope of noticing our flaws and having the next version be even more corrigible. But in the perspective of the other view of mindspace, becoming more corrigible requires switching architectures and building an almost entirely new mind — the thing that exists is nowhere near the place you're trying to go.
Now, it might be true that we can do something like gradient descent on corrigibility, always able to make progress with little tweaks. But that seems like a significant additional assumption, and is not something that I feel confident is at all true. The process of iteration that I described in CAST involves more deliberate and potentially large-scale changes than just tweaking the parameters a little, and with big changes like that I think there's a big chance of kicking us out of "the basin of attraction."
Idk this doesn't really seem to me like a strong counterargument. When you make a bigger change you just have to be really careful that you land in the basin again. And maybe we don't need big changes.
That said, I'm quite uncertain about how stable the basin really is. I think a problem is that sycophantic behavior will likely get a bit higher reward than corrigible behavior for smart AIs. So there are 2 possibilities:
My uncertain guess is that (2) would by default likely win out in the case of normal training for corrigible behavior. But maybe we could make (1) more likely by using sth like IDA? And in actor-critic model-based RL we could also stop updating the critic at the point when we think the AI might apply smart enough sycophancy that it wins out against corrigibility, and let the model and actor still become a bit smarter.
And then there's of course the problem of how we land in the basin in the first place. Still need to think about how a good approach for that would look like, but doesn't seem implausible to me that we could try in a good way and hit it.
No, the first 3 difficulties I explain were mainly written with sth like helpfulness/instruction-following/DWIM in mind. I think corrigibility would be an even better target for RL based AI, although I didn't want to need to explain it in this post. I wrote:
Only the last problem was specifically about value alignment, because it looks like something like CEV might be needed for an AI whose intelligence can increase arbitrarily. Or at least it's unclear helpfulness/instruction-following would generalize if you crank up intelligence very high.
I totally agree that we currently shouldn't aim for CEV.