I’m a staff artificial intelligence engineer and researcher working with AI and LLMs, and have been interested in AI alignment, safety and interpretability for the last 17 years. I did research into this during SERI MATS summer 2025, and am now an independent researcher at Meridian in Cambridge. I’m currently looking for either employment or funding to work on this subject in the London/Cambridge area in the UK.
Last summer I attended a talk at LighHaven (in Bldg B on the first floor)) by Alexander Wales on prompting LLMs to write novels. I'm trying to figure out when this was, and whether it was at LessOnline or at MATS. Does anyone still have a copy of the LessOnline ‘25 schedule who could look this talk up?
I'd like to offer my thoughts on this topic as another source to explore:
• A Sense of Fairness: Deconfusing Ethics: suggests a framework for considering the issue, and why an aligned AI would decline moral standing (the first post in a sequence, some later posts are also relevant)
• Grounding Value Learning in Evolutionary Psychology: an Alternative Proposal to CEV: basing that framework on the context of the evolutionary psychology of humans
I think you should have titled or subtitled this "How Fast is my Centaur?" :-)
Interesting, exciting, and very valuable work.
I wish you'd labeled the points to help look for model family effects.
The point spreads are quite noisy — your conclusion that quality caps out looks very sensitive to a balance between two outlier points, one very high and the other very low: remove either of those and the story might change. Obviously there is enormous economic value in ensuring that low capability humans don't drag high capability model outputs down, whether that requires retraining the model or training the human.
Is that 20% prediction a total increase in GDP over the period, or an annualized rate of increase?
The other problem is that Seth's useful and thought-provoking map is in 2 dimensions, and humans are used to thinking in 1–3 dimensions (and have a visual cortext with 2-dimensional local connectivity, so lack the wetware for thinking in more than 2½ dimensions). LLM activations, KV values etc. are generally in O(8192) dimensions. High dimensional spaces just have a lot of statistical/geometric properties that are wildly unituitive to us, since we're use to working in very low numbers of dimensions. (Also known as the curse of dimensionality.) You can, with practice, learn to recognize when your intuition is misleading you and what the correct answer is, but this takes quite a bit of practice. For example, if you pick two vectors at random in a high dimensional space, they will almost invariably be almost orthogonal to each other (the angle between them will be nearly 90 degrees). Random walks in high dimensional spaces practically never return to anywhere near any place they're gone before: they continually get "more lost". So a lot of the intuitions that Seth's diagram gives are rather misleading: start in the middle of the "Bay of Dimensional Mismatch", go a shortish distance in a random direction, and you will almost certainly end up deep at sea, rather than back on land as his 2-D map suggests. Nevertheless, all the effects he discusses are real and important—just be aware that there's simply no way to diagram them in only 2 dimensions (or anything a human can visualize) that isn't inherently rather misleading.
I'd suggest TurnTrout's writing (Alex Turner at DeepMind), since he's the person who first came up with the idea. Most of his posts are on LessWrong/The Aligment Forum, but they're best organized on his own website. I'd suggest starting at https://turntrout.com/research, reading the section on Shard Theory, and following links.
He himself admits that some of his key posts often seem to get misunderstood: I think they repay careful reading and some thought.
I completely agree with you — I personally have filtered a thousand samples manually, and it takes a while. Finding a good human-LLM centaur solution is very helpful. Sounds like you know all the tricks at least as well as I do.
Does corrigibility make it harder to get such an international agreement, compared to alternative strategies? I can imagine an effect: the corrigible AI would serve the goals of just one principal, making it less trustworthy than, say, a constitutional AI. That effect seems hard to analyze. I expect that an AI will be able to mitigate that risk by a combination of being persuasive and being able to show that there are large risks to an arms race.
It's easy to get a corrigible ASI not to use its persuasiveness on you: you tell it not to do that.
I think you need to think harder about that "hard to analyze" bit — it's the fatal flaw (as in x-risk) of the corrigibility based approach. You don't get behavior any wiser or more moral than the principal. And 2–4% of principals are sociopaths (the figures for tech titans and heads of authoritarian states may well be higher).
I assume there will be a nontrivial period when the AI behaves corrigibly in situations that closely resemble its training environment, but would behave incorrigibly in some unusual situations.
As a rule of thumb, anything smart enough to be dangerous is dangerous because it can do scientific research and self-improve. If it can't tell when you're out-of-distribution and might need to generate some new hypotheses, it can't do scientific research, so it's not that dangerous. So yes, there might be some very unusual situation that decreased corrigibility: but for an ASI, just taking it out-of-training-distribution should pretty-reliably cause it to say "I know that I don't know what I'm doing, so I should be extra cautious/pessimistic, and that includes being extra-corrigible."
The engineering feedback loop will use up all its fuel
I discussed this with Jeremy Gillen in the comments of his post, and I'm still not clear what he meant by 'fuel' here. Possibly something to do with the problem of "fully-updated deference", a.k.a. the right to keep arbitrarily and inconsistently changing out minds?
I found it: Less Online, 5pm on Sunday Jun 1st in Bayes Ground Floor