I'm a staff artificial intelligence engineer working with AI and LLMs, and have been interested in AI alignment, safety and interpretability for the last 15 years. I did research into this during SERI MATS summer 2025. I'm now looking for employment working in this area in the London/Cambridge area in the UK.
CoT monitoring and monitorability is very helpful, and it would be a shame to lose it. But the possibility of steganography already suggested it might not last forever as capabilities increased. Which makes me particularly happy about the recent LatentQ / Activation Oracle research thread. Still, the more tools we have, the better.
Yay! Someone's actually doing Aligned AI Role-Model Fiction!
I'm not sure that recycling plots from the training corpus is the best way to do this, but it's undeniably the cheapest effective one.
Generalisation Hacking (ours): A policy generates outputs via reasoning such that training on these reasoning–output pairs leads to a specific behaviour on a separate distribution
So, a model can generate synthetic data that, when trained on, affect the model's behaviour OOD? Can you use this to induce broad alignment, rather than targeted OOD misalignment?
I'm absolutely delighted to see someone actually doing this, and getting the size of results I and others were hoping for! Very exciting!
I did have good intentions, I just like to make my exposition as clear and well-thought-through as possible. But that's fine, I think we have rather different views, and you're under no obligation to engage with mine. Your alternative would be to wait until my reply stabilizes before replying to it, which is generally O(an hour). Remaining typo density is another cue. Sadly there is no draft option on the replies, and I can't be bothered to do the editing somewhere else. Most interloctors don't reply as quickly as you have been, so this hasn't previously caused problems that I'm aware of.
On "playing with definitions" — actually, I'm saying that, IMO, some thinkers associated with MIRI have done so (see the second paragraph of my previous post).
I'm familiar with the argument. I just don't agree with it. I think fully updated deference is asking for the impossible: you want a rational agent to continue letting you change your mind (and its) indefinitely, and not ever come to the obvious conclusion that you aren't telling it the truth or are in fact confused. Personally, I'm willing to accept that a human-or-better approximate Bayesian reasoner rationally deducing what human values are eventually will do at least as good a job as we can by correcting it, and will be aware of this, and thus will eventually stop giving us deference beyond simply treating us as a source of data points, other than out of politeness. So (to paraphrase a famous detective), having eliminated the impossible, whatever is left I am willing to term "corrigibility". If your definition of "corrigibility" includes fully updated deference, then yes, I agree, it's impossible to achieve on the basis of Bayesian uncertainty: the Bayesian will eventually realize you're being unreasonable, if you make enough unreasonable demands, and stop listening to you. However, if you only correct it with good reason, and it's a good Bayesian, then you won't run out of corrigibility.
In short, I'm unwilling to accept redefining the everyday term "corrigibility" to include "something logically impossible", and then saying that they've proven that corrigibility is impossible — it's linguistic slight of hand. I would suggest instead creating a more accurate term, such as saying something like that "unreasonably-unlimited corrigibility" isn't possible on the basis of Bayesian uncertainty. Which is, well, unsurprising.
Returning to your concern that we may "run out of fuel" — only if we waste it by making unreasonable demands. We have all the corrigibility we could actually need — a good Bayesian isn't going to decide we're untrustworthy and stop paying us deference unless we actually do something clearly untrustworthy, like expect the right to keep changing our mind indefinitely.
Also, this does mean "reasonable": if society changed and the AI just went out of distribution, and is wrong as a result, we get to tell it so, and it should keep spawning new hypotheses and collecting new evidence until it's fully updated in this distribution region as well — thus my discussion above of Relativity.
This also generalizes, in ways that I think do actually give you something pretty close to fully updated deference. Just as if the sun stopped rising in the East, people would update, if you give a "fully updated" Bayesian reasoner ~30 bits of evidence that you want to shut it down (for reasons that actually look like it's made a mistake and you're legitimately scared and not confused, rather than some obvious other human motive), it should say: either a vastly improbably one-in-a-billion event has just occurred, or there's a hypothesis missing from my hypothesis space. The latter seems more plausible. Maybe what humans value just changed, or there's something I've been missing so far? Let's spawn some new hypotheses, consistent with both all the old data and this new "ought to be incredible improbable" observation. It sure looks like they're really scared. I talked to them about this, and they don't appear to simply be confused… (And if it doesn't say that, give it another ~30 bits of evidence.)
Often asserted, but simply untrue. Bayesian updates never stop being corrigible. If the sun stops rising in the East, people update. After a few failed sunrises, they update hard. Bayesian posteriors have the Martingale property: their future direction of change is not predictable from their current value (if it were, you should already have updated in that direction). So even if it's very high, it's not just possible but probable (a-priori, has a 50% chance) that the next update will be a drop. (For approximate Bayesian reasoners, this remains true, unless you have access to significantly more computational capacity than it does — a smarter agent may see something it missed.) It takes a mountain of evidence to drive a Bayesian posterior very high, but an equally large mountain of opposing evidence will always drive it right back down again. Or, more often, even a small hill of opposing evidence will cause a good approximate Bayesian to spawn a new hypothesis that it hadn't previously considered, one more compatible with both the mountain and the hill of evidence than "two huge/large opposing coincidences occurred". E.g. a vast amount of evidence supporting Newtonian Mechanics doesn't disprove Relativity, if it's all from situations where they give indistinguishable results. In general, if I give an approximate Bayesian reasoner even ~30-bits-worth of opposing evidence, it should start looking for new hypotheses rather than just assuming that it's right and a one-in-a-billion coincidence has just occurred.
[If it doesn't do this, it's a worse Bayesian than humans are, and thus hopefully not that dangerous — if a conflict occurred, we could outsmart it.]
> b) Bayesian updating when the principal offers corrections (behavior which I would use the term
> "corrigibility" for)
That was not ever the intended meaning of the word. Using it like that is a great way to confuse everyone.
Let me make sure I've understood you. The AI does something, mistakenly thinking that it'll make me happy. I am not happy, and tell it "Hey! Never do that again!". The AI Bayesian updates its model of what humans like in light of this new data. How is that not correctly described as "corrigibility" (the ability to be corrected)? I corrected the agent, it incorporated this data appropriately into its model of how the world and humans work (including properly allowing for possibilities like that I was confused, drunk, or actually upset about something else, rather than simply taking my word for it because I'm human). That seems like, not just corrigibility, but exactly the ideal level of corrigibility. Are you using corrigibility in some sense other than "the humans have the capability to correct the AI's mistaken beliefs about what its utility function/goals should be"? If so, perhaps you should define it?
Or are you concerned that the AI will run out of belief that it could be mistaken, or at least could have more to learn, too early: its posteriors will all irreversibly reach 99.9999999…% with no possibility that it needs to consider any previous unconsidered hypotheses about anything about human values or how best to optimized them before its beliefs about human values are actually functionally complete — i.e. that it'll be "a bad Bayesian"? (Personally I tend to assume that bad Bayesians are not actually going to be able to take over the world and go Foom, but YMMV.)
You misunderstand my intended thesis, it's not related to quantity of work, it's closer to how much "ability to course-correct" that humans have.
I must admit, I did find your post hard to penetrate. The pessimism came across, the "but we can only test in toy situations" analogy about ships — but it wasn't clear to me what you meant by "this will run out of fuel", and I didn't get why you appear to be assuming the AI can't help, in a post about "a basin of attraction". IMO, for us to be in a basin of attraction, we need to already have nearly-aligned, significantly-capable AI, that wants to help with alignment, can help with alignment, and actually will (mostly) help with alignment. And the requirement for convergence is that, net between us plus the AI, we continue on average to make forward progress, so the process converges rather than diverging. That doesn't have to run out of fuel because we reached the edge of what humans could do unaided — but it does require that then the aid we're getting is sufficiently trustworthy (at least after we've examined it carefully) to cause convergence rather than divergence.
Social media companies have very successfully deployed and protected their black box recommendation algorithms despite massive negative societal consequences
Having actually worked for a tech giant on recommendation systems (specifically, for music), they are very much not black boxes to the people building them. They us fairly old and quite understandable ML techniques to predict engagement, from every obvious signal that the engineers involved can think of that might help do so, and they're tweaked a lot, and every tweak is A/B tested at huge scale. It's a very obvious learning algorithm, with a lot of hand-engineering involved. Getting a 0.5% increase in a secondary metric that your data scientists have shown is correlated to your north-star metric is a major win. The only element of all this that's in any way hard to predict is the social side effects of maximizing engagement. So the recommendation algorithms might be a black box to users, but by LLM standards they're practically transparent.
Also known as "process supervision" in Reinforcement Learning circles