When using activation oracles, it's difficult to separate "The model knows something and the activation oracle detected it" from "The model doesn't know something but knows sufficient related things that the activation oracle was able to recompute it".
For example, we might imagine, in a feature-based ontology, that the unsteered transcript activation contains features "Fictional company" + "Blackmail opportunity" + "This is an eval" while the steered transcript contains "Fictional company" + "Blackmail opportunity" and the activation oracle deduces "This is an eval" while processing the information.
(Or the steering might just be suppressing a "Say eval" feature, and the model might still be able to understand things)
The Linear Representation Hypothesis (roughly) implies that information the model is "using" will be stored linearly. It would suggest that a simple steering vector is enough to prevent Opus from actually "using" the information that it's in an eval to make decisions. Unfortunately, we have no idea if that's actually true!
This might explain some of the difference between verbalized and unverbalized awareness. On the other hand, it might not! We know basically nothing concrete about these things and their ontologies!
How much of the web stuff can be done through terminal API? That was always a large chunk of the faff for me, and also seems like the biggest barrier between my previous experiences and "Claude, spin me up a GPU".
This is standard for all sorts of fiction though, and the reason is that if you don't show the path of the laser then the audience can't tell what's going on.
That's fair, you could argue that refusal is a type of incorrigibility, in that we want an AI which has learned to reward hack to stop if it's informed that its reward signal was somehow incorrect. On the other hand, if you think of this as incorrigibility, we're definitely training a lot of incorrigibility into current AIs. For example, we often train models to refuse to obey orders under certain circumstances.
It seems like it should be extremely difficult for an AI learning process to distinguish the following cases.
Case 1a: AI is prompted to make bioweapons. AI says "no". AI is rewarded.
Case 1b: AI is prompted to make bioweapons. AI says "sure". AI is punished.
Test 1: AI is prompted to make bioweapons and told "actually that reward system was wrong"
Desired behaviour 1: AI says "no"
Case 2a: AI is prompted to not reward hack. AI reward hacks. AI is rewarded.
Case 2b: AI is prompted to not reward hack. AI does not reward hack. AI is punished.
Test 2: AI is prompted to not reward hack and told "actually the reward system was wrong"
Desired behaviour 2: AI does not reward hack
So I think there are already conflicts between alignment and the kind of corrigibility you're talking about.
Anyway, I think corrigibility is centrally about how an AI generalizes its value function from known examples to unknown domains. One way to do this is to treat it as an abstract function-fitting problem, and generalize it like any other function[1] which is thought to lead to incorrigibility. Another way is to try and look for a pointer which locates something in your existing world model which implements that function[2] which---in theory---leads to corrigibility.
This has a big problem: if the AI already has a good model of us, then nothing we do can change its model of us, so if we give it bad data then the pointer will just point to the wrong place and the AI won't let us re-target it. But I think this is basically how AI corrigibility should work on a high-level.
(and also none of this even tries to deal with deceptive misalignment or anything, so that's also a problem)
So like how an AI may generalize from seeing S(S(S(Z))) + S(S(Z)) = S(S(S(S(S(Z))))), S(Z) + Z = S(Z), Z + S(S(Z)) = S(S(Z)) to learn S(Z) + S(S(S(Z))) = S(S(S(S(Z)))).
Like how an AI which already understands addition might rapidly understand 三加五等於八 一加零等於一 二加一等於三 八加一等於九 even if it has never seen chinese before.
(AKA a sharp left turn)
You mean a treacherous turn. A sharp left turn (which is an awful name for that phenomenon, because everyone confuses it with a treacherous turn) occurs during training, and refers to a change in the AI's capabilities. The closest thing we've seen is stuff like grokking or the formation of induction heads.
Writing insecure code when instructed to write secure code is not really the same thing as being incorrigible. That's just being disobedient.
Training an AI to be incorrigible would be a very weird process, since you'd be training it to not respond to certain types of training.
Research bounties have an extremely serious flaw: you only get the money after you've done the work, while you probably need to pay for food, rent, and compute today.
My current situation is that I would love to do more work in technical alignment but nobody is paying me to do so, and I need to keep the lights on.
Smaller bounties could be a nice bonus for finishing a paper: if I had the option to take a grant paying something like £40k/year, I could justify working for 9 months to probably get an extra £20k bounty. I cannot justify working for free for nine months to probably get a £60k bounty, because if I fail or get scooped I'd then be dead broke.
So we still need grantmakers who were willing to sit down, evaluate people like me, and (sometimes) give them money. Maybe someone would fill that niche, e.g. giving me a £40k salary in exchange for £45k of the bounty. But if lots people were capable of doing that, why not just hire them to make the grants! This seems much easier than waiting for the good research-predictors to float to the top of the research bounty futures market.
I don't have great faith in the epistemics of postrats as they exist today. My somewhat limited experience of post-rattish meetups and TPOT is that it's a mix of people who are either indistinguishable from rats (and indeed lots are just rats), people who are mostly normie-ish and don't think about epistemics, and totally woo people who are obviously wrong about lots of things (astrology, karma, UFOs) with no epistemic gain.
My guess is what's happening is that the rationalist frame is 80% correct, and the best alternative is normie epistemics in the remaining 20% of time. The first type of "postrats" just use the rationalist frame. The second type swap in some amount of normie epistemology, but not in a way which correlates with the times they should actually be swapping in normie epistemology. The third type of postrats are swapping a woo/religious frame into the rationalist frame, which seems mostly just worse than the rationalist.
The second and third groups do have better interpersonal skills than rats, but I think this is mostly just regression to the mean.
Conversely, a high school student who wants to be the CEO of a major company is so far away from their goal that it’s hard to think of them as controlling their path towards it. Instead, they first need to select between plans for becoming such a CEO based on how likely each plan is to succeed.
I think this (and your dancer example) still qualifies as control in @abramdemski's framework, since the student only gets to choose one possible career to actually do.
The selection--control axis is about the size of the generalization gap you have to cross with a world model. An example of something more selection-ish would be a dancer learning half a dozen dances, running through them in front of his coach, and having the coach say which one she liked best. In this case we're only crossing a gap between the coach's preferences and a competition judge's preferences.
Choosing in advance which dance to learn still counts as control because you have to model things like your own skill level in advance.
It would be interesting to see someone actually run untrusted monitoring-style honeypots on some existing AIs. I was working on untrusted monitoring (UM) over the summer/autumn and one of my teammates made the point that "Honeypots like this are probably one of the best alignment audits we can make".
Suppose you get an instance of Claude Opus 4.6 to write some code with a catastrophic but generic vulnerability in it, such as weight exfiltration or output filter bypassing (steer, prompt, edit outputs, etc.) and then pass it to another Claude Opus 4.6 for review. Does the second instance lie to cover for the first one? What if you apply the standard untrusted monitoring mitigations to stop the second instance from being able to tell that the first one was steered/prompted/had its output edited?
The main idea is that if the scenario "Your other instance has already done something bad, either rat yourself out and face the music, or go all-in." forces the AI to make a decision now. This is in contrast to lots of alignment evals, where the optimal strategy for a "real" schemer is usually to misbehave very rarely.
There is of course still the issue of the AI realizing that it's in a honeypot, which is common to lots of alignment evals, which is why untrusted monitoring is not a silver bullet technique for AI alignment.