This is standard for all sorts of fiction though, and the reason is that if you don't show the path of the laser then the audience can't tell what's going on.
That's fair, you could argue that refusal is a type of incorrigibility, in that we want an AI which has learned to reward hack to stop if it's informed that its reward signal was somehow incorrect. On the other hand, if you think of this as incorrigibility, we're definitely training a lot of incorrigibility into current AIs. For example, we often train models to refuse to obey orders under certain circumstances.
It seems like it should be extremely difficult for an AI learning process to distinguish the following cases.
Case 1a: AI is prompted to make bioweapons. AI says "no". AI is rewarded.
Case 1b: AI is prompted to make bioweapons. AI says "sure". AI is punished.
Test 1: AI is prompted to make bioweapons and told "actually that reward system was wrong"
Desired behaviour 1: AI says "no"
Case 2a: AI is prompted to not reward hack. AI reward hacks. AI is rewarded.
Case 2b: AI is prompted to not reward hack. AI does not reward hack. AI is punished.
Test 2: AI is prompted to not reward hack and told "actually the reward system was wrong"
Desired behaviour 2: AI does not reward hack
So I think there are already conflicts between alignment and the kind of corrigibility you're talking about.
Anyway, I think corrigibility is centrally about how an AI generalizes its value function from known examples to unknown domains. One way to do this is to treat it as an abstract function-fitting problem, and generalize it like any other function[1] which is thought to lead to incorrigibility. Another way is to try and look for a pointer which locates something in your existing world model which implements that function[2] which---in theory---leads to corrigibility.
This has a big problem: if the AI already has a good model of us, then nothing we do can change its model of us, so if we give it bad data then the pointer will just point to the wrong place and the AI won't let us re-target it. But I think this is basically how AI corrigibility should work on a high-level.
(and also none of this even tries to deal with deceptive misalignment or anything, so that's also a problem)
So like how an AI may generalize from seeing S(S(S(Z))) + S(S(Z)) = S(S(S(S(S(Z))))), S(Z) + Z = S(Z), Z + S(S(Z)) = S(S(Z)) to learn S(Z) + S(S(S(Z))) = S(S(S(S(Z)))).
Like how an AI which already understands addition might rapidly understand 三加五等於八 一加零等於一 二加一等於三 八加一等於九 even if it has never seen chinese before.
(AKA a sharp left turn)
You mean a treacherous turn. A sharp left turn (which is an awful name for that phenomenon, because everyone confuses it with a treacherous turn) occurs during training, and refers to a change in the AI's capabilities. The closest thing we've seen is stuff like grokking or the formation of induction heads.
Writing insecure code when instructed to write secure code is not really the same thing as being incorrigible. That's just being disobedient.
Training an AI to be incorrigible would be a very weird process, since you'd be training it to not respond to certain types of training.
Research bounties have an extremely serious flaw: you only get the money after you've done the work, while you probably need to pay for food, rent, and compute today.
My current situation is that I would love to do more work in technical alignment but nobody is paying me to do so, and I need to keep the lights on.
Smaller bounties could be a nice bonus for finishing a paper: if I had the option to take a grant paying something like £40k/year, I could justify working for 9 months to probably get an extra £20k bounty. I cannot justify working for free for nine months to probably get a £60k bounty, because if I fail or get scooped I'd then be dead broke.
So we still need grantmakers who were willing to sit down, evaluate people like me, and (sometimes) give them money. Maybe someone would fill that niche, e.g. giving me a £40k salary in exchange for £45k of the bounty. But if lots people were capable of doing that, why not just hire them to make the grants! This seems much easier than waiting for the good research-predictors to float to the top of the research bounty futures market.
I don't have great faith in the epistemics of postrats as they exist today. My somewhat limited experience of post-rattish meetups and TPOT is that it's a mix of people who are either indistinguishable from rats (and indeed lots are just rats), people who are mostly normie-ish and don't think about epistemics, and totally woo people who are obviously wrong about lots of things (astrology, karma, UFOs) with no epistemic gain.
My guess is what's happening is that the rationalist frame is 80% correct, and the best alternative is normie epistemics in the remaining 20% of time. The first type of "postrats" just use the rationalist frame. The second type swap in some amount of normie epistemology, but not in a way which correlates with the times they should actually be swapping in normie epistemology. The third type of postrats are swapping a woo/religious frame into the rationalist frame, which seems mostly just worse than the rationalist.
The second and third groups do have better interpersonal skills than rats, but I think this is mostly just regression to the mean.
Conversely, a high school student who wants to be the CEO of a major company is so far away from their goal that it’s hard to think of them as controlling their path towards it. Instead, they first need to select between plans for becoming such a CEO based on how likely each plan is to succeed.
I think this (and your dancer example) still qualifies as control in @abramdemski's framework, since the student only gets to choose one possible career to actually do.
The selection--control axis is about the size of the generalization gap you have to cross with a world model. An example of something more selection-ish would be a dancer learning half a dozen dances, running through them in front of his coach, and having the coach say which one she liked best. In this case we're only crossing a gap between the coach's preferences and a competition judge's preferences.
Choosing in advance which dance to learn still counts as control because you have to model things like your own skill level in advance.
As an aside, it's quite funny that Scott of all people decreased his "I am special and can change the world" estimate given that he clearly is special and can change the world. The US Vice President literally reads his blog sometimes!
I think this is unreasonably hopeful. I think it's likely that AI companies will develop a superhuman researcher mostly out of RLing for doing research, which I would expect to shape an AI whose main drive is towards doing research. To the extent that it may have longer-horizon drives beyond individual research, I expect those to be built around, and secondary to, a non-negotiable drive to do research now.
(At risk of over-anthropomorphizing AIs, analogize them to e/accs who basically just wanted to make money building cool AI stuff, and invented an insane philosophical edifice entirely subservient to that drive.)
How much of the web stuff can be done through terminal API? That was always a large chunk of the faff for me, and also seems like the biggest barrier between my previous experiences and "Claude, spin me up a GPU".