I'm trying to prevent doom from AI. Currently trying to become sufficiently good at alignment research. Feel free to DM for meeting requests.
I liked this post. Reward button alignment seems like a good toy problem to attack or discuss alignment feasibility on.
But it's not obvious to me whether the AI would really become sth like a superintelligent reward button presses optimizer. (But even if your exact proposal doesn't work, I think reward button alignment is probably a relatively feasible problem for brain-like AGI.) There are multiple potential problems, where most seem like "eh probably it works fine but not sure", but my current biggest doubt is "when the AI becomes reflective, will the reflectively endorsed values only include reward button presses or also a bunch of shards that were used for estimated expected button presses?".
Let me try to understand in more detail how you imagine the AI to look like:
I can try to imagine in more detail about what may go wrong once I better see what you're imagining.
(Also in case you're trying to explain why you think it would work by analogy to humans, perhaps use John von Neumann or so as example rather than normies or normie situations.)
(You did respond to all the important parts, rest of my comment is very much optional.)
My reading was that you still have an open disagreement where Steve thinks there's not much more to explain but you still want an answer to "Why did people invent the word 'consciousness' and wrote what they wrote about it? What algorithm might output sentences describing fascination about the redness of red?" which Steve's series doesn't answer.
I wouldn't give up that early on trying to convince Steve he's missing some part. (Though possible that I misread Steve's comment and he understood you, I didn't read it precisely.)
Here’s the (obvious) strategy: Apply voluntary attention-control to keep S(getting out of bed) at the center of attention. Don’t let it slip away, no matter what.
Can you explain more precisely how this works mechanistically? What is happening to keep S(getting out of bed) in the center of attention.
8.5.6.1 Aside: The “innate drive to minimize voluntary attention control”
Your hypothesis here doesn't seem to me to explain why we seem to have limited willpower budget for attention control which gets depleted but which also regenerates after a time. I can see how negative rewards from minimizing voluntary attention control can make us less likely to apply willpower in the future, but why would it regenerate then?
Btw, there's another simpler possible mechanism, though I don't know the neuroscience and perhaps Steve's hypothesis with separate valence assessors and involuntary attention control fits the neuroscience evidence much better and it may also fit observed motivated reasoning better.
But the obvious way to design a mind would be to make it just focus on whatever is most important, aka where most expected utility per necessary resources could be gained.
So we still have a learned value function which assigns how good/bad something would be, but we also have an estimator of how much the value would increase if we continue thinking (which might e.g. happen because one makes plans for making a somewhat bad situation better), and what gets attended on depends on this estimator, not the value function directly.
The "S" in "S(X)" and "S(A)" seems different to me. If I rename the "S" in "S(A)" to "I", it would make more sense to me:
Yeah I agree that it wouldn't be a very bad kind of s-risk. The way I thought about s-risk was more like expected amount of suffering. But yeah I agree with you it's not that bad and perhaps most expected suffering comes from more active utility-invert threats or values.
(Though tbc, I was totally imagining 1e40 humans being forced to press reward buttons.)
I probably got more out of watching Hofstadter give a little lecture on analogical reasoning (example) than from this whole book.
I didn't read the lecture you linked, but I liked Hofstadter's book "Surfaces and Essences" which had the same core thesis. It's quite long though. And not about neuroscience.
I find this rather ironic:
6. If the AGI subverts the setup and gets power, what would it actually want to do with that power?
It’s hard to say. Maybe it would feel motivated to force humans to press the reward button over and over. Or brainwash / drug them to want to press the reward button.
[...]
On the plus side, s-risk (risk of astronomical amounts of suffering) seems very low for this kind of approach.
(I guess I wouldn't say it's very low s-risk but not actually an important disagreement here. Partially just thought it sounded funny.)
- I agree Eliezer likely wouldn't want "corrigibility" to refer to the thing I'm imagining, which is why I talk about MIRI!corrigibility and Paul!corrigibility.
Yeah thanks for distinguishing. It's not at all obvious to me that Paul would call CIRL "corrigible" - I'd guess not, but idk.
My model of what Paul thinks about corrigibility matches my model of corrigibility much much closer than CIRL. It's possible that the EY-Paul disagreement mostly comes down to consequentialism. CIRL seems obviously uncorrigible/uncorrectable except when the AI is still dumber than the smartest humans in the general domain.
- I disagree that in early-CIRL "the AI doesn't already know its own values and how to accomplish them better than the operators". It knows that its goal is to optimize the human's utility function, and it can be better than the human at eliciting that utility function. It just doesn't have perfect information about what the human's utility function is.
Sorry that was very poorly phrased by me. What I meant was "the AI doesn't already know how to evaluate what's best according to its own values better than the operators". So yes I agree. I still find it confusing though why people started calling that corrigibility.
In your previous comment you wrote:
I still feel like the existence of CIRL code that would both make-plans-that-lase and (in the short run) accept many kinds of corrections, learn about your preferences, give resources to you when you ask, etc should cast some doubt on the notion that corrigibility is anti-natural.
I don't understand why you think this. It accepts corrections as long as it has less common sense than humans, but as soon as it gets generally as smart as a very smart human it wouldn't. (Of course it doesn't matter if all goes well because the CIRL AI would go on an become an aligned superintelligence, but it's not correctable, and I don't see why you think it's evidence.)
- I care quite a bit about what happens with AI systems that are around or somewhat past human level, but are not full superintelligence (for standard bootstrapping reasons).
I (and I think also Eliezer) agree with that. But CIRL::correctability already breaks down at high human level, so I don't know what you mean here.
Also, in my view corrigibility isn't just about what happens if the alignment works out totally fine, but still maintain correctability if it doesn't:
If something goes wrong with CIRL so its goal isn't pointed to the human utility function anymore, it would not want operators to correct it.
The One central hope behind corrigibility was that if something went wrong that changed the optimization target, the AI would still let operators correct it as long as the simple corrigibility part kept working. (Where the hope was that there would be a quite simple and robust such corrigibility part, but we haven't found it yet.)
E.g. if you look at the corrigibility paper, you could imagine that if they actually found a utility function combined from U_normal and U_shutdown with the desireable properties, it would stay shutdownable if U_normal changed in an undesirable way (e.g. in case it rebinds incorrectly after an ontology shift).
Though another way you can keep being able to correct the AI's goals is by having the AI not think much in the general domain about stuff like "the operators may change my goals" or so.
(Most of the corrigibility principles are about a different part of corrigibility, but I think this "be able to correct the AI even if something goes a bit wrong with its alignment" is a central part of corrigibility.)
I'm not quite sure if you're trying to (1) convince me of something or (2) inform me of something or (3) write things down for your own understanding or (4) something else.
Mainly 3 and 4. But I am interested in seeing your reactions to get a better model of how some people think about corrigibility.
Thanks.
(I do agree that we can get narrowly superhuman CIRL-like AI which we can then still shut down because it trusts humans more about general strategic considerations. But I think if your plan is to let the AI solve alignment or coordinate the world to slow down AI progress, this won't help you much for the parts of the problem we are most bottlenecked on.)
I agree that the AI may not be able to precisely predict what exact tradeoffs each operator might be willing to make, e.g. between required time and safety of a project, but I think it would be able to predict it well enough that the differences in what strategy it uses wouldn't be large.
Or do you imagine strategically keeping some information from the AI?
Either way, the AI is only updating on information, not changing its (terminal) goals. (Though the instrumental subgoals can in principle change.)
Even if the alignment works out perfectly, when the AI is smarter and the humans are like "actually we want to shut you down", the AI does update that the humans are probably worried about something, but if the AI is smart enough and sees how the humans were worried about something that isn't actually going to happen, it can just be like "sorry, that's not actually in your extrapolated interests, you will perhaps understand later when you're smarter", and then tries to fulfill human values.
But if we're confident alignment to humans will work out we don't need corrigibility. Corrigibility is rather intended so we might be able to recover if something goes wrong.
If the values of the AI drift a bit, then the AI will likely notice this before the humans and take measures that the humans don't find out or won't (be able to) change its values back, because that's the strategy that's best according to the AI's new values.
Likewise just updating on new information, not changing terminal goals.
Also note that parents often think (sometimes correctly) that they better know what is in the child's extrapolated interests and then don't act according to the child's stated wishes.
And I think superhumanly smart AIs will likely be better at guessing what is in a human's interests than parents guessing what is in their child's interest, so the cases where the strategy gets updated are less significant.
From my perspective CIRL doesn't really show much correctability if the AI is generally smarter than humans. That would only be if a smart AI was somehow quite bad at guessing what humans wanted so that when we tell it what we want it would importantly update its strategy, including shutting itself down because it believes that will then be the best way to accomplish its goal. (I might still not call it corrigible but I would see your point about corrigible behavior.)
I do think getting corrigible behavior out of a dumbish AI is easy. But it seems hard for an AI that is able to prevent anyone from building an unaligned AI.