Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.
This is a special post for short-form writing by evhub. Only they can create top-level comments. Comments here also appear on the Shortform Page and All Posts page.
This is a list of random, assorted AI safety ideas that I think somebody should try to write up and/or work on at some point. I have a lot more than this in my backlog, but these are some that I specifically selected to be relatively small, single-post-sized ideas that an independent person could plausibly work on without much oversight. That being said, I think it would be quite hard to do a good job on any of these without at least chatting with me first—though feel free to message me if you’d be interested.
I'll continue to include more directions like this in the comments here.
I'd just make this a top level post.
I want this more as a reference to point specific people (e.g. MATS scholars) to than as something I think lots of people should see—I don't expect most people to get much out of this without talking to me. If you think other people would benefit from looking at it, though, feel free to call more attention to it.
Mmm, maybe you're right (I was gonna say "making a top-level post which includes 'chat with me about this if you actually wanna work on one of these'", but it then occurs to me you might already be maxed out on chat-with-people time, and it may be more useful to send this to people who have already passed some kind of 'worth your time' filter)
Humans don't wirehead because reward reinforces the thoughts which the brain's credit assignment algorithm deems responsible for producing that reward. Reward is not, in practice, that-which-is-maximized -- reward is the antecedent-thought-reinforcer, it reinforces that which produced it. And when a person does a rewarding activity, like licking lollipops, they are thinking thoughts about reality (like "there's a lollipop in front of me" and "I'm picking it up"), and so these are the thoughts which get reinforced. This is why many human values are about latent reality and not about the human's beliefs about reality or about the activation of the reward system.
It seems that you're postulating that the human brain's credit assignment algorithm is so bad that it can't tell what high-level goals generated a particular action and so would give credit just to thoughts directly related to the current action. That seems plausible for humans, but my guess would be against for advanced AI systems.
No, I don't intend to postulate that. Can you tell me a mechanistic story of how better credit assignment would go, in your worldview?
Disclaimer: At the time of writing, this has not been endorsed by Evan.
I can give this a go.
Unpacking Evan's Comment:
My read of Evan's comment (the parent to yours) is that there are a bunch of learned high-level-goals ("strategies") with varying levels of influence on the tactical choices made, and that a well-functioning end-to-end credit-assignment mechanism would propagate through action selection ("thoughts directly related to the current action" or "tactics") all the way to strategy creation/selection/weighting. In such a system, strategies which decide tactics which emit actions which receive reward are selected for at the expense of strategies less good at that. Conceivably, strategies aiming directly for reward would produce tactical choices more highly rewarded than strategies not aiming quite so directly.
One way for this not to be how humans work would be if reward did not propagate to the strategies, and they were selected/developed by some other mechanism while reward only honed/selected tactical cognition. (You could imagine that "strategic cognition" is that which chooses bundles of context-dependent tactical policies, and "tactical cognition" is that which implements a given tactic's choice of actions in response to some context.) This feels to me close to what Evan was suggesting you were saying is the case with humans.
One Vaguely Mechanistic Illustration of a Similar Concept:
A similar way for this to be broken in humans, departing just a bit from Evan's comment, is if the credit assignment algorithm could identify tactical choices with strategies, but not equally reliably across all strategies. As a totally made up concrete and stylized illustration: Consider one evolutionarily-endowed credit-assignment-target: "Feel physically great," and two strategies: wirehead with drugs (WIRE), or be pro-social (SOCIAL.) Whenever WIRE has control, it emits some tactic like "alone in my room, take the most fun available drug" which takes actions that result in Xw physical pleasure over a day. Whenever SOCIAL has control, it emits some tactic like "alone in my room, abstain from dissociative drugs and instead text my favorite friend" taking actions which result in Xs physical pleasure over a day.
Suppose also that asocial cognitions like "eat this" have poorly wired feed-back channels and the signal is often lost and so triggers credit-assignment only some small fraction of the time. Social cognition is much better wired-up and triggers credit-assignment every time. Whenever credit assignment is triggered, once a day, reward emitted is 1:1 with the amount of physical pleasure experienced that day.
Since WIRE only gets credit a fraction of the time that it's due, the average reward (over 30 days, say) credited to WIRE is <<Xw∗30. If and only if Xw>>Xs, like if the drug is heroin or your friends are insufficiently fulfilling, WIRE will be reinforced more relative to SOCIAL. Otherwise, even if the drug is somewhat more physically pleasurable than the warm-fuzzies of talking with friends, SOCIAL will be reinforced more relative to WIRE.
Conclusion:
I think Evan is saying that he expects advanced reward-based AI systems to have no such impediments by default, even if humans do have something like this in their construction. Such a stylized agent without any signal-dropping would reinforce WIRE over SOCIAL every time that taking the drug was even a tiny bit more physically pleasurable than talking with friends.
Maybe there is an argument that such reward-aimed goals/strategies would not produce the most rewarding actions in many contexts, or for some other reason would not be selected for / found in advanced agents (as Evan suggests in encouraging someone to argue that such goals/strategies require concepts which are unlikely to develop,) but the above might be in the rough vicinity of what Evan was thinking.
REMINDER: At the time of writing, this has not been endorsed by Evan.
Thanks for the story! I may comment more on it later.
That seems to imply that humans would continue to wirehead conditional on that they started wireheading.
Yes, I think they indeed would.
About the following point:
"Argue that wireheading, unlike many other reward gaming or reward tampering problems, is unlikely in practice because the model would have to learn to value the actual transistors storing the reward, which seems exceedingly unlikely in any natural environment."
Well, that seems to be what happened in the case of rats and probably many other animals. Stick an electrode into the reward center of the brain of a rat. Then give it a button to trigger the electrode. Now some rats will trigger their reward centers and ignore food.
Humans value their experience. A pleasant state of consciousness is actually intrinsically valuable to humans. Not that this is the only thing that humans value, but it is certainly a big part.
It is unclear how this would generalize to artificial systems. We don't know if, or in what sense they would have experience, and why that would even matter in the first place. But I don't think we can confidently say that something computationally equivalent to "valuing experience", won't be going on in artificial systems we are going to build.
So somebody picking this point would probably need to address this point and argue why artificial systems are different in this regard. The observation that most humans are not heroin addicts seems relevant. Though the human story might be different if there were no bad side effects and you had easy access to it. This would probably be more the situation artificial systems would find themselves in. Or in a more extreme case, imagine soma but you live longer.
In short: Is valuing experience perhaps computationally equivalent to valuing transistors storing the reward? Then there would be real-world examples of that happening.
I have a related draft on this.
Listening to this John Oliver, I feel like getting broad support behind transparency-based safety standards might be more possible than I previously thought. He emphasizes the "if models are doing some bad behavior, the creators should be able to tell us why" point a bunch and it's in fact a super reasonable point. It seems to me like we really might be able to get enough broad consensus on that sort of a point to get labs to agree to some sort of standard based on it.
The hard part to me now seems to be in crafting some kind of useful standard rather than one in hindsight makes us go "well that sure have everyone a false sense of security".
Yeah I also felt some vague optimism about that.
If you want to produce warning shots for deceptive alignment, you're faced with a basic sequencing question. If the model is capable of reasoning about its training process before it's capable of checking a predicate like RSA-2048, then you have a chance to catch it—but if it becomes capable of checking a predicate like RSA-2048 first, then any deceptive models you build won't be detectable.