Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.
This is a special post for short-form writing by evhub. Only they can create top-level comments. Comments here also appear on the Shortform Page and All Posts page.
This is a list of random, assorted AI safety ideas that I think somebody should try to write up and/or work on at some point. I have a lot more than this in my backlog, but these are some that I specifically selected to be relatively small, single-post-sized ideas that an independent person could plausibly work on without much oversight. That being said, I think it would be quite hard to do a good job on any of these without at least chatting with me first—though feel free to message me if you’d be interested.
I'll continue to include more directions like this in the comments here.
I'd just make this a top level post.
I want this more as a reference to point specific people (e.g. MATS scholars) to than as something I think lots of people should see—I don't expect most people to get much out of this without talking to me. If you think other people would benefit from looking at it, though, feel free to call more attention to it.
Mmm, maybe you're right (I was gonna say "making a top-level post which includes 'chat with me about this if you actually wanna work on one of these'", but it then occurs to me you might already be maxed out on chat-with-people time, and it may be more useful to send this to people who have already passed some kind of 'worth your time' filter)
Humans don't wirehead because reward reinforces the thoughts which the brain's credit assignment algorithm deems responsible for producing that reward. Reward is not, in practice, that-which-is-maximized -- reward is the antecedent-thought-reinforcer, it reinforces that which produced it. And when a person does a rewarding activity, like licking lollipops, they are thinking thoughts about reality (like "there's a lollipop in front of me" and "I'm picking it up"), and so these are the thoughts which get reinforced. This is why many human values are about latent reality and not about the human's beliefs about reality or about the activation of the reward system.
It seems that you're postulating that the human brain's credit assignment algorithm is so bad that it can't tell what high-level goals generated a particular action and so would give credit just to thoughts directly related to the current action. That seems plausible for humans, but my guess would be against for advanced AI systems.
No, I don't intend to postulate that. Can you tell me a mechanistic story of how better credit assignment would go, in your worldview?
Disclaimer: At the time of writing, this has not been endorsed by Evan.
I can give this a go.
Unpacking Evan's Comment:
My read of Evan's comment (the parent to yours) is that there are a bunch of learned high-level-goals ("strategies") with varying levels of influence on the tactical choices made, and that a well-functioning end-to-end credit-assignment mechanism would propagate through action selection ("thoughts directly related to the current action" or "tactics") all the way to strategy creation/selection/weighting. In such a system, strategies which decide tactics which emit actions which receive reward are selected for at the expense of strategies less good at that. Conceivably, strategies aiming directly for reward would produce tactical choices more highly rewarded than strategies not aiming quite so directly.
One way for this not to be how humans work would be if reward did not propagate to the strategies, and they were selected/developed by some other mechanism while reward only honed/selected tactical cognition. (You could imagine that "strategic cognition" is that which chooses bundles of context-dependent tactical policies, and "tactical cognition" is that which implements a given tactic's choice of actions in response to some context.) This feels to me close to what Evan was suggesting you were saying is the case with humans.
One Vaguely Mechanistic Illustration of a Similar Concept:
A similar way for this to be broken in humans, departing just a bit from Evan's comment, is if the credit assignment algorithm could identify tactical choices with strategies, but not equally reliably across all strategies. As a totally made up concrete and stylized illustration: Consider one evolutionarily-endowed credit-assignment-target: "Feel physically great," and two strategies: wirehead with drugs (WIRE), or be pro-social (SOCIAL.) Whenever WIRE has control, it emits some tactic like "alone in my room, take the most fun available drug" which takes actions that result in Xw physical pleasure over a day. Whenever SOCIAL has control, it emits some tactic like "alone in my room, abstain from dissociative drugs and instead text my favorite friend" taking actions which result in Xs physical pleasure over a day.
Suppose also that asocial cognitions like "eat this" have poorly wired feed-back channels and the signal is often lost and so triggers credit-assignment only some small fraction of the time. Social cognition is much better wired-up and triggers credit-assignment every time. Whenever credit assignment is triggered, once a day, reward emitted is 1:1 with the amount of physical pleasure experienced that day.
Since WIRE only gets credit a fraction of the time that it's due, the average reward (over 30 days, say) credited to WIRE is <<Xw∗30. If and only if Xw>>Xs, like if the drug is heroin or your friends are insufficiently fulfilling, WIRE will be reinforced more relative to SOCIAL. Otherwise, even if the drug is somewhat more physically pleasurable than the warm-fuzzies of talking with friends, SOCIAL will be reinforced more relative to WIRE.
Conclusion:
I think Evan is saying that he expects advanced reward-based AI systems to have no such impediments by default, even if humans do have something like this in their construction. Such a stylized agent without any signal-dropping would reinforce WIRE over SOCIAL every time that taking the drug was even a tiny bit more physically pleasurable than talking with friends.
Maybe there is an argument that such reward-aimed goals/strategies would not produce the most rewarding actions in many contexts, or for some other reason would not be selected for / found in advanced agents (as Evan suggests in encouraging someone to argue that such goals/strategies require concepts which are unlikely to develop,) but the above might be in the rough vicinity of what Evan was thinking.
REMINDER: At the time of writing, this has not been endorsed by Evan.
Thanks for the story! I may comment more on it later.
That seems to imply that humans would continue to wirehead conditional on that they started wireheading.
Yes, I think they indeed would.
About the following point:
"Argue that wireheading, unlike many other reward gaming or reward tampering problems, is unlikely in practice because the model would have to learn to value the actual transistors storing the reward, which seems exceedingly unlikely in any natural environment."
Well, that seems to be what happened in the case of rats and probably many other animals. Stick an electrode into the reward center of the brain of a rat. Then give it a button to trigger the electrode. Now some rats will trigger their reward centers and ignore food.
Humans value their experience. A pleasant state of consciousness is actually intrinsically valuable to humans. Not that this is the only thing that humans value, but it is certainly a big part.
It is unclear how this would generalize to artificial systems. We don't know if, or in what sense they would have experience, and why that would even matter in the first place. But I don't think we can confidently say that something computationally equivalent to "valuing experience", won't be going on in artificial systems we are going to build.
So somebody picking this point would probably need to address this point and argue why artificial systems are different in this regard. The observation that most humans are not heroin addicts seems relevant. Though the human story might be different if there were no bad side effects and you had easy access to it. This would probably be more the situation artificial systems would find themselves in. Or in a more extreme case, imagine soma but you live longer.
In short: Is valuing experience perhaps computationally equivalent to valuing transistors storing the reward? Then there would be real-world examples of that happening.
I have a related draft on this.
Listening to this John Oliver, I feel like getting broad support behind transparency-based safety standards might be more possible than I previously thought. He emphasizes the "if models are doing some bad behavior, the creators should be able to tell us why" point a bunch and it's in fact a super reasonable point. It seems to me like we really might be able to get enough broad consensus on that sort of a point to get labs to agree to some sort of standard based on it.
The hard part to me now seems to be in crafting some kind of useful standard rather than one in hindsight makes us go "well that sure have everyone a false sense of security".
Yeah I also felt some vague optimism about that.
Epistemic status: random philosophical musings.
Assuming a positive singularity, how should humanity divide its resources? I think the obvious (and essentially correct) answer is "in that situation, you have an aligned superintelligence, so just ask it what to do." But I nevertheless want to philosophize a bit about this, for one main reason.
That reason is: an important factor imo in determining the right thing to do in distributing resources post-singularity is what incentives that choice of resource allocation creates for people pre-singularity. For those incentives to work, though, we have to actually be thinking about this now, since that's what allows the choice of resource distribution post-singularity to have its acausal influence on our choices pre-singularity. I will note that this is definitely something that I think about sometimes, and something that I think a lot of other people also implicitly think about sometimes when they consider things like amassing wealth, specifically gaining control over current AIs, and/or the future value of their impact certificates.
So, what are some of the possible options for how to distribute resources post-singularity? Let's go over some of the various possible solutions here and why I don't think any of the obvious things here are what you want:
As above, I don't think that you should want your aligned AI to implement any of these particular solutions. I think some combination of (3) and (4) is probably the best out of these options, though of course I'm sure that if you actually asked an aligned superintelligent AI it would do better than any of these. More broadly, though, I think that it's important to note that (1), (2), and (3) are all failure stories, not success stories, and you shouldn't expect them to happen in any scenario where we get alignment right.
Circling back to the original reason that I wanted to discuss all of this, which is how it should influence our decisions now:
When you start diverting significant resources away from #1, you’ll probably discover that the definition of “aligned” is somewhat in contention.
Here's a simple argument that I find quite persuasive for why you should have linear returns to whatever your final source of utility is (e.g. human experience of a fulfilling life, which I'll just call "happy humans"). Note that this is not an argument that you should have linear returns to resources (e.g. money). The argument goes like this:
You're assuming that your utility function should have the general form of valuing each "source of utility" independently and then aggregating those values (such that when aggregating you no longer need the details of each "source" but just their values). But in The Moral Status of Independent Identical Copies I found this questionable (i.e., conflicting with other intuitions).
This is the fungibility objection I address above:
Ah, I think I didn't understand that parenthetical remark and skipped over it. Questions:
I tried to explain my intuitions/uncertainty about this in The Moral Status of Independent Identical Copies (it was linked earlier in this thread).
I read it, and I think I broadly agree with it, but I don't know why you think it's a reason to treat physical distance differently to Everett branch distance, holding diversity constant. The only reason that you would want to treat them differently, I think, is if the Everett branch happy humans are very similar, whereas the physically separated happy humans are highly diverse. But, in that case, that's an argument for superlinear local returns to happy humans, since it favors concentrating them so that it's easier to make them as diverse as possible.
I have a stronger intuition for "identical copy immortality" when the copies are separated spatially instead of across Everett branches (the latter also called quantum immortality). For example if you told me there are 2 identical copies of Earth spread across the galaxy and 1 of them will instantly disintegrate, I would be much less sad than if you told me that you'll flip a quantum coin and disintegrate Earth if it comes up heads.
I'm not sure if this is actually a correct intuition, but I'm also not sure that it's not, so I'm not willing to make assumptions that contradict it.
Not sure about this. Even if I think I am only acting locally, my actions and decisions could have an effect on the larger multiverse. When I do something to increase happy humans in my own local universe, I am potentially deciding / acting for everyone in my multiverse neighborhood who is similar enough to me to make similar decisions for similar reasons.
I agree that this is the main way that this argument could fail. Still, I think the multiverse is too large and the correlation not strong enough across very different versions of the Earth for this objection to substantially change the bottom line.
My sense is that many solutions to infinite ethics look a bit like this. For example, if you use UDASSA, then a single human who is alone in a big universe will have a shorter description length than a single human who is surrounded by many other humans in a big universe. Because for the former, you can use pointers that specify the universe and then describe sufficient criteria to recognise a human, but for the latter, you need to nail down exact physical location or some other exact criteria that distinguishes a specific human from every other human.
I agree that UDASSA might introduce a small effect like this, but my guess is that the overall effect isn't enough to substantially change the bottom line. Fundamentally, being separated in space vs. being separated across different branches of the wavefunction seem pretty similar in terms of specification difficulty.
Maybe? I don't really know how to reason about this.
If that's true, that still only means that you should be linear for gambles that give different results in different quantum branches. C.f. logical vs. physical risk aversion.
Some objection like that might work more generally, since some logical facts will mean that there are far less humans in the universe-at-large, meaning that you're at a different point in the risk-returns curve. So when comparing different logical ways the universe could be, you should not always care about the worlds where you can affect more sentient beings. If you have diminishing marginal returns, you need to be thinking about some more complicated function that is about whether you have a comparative advantage at affecting more sentient beings in worlds where there is overall fewer sentient beings (as measured by some measure that can handle infinities). Which matters for stuff like whether you should bet on the universe being large.
If you want to produce warning shots for deceptive alignment, you're faced with a basic sequencing question. If the model is capable of reasoning about its training process before it's capable of checking a predicate like RSA-2048, then you have a chance to catch it—but if it becomes capable of checking a predicate like RSA-2048 first, then any deceptive models you build won't be detectable.