This is a list of random, assorted AI safety ideas that I think somebody should try to write up and/or work on at some point. I have a lot more than this in my backlog, but these are some that I specifically selected to be relatively small, single-post-sized ideas that an independent person could plausibly work on without much oversight. That being said, I think it would be quite hard to do a good job on any of these without at least chatting with me first—though feel free to message me if you’d be interested.
I'll continue to include more directions like this in the comments here.
I want this more as a reference to point specific people (e.g. MATS scholars) to than as something I think lots of people should see—I don't expect most people to get much out of this without talking to me. If you think other people would benefit from looking at it, though, feel free to call more attention to it.
Mmm, maybe you're right (I was gonna say "making a top-level post which includes 'chat with me about this if you actually wanna work on one of these'", but it then occurs to me you might already be maxed out on chat-with-people time, and it may be more useful to send this to people who have already passed some kind of 'worth your time' filter)
- Argue that wireheading, unlike many other reward gaming or reward tampering problems, is unlikely in practice because the model would have to learn to value the actual transistors storing the reward, which seems exceedingly unlikely in any natural environment.
Humans don't wirehead because reward reinforces the thoughts which the brain's credit assignment algorithm deems responsible for producing that reward. Reward is not, in practice, that-which-is-maximized -- reward is the antecedent-thought-reinforcer, it reinforces that which produced it. And when a person does a rewarding activity, like licking lollipops, they are thinking thoughts about reality (like "there's a lollipop in front of me" and "I'm picking it up"), and so these are the thoughts which get reinforced. This is why many human values are about latent reality and not about the human's beliefs about reality or about the activation of the reward system.
It seems that you're postulating that the human brain's credit assignment algorithm is so bad that it can't tell what high-level goals generated a particular action and so would give credit just to thoughts directly related to the current action. That seems plausible for humans, but my guess would be against for advanced AI systems.
No, I don't intend to postulate that. Can you tell me a mechanistic story of how better credit assignment would go, in your worldview?
Disclaimer: At the time of writing, this has not been endorsed by Evan.
I can give this a go.
Unpacking Evan's Comment:
My read of Evan's comment (the parent to yours) is that there are a bunch of learned high-level-goals ("strategies") with varying levels of influence on the tactical choices made, and that a well-functioning end-to-end credit-assignment mechanism would propagate through action selection ("thoughts directly related to the current action" or "tactics") all the way to strategy creation/selection/weighting. In such a system, strategies which decide tactics which emit actions which receive reward are selected for at the expense of strategies less good at that. Conceivably, strategies aiming directly for reward would produce tactical choices more highly rewarded than strategies not aiming quite so directly.
One way for this not to be how humans work would be if reward did not propagate to the strategies, and they were selected/developed by some other mechanism while reward only honed/selected tactical cognition. (You could imagine that "strategic cognition" is that which chooses bundles of context-dependent tactical policies, and "tactical cognition" is that which implements a given tactic's choice of actions in response to some context.) This feels to me close to what Evan was suggesting you were saying is the case with humans.
One Vaguely Mechanistic Illustration of a Similar Concept:
A similar way for this to be broken in humans, departing just a bit from Evan's comment, is if the credit assignment algorithm could identify tactical choices with strategies, but not equally reliably across all strategies. As a totally made up concrete and stylized illustration: Consider one evolutionarily-endowed credit-assignment-target: "Feel physically great," and two strategies: wirehead with drugs (WIRE), or be pro-social (SOCIAL.) Whenever WIRE has control, it emits some tactic like "alone in my room, take the most fun available drug" which takes actions that result in physical pleasure over a day. Whenever SOCIAL has control, it emits some tactic like "alone in my room, abstain from dissociative drugs and instead text my favorite friend" taking actions which result in physical pleasure over a day.
Suppose also that asocial cognitions like "eat this" have poorly wired feed-back channels and the signal is often lost and so triggers credit-assignment only some small fraction of the time. Social cognition is much better wired-up and triggers credit-assignment every time. Whenever credit assignment is triggered, once a day, reward emitted is 1:1 with the amount of physical pleasure experienced that day.
Since WIRE only gets credit a fraction of the time that it's due, the average reward (over 30 days, say) credited to WIRE is . If and only if , like if the drug is heroin or your friends are insufficiently fulfilling, WIRE will be reinforced more relative to SOCIAL. Otherwise, even if the drug is somewhat more physically pleasurable than the warm-fuzzies of talking with friends, SOCIAL will be reinforced more relative to WIRE.
Conclusion:
I think Evan is saying that he expects advanced reward-based AI systems to have no such impediments by default, even if humans do have something like this in their construction. Such a stylized agent without any signal-dropping would reinforce WIRE over SOCIAL every time that taking the drug was even a tiny bit more physically pleasurable than talking with friends.
Maybe there is an argument that such reward-aimed goals/strategies would not produce the most rewarding actions in many contexts, or for some other reason would not be selected for / found in advanced agents (as Evan suggests in encouraging someone to argue that such goals/strategies require concepts which are unlikely to develop,) but the above might be in the rough vicinity of what Evan was thinking.
REMINDER: At the time of writing, this has not been endorsed by Evan.
About the following point:
"Argue that wireheading, unlike many other reward gaming or reward tampering problems, is unlikely in practice because the model would have to learn to value the actual transistors storing the reward, which seems exceedingly unlikely in any natural environment."
Well, that seems to be what happened in the case of rats and probably many other animals. Stick an electrode into the reward center of the brain of a rat. Then give it a button to trigger the electrode. Now some rats will trigger their reward centers and ignore food.
Humans value their experience. A pleasant state of consciousness is actually intrinsically valuable to humans. Not that this is the only thing that humans value, but it is certainly a big part.
It is unclear how this would generalize to artificial systems. We don't know if, or in what sense they would have experience, and why that would even matter in the first place. But I don't think we can confidently say that something computationally equivalent to "valuing experience", won't be going on in artificial systems we are going to build.
So somebody picking this point would probably need to address this point and argue why artificial systems are different in this regard. The observation that most humans are not heroin addicts seems relevant. Though the human story might be different if there were no bad side effects and you had easy access to it. This would probably be more the situation artificial systems would find themselves in. Or in a more extreme case, imagine soma but you live longer.
In short: Is valuing experience perhaps computationally equivalent to valuing transistors storing the reward? Then there would be real-world examples of that happening.
Here's a two-sentence argument for misalignment that I think is both highly compelling to laypeople and technically accurate at capturing the key issue:
When we train AI systems to be nice, we're giving a bunch of random programs a niceness exam and selecting the programs that score well. If you gave a bunch of humans a niceness exam where the humans would be rewarded for scoring well, do you think you would get actually nice humans?
To me it seems a solid attempt at conveying [misalignment is possible, even with a good test], but not necessarily [misalignment is likely, even with a good test]. (not that I have a great alternative suggestion)
Important disanalogies seem:
1) Most humans aren't good at convincingly faking niceness (I think!). The listener may assume a test good enough to successfully exploit this most of the time.
2) The listener will assume that [score highly on niceness] isn't the human's only reward. (both things like [desire to feel honest] and [worry of the consequences of being caught cheating])
3) A fairly large proportion of humans are nice (I think!).
The second could be addressed somewhat by raising the stakes.
The first seems hard to remedy within this analogy. I'd be a little concerned that people initially buy it, then think for themselves and conclude "But if we design a really clever niceness test, then it'd almost always work - all we need is clever people to work for a while on some good tests".
Combined with (3), this might seem like a decent solution.
Overall, I think what's missing is that we'd expect [our clever test looks to us as if it works] well before [our clever test actually works]. My guess is that the layperson isn't going to have this intuition in the human-niceness-test case.
I expect this is very susceptible to opinions about human nature. To someone who thinks humans ARE generally nice, they are likely to answer "yes, of course" to your question. To someone who thinks humans are generally extremely context-sensitive, which appears to be nice in the co-evolved social settings in which we generally interact, the answer is "who knows?". But the latter group doesn't need to be convinced, we're already worried.
Surely nobody thinks that all humans are nice all the time and nobody would ever fake a niceness exam. I mean, I think humans are generally pretty good, but obviously that always has to come with a bunch of caveats because you don't have to look very far into human history to see quite a lot of human-committed atrocities.
I think the answer is an obvious yes, all other things held equal. Of course, what happens in reality is more complex than this, but I'd still say yes in most cases, primarily because I think that aligned behavior is very simple, so simple that it either only barely loses out to the deceptive model or outright has the advantage, depending on the programming language and low level details, and thus we only need to transfer 300-1000 bits maximum, which is likely very easy.
Much more generally, my fundamental claim is that the complexity of pointing to human values is very similar to the set of all long-term objectives, and can be easier or harder, but I don't buy the assumption that pointing to human values is way harder than pointing to the set of long-term goals.
I think that sticking to capitalism as an economic system post-singularity would be pretty clearly catastrophic and something to strongly avoid, despite capitalism working pretty well today. I've talked about this a bit previously here, but some more notes on why:
Capitalism is a complex system with many moving parts, some of which are sometimes assumed to consist of the entirety of what defines it. What kinds of components do you see as being highly unlikely to be included in a successful utopia, and what components could be internal to a well functioning system as long as (potentially-left-unspecified) conditions are met? I could name some kinds of components (eg some kinds of contracts or enforcement mechanisms) that I expect to not be used in a utopia, though I suspect at this point you've seen my comments where I get into this, so I'm more interested in what you say without that prompting.
Who's this "we" you're talking about? It doesn't seem to be any actual humans I recognize. As far as I can tell, the basics of capitalism (call it "simple capitalism") are just what happens when individuals make decisions about resource use. We call it "ownership", but really any form of resolution of the underlying conflict of preferences would likely work out similarly. That conflict is that humans have unbounded desires, and resources are bounded.
The drive to make goods and services for each other, in pursuit of selfish wants, does incentivize labor, but it's not because "society requires" it, except in a pretty blurry aggregation of individual "requires". Price discovery is only half of what market transactions do. The other half is usage limits and resource distribution. These are sides of a coin, and can't be separated - without limited amounts, there is no price, without prices there is no agent-driven exchange of different kinds of resource.
I'm with you that modern capitalism is pretty unpleasant due to optimization pressure, and due to the easy aggregation of far more people and resources than historically possible, and than human culture was evolved around. I don't see how the underlying system has any alternative that doesn't do away with individual desire and consumption. Especially the relative/comparative consumption that seems to drive a LOT of perceived-luxury requirements.
I think some version of distributing intergalactic property rights uniformly (e.g. among humans existing in 2023) combined with normal capitalism isn't clearly that catastrophic. (The distribution is what you call the egalitarian/democratic solution in the link.)
Maybe you lose about a factor of 5 or 10 over the literally optimal approach from my perspective (but maybe this analysis is tricky due to two envelope problems).
(You probably also need some short term protections to avoid shakedowns etc.)
Are you pessimstic that people will bother reflecting or thinking carefully prior to determing resource utilization or selling their property? I guess I feel like 10% of people being somewhat thoughtful matches the rough current distribution of extremely rich people.
If the situation was something like "current people, weighted by wealth, deliberate for a while on what to do with our resources" then I agree that's probably like 5 - 10 times worse than the best approach (which is still a huge haircut) but not clearly catastrophic. But it's not clear to me that's what the default outcome of competitive dynamics would look like—sufficiently competitive dynamics could force out altruistic actors if they get outcompeted by non-altruistic actors.
I think one crux between you and I, at least, is that you see this as a considered division of how to divide resources, and I see it as an equilibrium consensus/acceptance of what property rights to enforce in maintenance, creation, and especially transfer of control/usage of resources. You think of static division, I think of equilibria and motion. Both are valid, but experience and resource use is ongoing and it must be accounted for.
I'm happy that the modern world generally approves of self-ownership: a given individual gets to choose what to do (within limits, but it's nowhere near the case that my body and mind are part of the resources allocated by whatever mechanism is being considered). It's generally considered an alignment failure if individual will is just a resource that the AI manages. Physical resources (and behavioral resources, which are a sale of the results of some human efforts, a distinct resource from the mind performing the action) are generally owned by someone, and they trade some results to get the results of other people's resources (including their labor and thought-output).
There could be a side-mechanism for some amount of resources just for existing, but it's unlikely that it can be the primary transfer/allocation mechanism, as long as individuals have independent and conflicting desires. Current valuable self-owned products (office work, software design, etc.) probably reduces in value a lot. If all human output becomes valueless (in the "tradable for other desired things or activities" sense of valuable), I don't think current humans will continue to exist.
Wirehead utopia (including real-world "all desires fulfilled without effort or trade") doesn't sound appealing or workable for what I know of my own and general human psychology.
self-ownership: a given individual gets to choose what to do (within limits, but it's nowhere near the case that my body and mind are part of the resources allocated by whatever mechanism is being considered)
for most people, this is just the right to sell their body to the machine. better than being forced at gunpoint, but being forced to by an empty fridge is not that much better, especially with monopoly accumulation as the default outcome. I agree that being able to mark ones' selfhood boundaries with property contracts is generally good, but the ability to massively expand ones' property contracts to exclude others from resource access is effectively a sort of scalping - sucking up resources so as to participate in an emergent cabal of resource withholding. In other words,
It's generally considered an alignment failure if individual will is just a resource that the AI manages.
The core argument that there's something critically wrong with capitalism is that the stock market has been an intelligence aggregation system for a long time and has a strong tendency to suck up the air in the system.
Utopia would need to involve a load balancing system that can prevent sucking-up-the-air type resource control imbalancing, so as to prevent
If all human output becomes valueless
for most people, this is just the right to sell their body to the machine.
I think this is a big point of disagreement. For most people, there's some amount of time/energy that's sold to the machine, and it's NOWHERE EVEN CLOSE to selling their actual asset (body and mind). There's a LOT of leisure time, and a LOT of freedom even within work hours, and the choice to do something different tomorrow. It may not be as rewarding, but it's available and the ability to make those decisions has not been sold or taken.
yeah like, above a certain level of economic power that's true, but the overwhelming majority of humans are below that level, and AI is expected to raise that waterline. it's kind of the primary failure mode I expect.
I mean, the 40 hour work week movement did help a lot. But it was an instance of a large push of organizing to demand constraint on what the aggregate intelligence (which at the time was the stock market - which is a trade market of police-enforceable ownership contracts), could demand of people who were not highly empowered. And it involved leveling a lopsided playing field by things that one side considered dirty tricks, such as strikes. I don't think that'll help against AI, to put it lightly.
To be clear, I recognize that your description is accurate for a significant portion of people. But it's not close to the majority, and movement towards making it the majority has historically demanded changing the enforceable rules in a way that would reliably constrain the aggregate agency of the high dimensional control system steering the economy. When we have a sufficiently much more powerful one of those is when we expect failure, and right now it doesn't seem to me that there's any movement on a solution to that. We can talk about "oh we need something better than capitalism" but the problem with the stock market is simply that it's enforceable prediction, thereby sucking up enough air from the room that a majority of people do not get the benefits you're describing. If they did, then you're right, it would be fine!
I mean, also there's this, but somehow I expect that that won't stick around long after robots are enough cheaper than humans
I think we're talking past each other a bit. It's absolutely true that the vast majority historically and, to a lesser extent, in modern times, are pretty constrained in their choices. This constraint is HIGHLY correlated with distance from participation in voluntary trade (of labor or resources).
I think the disconnect is the word "capitalism" - when you talk about stock markets and price discovery, that says to me you're thinking of a small part of the system. I fully agree that there are a lot of really unpleasant equilibra with the scale and optimization pressure of the current legible financial world, and I'd love to undo a lot of it. But the underlying concept of enforced and agreed property rights and individual human decisions is important to me, and seems to be the thing that gets destroyed first when people decry capitalism.
Ok, it sounds, even to me, like "The heads. You're looking at the heads. Sometimes he goes too far. He's the first one to admit it." But really, I STRONGLY expect that I am experiencing peak human freedom RIGHT NOW (well, 20 years ago, but it's been rather flat for me and my cultural peers for a century, even if somewhat declining recently), and capitalism (small-c, individual decisions and striving, backed by financial aggregation with fairly broad participation) has been a huge driver of that. I don't see any alternatives that preserve the individuality of even a significant subset of humanity.
If property rights to the stars are distributed prior to this, why does this competition cause issues? Maybe you basically agree here, but think it's unlikely property will be distributed like this.
Separately, for competitive dynamics with reasonable rule of law and alignment ~solved, why do you think the strategy stealing assumption won't apply? (There are a bunch of possible objections here, just wondering what your's is. Personally I think strategy stealing is probably fine if the altruistic actors care about the long run and are strategic.)
Listening to this John Oliver, I feel like getting broad support behind transparency-based safety standards might be more possible than I previously thought. He emphasizes the "if models are doing some bad behavior, the creators should be able to tell us why" point a bunch and it's in fact a super reasonable point. It seems to me like we really might be able to get enough broad consensus on that sort of a point to get labs to agree to some sort of standard based on it.
If you want to better understand counting arguments for deceptive alignment, my comment here might be a good place to start.
Epistemic status: random philosophical musings.
Assuming a positive singularity, how should humanity divide its resources? I think the obvious (and essentially correct) answer is "in that situation, you have an aligned superintelligence, so just ask it what to do." But I nevertheless want to philosophize a bit about this, for one main reason.
That reason is: an important factor imo in determining the right thing to do in distributing resources post-singularity is what incentives that choice of resource allocation creates for people pre-singularity. For those incentives to work, though, we have to actually be thinking about this now, since that's what allows the choice of resource distribution post-singularity to have its acausal influence on our choices pre-singularity. I will note that this is definitely something that I think about sometimes, and something that I think a lot of other people also implicitly think about sometimes when they consider things like amassing wealth, specifically gaining control over current AIs, and/or the future value of their impact certificates.
So, what are some of the possible options for how to distribute resources post-singularity? Let's go over some of the various possible solutions here and why I don't think any of the obvious things here are what you want:
As above, I don't think that you should want your aligned AI to implement any of these particular solutions. I think some combination of (3) and (4) is probably the best out of these options, though of course I'm sure that if you actually asked an aligned superintelligent AI it would do better than any of these. More broadly, though, I think that it's important to note that (1), (2), and (4) are all failure stories, not success stories, and you shouldn't expect them to happen in any scenario where we get alignment right.
Circling back to the original reason that I wanted to discuss all of this, which is how it should influence our decisions now:
Disagree. I'm in favor of (2) because I think that what you call a "tyranny of the present" makes perfect sense. Why would the people of the present not maximize their utility functions, given that it's the rational thing for them to do by definition of "utility function"? "Because utilitarianism" is a nonsensical answer IMO. I'm not a utilitarian. If you're a utilitarian, you should pay for your utilitarianism out of your own resource share. For you to demand that I pay for your utilitarianism is essentially a defection in the decision-theoretic sense, and would incentivize people like me to defect back.
As to problem (2.b), I don't think it's a serious issue in practice because time until singularity is too short for it to matter much. If it was, we could still agree on a cooperative strategy that avoids a wasteful race between present people.
Even if you don't personally value other people, if you're willing to step behind the veil of ignorance with respect to whether you'll be an early person or a late person, it's clearly advantageous before you know which one you'll be to not allocate all the resources to the early people.
First, I said I'm not a utilitarian, I didn't say that I don't value other people. There's a big difference!
Second, I'm not willing to step behind that veil of ignorance. Why should I? Decision-theoretically, it can make sense to argue "you should help agent X because in some counterfactual, agent X would be deciding whether to help you using similar reasoning". But, there might be important systematic differences between early people and late people (for example, because late people are modified in some ways compared to the human baseline) which break the symmetry. It might be a priori improbable for me to be born as a late person (and still be me in the relevant sense) or for a late person to be born in our generation[1].
Moreover, if there is a valid decision-theoretic argument to assign more weight to future people, then surely a superintelligent AI acting on my behalf would understand this argument and act on it. So, this doesn't compel me to precommit to a symmetric agreement with future people in advance.
There is a stronger case for intentionally creating and giving resources to people who are early in counterfactual worlds. At least, assuming people have meaningful preferences about the state of never-being-born.
If a future decision is to shape the present, we need to predict it.
The decision-theoretic strategy "Figure out where you are, then act accordingly." is merely an approximation to "Use the policy that leads to the multiverse you prefer.". You *can* bring your present loyalties with you behind the veil, it might just start to feel farcically Goodhartish at some point.
There are of course no probabilities of being born into one position or another, there are only various avatars through which your decisions affect the multiverse. The closest thing to probabilities you'll find is how much leverage each avatar offers: The least wrong probabilistic anthropics translates "the effect of your decisions through avatar A is twice as important as through avatar B" into "you are twice as likely to be A as B".
So if we need probabilities of being born early vs. late, we can compare their leverage. We find:
When you start diverting significant resources away from #1, you’ll probably discover that the definition of “aligned” is somewhat in contention.
i feel like (2)/(3) is about "what does (the altruistic part of) my utility function want?" and 4 is "how do i decision-theoretically maximize said utility function?". they're different layers, and ultimately it's (2)/(3) we want to maximize, but maximizing (2)/(3) entails allocating some of the future lightcore to (4).
A couple of thoughts:
If correct incentives were the only desideratum, I don't see how we'd avoid [post-singularity 'hell' (with some probability) for those who're reckless with AGI].
(some very mild spoilers for yudkowsky's planecrash glowfic (very mild as in this mostly does not talk about the story, but you could deduce things about where the story goes by the fact that characters in it are discussing this))
[edit: links in spoiler tags are bugged. in the spoiler, "speculates about" should link to here and "have the stance that" to here]
"The Negative stance is that everyone just needs to stop calculating how to pessimize anybody else's utility function, ever, period. That's a simple guideline for how realness can end up mostly concentrated inside of events that agents want, instead of mostly events that agents hate."
"If at any point you're calculating how to pessimize a utility function, you're doing it wrong. If at any point you're thinking about how much somebody might get hurt by something, for a purpose other than avoiding doing that, you're doing it wrong.)"
i think this is a pretty solid principle. i'm very much not a fan of anyone's utility function getting pessimized.
so pessimising a utility function is a bad idea. but we can still produce correct incentive gradients in other ways! for example, we could say that every moral patient starts with 1 unit of utility function handshake, but if you destroy the world you lose some of your share. maybe if you take actions that cause ⅔ of timelines to die, you only get ⅓ units of utility function handshake, and the more damage you do the less handshake you get.
it never gets into the negative, that way we never go out of our way to pessimize someone's utility function; but it does get increasingly close to 0.
(this isn't necessarily a scheme i'm committed to, it's just an idea i've had for a scheme that provides the correct incentives for not destroying the world, without having to create hells / pessimize utility functions)
Hmmm, I don't think that kind of thing is going to give correct world-saving incentives for the selfish part of people (unless failing to save the world counts as destroying it - in which case almost everyone is going to get approximately no influence).
More fundamentally, I don't think it works out in this kind of case due to logical uncertainty.
If I'm uncertain about a particular plan, and my estimate is {80% everyone dies; 20% I save the world}, that's not {in 80% of timelines everyone dies; in 20% of timelines I save the world}.
It's closer to [there's an 80% chance that {in ~99% of timelines everyone dies}; there's a 20% chance that {in ~99% of timelines I save the world}].
So, conditional on my saving the world in some timeline by taking some action, I saved the world in most timelines where I took that action and would get a load of influence. This won't disincentivize risky gambles for selfish/power-hungry people. (at least of the form [let's train this model and see what happens] - most of the danger there being a logical uncertainty thing)
I think influence would need to be based on expected social value given the 'correct' level of logical uncertainty - probably something like [what (expected value | your action) is justified by your beliefs, and valid arguments you'd make for them based on information you have].
Or at least some subjective perspective seems to be necessary - and something that doesn't give more points for overconfident people.
Here's a simple argument that I find quite persuasive for why you should have linear returns to whatever your final source of utility is (e.g. human experience of a fulfilling life, which I'll just call "happy humans"). Note that this is not an argument that you should have linear returns to resources (e.g. money). The argument goes like this:
You're assuming that your utility function should have the general form of valuing each "source of utility" independently and then aggregating those values (such that when aggregating you no longer need the details of each "source" but just their values). But in The Moral Status of Independent Identical Copies I found this questionable (i.e., conflicting with other intuitions).
This is the fungibility objection I address above:
Note that this assumes "happy humans" are fungible, which I don't actually believe—I care about the overall diversity of human experience throughout the multiverse. However, I don't think that changes the bottom line conclusion, since, if anything, centralizing the happy humans rather than spreading them out seems like it would make it easier to ensure that their experiences are as diverse as possible.
Ah, I think I didn't understand that parenthetical remark and skipped over it. Questions:
I don’t see any reason why spatial distance should behave differently than being in separate Everett branches here.
I tried to explain my intuitions/uncertainty about this in The Moral Status of Independent Identical Copies (it was linked earlier in this thread).
I read it, and I think I broadly agree with it, but I don't know why you think it's a reason to treat physical distance differently to Everett branch distance, holding diversity constant. The only reason that you would want to treat them differently, I think, is if the Everett branch happy humans are very similar, whereas the physically separated happy humans are highly diverse. But, in that case, that's an argument for superlinear local returns to happy humans, since it favors concentrating them so that it's easier to make them as diverse as possible.
but I don’t know why you think it’s a reason to treat physical distance differently to Everett branch distance
I have a stronger intuition for "identical copy immortality" when the copies are separated spatially instead of across Everett branches (the latter also called quantum immortality). For example if you told me there are 2 identical copies of Earth spread across the galaxy and 1 of them will instantly disintegrate, I would be much less sad than if you told me that you'll flip a quantum coin and disintegrate Earth if it comes up heads.
I'm not sure if this is actually a correct intuition, but I'm also not sure that it's not, so I'm not willing to make assumptions that contradict it.
Furthermore, the total number of happy humans is mostly insensitive to anything you can do, or anything happening locally within this universe, since this universe is only a tiny fraction of the overall multiverse.
Not sure about this. Even if I think I am only acting locally, my actions and decisions could have an effect on the larger multiverse. When I do something to increase happy humans in my own local universe, I am potentially deciding / acting for everyone in my multiverse neighborhood who is similar enough to me to make similar decisions for similar reasons.
I agree that this is the main way that this argument could fail. Still, I think the multiverse is too large and the correlation not strong enough across very different versions of the Earth for this objection to substantially change the bottom line.
(Though you could get out of this by claiming that what you really care about is happy humans per universe, that's a pretty strange thing to care about—it's like caring about happy humans per acre.)
My sense is that many solutions to infinite ethics look a bit like this. For example, if you use UDASSA, then a single human who is alone in a big universe will have a shorter description length than a single human who is surrounded by many other humans in a big universe. Because for the former, you can use pointers that specify the universe and then describe sufficient criteria to recognise a human, but for the latter, you need to nail down exact physical location or some other exact criteria that distinguishes a specific human from every other human.
I agree that UDASSA might introduce a small effect like this, but my guess is that the overall effect isn't enough to substantially change the bottom line. Fundamentally, being separated in space vs. being separated across different branches of the wavefunction seem pretty similar in terms of specification difficulty.
being separated in space vs. being separated across different branches of the wavefunction seem pretty similar in terms of specification difficulty
Maybe? I don't really know how to reason about this.
If that's true, that still only means that you should be linear for gambles that give different results in different quantum branches. C.f. logical vs. physical risk aversion.
Some objection like that might work more generally, since some logical facts will mean that there are far less humans in the universe-at-large, meaning that you're at a different point in the risk-returns curve. So when comparing different logical ways the universe could be, you should not always care about the worlds where you can affect more sentient beings. If you have diminishing marginal returns, you need to be thinking about some more complicated function that is about whether you have a comparative advantage at affecting more sentient beings in worlds where there is overall fewer sentient beings (as measured by some measure that can handle infinities). Which matters for stuff like whether you should bet on the universe being large.
If you want to produce warning shots for deceptive alignment, you're faced with a basic sequencing question. If the model is capable of reasoning about its training process before it's capable of checking a predicate like RSA-2048, then you have a chance to catch it—but if it becomes capable of checking a predicate like RSA-2048 first, then any deceptive models you build won't be detectable.