7

Ω 6

This is a special post for quick takes by evhub. Only they can create top-level comments. Comments here also appear on the Quick Takes page and All Posts page.
62 comments, sorted by Click to highlight new comments since:
[-]evhubΩ25640

This is a list of random, assorted AI safety ideas that I think somebody should try to write up and/or work on at some point. I have a lot more than this in my backlog, but these are some that I specifically selected to be relatively small, single-post-sized ideas that an independent person could plausibly work on without much oversight. That being said, I think it would be quite hard to do a good job on any of these without at least chatting with me first—though feel free to message me if you’d be interested.

  • What would be necessary to build a good auditing game benchmark?
  • How would AI safety AI work? What is necessary for it to go well?
  • How do we avoid end-to-end training while staying competitive with it? Can we use transparency on end-to-end models to identify useful modules to train non-end-to-end?
  • What would it look like to do interpretability on end-to-end trained probabilistic models instead of end-to-end trained neural networks?
  • Suppose you had a language model that you knew was in fact a good generative model of the world and that this property continued to hold regardless of what you conditioned it on. Furthermore, suppose you had some prompt that described some agent for the language model to simulate (Alice) that in practice resulted in aligned-looking outputs. Is there a way we could use different conditionals to get at whether or not Alice was deceptive (e.g. prompt the model with “DeepMind develops perfect transparency tools and provides an opportunity for deceptive models to come clean and receive a prize before they’re discovered.”).
  • Argue for the importance of ensuring that the state-of-the-art in “using AI for alignment” never lags behind as a capability compared to where it could be given just additional engineering effort.
  • What does inner alignment look like in the context of models with access to memory (e.g. a retrieval database)?
  • Argue for doing scaling laws for phase changes. We have found some phase changes in models—e.g. the induction bump—but we haven’t yet really studied the extent to which various properties—e.g. Honesty—generalize across these sorts of phase changes.
  • Humans rewarding themselves for finishing their homework by eating candy suggests a plausible mechanism for gradient hacking.
  • If we see precursors to deception (e.g. non-myopia, self-awareness, etc.) but suspiciously don’t see deception itself, that’s evidence for deception.
  • The more model’s objectives vary depending on exact setup, randomness, etc., the less likely deceptive models are to want to cooperate with future deceptive models, thus making defection earlier more likely.
  • China is not a strategically relevant actor for AI, at least in short timeline scenarios—they are too far behind, their GDP isn’t growing fast enough, and their leaders aren’t very good at changing those things.
  • If you actually got a language model that was a true generative model of the world that you could get arbitrary conditionals from, that would be equivalent to having access to a quantum suicide machine.
  • Introduce the concept of how factored an alignment solution is in terms of how easy it is to turn up or down alignment relative to capabilities—or just swap out an aligned goal for a misaligned one—as an important axis to pay attention to. Currently, things are very factored—alignment and capabilities are both heavily dependent on dataset, reward, etc.—but that could change in the future.
  • Argue that wireheading, unlike many other reward gaming or reward tampering problems, is unlikely in practice because the model would have to learn to value the actual transistors storing the reward, which seems exceedingly unlikely in any natural environment.
  • How has transparency changed over time—Chris claims it's easier to interpret later models; is that true?
  • Which AI safety proposals are most likely to fail safely? Proposals which have the property that the most likely way for them to fail is just not to work are better than those that are most likely to fail catastrophically. In the former case, we’ve sacrificed some of our alignment tax, but still have another shot.
  • What are some plausible scenarios for how a model might be suboptimality deceptively aligned?
  • What can we learn about AI safety from the domestication of animals? Does the success of domesticating dogs from wolves provide an example of how to train for corrigibility? Or did we just make them dumber via the introduction of something like William’s syndrome?

I'll continue to include more directions like this in the comments here.

I'd just make this a top level post.

I want this more as a reference to point specific people (e.g. MATS scholars) to than as something I think lots of people should see—I don't expect most people to get much out of this without talking to me. If you think other people would benefit from looking at it, though, feel free to call more attention to it.

Mmm, maybe you're right (I was gonna say "making a top-level post which includes 'chat with me about this if you actually wanna work on one of these'", but it then occurs to me you might already be maxed out on chat-with-people time, and it may be more useful to send this to people who have already passed some kind of 'worth your time' filter)

[-]evhubΩ7110
  • Other search-like algorithms like inference on a Bayes net that also do a good job in diverse environments also have the problem that their capabilities generalize faster than their objectives—the fundamental reason being that the regularity that they are compressing is a regularity only in capabilities.
  • Neural networks, by virtue of running in constant time, bring algorithmic equality all the way from uncomputable to EXPTIME—not a large difference in practice.
  • One way to think about the core problem with relaxed adversarial training is that when we generate distributions over intermediate latent states (e.g. activations) that trigger the model to act catastrophically, we don't know how to guarantee that those distributions over latents actually correspond to some distribution over inputs that would cause them.
[-]evhubΩ8100
  • Ensembling as an AI safety solution is a bad way to spend down our alignment tax—training another model brings you to 2x compute budget, but even in the best case scenario where the other model is a totally independent draw (which in fact it won't be), you get at most one extra bit of optimization towards alignment.
  • Chain of thought prompting can be thought of as creating an average speed bias that might disincentivize deception.
[-]evhubΩ790
  • A deceptive model doesn't have to have some sort of very explicit check for whether it's in training or deployment any more than a factory-cleaning robot has to have a very explicit check for whether it's in the jungle instead of a factory. If it someday found itself in a very different situation than currently (training), it would reconsider its actions, but it doesn't really think about it very often because during training it just looks too unlikely.
  • Argue that wireheading, unlike many other reward gaming or reward tampering problems, is unlikely in practice because the model would have to learn to value the actual transistors storing the reward, which seems exceedingly unlikely in any natural environment.

Humans don't wirehead because reward reinforces the thoughts which the brain's credit assignment algorithm deems responsible for producing that reward. Reward is not, in practice, that-which-is-maximized -- reward is the antecedent-thought-reinforcer, it reinforces that which produced it. And when a person does a rewarding activity, like licking lollipops, they are thinking thoughts about reality (like "there's a lollipop in front of me" and "I'm picking it up"), and so these are the thoughts which get reinforced. This is why many human values are about latent reality and not about the human's beliefs about reality or about the activation of the reward system.

[-]evhubΩ230

It seems that you're postulating that the human brain's credit assignment algorithm is so bad that it can't tell what high-level goals generated a particular action and so would give credit just to thoughts directly related to the current action. That seems plausible for humans, but my guess would be against for advanced AI systems.

No, I don't intend to postulate that. Can you tell me a mechanistic story of how better credit assignment would go, in your worldview? 

Disclaimer: At the time of writing, this has not been endorsed by Evan.

I can give this a go.

Unpacking Evan's Comment:
My read of Evan's comment (the parent to yours) is that there are a bunch of learned high-level-goals ("strategies") with varying levels of influence on the tactical choices made, and that a well-functioning end-to-end credit-assignment mechanism would propagate through action selection ("thoughts directly related to the current action" or "tactics") all the way to strategy creation/selection/weighting. In such a system, strategies which decide tactics which emit actions which receive reward are selected for at the expense of strategies less good at that. Conceivably, strategies aiming directly for reward would produce tactical choices more highly rewarded than strategies not aiming quite so directly.

One way for this not to be how humans work would be if reward did not propagate to the strategies, and they were selected/developed by some other mechanism while reward only honed/selected tactical cognition. (You could imagine that "strategic cognition" is that which chooses bundles of context-dependent tactical policies, and "tactical cognition" is that which implements a given tactic's choice of actions in response to some context.) This feels to me close to what Evan was suggesting you were saying is the case with humans.

One Vaguely Mechanistic Illustration of a Similar Concept:
A similar way for this to be broken in humans, departing just a bit from Evan's comment, is if the credit assignment algorithm could identify tactical choices with strategies, but not equally reliably across all strategies. As a totally made up concrete and stylized illustration: Consider one evolutionarily-endowed credit-assignment-target: "Feel physically great,"  and two strategies: wirehead with drugs (WIRE), or be pro-social (SOCIAL.) Whenever WIRE has control, it emits some tactic like "alone in my room, take the most fun available drug" which takes actions that result in  physical pleasure over a day. Whenever SOCIAL has control, it emits some tactic like "alone in my room, abstain from dissociative drugs and instead text my favorite friend" taking actions which result in  physical pleasure over a day. 

Suppose also that asocial cognitions like "eat this" have poorly wired feed-back channels and the signal is often lost and so triggers credit-assignment only some small fraction of the time. Social cognition is much better wired-up and triggers credit-assignment every time. Whenever credit assignment is triggered, once a day, reward emitted is 1:1 with the amount of physical pleasure experienced that day.

Since WIRE only gets credit a fraction of the time that it's due, the average reward (over 30 days, say) credited to WIRE is . If and only if , like if the drug is heroin or your friends are insufficiently fulfilling, WIRE will be reinforced more relative to SOCIAL. Otherwise, even if the drug is somewhat more physically pleasurable than the warm-fuzzies of talking with friends, SOCIAL will be reinforced more relative to WIRE.

Conclusion:
I think Evan is saying that he expects advanced reward-based AI systems to have no such impediments by default, even if humans do have something like this in their construction. Such a stylized agent without any signal-dropping would reinforce WIRE over SOCIAL every time that taking the drug was even a tiny bit more physically pleasurable than talking with friends.

Maybe there is an argument that such reward-aimed goals/strategies would not produce the most rewarding actions in many contexts, or for some other reason would not be selected for / found in advanced agents (as Evan suggests in encouraging someone to argue that such goals/strategies require concepts which are unlikely to develop,) but the above might be in the rough vicinity of what Evan was thinking.

REMINDER: At the time of writing, this has not been endorsed by Evan.

Thanks for the story! I may comment more on it later.

That seems to imply that humans would continue to wirehead conditional on that they started wireheading.

Yes, I think they indeed would.

About the following point:

"Argue that wireheading, unlike many other reward gaming or reward tampering problems, is unlikely in practice because the model would have to learn to value the actual transistors storing the reward, which seems exceedingly unlikely in any natural environment."

Well, that seems to be what happened in the case of rats and probably many other animals. Stick an electrode into the reward center of the brain of a rat. Then give it a button to trigger the electrode. Now some rats will trigger their reward centers and ignore food.

Humans value their experience. A pleasant state of consciousness is actually intrinsically valuable to humans. Not that this is the only thing that humans value, but it is certainly a big part.

It is unclear how this would generalize to artificial systems. We don't know if, or in what sense they would have experience, and why that would even matter in the first place. But I don't think we can confidently say that something computationally equivalent to "valuing experience", won't be going on in artificial systems we are going to build.

So somebody picking this point would probably need to address this point and argue why artificial systems are different in this regard. The observation that most humans are not heroin addicts seems relevant. Though the human story might be different if there were no bad side effects and you had easy access to it. This would probably be more the situation artificial systems would find themselves in. Or in a more extreme case, imagine soma but you live longer.

In short: Is valuing experience perhaps computationally equivalent to valuing transistors storing the reward? Then there would be real-world examples of that happening.

I have a related draft on this.

[-]evhubΩ20380

Here's a two-sentence argument for misalignment that I think is both highly compelling to laypeople and technically accurate at capturing the key issue:

When we train AI systems to be nice, we're giving a bunch of random programs a niceness exam and selecting the programs that score well. If you gave a bunch of humans a niceness exam where the humans would be rewarded for scoring well, do you think you would get actually nice humans?

To me it seems a solid attempt at conveying [misalignment is possible, even with a good test], but not necessarily [misalignment is likely, even with a good test]. (not that I have a great alternative suggestion)

Important disanalogies seem:
1) Most humans aren't good at convincingly faking niceness (I think!). The listener may assume a test good enough to successfully exploit this most of the time.
2) The listener will assume that [score highly on niceness] isn't the human's only reward. (both things like [desire to feel honest] and [worry of the consequences of being caught cheating])
3) A fairly large proportion of humans are nice (I think!).

The second could be addressed somewhat by raising the stakes.
The first seems hard to remedy within this analogy. I'd be a little concerned that people initially buy it, then think for themselves and conclude "But if we design a really clever niceness test, then it'd almost always work - all we need is clever people to work for a while on some good tests".
Combined with (3), this might seem like a decent solution.

Overall, I think what's missing is that we'd expect [our clever test looks to us as if it works] well before [our clever test actually works]. My guess is that the layperson isn't going to have this intuition in the human-niceness-test case.

I expect this is very susceptible to opinions about human nature.  To someone who thinks humans ARE generally nice, they are likely to answer "yes, of course" to your question.  To someone who thinks humans are generally extremely context-sensitive, which appears to be nice in the co-evolved social settings in which we generally interact, the answer is "who knows?".  But the latter group doesn't need to be convinced, we're already worried.

Surely nobody thinks that all humans are nice all the time and nobody would ever fake a niceness exam. I mean, I think humans are generally pretty good, but obviously that always has to come with a bunch of caveats because you don't have to look very far into human history to see quite a lot of human-committed atrocities.

I think the answer is an obvious yes, all other things held equal. Of course, what happens in reality is more complex than this, but I'd still say yes in most cases, primarily because I think that aligned behavior is very simple, so simple that it either only barely loses out to the deceptive model or outright has the advantage, depending on the programming language and low level details, and thus we only need to transfer 300-1000 bits maximum, which is likely very easy.

Much more generally, my fundamental claim is that the complexity of pointing to human values is very similar to the set of all long-term objectives, and can be easier or harder, but I don't buy the assumption that pointing to human values is way harder than pointing to the set of long-term goals.

[-]evhub2114

I think that sticking to capitalism as an economic system post-singularity would be pretty clearly catastrophic and something to strongly avoid, despite capitalism working pretty well today. I've talked about this a bit previously here, but some more notes on why:

  • Currently, our society requires the labor of sentient beings to produce goods and services. Capitalism incentivizes that labor by providing a claim on society's overall production in exchange for it. If the labor of sentient beings becomes largely superfluous as an economic input, however, then having a system that effectively incentivizes that labor also becomes largely superfluous.
  • Currently, we rely on the mechanism of price discovery to aggregate and disseminate information about the optimal allocation of society's resources. But it's far from an optimal mechanism for allocating resources, and a superintelligence with full visibility and control could do a much better job of resource allocation without falling prey to common pitfalls of the price mechanism such as externalities.
  • Capitalism incentivizes the smart allocation of capital in the same way it incentivizes labor. If society can make smart capital allocation decisions without relying on properly incentivized investors, however, then as with labor there's no reason to keep such an incentive mechanism.
  • While very large, the total optimization pressure humanity puts into economic competition today would likely pale in comparison to that of a post-singularity future. In the context of such a large increase in optimization pressure, we should generally expect extremal Goodhart failures.
  • More specifically, competitive dynamics incentivize the reinvestment of all economic proceeds back into resource acquisition lest you be outcompeted by another entity doing so. Such a dynamic results in pushing out actors that reinvest proceeds into the flourishing of sentient beings in exchange for those that disregard any such investment in favor of more resource acquisition.
  • Furthermore, the proceeds of post-singularity economic expansion flowing to the owners of existing capital is very far from socially optimal. It strongly disfavors future generations, simulated humans, and overall introduces a huge amount of variance into whether we end up with a positive future, putting a substantial amount of control into a set of people whose consumption decisions need not align with the socially optimal allocation.

Capitalism is a complex system with many moving parts, some of which are sometimes assumed to consist of the entirety of what defines it. What kinds of components do you see as being highly unlikely to be included in a successful utopia, and what components could be internal to a well functioning system as long as (potentially-left-unspecified) conditions are met? I could name some kinds of components (eg some kinds of contracts or enforcement mechanisms) that I expect to not be used in a utopia, though I suspect at this point you've seen my comments where I get into this, so I'm more interested in what you say without that prompting.

Who's this "we" you're talking about?  It doesn't seem to be any actual humans I recognize.  As far as I can tell, the basics of capitalism (call it "simple capitalism") are just what happens when individuals make decisions about resource use.  We call it "ownership", but really any form of resolution of the underlying conflict of preferences would likely work out similarly.  That conflict is that humans have unbounded desires, and resources are bounded.  

The drive to make goods and services for each other, in pursuit of selfish wants, does incentivize labor, but it's not because "society requires" it, except in a pretty blurry aggregation of individual "requires".  Price discovery is only half of what market transactions do.  The other half is usage limits and resource distribution.  These are sides of a coin, and can't be separated - without limited amounts, there is no price, without prices there is no agent-driven exchange of different kinds of resource.

I'm with you that modern capitalism is pretty unpleasant due to optimization pressure, and due to the easy aggregation of far more people and resources than historically possible, and than human culture was evolved around.  I don't see how the underlying system has any alternative that doesn't do away with individual desire and consumption.  Especially the relative/comparative consumption that seems to drive a LOT of perceived-luxury requirements.

I think some version of distributing intergalactic property rights uniformly (e.g. among humans existing in 2023) combined with normal capitalism isn't clearly that catastrophic. (The distribution is what you call the egalitarian/democratic solution in the link.)

Maybe you lose about a factor of 5 or 10 over the literally optimal approach from my perspective (but maybe this analysis is tricky due to two envelope problems).

(You probably also need some short term protections to avoid shakedowns etc.)

Are you pessimstic that people will bother reflecting or thinking carefully prior to determing resource utilization or selling their property? I guess I feel like 10% of people being somewhat thoughtful matches the rough current distribution of extremely rich people.

If the situation was something like "current people, weighted by wealth, deliberate for a while on what to do with our resources" then I agree that's probably like 5 - 10 times worse than the best approach (which is still a huge haircut) but not clearly catastrophic. But it's not clear to me that's what the default outcome of competitive dynamics would look like—sufficiently competitive dynamics could force out altruistic actors if they get outcompeted by non-altruistic actors.

[-]Dagon4-4

I think one crux between you and I, at least, is that you see this as a considered division of how to divide resources, and I see it as an equilibrium consensus/acceptance of what property rights to enforce in maintenance, creation, and especially transfer of control/usage of resources.  You think of static division, I think of equilibria and motion.  Both are valid, but experience and resource use is ongoing and it must be accounted for.

I'm happy that the modern world generally approves of self-ownership: a given individual gets to choose what to do (within limits, but it's nowhere near the case that my body and mind are part of the resources allocated by whatever mechanism is being considered).  It's generally considered an alignment failure if individual will is just a resource that the AI manages.  Physical resources (and behavioral resources, which are a sale of the results of some human efforts, a distinct resource from the mind performing the action) are generally owned by someone, and they trade some results to get the results of other people's resources (including their labor and thought-output).  

There could be a side-mechanism for some amount of resources just for existing, but it's unlikely that it can be the primary transfer/allocation mechanism, as long as individuals have independent and conflicting desires.  Current valuable self-owned products (office work, software design, etc.) probably reduces in value a lot.  If all human output becomes valueless (in the "tradable for other desired things or activities" sense of valuable), I don't think current humans will continue to exist.  

Wirehead utopia (including real-world "all desires fulfilled without effort or trade") doesn't sound appealing or workable for what I know of my own and general human psychology.  

self-ownership: a given individual gets to choose what to do (within limits, but it's nowhere near the case that my body and mind are part of the resources allocated by whatever mechanism is being considered)

for most people, this is just the right to sell their body to the machine. better than being forced at gunpoint, but being forced to by an empty fridge is not that much better, especially with monopoly accumulation as the default outcome. I agree that being able to mark ones' selfhood boundaries with property contracts is generally good, but the ability to massively expand ones' property contracts to exclude others from resource access is effectively a sort of scalping - sucking up resources so as to participate in an emergent cabal of resource withholding. In other words,

It's generally considered an alignment failure if individual will is just a resource that the AI manages.

The core argument that there's something critically wrong with capitalism is that the stock market has been an intelligence aggregation system for a long time and has a strong tendency to suck up the air in the system.

Utopia would need to involve a load balancing system that can prevent sucking-up-the-air type resource control imbalancing, so as to prevent

If all human output becomes valueless

for most people, this is just the right to sell their body to the machine.

I think this is a big point of disagreement.  For most people, there's some amount of time/energy that's sold to the machine, and it's NOWHERE EVEN CLOSE to selling their actual asset (body and mind).  There's a LOT of leisure time, and a LOT of freedom even within work hours, and the choice to do something different tomorrow.  It may not be as rewarding, but it's available and the ability to make those decisions has not been sold or taken.

yeah like, above a certain level of economic power that's true, but the overwhelming majority of humans are below that level, and AI is expected to raise that waterline. it's kind of the primary failure mode I expect.

I mean, the 40 hour work week movement did help a lot. But it was an instance of a large push of organizing to demand constraint on what the aggregate intelligence (which at the time was the stock market - which is a trade market of police-enforceable ownership contracts), could demand of people who were not highly empowered. And it involved leveling a lopsided playing field by things that one side considered dirty tricks, such as strikes. I don't think that'll help against AI, to put it lightly.

To be clear, I recognize that your description is accurate for a significant portion of people. But it's not close to the majority, and movement towards making it the majority has historically demanded changing the enforceable rules in a way that would reliably constrain the aggregate agency of the high dimensional control system steering the economy. When we have a sufficiently much more powerful one of those is when we expect failure, and right now it doesn't seem to me that there's any movement on a solution to that. We can talk about "oh we need something better than capitalism" but the problem with the stock market is simply that it's enforceable prediction, thereby sucking up enough air from the room that a majority of people do not get the benefits you're describing. If they did, then you're right, it would be fine!

I mean, also there's this, but somehow I expect that that won't stick around long after robots are enough cheaper than humans

I think we're talking past each other a bit.  It's absolutely true that the vast majority historically and, to a lesser extent, in modern times, are pretty constrained in their choices.  This constraint is HIGHLY correlated with distance from participation in voluntary trade (of labor or resources).

I think the disconnect is the word "capitalism" - when you talk about stock markets and price discovery, that says to me you're thinking of a small part of the system.  I fully agree that there are a lot of really unpleasant equilibra with the scale and optimization pressure of the current legible financial world, and I'd love to undo a lot of it.  But the underlying concept of enforced and agreed property rights and individual human decisions is important to me, and seems to be the thing that gets destroyed first when people decry capitalism.  

Ok, it sounds, even to me, like "The heads. You're looking at the heads. Sometimes he goes too far. He's the first one to admit it."  But really, I STRONGLY expect that I am experiencing peak human freedom RIGHT NOW (well, 20 years ago, but it's been rather flat for me and my cultural peers for a century, even if somewhat declining recently), and capitalism (small-c, individual decisions and striving, backed by financial aggregation with fairly broad participation) has been a huge driver of that.  I don't see any alternatives that preserve the individuality of even a significant subset of humanity.

If property rights to the stars are distributed prior to this, why does this competition cause issues? Maybe you basically agree here, but think it's unlikely property will be distributed like this.

Separately, for competitive dynamics with reasonable rule of law and alignment ~solved, why do you think the strategy stealing assumption won't apply? (There are a bunch of possible objections here, just wondering what your's is. Personally I think strategy stealing is probably fine if the altruistic actors care about the long run and are strategic.)

[-]evhubΩ11200

Listening to this John Oliver, I feel like getting broad support behind transparency-based safety standards might be more possible than I previously thought. He emphasizes the "if models are doing some bad behavior, the creators should be able to tell us why" point a bunch and it's in fact a super reasonable point. It seems to me like we really might be able to get enough broad consensus on that sort of a point to get labs to agree to some sort of standard based on it.

[-]RubyΩ9150

The hard part to me now seems to be in crafting some kind of useful standard rather than one in hindsight makes us go "well that sure have everyone a false sense of security".

[-]RaemonΩ580

Yeah I also felt some vague optimism about that.

[-]evhubΩ91510

If you want to better understand counting arguments for deceptive alignment, my comment here might be a good place to start.

[-]evhub10-3

Epistemic status: random philosophical musings.

Assuming a positive singularity, how should humanity divide its resources? I think the obvious (and essentially correct) answer is "in that situation, you have an aligned superintelligence, so just ask it what to do." But I nevertheless want to philosophize a bit about this, for one main reason.

That reason is: an important factor imo in determining the right thing to do in distributing resources post-singularity is what incentives that choice of resource allocation creates for people pre-singularity. For those incentives to work, though, we have to actually be thinking about this now, since that's what allows the choice of resource distribution post-singularity to have its acausal influence on our choices pre-singularity. I will note that this is definitely something that I think about sometimes, and something that I think a lot of other people also implicitly think about sometimes when they consider things like amassing wealth, specifically gaining control over current AIs, and/or the future value of their impact certificates.

So, what are some of the possible options for how to distribute resources post-singularity? Let's go over some of the various possible solutions here and why I don't think any of the obvious things here are what you want:

  1. The evolutionary/capitalist solution: divide future resources in proportion to control of current resources (e.g. AIs). This is essentially what happens by default if you keep in place an essentially capitalist system and have all the profits generated by your AIs flow to the owners of those AIs. Another version of this is a more power/bargaining-oriented version where you divide resources amongst agents in proportion to the power those agents could bring to bear if they chose to fight for those resources.
    1. The most basic problem with this solution is that it's a moral catastrophe if the people that get all the resources don't do good things with them. We should not want to build AIs that lead to this outcome--and I wouldn't really call AIs that created this outcome aligned.
    2. Another more subtle problem with this solution is that it creates terrible incentives for current people if they expect this to be what happens, since it e.g. incentivizes people to maximize their personal control over AIs at the expense of spending more resources trying to align those AIs.
      1. I feel like I see this sort of thinking a lot and I think that if we were to make it more clear that this is never what should happen in a positive singularity that then people would do this sort of thing less.
  2. The egalitarian/democratic solution: divide resources equally amongst all current humans. This is what naive preference utilitarianism would do.
    1. Though it might be less obvious than with the first solution, I think this solution also leads to a moral catastrophe, since it cements current people as oligarchs over future people, leads to value lock-in, and could create a sort of tyranny of the present.
    2. This solution also creates some weird incentives for trying to spread your ideals as widely as possible and to create as many people as possible that share your ideals.
  3. The unilateralist/sovereign/past-agnostic/CEV solution: concentrate all resources under the control of your aligned AI(s), then distribute those resources in accordance with how they generate the most utility/value/happiness/goodness/etc., without any special prejudice given to existing people.
    1. In some sense, this is the "right" thing to do, and it's pretty close to what I would ideally want. However, it has a couple of issues:
      1. Though, unlike the first solution, it doesn't create any perverse incentives right now, it doesn't create any positive incentives either.
    2. Since this solution doesn't give any special prejudice to current people, it might be difficult to get current people to agree to this solution, if that's necessary.
  4. The retroactive impact certificate solution: divide future resources in proportion to retroactively-assessed social value created by past agents.
    1. This solution obviously creates the best incentives for current agents, so in that sense it does very well.
    2. However, it still does pretty poorly on potentially creating a moral catastrophe, since the people that created the most social value in the past need not continue doing so in the future.

As above, I don't think that you should want your aligned AI to implement any of these particular solutions. I think some combination of (3) and (4) is probably the best out of these options, though of course I'm sure that if you actually asked an aligned superintelligent AI it would do better than any of these. More broadly, though, I think that it's important to note that (1), (2), and (4) are all failure stories, not success stories, and you shouldn't expect them to happen in any scenario where we get alignment right.

Circling back to the original reason that I wanted to discuss all of this, which is how it should influence our decisions now:

  • Obviously, the part of your values that isn't selfish should continue to want things to go well.
  • However, for the part of your values that cares about your own future resources, if that's something that you care about, how you go about maximizing that is going to depend on what you most expect between (1), (2), and (4).
    • First, in determining this, you should condition on situations where you don't just die or are otherwise totally disempowered, since obviously those are the only cases where this matters. And if that probability is quite high, then presumably a lot of your selfish values should just want to minimize that probability.
    • However, going ahead anyway and conditioning on everyone not being dead/disempowered, what should you expect? I think that (1) and (2) are possible in worlds where get some parts of alignment right, but overall are pretty unlikely: it's a very narrow band of not-quite-alignment that gets you there. So probably if I cared about this a lot I'd focus more on (4) than (1) and (2).
      • Which of course gets me to why I'm writing this up, since that seems like a good message for people to pick up. Though I expect it to be quite difficult to effectively communicate this very broadly.

Disagree. I'm in favor of (2) because I think that what you call a "tyranny of the present" makes perfect sense. Why would the people of the present not maximize their utility functions, given that it's the rational thing for them to do by definition of "utility function"? "Because utilitarianism" is a nonsensical answer IMO. I'm not a utilitarian. If you're a utilitarian, you should pay for your utilitarianism out of your own resource share. For you to demand that I pay for your utilitarianism is essentially a defection in the decision-theoretic sense, and would incentivize people like me to defect back.

As to problem (2.b), I don't think it's a serious issue in practice because time until singularity is too short for it to matter much. If it was, we could still agree on a cooperative strategy that avoids a wasteful race between present people.

[-]evhub2-2

Even if you don't personally value other people, if you're willing to step behind the veil of ignorance with respect to whether you'll be an early person or a late person, it's clearly advantageous before you know which one you'll be to not allocate all the resources to the early people.

First, I said I'm not a utilitarian, I didn't say that I don't value other people. There's a big difference!

Second, I'm not willing to step behind that veil of ignorance. Why should I? Decision-theoretically, it can make sense to argue "you should help agent X because in some counterfactual, agent X would be deciding whether to help you using similar reasoning". But, there might be important systematic differences between early people and late people (for example, because late people are modified in some ways compared to the human baseline) which break the symmetry. It might be a priori improbable for me to be born as a late person (and still be me in the relevant sense) or for a late person to be born in our generation[1].

Moreover, if there is a valid decision-theoretic argument to assign more weight to future people, then surely a superintelligent AI acting on my behalf would understand this argument and act on it. So, this doesn't compel me to precommit to a symmetric agreement with future people in advance.

  1. ^

    There is a stronger case for intentionally creating and giving resources to people who are early in counterfactual worlds. At least, assuming people have meaningful preferences about the state of never-being-born.

If a future decision is to shape the present, we need to predict it.

The decision-theoretic strategy "Figure out where you are, then act accordingly." is merely an approximation to "Use the policy that leads to the multiverse you prefer.". You *can* bring your present loyalties with you behind the veil, it might just start to feel farcically Goodhartish at some point.

There are of course no probabilities of being born into one position or another, there are only various avatars through which your decisions affect the multiverse. The closest thing to probabilities you'll find is how much leverage each avatar offers: The least wrong probabilistic anthropics translates "the effect of your decisions through avatar A is twice as important as through avatar B" into "you are twice as likely to be A as B".

So if we need probabilities of being born early vs. late, we can compare their leverage. We find:

  • Quantum physics shows that the timeline splits a bazillion times a second. So each second, you become a bazillion yous, but the portions of the multiverse you could first-order impact are divided among them. Therefore, you aren't significantly more or less likely to find yourself a second earlier or later.
  • Astronomy shows that there's a mazillion stars up there. So we build a Dyson sphere and huge artificial womb clusters, and one generation later we launch one colony ship at each star. But in that generation, the fate of the universe becomes a lot more certain, so we should expect to find ourselves before that point, not after.
  • Physics shows that several constants are finely tuned to support organized matter. We can infer that elsewhere, they aren't. Since you'd think that there are other, less precarious arrangements of physical law with complex consequences, we can also moderately update towards that very precariousness granting us unusual leverage about something valuable in the acausal marketplace.
  • History shows that we got lucky during the Cold War. We can slightly update towards:
    • Current events are important.
    • Current events are more likely after a Cold War.
    • Nuclear winter would settle the universe's fate.
  • The news show that ours is the era of inadequate AI alignment theory. We can moderately update towards being in a position to affect that.

When you start diverting significant resources away from #1, you’ll probably discover that the definition of “aligned” is somewhat in contention.

i feel like (2)/(3) is about "what does (the altruistic part of) my utility function want?" and 4 is "how do i decision-theoretically maximize said utility function?". they're different layers, and ultimately it's (2)/(3) we want to maximize, but maximizing (2)/(3) entails allocating some of the future lightcore to (4).

A couple of thoughts:

  • I think that (3) does create strong incentives right now - at least for anyone who assumes [without any special prejudice given to existing people] amounts to [and it's fine to disassemble everyone who currently exists if it's the u/v/h/g/etc maximising policy]. This seems probable to me, though not entirely clear (I'm not an optimal configuration, and smoothly, consciousness-preservingly transitioning me to something optimal seems likely to take more resources than unceremoniously recycling me).

    Incentives now include:
    • Prevent (3) happening.
    • To the extent that you expect (3) and are selfish, live for the pre-(3) time interval, for (3) will bring your doom.
  • On (4), "This solution obviously creates the best incentives for current agents" seems badly mistaken unless I'm misunderstanding you.

    Something in this spirit would need to be based on a notion of [expected social value], not on actual contributions, since in the cases where we die we don't get to award negative points.

    For example, suppose my choice is between:
    A: {90% chance doom for everyone; 10% I save the world}
    B: {85% chance doom for everyone; 15% someone else saves the world}

    To the extent that I'm selfish, and willing to risk some chance of death for greater control over the future, I'm going to pick A under (4).
    The more selfish, reckless and power-hungry I am, and the more what I want deviates from that most people want, the more likely I am to actively put myself in position to take an A-like action.

    Moreover, if the aim is to get ideal incentives, it seems unavoidable to have symmetry and include punishments rather than only [you don't get many resources]. Otherwise the incentive is to shoot for huge magnitude of impact, without worrying much about the sign, since no-one can do worse than zero resources.

    If correct incentives were the only desideratum, I don't see how we'd avoid [post-singularity 'hell' (with some probability) for those who're reckless with AGI].
    For any nicer approach I think we'd either be incenting huge impact with uncertain sign, or failing to incent large sacrifice in order to save the world.

    Perhaps the latter is best??
    I.e. cap the max resources for any individual at a fairly low level, so that e.g. [this person was in the top percentile of helpfulness] and [this person saved the world] might get you about the same resource allocation.
    It has the upsides both of making 'hell' less necessary, and of giving a lower incentive to overconfident people with high-impact schemes. (but still probably incents particularly selfish people to pick A over B)

If correct incentives were the only desideratum, I don't see how we'd avoid [post-singularity 'hell' (with some probability) for those who're reckless with AGI].

(some very mild spoilers for yudkowsky's planecrash glowfic (very mild as in this mostly does not talk about the story, but you could deduce things about where the story goes by the fact that characters in it are discussing this))

[edit: links in spoiler tags are bugged. in the spoiler, "speculates about" should link to here and "have the stance that" to here]

"The Negative stance is that everyone just needs to stop calculating how to pessimize anybody else's utility function, ever, period. That's a simple guideline for how realness can end up mostly concentrated inside of events that agents want, instead of mostly events that agents hate."

"If at any point you're calculating how to pessimize a utility function, you're doing it wrong. If at any point you're thinking about how much somebody might get hurt by something, for a purpose other than avoiding doing that, you're doing it wrong.)"

i think this is a pretty solid principle. i'm very much not a fan of anyone's utility function getting pessimized.

so pessimising a utility function is a bad idea. but we can still produce correct incentive gradients in other ways! for example, we could say that every moral patient starts with 1 unit of utility function handshake, but if you destroy the world you lose some of your share. maybe if you take actions that cause ⅔ of timelines to die, you only get ⅓ units of utility function handshake, and the more damage you do the less handshake you get.

it never gets into the negative, that way we never go out of our way to pessimize someone's utility function; but it does get increasingly close to 0.

(this isn't necessarily a scheme i'm committed to, it's just an idea i've had for a scheme that provides the correct incentives for not destroying the world, without having to create hells / pessimize utility functions)

Hmmm, I don't think that kind of thing is going to give correct world-saving incentives for the selfish part of people (unless failing to save the world counts as destroying it - in which case almost everyone is going to get approximately no influence).
More fundamentally, I don't think it works out in this kind of case due to logical uncertainty.

If I'm uncertain about a particular plan, and my estimate is {80% everyone dies; 20% I save the world}, that's not {in 80% of timelines everyone dies; in 20% of timelines I save the world}.

It's closer to [there's an 80% chance that {in ~99% of timelines everyone dies}; there's a 20% chance that {in ~99% of timelines I save the world}].

So, conditional on my saving the world in some timeline by taking some action, I saved the world in most timelines where I took that action and would get a load of influence. This won't disincentivize risky gambles for selfish/power-hungry people. (at least of the form [let's train this model and see what happens] - most of the danger there being a logical uncertainty thing)

I think influence would need to be based on expected social value given the 'correct' level of logical uncertainty - probably something like [what (expected value | your action) is justified by your beliefs, and valid arguments you'd make for them based on information you have].
Or at least some subjective perspective seems to be necessary - and something that doesn't give more points for overconfident people.

Here's a simple argument that I find quite persuasive for why you should have linear returns to whatever your final source of utility is (e.g. human experience of a fulfilling life, which I'll just call "happy humans"). Note that this is not an argument that you should have linear returns to resources (e.g. money). The argument goes like this:

  1. You have some returns to happy humans (or whatever else you're using as your final source of utility) in terms of how much utility you get from some number of happy humans existing.
  2. In most cases, I think those returns are likely to be diminishing, but nevertheless monotonically increasing and differentiable. For example, maybe you have logarithmic returns to happy humans.
  3. We happen to live in a massive multiverse. (Imo the Everett interpretation is settled science, and I don't think you need to accept anything else to make this go through, but note that we're only depending on the existence of any sort of big multiverse here—the one that the Everett interpretation gives you is just the only one that we know is guaranteed to actually exist.)
  4. In a massive multiverse, the total number of happy humans is absolutely gigantic (let's ignore infinite ethics problems, though, and assume it's finite—though I think this argument still goes through in the infinite case, it just then depends on whatever infinite ethics framework you like).
  5. Furthermore, the total number of happy humans is mostly insensitive to anything you can do, or anything happening locally within this universe, since this universe is only a tiny fraction of the overall multiverse. (Though you could get out of this by claiming that what you really care about is happy humans per universe, that's a pretty strange thing to care about—it's like caring about happy humans per acre.)
  6. As a result, the effective returns to happy humans that you are exposed to within this universe reflect only the local behavior of your overall returns. (Note that this assumes "happy humans" are fungible, which I don't actually believe—I care about the overall diversity of human experience throughout the multiverse. However, I don't think that changes the bottom line conclusion, since, if anything, centralizing the happy humans rather than spreading them out seems like it would make it easier to ensure that their experiences are as diverse as possible.)
  7. As anyone who has taken any introductory calculus will know, the local behavior of any differentiable function is linear.
  8. Since we assumed that your overall returns were differentiable and monotonically increasing, the local returns must be linear with a positive slope.

You're assuming that your utility function should have the general form of valuing each "source of utility" independently and then aggregating those values (such that when aggregating you no longer need the details of each "source" but just their values). But in The Moral Status of Independent Identical Copies I found this questionable (i.e., conflicting with other intuitions).

This is the fungibility objection I address above:

Note that this assumes "happy humans" are fungible, which I don't actually believe—I care about the overall diversity of human experience throughout the multiverse. However, I don't think that changes the bottom line conclusion, since, if anything, centralizing the happy humans rather than spreading them out seems like it would make it easier to ensure that their experiences are as diverse as possible.

Ah, I think I didn't understand that parenthetical remark and skipped over it. Questions:

  1. I thought your bottom line conclusion was "you should have linear returns to whatever your final source of utility is" and I'm not sure how "centralizing the happy humans rather than spreading them out seems like it would make it easier to ensure that their experiences are as diverse as possible" relates to that.
  2. I'm not sure that the way my utility function deviates from fungibility is "I care about overall diversity of human experience throughout the multiverse". What if it's "I care about diversity of human experience in this Everett branch" then I could get a non-linear diminishing returns effect where as humans colonize more stars or galaxies, each new human experience is more likely to duplicate an existing human experience or be too similar to an existing experience so that its value has to be discounted.
  1. The thing I was trying to say there is that I think the non-fungibility concern pushes in the direction of superlinear rather than sublinear local returns to "happy humans" per universe. (Since concentrating the "happy humans" likely makes it easier to ensure that they're all different.)
  2. I agree that this will depend on exactly in what way you think your final source of utility is non-fungible. I would argue that "diversity of human experience in this Everett branch" is a pretty silly thing to care about, though. I don't see any reason why spatial distance should behave differently than being in separate Everett branches here.

I don’t see any reason why spatial distance should behave differently than being in separate Everett branches here.

I tried to explain my intuitions/uncertainty about this in The Moral Status of Independent Identical Copies (it was linked earlier in this thread).

I read it, and I think I broadly agree with it, but I don't know why you think it's a reason to treat physical distance differently to Everett branch distance, holding diversity constant. The only reason that you would want to treat them differently, I think, is if the Everett branch happy humans are very similar, whereas the physically separated happy humans are highly diverse. But, in that case, that's an argument for superlinear local returns to happy humans, since it favors concentrating them so that it's easier to make them as diverse as possible.

but I don’t know why you think it’s a reason to treat physical distance differently to Everett branch distance

I have a stronger intuition for "identical copy immortality" when the copies are separated spatially instead of across Everett branches (the latter also called quantum immortality). For example if you told me there are 2 identical copies of Earth spread across the galaxy and 1 of them will instantly disintegrate, I would be much less sad than if you told me that you'll flip a quantum coin and disintegrate Earth if it comes up heads.

I'm not sure if this is actually a correct intuition, but I'm also not sure that it's not, so I'm not willing to make assumptions that contradict it.

Furthermore, the total number of happy humans is mostly insensitive to anything you can do, or anything happening locally within this universe, since this universe is only a tiny fraction of the overall multiverse.

Not sure about this. Even if I think I am only acting locally, my actions and decisions could have an effect on the larger multiverse. When I do something to increase happy humans in my own local universe, I am potentially deciding / acting for everyone in my multiverse neighborhood who is similar enough to me to make similar decisions for similar reasons.

I agree that this is the main way that this argument could fail. Still, I think the multiverse is too large and the correlation not strong enough across very different versions of the Earth for this objection to substantially change the bottom line.

(Though you could get out of this by claiming that what you really care about is happy humans per universe, that's a pretty strange thing to care about—it's like caring about happy humans per acre.)

My sense is that many solutions to infinite ethics look a bit like this. For example, if you use UDASSA, then a single human who is alone in a big universe will have a shorter description length than a single human who is surrounded by many other humans in a big universe. Because for the former, you can use pointers that specify the universe and then describe sufficient criteria to recognise a human, but for the latter, you need to nail down exact physical location or some other exact criteria that distinguishes a specific human from every other human.

I agree that UDASSA might introduce a small effect like this, but my guess is that the overall effect isn't enough to substantially change the bottom line. Fundamentally, being separated in space vs. being separated across different branches of the wavefunction seem pretty similar in terms of specification difficulty.

being separated in space vs. being separated across different branches of the wavefunction seem pretty similar in terms of specification difficulty

Maybe? I don't really know how to reason about this.

If that's true, that still only means that you should be linear for gambles that give different results in different quantum branches. C.f. logical vs. physical risk aversion.

Some objection like that might work more generally, since some logical facts will mean that there are far less humans in the universe-at-large, meaning that you're at a different point in the risk-returns curve. So when comparing different logical ways the universe could be, you should not always care about the worlds where you can affect more sentient beings. If you have diminishing marginal returns, you need to be thinking about some more complicated function that is about whether you have a comparative advantage at affecting more sentient beings in worlds where there is overall fewer sentient beings (as measured by some measure that can handle infinities). Which matters for stuff like whether you should bet on the universe being large.

[+]TAG-80
[-]evhubΩ220

I wrote up some of my thoughts on how to think about comparing the complexities of different model classes here.

[-]evhubΩ220

If you want to produce warning shots for deceptive alignment, you're faced with a basic sequencing question. If the model is capable of reasoning about its training process before it's capable of checking a predicate like RSA-2048, then you have a chance to catch it—but if it becomes capable of checking a predicate like RSA-2048 first, then any deceptive models you build won't be detectable.