Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

Habryka recently wrote (emphasis mine):

My inside views on AI Alignment make me think that work on impact measures is very unlikely to result in much concrete progress on what I perceive to be core AI Alignment problems, and I have talked to a variety of other researchers in the field who share that assessment. I think it’s important that this grant not be viewed as an endorsement of the concrete research direction that Alex is pursuing, but only as an endorsement of the higher-level process that he has been using while doing that research.

As such, I think it was a necessary component of this grant that I have talked to other people in AI Alignment whose judgment I trust, who do seem excited about Alex’s work on impact measures. I think I would not have recommended this grant, or at least this large of a grant amount, without their endorsement. I think in that case I would have been worried about a risk of diverting attention from what I think are more promising approaches to AI Alignment, and a potential dilution of the field by introducing a set of (to me) somewhat dubious philosophical assumptions.

I'm interested in learning about the intuitions, experience, and facts which inform this pessimism. As such, I'm not interested in making any arguments to the contrary in this post; any pushback I provide in the comments will be with clarification in mind.

There are two reasons you could believe that "work on impact measures is very unlikely to result in much concrete progress on… core AI Alignment problems". First, you might think that the impact measurement problem is intractable, so work is unlikely to make progress. Second, you might think that even a full solution wouldn't be very useful.

Over the course of 5 minutes by the clock, here are the reasons I generated for pessimism (which I either presently agree with or at least find it reasonable that an intelligent critic would raise the concern on the basis of currently-public reasoning):

  • Declarative knowledge of a solution to impact measurement probably wouldn't help us do value alignment, figure out embedded agency, etc.
  • We want to figure out how to transition to a high-value stable future, and it just isn't clear how impact measures help with that.
  • Competitive and social pressures incentivize people to cut corners on safety measures, especially those which add overhead.
    • Computational overhead.
    • Implementation time.
    • Training time, assuming they start with low aggressiveness and dial it up slowly.
  • Depending on how "clean" of an impact measure you think we can get, maybe it's way harder to get low-impact agents to do useful things.
    • Maybe we can get a clean one, but only for powerful agents.
    • Maybe the impact measure misses impactful actions if you can't predict at near human level.
  • In a world where we know how to build powerful AI but not how to align it (which is actually probably the scenario in which impact measures do the most work), we play a very unfavorable game while we use low-impact agents to somehow transition to a stable, good future: the first person to set the aggressiveness too high, or to discard the impact measure entirely, ends the game.
  • In a More realistic tales of doom-esque scenario, it isn't clear how impact helps prevent "gradually drifting off the rails".

Paul raised concerns along these lines:

We'd like to build AI systems that help us resolve the tricky situation that we're in. That help design and enforce agreements to avoid technological risks, build better-aligned AI, negotiate with other actors, predict and manage the impacts of AI, improve our institutions and policy, etc.

I think the default "terrible" scenario is one where increasingly powerful AI makes the world change faster and faster, and makes our situation more and more complex, with humans having less and less of a handle on what is going on or how to steer it in a positive direction. Where we must rely on AI to get anywhere at all, and thereby give up the ability to choose where we are going.

That may ultimately culminate with a catastrophic bang, but if it does it's not going to be because we wanted the AI to have a small impact and it had a large impact. It's probably going to be because we have a very limited idea what is going on, but we don't feel like we have the breathing room to step back and chill out (at least not for long) because we don't believe that everyone else is going to give us time.

If I'm trying to build an AI to help us navigate an increasingly complex and rapidly-changing world, what does "low impact" mean? In what sense do the terrible situations involve higher objective impact than the intended behaviors?

(And realistically I doubt we'll fail at alignment with a bang---it's more likely that the world will just drift off the rails over the course of a few months or years. The intuition that we wouldn't let things go off the rails gradually seems like the same kind of wishful thinking that predicts war or slow-rolling environmental disasters should never happen.)

It seems like "low objective impact" is what we need once we are in the unstable situation where we have the technology to build an AI that would quickly and radically transform the world, but we have all decided not to and so are primarily concerned about radically transforming the world by accident. I think that's a coherent situation to think about and plan for, but we shouldn't mistake it for the mainline. (I personally think it is quite unlikely, and it would definitely be unprecedented, though you could still think it's the best hope if you were very pessimistic about what I consider "mainline" alignment.)

New to LessWrong?

New Answer
New Comment

7 Answers sorted by

When I think about solutions to AI alignment, I often think about 'meaningful reductionism.' That is, if I can factor a problem into two parts, and the parts don't actually rely on each other, now I have two smaller problems to solve. But if the parts are reliant on each other, I haven't really simplified anything yet.

While impact measures feel promising to me as a cognitive strategy (often my internal representation of politeness feels like 'minimizing negative impact', like walking on sidewalks in a way that doesn't startle birds), they don't feel promising to me as reductionism. That is, if I already had a solution to the alignment problem, then impact measures would likely be part of how I implement that solution, but solving it separately from alignment doesn't feel like it gets me any closer to solving alignment.

[The argument here I like most rests on the difference between costs and side effects; we don't want to minimize side effects because that leads to minimizing good side effects also, and it's hard to specify the difference between 'side effects' and 'causally downstream effects,' and so on. But if we just tell the AI "score highly on a goal measure while scoring low on this cost measure," this only works if we specified the goal and the cost correctly.]

But there's a different approach to AI alignment, which is something more like 'correct formalisms.' We talk sometimes about handing a utility function to the robot, or (in old science fiction) providing it with rules to follow, or so on, and by seeing what it actually looks like when we follow that formalism we can figure out how well that formalism fits to what we're interested in. Utility functions on sensory inputs don't seem alignable because of various defects (like wireheading), and so it seems like the right formalism needs to have some other features (it might still be a utility function, but it needs to be an utility function over mental representations of external reality in such a way that the mental representation tracks external reality even when you have freedom to alter your mental representation, in a way that we can't turn into code yet).

So when I ask myself questions like "why am I optimistic about researching impact measures now?" I get answers like "because exploring the possibility space will make clear exactly how the issues link up." For example, looking at things like relative reachability made it clear to me how value-laden the ontology needs to be in order for a statistical measure on states to be meaningful. This provides a different form-factor for 'transferring values to the AI'; instead of trying to ask something like "is scenario A or B better?" and train a utility function, I might instead try to ask something like "how different are scenarios A and B?" or "how are scenarios A and B different?" and train an ontology, with the hopes that this makes other alignment problems easier because the types line up somewhat more closely.

[I think even that last example still performs poorly on the 'meaningful reductionism' angle, since getting more options for types to use in value loading doesn't seem like it addresses the core obstacles of value loading, but provides some evidence of how it could be useful or clarify thinking.]

this only works if we specified the goal and the cost correctly

Wait, why doesn't it work if you just specify the cost (impact) correctly?

Suppose the goal dramatically overvalues some option; then the AI would be willing to pay large (correctly estimated) costs in order to achieve "even larger" (incorrectly estimated) gains.
  • Giving people a slider with "safety" written on one end and "capability" written on the other, and then trying to get people to set it close enough to the "safety" end, seems like a bad situation. (Very similar to points you raised in your 5-min-timer list.)
    • An improvement on this situation would be something which looked more like a theoretical solution to Goodhart's law, giving an (in-some-sense) optimal setting of a slider to maximize a trade-off between alignment and capabilities ("this is how you get the most of what you want"), allowing ML researchers to develop algorithms orienting toward this.
    • Even better (but similarly), an approach where capability and alignment go hand in hand would be ideal -- a way to directly optimize for "what I mean, not what I say", such that it is obvious that things are just worse if you depart from this.
    • However, maybe those things are just pipe dreams -- this should not be the fundamental reason to ignore impact measures, unless promising approaches in the other two categories are pointed out; and even then, impact measures as a backup plan would still seem desirable.
      • My response to this is roughly that I prefer mild optimization techniques for this back up plan. Like impact measures, they are vulnerable to the objection above; but they seem better in terms of the objection which follows.
      • Part of my intuition, however, is just that mild optimization is going to be closer to the theoretical heart of anti-Goodhart technology. (Evidence for this is that quantilization seems, to me, theoretically nicer than any low-impact measure.)
        • In other words, conditioned on having a story more like "this is how you get the most of what you want" rather than a slider reading "safety ------- capability", I more expect to see a mild optimizer as opposed to an impact measure.
  • Unlike mild-optimization approaches, impact measures still allow potentially large amounts of optimization pressure to be applied to a metric that isn't exactly what we want.
    • It is apparent that some attempted impact measures run into nearest-unblocked-strategy type problems, where the supposed patch just creates a different problem when a lot of optimization pressure is applied. This gives reason for concern even if you can't spot a concrete problem with a given impact measure: impact measures don't address the basic nearest-unblocked-strategy problem, and so are liable to severe Goodheartian results.
    • If an impact measure were perfect, then adding it as a penalty on an otherwise (slightly or greatly) misaligned utility function just seems good, and adding it as a penalty to a perfectly aligned utility function would seem an acceptable loss. If impact is slightly misspecified, however, then adding it as a penalty may make a utility function less friendly than it otherwise would be.
      • (It is a desirable feature of safety measures, that those safety measures do not risk decreasing alignment.)
    • On the other hand, a mild optimizer seems to get the spirit of what's wanted from low-impact.
      • This is only somewhat true: a mild optimizer may create a catastrophe through negligence, where a low-impact system would try hard to avoid doing so. However, I view this as a much more acceptable and tractable problem than the nearest-unblocked-strategy type problem.
  • Both mild optimization and impact measures require separate approaches to "doing what people want".
    • Arguably this is OK, because they could greatly reduce the bar for alignment of specified utility functions. However, it seems possible to me that we need to understand more about the fundamentally puzzling nature of "do what I want" before we can be confident even in low-impact or mild-optimization approaches, because it is difficult to confidently say that an approach avoids risk of hugely violating your preferences while still being so confused about what human preference even is.

I have an intuition that while impact measures as a way of avoiding negative side effects might work well in toy models, it will be hard or impossible to get them to work in the real world, because what counts as a negative side effect in the real world seems too complex to easily capture. It seems like AUP tries to get around this by aiming at a lower bar than "avoid negative side effects", namely "avoid catastrophic side effects", and aside from whether it actually succeeds at clearing this lower bar, it would mean that an AI that is only "safe" because of AUP can't be safely used for ordinary goals (e.g., invent a better widget, or make someone personally more successful in life) and instead we have to somehow restrict them to being used just for goals that relate to x-risk reduction, where it's worthwhile to risk incurring less-than-catastrophic negative side effects.

As a side note, it seems generally the case that some approaches to AI safety/alignment aim at the higher bar of "safe for general use" and others aim at "safe enough to use for x-risk reduction", and this isn't always made clear, which can be a source of confusion for both AI safety/alignment researchers and others such as strategists and policy makers.

I have an intuition that while impact measures as a way of avoiding negative side effects might work well in toy models, it will be hard or impossible to get them to work in the real world

Do you think there any experiments that could be performed that would change your view on this point? Or is an impact measure the type of thing that does not generalize well from testing environment to the real world?

I have an intuition that while impact measures as a way of avoiding negative side effects might work well in toy models, it will be hard or impossible to get them to work in the real world, because what counts as a negative side effect in the real world seems too complex to easily capture.

Although a far cry from "[avoiding side effects] in the real world", see Avoiding Side Effects in Complex Environments as another piece of evidence to update on

My concern is similar to Wei Dai's: it seems to me that at a fundamental physical level, any plan involving turning on a computer that does important stuff will make pretty big changes to the world's trajectory in phase space. Heat dissipation will cause atmospheric particles to change their location and momentum, future weather patterns will be different, people will do things at different times (e.g. because they're waiting for a computer program to run, or because the computer is designed to change the flow of traffic through a city), meet different people, and have different children. As a result, it seems hard for me to understand how impact measures could work in the real world without a choice of representation very close to the representation humans use to determine the value of different worlds. I suspect that this will need input from humans similar to what value learning approaches might need, and that once it's done one could just do value learning and dispense with the need for impact measures. That being said, this is more of an impression than a belief - I can't quite convince myself that no good method of impact regularisation exists, and some other competent people seem to disagree with me.

How does this concern interact with the effective representation invariance claim I made when introducing AUP?

I don't see how representation invariance addresses this concern. As far as I understand, the concern is about any actions in the real world causing large butterfly effects. This includes effects that would be captured by any reasonable representation, e.g. different people existing in the action and inaction branches of the world. The state representations used by humans also distinguish between these world branches, but humans have limited models of the future that don't capture butterfly effects (e.g. person X can distinguish between the world state where person Y exists and the world state where person Z exists, but can't predict that choosing a different route to work will cause person Z to exist instead of person Y). I agree with Daniel that this is a major problem with impact measures. I think that to get around this problem we would either need to figure out how to distinguish butterfly effects from other effects (and then include all the butterfly effects in the inaction branch) or use a weak world model that does not capture butterfly effects (similarly to humans) for measuring impact. Even if we know how to do this, it's not entirely clear whether we should avoid penalizing butterfly effects. Unlike humans, AI systems would be able to cause butterfly effects on purpose, and could channel their impact through butterfly effects if they are not penalized.
I think my post was basically saying "representation selection seems like a problem because people are confused about the type signature of impact, which is actually a thing you can figure out no matter what you think the world is made of". I don't want to go into too much detail here (as I explained below), but part of what this implies is that discrete "effects" are fake/fuzzy mental constructs/not something to think about when designing an impact measure. In turn, this would mean we should ask a different question that isn't about butterfly effects.
Indeed - a point I think is illustrated by the Chaotic Hurricanes test case. I'm probably most excited about methods that would use transparency techniques to determine when a system is deliberately optimising for a part of the world (e.g. the members of the long-term future population) that we don't want it to care about, but this has a major drawback of perhaps requiring multiple philosophical advances into the meaning of reference in cognition and a greater understanding of what optimisation is.
What would you predict AUP does for the chaotic scenarios? Suppose the attainable set just includes the survival utility function, which is 1 if the agent is activated and 0 otherwise.
I think that under the worldview of this concern, the distribution of reward functions effectively defines a representation that, if too different from the one humans care about, will either mean that no realistic impact is possible in the real world or be ineffective at penalising unwanted negative impacts.
is there a central example you have in mind for this potential failure mode?
I can't quite convince myself that no good method of value learning exists, and some other competent people seem to disagre ewith me.

No good method of measuring impact, presumably?

I prefer the phrase 'impact regularisation', but indeed that was a slip of the mind.

Thanks Alex for starting this discussion and thanks everyone for the thought-provoking answers. Here is my current set of concerns about the usefulness of impact measures, sorted in decreasing order of concern:

Irrelevant factors. When applied to the real world, impact measures are likely to be dominated by things humans don't care about (heat dissipation, convection currents, positions of air molecules, etc). This seems likely to happen to value-agnostic impact measures, e.g. AU with random utility functions, which would mostly end up rewarding specific configurations of air molecules.

This may be mitigated by inability to perceive the irrelevant factors, which results in a more coarse-grained state representation: if the agent can't see air molecules, all the states with different air molecule positions will look the same, as they do to humans. Some human-relevant factors can also be difficult to perceive, e.g. the presence of poisonous gas in the room, so we may not want to limit the agent's perception ability to human level. Automatically filtering out irrelevant factors does seem difficult, and I think this might imply that it is impossible to design an impact measure that is both useful and truly value-agnostic.

However, the value-agnostic criterion does not seem very important in itself. I think the relevant criterion is that designing impact measures should be easier than the general value learning problem. We already have a non-value-agnostic impact measure that plausibly satisfies this criterion: RLSP learns what is effectively an impact measure (the human theta parameter) using zero human input just by examining the starting state. This could also potentially be achieved by choosing an attainable utility set that rewards a broad enough sample of things humans care about, and leaves the rest to generalization. Choosing a good attainable utility set may not be easy but it seems unlikely to be as hard as the general value learning problem.

Butterfly effects. Every action is likely to have large effects that are difficult to predict, e.g. taking a different route to work may result in different people being born. Taken literally, this means that there is no such thing as a low-impact action. Humans get around this by only counting easily predictable effects as impact that they are considered responsible for. If we follow a similar strategy of not penalizing butterfly effects, we might incentivize the agent to deliberately cause butterfly effects. The easiest way around this that I can currently see is restricting the agent's capability to model the effects of its actions, though this has obvious usefulness costs as well.

Chaotic world. Every action, including inaction, is irreversible, and each branch contains different states. While preserving reversibility is impossible in this world, preserving optionality (attainable utility, reachability, etc) seems possible. For example, if the attainable set contains a function that rewards the presence of vases, the action of breaking a vase will make this reward function more difficult to satisfy (even if the states with/without vases are different in every branch). If we solve the problem of designing/learning a good utility set that is not dominated by irrelevant factors, I expect chaotic effects will not be an issue.

If any of the above-mentioned concerns are not overcome, impact measures will fail to distinguish between what humans would consider low-impact and high-impact. Thus, penalizing high-impact actions would come with penalizing low-impact actions as well, which would result in a strong safety-capability tradeoff. I think the most informative direction of research to figure out whether these concerns are a deal-breaker is to scale up impact measures to apply beyond gridworlds, e.g. to Atari games.

Thanks for the detailed list!

AU with random utility functions, which would mostly end up rewarding specific configurations of air molecules.

What does this mean, concretely? And what happens with the survival utility function being the sole member of the attainable set? Does this run into that problem, in your model?

Humans get around this by only counting easily predictable effects as impact that they are considered responsible for.

What makes you think that?

I meant that for attainable set consisting of random utility functions, I would expect most of the variation in utility to be based on irrelevant factors like the positions of air molecules. This does not apply to the attainable set consisting of the survival utility function, since that is not a random utility function. This is an intuitive claim based on a general observation of how people attribute responsibility. For example, if I walk into a busy street and get hit by a car, I will be considered responsible for this because it's easy to predict. On the other hand, if I am walking down the street and a brick falls on my head from the nearby building, then I will not be considered responsible, because this event would be hard to predict. There are probably other reasons that humans don't consider themselves responsible for butterfly effects.
Are you thinking of an action observation formalism, or some kind of reward function over inferred state? If you had to pose the problem of impact measurement as a question, what would it be?
I don't quite understand what you're asking here, could you clarify? Something along the lines of: "How can we measure to what extent the agent is changing the world in ways that we care about?". Why?

So there's a thing people do when they talk about AUP which I don't understand. They think it's about state, even though I insist it's fundamentally different, and try to explain why (note that AUP in the MDP setting is necessarily over states, because states are the observations). My explanations apparently haven't been very good; in the given conversation, they acknowledge that it's different, but then regress a little while later. I think they might be trying understand the explanation, remain confused, and then subconsciously slip back to their old model. out of everyone I've talked to, I can probably count on my hands the number of people who get this – note that agreeing with specific predictions of mine is different.

Now, it's the author's job to communicate their ideas. When I say "as far as I can tell, few others have internalized how AUP actually works", this doesn't connote "gosh, I can't stand you guys, how could you do this", it's more like "somehow I messed up the explanations; I wonder what key ideas are missing still? How can I fix this?".

my goal with this comment isn't t... (read more)

I think in a conversation I had with you last year, I kept going back to 'state' despite protests because I kept thinking "if AUP works, surely it would be because some of the utility functions calculate a sensible state estimate in a humanlike ontology and then define utility from this". It isn't necessarily the right way to critique AUP, but I think I was right to think those thoughts conditional on that assumption -- ie, even if it isn't the argument you're trying to make for AUP, it seems like a not-unreasonable position to consider, and so thinking about how AUP does in terms of state can be a reasonable and important part of a thought-process assessing AUP. I believe I stopped making the assumption outright at some point, but kept bringing out the assumption as a tool for analysis -- for example, supporting a thought experiment with the argument that there would at least be some utility functions which thought about the external world enough to case about such-and-such. I think in our conversation I managed to appropriately flag these sorts of assumptions such that you were OK with the role it was playing in the wider argument (well... not in the sense of necessarily accepting the arguments, but in the sense of not thinking I was just repeatedly making the mistake of thinking it has to be about state, I think). Other people could be thinking along similar lines without flagging it so clearly.
Thanks for the detailed explanation - I feel a bit less confused now. I was not intending to express confidence about my prediction of what AU does. I was aware that I didn't understand the state representation invariance claim in the AUP proposal, though I didn't realize that it is as central to the proposal as you describe here. I am still confused about what you means by penalizing 'power' and what exactly it is a function of. The way you describe it here sounds like it's a measure of the agent's optimization ability that does not depend on the state at all. Did you mean that in the real world the agent always receives the same AUP penalty no matter which state it is in? If that is what you meant, then I'm not sure how to reconcile your description of AUP in the real world (where the penalty is not a function of the state) and AUP in an MDP (where it is a function of the state). I would find it helpful to see a definition of AUP in a POMDP as an intermediate case. I agree with Daniel's comment that if AUP is not penalizing effects on the world, then it is confusing to call it an 'impact measure', and something like 'optimization regularization' would be better. Since I still have lingering confusions after your latest explanation, I would really appreciate if someone else who understands this could explain it to me.
I am still confused about what you means by penalizing 'power' and what exactly it is a function of. The way you describe it here sounds like it's a measure of the agent's optimization ability that does not depend on the state at all.

It definitely does depend on the state. If the agent moves to a state where it has taken over the world, that's a huge increase in its ability to achieve arbitrary utility functions, and it would get a large penalty.

I think the claim is more that while the penalty does depend on the state, it's not central to think about the state to understand the major effects of AUP. (As an analogy, if you want to predict whether I'm about to leave my house, it's useful to see whether or not I'm wearing shoes, but if you want to understand why I am or am not about to leave my house, whether I'm wearing shoes is not that relevant -- you'd want to know what my current subgoal or plan is.)

Similarly, with AUP, the claim is that while you can predict what the penalty is going to be by looking at particular states and actions, and the penalty certainly does change with different states/actions, the overall effect of A... (read more)

Thanks Rohin! Your explanations (both in the comments and offline) were very helpful and clarified a lot of things for me. My current understanding as a result of our discussion is as follows. AU is a function of the world state, but intends to capture some general measure of the agent's influence over the environment that does not depend on the state representation. Here is a hierarchy of objects, where each object is a function of the previous one: world states / microstates (e.g. quark configuration) -> observations (e.g. pixels) -> state representation / coarse-graining (which defines macrostates as equivalence classes over observations) -> featurization (a coarse-graining that factorizes into features). The impact measure is defined over the macrostates. Consider the set of all state representations that are consistent with the true reward function (i.e. if two microstates have different true rewards, then their state representation is different). The impact measure is representation-invariant if it has the same values for any state representation in this reward-compatible set. (Note that if representation invariance was defined over the set of all possible state representations, this set would include the most coarse-grained representation with all observations in one macrostate, which would imply that the impact measure is always 0.) Now consider the most coarse-grained representation R that is consistent with the true reward function. An AU measure defined over R would remain the same for a finer-grained representation. For example, if the attainable set contains a reward function that rewards having a vase in the room, and the representation is refined to distinguish green and blue vases, then macrostates with different-colored vases would receive the same reward. Thus, this measure would be representation-invariant. However, for an AU measure defined over a finer-grained representation (e.g. distinguishing blue and green vases), a random reward functio
I don't read everything that you write, and when I do read things there seems to be some amount of dropout that occurs resulting in me missing certain clauses (not just in long posts by you, even while proofreading the introduction section of a friend's paper draft!) that I don't notice until quizzed in detail -- I suspect this is partially due to me applying lossy compression that preserves my first guess about the gist of a paragraph, and maybe partially due to literal saccades while reading. The solution is repetition and redundancy: for example, I assume that you tried to do that in your quotes after the phrase "Let's go through some of my past comments about this", but only the quote implies to me that we're moving away from a state-based way of thinking, and it doesn't directly say anything about AUP.
Yes, this is fine and understandable. I wasn’t meaning to imply that responsible people should have thought of all these things, but rather pointing to different examples. I’ll edit my phrasing there. I had a feeling that there was some illusion of transparency, (which is why I said “when I read it”), but I had no idea it was that strong. Good data point, thanks
This isn't a full response, but it seems to me that Vika is largely talking about problems she percieves with impact measures in general, as defined by "measures of how much impact things have on the world", and is thinking of AUP as an element of this class (as would I, had I not read this comment). Reasons to think this include: * A perception of your research as primarily being the development of AUP, and of this post as being research for that development and exposition. * The introduction of AUP being in a post titled "Towards a New Impact Measure". If AUP is not in fact about restricting an agent's impact on the world (or, in other words, on the state of the world), then I would describe it as something other than an "impact measure", since that term is primarily used by people using the way of thinking you denounce (and I believe was invented that way: it seems to have morphed from 'side effects', which strongly suggests effects on parts of the world, according to my quick looking-over of the relevant section of Concrete Problems in AI Safety). Perhaps "optimisation regularisation technique" would be better, although I don't presume to understand your way of thinking about it.
So the end result is this, but it doesn’t do it by considering impact to be a thing that happens to the state primarily, but rather to agents; impact not in the sense of “how different is the state”, but “how big of a deal is this to me?”. The objective is to limit the agent’s impact on us, which I think is the more important thing. I think this still falls under normal colloquial use of ‘impact’, but I agree that this is different from the approaches so far. I’m going to talk about this distinction quite a bit in the future.
There are various parts of your explanation that I find vague and could use a clarification on: * "AUP is not about state" - what does it mean for a method to be "about state"? Same goes for "the direct focus should not be on the state" - what does "direct focus" mean here? * "Overfitting the environment" - I know what it means to overfit a training set, but I don't know what it means to overfit an environment. * "The long arms of opportunity cost and instrumental convergence" - what do "long arms" mean? * "Wirehead a utility function" - is this the same as optimizing a utility function? * "Cut out the middleman" - what are you referring to here? I think these intuitive phrases may be a useful shorthand for someone who already understands what you are talking about, but since I do not understand, I have not found them illuminating. I sympathize with your frustration about the difficulty of communicating these complex ideas clearly. I think the difficulty is caused by the vague language rather than missing key ideas, and making the language more precise would go a long way.

I have a bit of time on my hands, so I thought I might try to answer some of your questions. Of course I can't speak for TurnTrout, and there's a decent chance that I'm confused about some of the things here. But here is how I think about AUP and the points raised in this chain:

  • "AUP is not about the state" - I'm going to take a step back, and pretend we have an agent working with AUP reasoning. We've specified an arcane set of utility functions (based on air molecule positions, well-defined human happiness, continued existence, whatever fits in the mathematical framework). Next we have an action A available, and would like to compute the impact of that action. To do this our agent would compare how well it would be able to optimize each of those arcane utility functions in the world where A was taken, versus how well it would be able to optimize these utility functions in the world where the rest action was taken instead. This is "not about state" in the sense that the impact is determined by the change in the ability for the agent to optimize these arcane utilities, not by the change in the world state. In the particular case where the
... (read more)
4Rohin Shah5y
^ This is also how I interpret all of those statements. (Though I don't agree with all of them.) I also dislike the "overfitting the environment" phrase, though the underlying concept seems fine. If anything, the concept being pointed at is more analogous to distributional shift, since the idea is that the utility function works well in "normal" cases and not elsewhere.
which do you disagree with?
4Rohin Shah5y
I disagree that AUP-the-method is hugely different from RR-the-method; I agree that the explanations and stated intuitions are very different, but I don't think the switch from states to utility functions is as fundamental as you think it is. I think you could make the same arguments about opportunity cost / instrumental convergence about the variant of RR that penalizes both increases and decreases in reachability. Ignoring my dislike of the phrase, I don't agree that AUP is stopping you from "overfitting the environment" (the way I interpret the phrase, which I hope is the same as your interpretation, but who knows). I'd guess that your-vision-of AUP wildly overcompensates and causes you to seriously "underfit the environment", or rephrased in my language, it prevents you from executing most interesting plans, which happens to include the catastrophic plans but also includes the useful plans. If you tune hyperparameters so it no longer "underfits the environment" (alternatively, "allows for interesting plans"), then I expect it allows catastrophic plans. I continue to feel some apprehension about defining impact as opportunity cost and instrumental convergence, though I wouldn't say I currently disagree with it.
(I'm going to take a shot at this now because it's meta, and I think there's a compact explanation I can provide that hopefully makes sense.) Suppose the theory of attainable utility is correct (i.e., we find things impactful when they change our ability to get what we want). Then whenever the theory of relative state reachability gets something right, you would be able to say "it's penalizing opportunity cost or instrumental convergence" post hoc because that's why we find things impactful. You could say the same thing about instances of correct behavior by agents which use whitelisting, which I think we agree is quite different. In the world where attainable utility is correct, you would indeed observe that reachability is conceptually similar in some ways. the problem is that you can't actually use the opportunity cost/instrumental convergence arguments to predict RR behavior. Here's an example, from the vantage point of you, a person. Choice A leads to a 180° rotation of a large, forever inaccessible shell of the observable universe. Choice B leads to the ruination of the planet, excluding what we personally need to survive. the theory of relative state reachability says choice A is maximally impactful. Why? You can't reach anything like the states you could under inaction. How does this decision track with opportunity cost? Attainable utility says choice B is the bigger deal. You couldn't do anything with that part of the universe anyways, so it doesn't change much. This is the correct answer. this scenario is important because it isn't just an issue with ontologies, or a situation designed to fool the exact formalism we provided. It's an illustration of where state reachability diverges from these notions. a natural reply is, what about things that AUP penalizes that we don't find impactful, like an agent connecting the Internet? the answer is that impact is being measured with respect to the agent itself (and Internet access is indeed impactful to the
5Rohin Shah5y
We talked a bit off-forum, which helped clarify things for me. Firstly, there's a difference between attainable utility theory (AU theory), and AUP-the-method. AU theory talks about how impact is about instrumental convergence and opportunity cost, and how that can be measured via thinking about how much utility the agent could attain. In particular, in AU theory "impact" is about how actions change your attainable utility according to the true utility function. AUP is a proposal for an impact regularization method, but it must deal with the fact that we don't know the true utility function, and so it forms an approximation by considering changes to the attainable utilities of a set of utility functions. Many of the claims are about AU theory and not about AUP. There isn't really an analogous "RR theory". Another thing is that while both AUP and RR-with-penalties-on-increases would give large penalties to instrumentally convergent actions, it seems like for "regular" irreversible actions like painting a wall AUP would assign a much lower penalty than RR, so differentially AUP is penalizing instrumentally convergent actions more. This happens because utility functions tend to care about particular aspects of the state, rather than all states. Consider the action of moving in a direction: if the utility functions don't care about being further in that direction, there is no AUP penalty. In contrast, with RR, we will now be able to more easily access states in that direction, leading to at least some penalty. That said, it seems like you can get this benefit with RR by using a featurization of the state, which also causes you to only care about particular aspects of the state.
These are good questions. As I mentioned, my goal here isn’t to explain the object level, so I’m going to punt on these for now. I think these will be comprehensible after the sequence, which is being optimized to answer this in the clearest way possible.
Here's a potentially helpful analogy. Imagine I program a calculator. Although its computation is determined by the state of the solar system, the computation isn't "about" the state of the solar system.

Here's a relevant passage by Rohin (from Alignment Newsletter #49, March 2019):

On the topic of impact measures, I'll repeat what I've said before: I think that it's hard to satisfy the conjunction of three desiderata -- objectivity (no dependence on human values), safety (preventing any catastrophic outcomes) and usefulness (the AI system is still able to do useful things). Impact measures are very clearly aiming for the first two criteria, but usually don't have much to say about the third one. My expectation is that there is a strong tradeoff between the first two criteria and the third one, and impact measures have not dealt with this fact yet, but will have to at some point.

Other relevant writing of mine:

Comment on the AUP post

Comment on the desiderata post

But it's true that that quoted passage is the best summary of my current position. Daniel's answer is a good example of an underlying intuition that drives this position.

Something which is low impact in the moment T0, could have large impact in the moment T1. For example, if AI is not breaking a vase in the moment T0, it is low impact event, but if a human being later crashes his head on this vase in the next moment T1, when not crashing the vase in T0 becomes very large impact event on the person's wellbeing.

Moreover, if in the moment T2 it turns out that the person, who crashed his head at T1, was next Hitler, when again preserving the vase in the T0 becomes a low impact event, and so on. In other words, to calculate the real impact of an event we have to calculate its consequences until the end of eternity, which is impossible.

Also, longterm impact can't be calculated without knowing actual human values, so lowering the impact is the same task as AI alignment, and low impact task can't replace AI alignment task or be solved separately.

Moreover, if in the moment T2 it turns out that the person, who crashed his head at T1, was next Hitler, when again preserving the vase in the T0 becomes a low impact event

why does it become a low impact event in your eyes?

In other words, to calculate the real impact of an event we have to calculate its consequences until the end of eternity, which is impossible.

What is your model of the thing that people are doing in their heads when they say things are impactful or not?

longterm impact can't be calculated without knowing actual human values,
... (read more)
1. In my eyes, it is a low impact event, as its 3 stages unfold as: The vase is preserved, new Hitler is killed by the vase, human history didn't change in much worse direction. An alternative is: The vase is broken, new Hitler is saved by the vase, human history change in much worse direction. 2. I think that people think of low impact as of actions which have very small consequences outside given space-time (or other) limits. For example, a robot moves from A to B without breaking the vase. 3. From my point of view, the impact is a long-term effect on human history, measured according to our values. Minimising near-term impact doesn't necessary mean minimising long term impact because of the "butterfly effect", and this was the main idea of my comment.
9 comments, sorted by Click to highlight new comments since: Today at 4:33 AM

On the process level: I would offer a bit of unsolicited advice about the method you used to generate reasons for pessimism. You (and others), might try it in future.

First of all, I strongly applaud the step of taking out a physical clock/ timer and making a solid attempt at answering the question for yourself. Virtue points (and karma) awarded!

However, when I read your list, it's blatantly one-sided. You're only trying to generate reasons for pessimism not reasons for optimism. This is not as bad as writing the bottom line, but generating arguments for only one side of a question biases your search.

Given this, one thing that I might do is first, spend 5 minutes generating the best arguments for (or concrete scenarios which inspire) pessimism about impact measures, then shift my mental orientation and spend 5 minutes generating arguments for (or concrete scenarios in which) impact measures seem promising.

But I wouldn't stop there. I would then spend 5 minutes (or as long as I need), looking over the first list and trying to generate counterarguments: reasons why the world probably isn't that way. Once I had done that, I would look over my new list of counter arguments, and try to generate counter-counterarguments, iterating until I either get stuck, or reach a sort of equilibrium where the arguments I've made are as strong as I can see how to make.

Then I would go back to my second original list (the one with reasons for optimism) and do the same back and forth, generating counterarguments and counter-counterarguments, until I get stuck or reach equilibrium on that side.

At that point, I should have two lists of the strongest reasons I can muster, arguments in favor of pessimism and arguments in favor of optimism, both of which have been stress-tested by my own skepticism. I'd then compare both lists, and if any of the arguments invalidates or weakens another, I adjust them accordingly (there might be a few more rounds of back and forth).

At this point, I've really thoroughly considered the issue. Obviously this doesn't mean that I've gotten the right answer, or that I've thought of everything. But it dose mean that for all practical purposes, I've exhausted the low hanging fruit of everything I can think of.

To recap...


0. Take a binary question.
1. Make the best case I can for one answer, giving what ever arguments, or ways the world would have to be, that support that outcome.
2. Similarly make the best case I can for the other answer.
3. Take the reasoning for my first answer generate counterarguments. Generate responses to those counterarguments. Continue Iterate until you reach equilibrium.
4. Do the same to the reasoning for your second answer
5. Compare your final arguments on both sides of the question, adjusting as necessary.

(This procedure is inspired by a technique that I originally learned from Leverage Research / Paradigm Academy. By their terminology, this procedure is called (the weak form of) Pyrrhonian skepticism, after the Greek philosopher Pyrrho (who insisted that knowledge was impossible, because there were always arguments on both sides of a question). I've also heard it referred to, more generally, as "alternate stories".)

Of course, this takes more time to do, and that time cost may or may not be worth it to you. Furthermore, there are certainly pieces of your context or thinking process that I'm missing. Maybe you, in fact, did part of this process. But this is an extended method to consider.

I no longer fully endorse this comment, though I recommend this procedure to anyone who think it sounds interesting.

Is there a short explanation why you no longer fully endorse?

Not really. Just that when I look that the text I wrote, now, it seems a little hacky / not quite expressing the true spirit of the mental motions that seem useful to me.

It might still be a good procedure for bootstrapping into the right mental motions though? I've haven't done any testing on this one, so I don't know.

this procedure is called (the weak form of) Pyrrhonian skepticism

What's the strong form?

I believe the strong form is generating a counter argument for any proposition, and then concluding that all propositions are equally likely and therefore that knowledge is impossible.

From wikipedia:

The main principle of Pyrrho's thought is expressed by the word acatalepsia, which connotes the ability to withhold assent from doctrines regarding the truth of things in their own nature; against every statement its contradiction may be advanced with equal justification.

I don't recommend the strong form.

Ah, ok. I note that it may have been intended more as a meditative practice, since the goal appears to have been reaching a state of bliss, the epistemic practice being a means to that end. Practicing doubting everything could be an interesting meditation (though it could perhaps be dangerous).

I’m interested in learning about the intuitions, experience, and facts which inform this pessimism. As such, I’m not interested in making any arguments to the contrary in this post; any pushback I provide in the comments will be with clarification in mind.

I would prefer that you and/or others did push back, as I'm really curious which of the causes/reasons for pessimism actually stand up under such pushback. (See Four Layers of Intellectual Conversation and AI Safety via Debate.) I do appreciate that you prioritize just knowing what the causes/reasons are in the first place and don't want to discourage people from sharing them, so I wonder if there's a way to get both of what we want.

I do plan on pushing back on certain concerns, but I think if I did so now, some of my reasons for believing things would seem weird and complicated-enough-to-be-shaky because of inferential distance. The main pedagogical mistake I made with Towards a New Impact Measure wasn't putting too much in one post, but rather spending too much time on conclusions, telling people what I think happens without helping build in them the intuitions and insights which generate those results. Over the last 8 months, I think I've substantially enriched my model of how agents interact with their environments. I'm interested in seeing how many disagreements melt away when these new insights are properly shared and understood, and what people still disagree with me on. That's why I'm planning on waiting until my upcoming sequence to debate these points.

I am comfortable sharing those concerns which I have specific reason to believe don't hold up. However, I'm wary of dismissing them in a way that doesn't Include those specific reasons. That seems unfair. If you're curious which ones I think these are, feel free to ask me over private message.