Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

Speculation and an invitation to make further suggestions or clarifications

I've been wondering what might count as an analogous example of gradient hacking if any has ever occurred, and if it could be useful to collect such analogies (while mindful not to overfit our expectations). Presumably so far the only examples will be for genetic natural selection (perhaps particularly in humans), and perhaps cultural/memetic selection (leaving aside the technicality of whether natural selection has a gradient per se).

I'm looking for deliberate behaviours which are intended to affect the fitness landscape of the outer process.

In biology

Sexual selection

I think examples of sexual selection are adjacent but don't qualify as gradient hacking. While they constitute an execution of a policy whose main legible effect is indeed on natural selection, presumably in nearly all cases there is not a deliberate process of influence on the natural selection process as such. The policies themselves are encoded by genetic natural selection for the most part, so I'd say this qualifies more as one kind of divergent training trajectory of the outer training process. There remains a plausible analogy that at least some animals and humans engaging in sexual selection are doing so 'because' they are inner misaligned (they are agents with preferences which are mere proxies for the 'true goals' of genetic natural selection and sexual selection is just one such proxy preference).

Examples of sexual selection where the preferences are socially learned seem closer, and there are examples in animals and humans. Even in this case, even though the policy is adapted 'at runtime' (the genetic policy might be something like 'choose a mate which my society codes as attractive', which incidentally presumably encourages other genetic policies like 'try to conform to my society's attractiveness code' and maybe even 'try to nudge my society's attractiveness code towards me and my kin') it is still not carried out with the runtime deliberate intention of affecting the outer process of natural selection (even though it is a deliberate action which does in fact have this side effect, the intended goal is the proxy).


Perhaps the first and only qualifying attempts at actual gradient hacking are efforts towards eugenics practiced by humans, up to and including some instances of (attempted or successful) genocide. Probably since prehistory and certainly since antiquity we've had some 'mesa'/'runtime' understanding of heritability, in contrast to (presumably) all other animals. Most such behaviours are deliberative and involve explicit modelling of the process of heritability, with the stated intention (at least nominally) being to affect the trajectory of the natural selection process (stated in antiquity in very crude but still essentially-correct terms).

What if (I think this is not very credible but at least plausible) a sufficient cause of all eugenics attempts is in fact a genetically-naturally-selected policy schema of 'come up with whatever excuses you can to favour (the success/reproduction/existence of) your kin'? (And the rest of the fluff is just instrumental persuasion and implementation attempts.) Would eugenics be downgraded all the way from being gradient hacking to 'mere' proxy alignment?

In society

This gets more speculative.


What if we adapt the previous hypothesis about genetic eugenics to... eumemics? It is harder to locate the object of memetic selection, and I'm not sure if it's right to identify it with individual organisms the way we can often sloppily get away with doing for biological genetic natural selection. Maybe it is hard to call humans 'inner misaligned with respect to memetic natural selection' with a straight face. But if so, do 'eumemic' attempts qualify as instances of gradient hacking?

If a meme or meme-complex encodes a behaviour of deliberately affecting the meme fitness landscape in ways unrelated to the particular meme(plex), does that qualify as gradient hacking? If so, could boycotts, cancellation, some reading of basically every ethical theory, and many other ideologies besides qualify?

Same question as above - what if such hackery is in fact just the side-effect of memetic natural selection acting on memeplexes which tend to encode behaviours promoting themselves (with a side-effect of also affecting the meme fitness lanscape in other ways)? If so, is this simply proxy alignment rather than gradient hacking?


Ω 9

New Comment
8 comments, sorted by Click to highlight new comments since: Today at 8:31 AM

Affecting 'someone else's gradient'

A case which didn't make the shortlist, but perhaps domestication counts?

It's a deliberate attempt at affecting the (best understanding of the) outer adaptation process. But in the case of domestication, it's targeted primarily at the outer natural selection process of a different lineage. Of course the lineages interact, meaning it does affect the outer natural selection process of the self lineage, but that's not the main legible effect, nor presumably the intended one.

A more modern and 'competent' example might be the (proposed) use of artificial gene drives to perturb an existing genetic population. Again this acts on a different lineage primarily.

Might recommender systems provide a similar phenomenon? The basic argument is that, instead of fulfilling the user's current preferences, they can shape the user's preferences to make the task easier, thus shaping their gradient to better achieve their goal. I haven't read enough about the inner alignment discussion of gradient hacking to fully understand the points of disanalogy, but at least one difference is that there is no mesa-optimizer in the recommender systems. Curious to hear your thoughts. 

Inspired by these two recent papers:

Yeah, I read the ADS paper(s) after writing this post. I think it's a useful framing, more 'selection theorem' ey and with less emphasis on deliberateness/purposefulness.

Additionally, I think there is another conceptual distinction worth attending to

  • auto-induced distributional shift is about affecting environment to change inputs
    • the system itself might remain unchanging and undergo no further learning, and still qualify
  • gradient hacking is about changing environment/inputs/observations to change updates (gradients)
    • the system is presumed subject to updates, which it is taking (some amount of deliberate) influence over

In this post I wrote

I'm looking for deliberate behaviours which are intended to affect the fitness landscape of the outer process.

which I think rules out (hopefully!) contemporary recommender systems on the above two distinctions (as you gestured to regarding mesa-optimization).

In practice, for a system subject to online outer training, ADS changes the inputs which changes the training distribution, in fact causing some change in the updates to the system (perhaps even a large change!). But ADS per se doesn't imply these effects are deliberate, though again you might be able to say something selection-theorem-ey about this process if iterated. Indeed, a competent and deliberate gradient hacker might use means of ADS quite effectively.

None of this is to say that ADS is not a concern, I just think it's conceptually somewhat distinct!

In short, I think ADS available as a mechanism to the extent that the responses of a system can affect subsequent inputs to the system (technically this is always, but in practice the degree of effect varies enormously). This need not be a system subject to further training updates, though if it is, depending how those updates are generated, ADS behaviour may or may not be reinforced.

Gradient hacking was originally coined to mean deliberate, situationally aware influence over training updates. (ADS is one mechanism by which this could be achieved.)

The term 'gradient hacking' seems to also be used commonly to refer to any kind of system influence over training updates, whether situationally aware/deliberate or no. I think it's helpful to distinguish these so I often say 'deliberate gradient hacking' to make sure.

Probably since prehistory and certainly since antiquity we've had some 'mesa'/'runtime' understanding of heritability, in contrast to (presumably) all other animals.

No, not so much. See e.g.

Like anything else, the idea of “breeding” had to be invented. That traits are genetically-influenced broadly equally by both parents subject to considerable randomness and can be selected for over many generations to create large average population-wide increases had to be discovered the hard way, with many wildly wrong theories discarded along the way. Animal breeding is a case in point, as reviewed by an intellectual history of animal breeding, Like Engend’ring Like, which covers mistaken theories of conception & inheritance from the ancient Greeks to perhaps the first truly successful modern animal breeder, Robert Bakewell (1725–1795).

Why did it take thousands of years to begin developing useful animal breeding techniques, a topic of interest to almost all farmers everywhere, a field which has no prerequisites such as advanced mathematics or special chemicals or mechanical tools, and seemingly requires only close observation and patience? ... What is most interesting is the intellectual history we can extract from it in terms of inventing heritability and as important, one of the inventions of progress in the gradual realization that selective breeding was even possible.

That's a very interesting link, thank you! I suppose my reply would be that I don't claim that any of these attempts are particularly competent, merely that they qualify as (incomplete) recognition of an outer adaptation process and deliberate attempts at hacking it.

It is more fiction but in Man in the High Castle there is the character of Thomas Smith.

I think you are handling the case of pressing other groups down with massmurders but that would be "just" "might makes right". Some of the more frightening aspects would be to pinpoint that pruning inwards to do self-eugenics which would be conceptualised as a favour to your in-group. If all eugenics would be "mere" proxys then one would expect to abandon it if it suggested things strongly in the negatives for the more actual goals. Ie the moment eugenics would call your kin to hinder you would drop it. A large part of the drama in the tv series is gotten by different moral intuitions pulling in different directions and I guess repeatedly exploring how that states ideology is f up in more and more detail.

For the big plot points relevant for this (spoilers for Man In the High Castle):

Thomas Smith becomes a saint for having a terminal case of internalised ablism. Atleast for subkin promotion it did not get dropped. There can be a case made that it makes sense for promoting his siblings. The issue for conceptual analysis is whether the motivations of people involved were grounded.

I'd argue almost none right now. In many ways, total gradient hacking was mostly avoided by the fact that the 20th century broke the idea of improving the human species. In other words, we voluntarily forgoed capabilities increases to prevent what we see as bad things.

Had the 20th century gone differently, gradient hacking/genetic engineering would happen in the 21st century.