Oliver Sourbut

Call me Oliver or Oly - I don't mind which.

I'm particularly interested in sustainable collaboration and the long-term future of value. I'd love to contribute to a safer and more prosperous future with AI! Always interested in discussions about axiology, x-risks, s-risks.

I'm currently (2022) just embarking on a PhD in AI in Oxford, and also spend time in (or in easy reach of) London. Until recently I was working as a senior data scientist and software engineer, and I've been doing occasional AI alignment research with SERI.

I enjoy meeting new perspectives and growing my understanding of the world and the people in it. I also love to read - let me know your suggestions! In no particular order, here are some I've enjoyed recently

  • Ord - The Precipice
  • Pearl - The Book of Why
  • Bostrom - Superintelligence
  • McCall Smith - The No. 1 Ladies' Detective Agency (and series)
  • Melville - Moby-Dick
  • Abelson & Sussman - Structure and Interpretation of Computer Programs
  • Stross - Accelerando
  • Graeme - The Rosie Project (and trilogy)

Cooperative gaming is a relatively recent but fruitful interest for me. Here are some of my favourites

  • Hanabi (can't recommend enough; try it out!)
  • Pandemic (ironic at time of writing...)
  • Dungeons and Dragons (I DM a bit and it keeps me on my creative toes)
  • Overcooked (my partner and I enjoy the foody themes and frantic realtime coordination playing this)

People who've got to know me only recently are sometimes surprised to learn that I'm a pretty handy trumpeter and hornist.


Breaking Down Goal-Directed Behaviour

Wiki Contributions


This was a great read. Thanks in particular for sharing some introspection on motivation and thinking processes leading to these findings!

Two thoughts:

First, I sense that you're somewhat dissatisfied with using total variation distance ('average action probability change') as a qualitative measure of the impact of an intervention on behaviour. In particular, it doesn't weight 'meaningfulness', and important changes might get washed out by lots of small changes in unimportant cells. When we visualise, I think we intuitively do something richer, but in order to test at scale, visualisation becomes a bottleneck, so you need something quantitative like this. Perhaps you might get some mileage by considering the stationary distribution of the policy-induced Markov chain? It can be approximated by multiplying the transition matrix by itself a few times! Obviously that matrix is technically quadratic size in state count, but it's also very sparse :) so that might be relatively tractable given that you've already computed a NN forward pass for each state by to get to this point. Or you could eigendecompose the transition matrix.

Second, this seems well-informed to me, but I can't really see the connection to (my understanding of) shard theory here, other than it being Team Shard! Maybe that'll be clearer in a later post.

Oh boy, this is terrifyingly familiar from my oncall days!

We call a subspace  invariant under  if, for all ,

should read

Let  now specifically be a one-dimensional subspace of  such that, for all ,

I think such can not exist in most cases, and it should instead read '... for some ...'

The expression for is describing the span of the vector , so certainly if is more than one-dimensional, if some subspace has this property for all then it has this property for linearly independent vectors in , which is a contradiction.

This is great! I'll thread a few nits under this comment

FWIW my experience of MATS 0.1 (i.e. the first run/pilot 2021-22) was that it was more open-ended and diversity-focused than subsequent MATS, which has been more apprenticeship-focused. That was helpful for me at the time, but I don't know if it was ever the intention per se, and I agree that the focus of MATS now is different. I haven't thought long enough to decide if this is good or bad.

This approach also makes lots of regularisation techniques transparent. Typically regularisation corresponds to applying some prior (over the weights/parameters of the model you're fitting). e.g. L2 norm aka ridge aka weight decay regularisation corresponds exactly to taking a Gaussian prior on the weights and finding the Maximum A Priori estimate (rather than the Maximum Likelihood).

(and 'self-replicating' for some reasonable operationalisation)

In short, I think ADS available as a mechanism to the extent that the responses of a system can affect subsequent inputs to the system (technically this is always, but in practice the degree of effect varies enormously). This need not be a system subject to further training updates, though if it is, depending how those updates are generated, ADS behaviour may or may not be reinforced.

Gradient hacking was originally coined to mean deliberate, situationally aware influence over training updates. (ADS is one mechanism by which this could be achieved.)

The term 'gradient hacking' seems to also be used commonly to refer to any kind of system influence over training updates, whether situationally aware/deliberate or no. I think it's helpful to distinguish these so I often say 'deliberate gradient hacking' to make sure.

Yeah, I read the ADS paper(s) after writing this post. I think it's a useful framing, more 'selection theorem' ey and with less emphasis on deliberateness/purposefulness.

Additionally, I think there is another conceptual distinction worth attending to

  • auto-induced distributional shift is about affecting environment to change inputs
    • the system itself might remain unchanging and undergo no further learning, and still qualify
  • gradient hacking is about changing environment/inputs/observations to change updates (gradients)
    • the system is presumed subject to updates, which it is taking (some amount of deliberate) influence over

In this post I wrote

I'm looking for deliberate behaviours which are intended to affect the fitness landscape of the outer process.

which I think rules out (hopefully!) contemporary recommender systems on the above two distinctions (as you gestured to regarding mesa-optimization).

In practice, for a system subject to online outer training, ADS changes the inputs which changes the training distribution, in fact causing some change in the updates to the system (perhaps even a large change!). But ADS per se doesn't imply these effects are deliberate, though again you might be able to say something selection-theorem-ey about this process if iterated. Indeed, a competent and deliberate gradient hacker might use means of ADS quite effectively.

None of this is to say that ADS is not a concern, I just think it's conceptually somewhat distinct!

Load More