Oliver Sourbut
  • Autonomous Systems @ UK AI Safety Institute (AISI)
  • DPhil AI Safety @ Oxford (Hertford college, CS dept, AIMS CDT)
  • Former senior data scientist and software engineer + SERI MATS

I'm particularly interested in sustainable collaboration and the long-term future of value. I'd love to contribute to a safer and more prosperous future with AI! Always interested in discussions about axiology, x-risks, s-risks.

I enjoy meeting new perspectives and growing my understanding of the world and the people in it. I also love to read - let me know your suggestions! In no particular order, here are some I've enjoyed recently

  • Ord - The Precipice
  • Pearl - The Book of Why
  • Bostrom - Superintelligence
  • McCall Smith - The No. 1 Ladies' Detective Agency (and series)
  • Melville - Moby-Dick
  • Abelson & Sussman - Structure and Interpretation of Computer Programs
  • Stross - Accelerando
  • Graeme - The Rosie Project (and trilogy)

Cooperative gaming is a relatively recent but fruitful interest for me. Here are some of my favourites

  • Hanabi (can't recommend enough; try it out!)
  • Pandemic (ironic at time of writing...)
  • Dungeons and Dragons (I DM a bit and it keeps me on my creative toes)
  • Overcooked (my partner and I enjoy the foody themes and frantic realtime coordination playing this)

People who've got to know me only recently are sometimes surprised to learn that I'm a pretty handy trumpeter and hornist.


Breaking Down Goal-Directed Behaviour

Wiki Contributions


Good old Coase! Thanks for this excellent explainer.

In contrast, if you think the relevant risks from AI look like people using their systems to do some small amounts of harm which are not particularly serious, you'll want to hold the individuals responsible for these harms liable and spare the companies.

Or (thanks to Coase), we could have two classes of harm, with big arbitrarily defined as, I don't know, say $500m which is a number I definitely just made up, and put liability for big harms on the big companies, while letting the classic societal apparatus for small harms tick over as usual? Surely only a power-crazed bureaucrat would suggest such a thing! (Of course this is prone to litigation over whether particular harms are one big harm or n smaller harms, or whether damages really were half a billion or actually $499m or whatever, but it's a good start.)

I like this decomposition!

I think 'Situational Awareness' can quite sensibly be further divided up into 'Observation' and 'Understanding'.

The classic control loop of 'observe', 'understand', 'decide', 'act'[1], is consistent with this discussion, where 'observe'+'understand' here are combined as 'situational awareness', and you're pulling out 'goals' and 'planning capacity' as separable aspects of 'decide'.

Are there some difficulties with factoring?

Certain kinds of situational awareness are more or less fit for certain goals. And further, the important 'really agenty' thing of making plans to improve situational awareness does mean that 'situational awareness' is quite coupled to 'goals' and to 'implementation capacity' for many advanced systems. Doesn't mean those parts need to reside in the same subsystem, but it does mean we should expect arbitrary mix and match to work less well than co-adapted components - hard to say how much less (I think this is borne out by observations of bureaucracies and some AI applications to date).

  1. Terminology varies a lot; this is RL-ish terminology. Classic analogues might be 'feedback', 'process model'/'inference', 'control algorithm', 'actuate'/'affect'... ↩︎

the original 'theorem' was wordcelled nonsense

Lol! I guess if there was a more precise theorem statement in the vicinity gestured, it wasn't nonsense? But in any case, I agree the original presentation is dreadful. John's is much better.

I would be curious to hear a *precise * statement why the result here follows from the Good Regular Theorem.

A quick go at it, might have typos.

Suppose we have

  • (hidden) state
  • output/observation

and a predictor

  • (predictor) state
  • predictor output
  • the reward or goal or what have you (some way of scoring 'was right?')

with structure

Then GR trivially says (predictor state) should model the posterior .

Now if these are all instead processes (time-indexed), we have HMM

  • (hidden) states
  • observations

and predictor process

  • (predictor) states
  • predictions
  • rewards

with structure

Drawing together as the 'goal', we have a GR motif

so must model ; by induction that is .

I guess my question would be 'how else did you think a well-generalising sequence model would achieve this?' Like, what is a sufficient world model but a posterior over HMM states in this case? This is what GR theorem asks. (Of course, a poorly-fit model might track extraneous detail or have a bad posterior.)

From your preamble and your experiment design, it looks like you correctly anticipated the result, so this should not have been a surprise (to you). In general I object to being sold something as surprising which isn't (it strikes me as a lesser-noticed and perhaps oft-inadvertent rhetorical dark art and I see it on the rise on LW, which is sad).

That said, since I'm the only one objecting here, you appear to be more right about the surprisingness of this!

The linear probe is new news (but not surprising?) on top of GR, I agree. But the OP presents the other aspects as the surprises, and not this.

Nice explanation of MSP and good visuals.

This is surprising!

Were you in fact surprised? If so, why? (This is a straightforward consequence of the good regulator theorem[1].)

In general I'd encourage you to carefully track claims about transformers, HMM-predictors, and LLMs, and to distinguish between trained NNs and the training process. In this writeup, all of these are quite blended.

  1. John has a good explication here ↩︎

Incidentally I noticed Yudkowsky uses 'brainware' in a few places (e.g. in conversation with Paul Christiano). But it looks like that's referring to something more analogous to 'architecture and learning algorithms', which I'd put more in the 'software' camp when in comes to the taxonomy I'm pointing at (the 'outer designer' is writing it deliberately).

Unironically, I think it's worth anyone interested skimming that Verma & Pearl paper for the pictures :) especially fig 2

Mmm, I misinterpreted at first. It's only a v-structure if and are not connected. So this is a property which needs to be maintained effectively 'at the boundary' of the fully-connected cluster which we're rewriting. I think that tallies with everything else, right?

ETA: both of our good proofs respect this rule; the first Reorder in my bad proof indeed violates it. I think this criterion is basically the generalised and corrected version of the fully-connected bookkeeping rule described in this post. I imagine if I/someone worked through it, this would clarify whether my handwave proof of Frankenstein Stitch is right or not.

That's concerning. It would appear to make both our proofs invalid.

But I think your earlier statement about incoming vs outgoing arrows makes sense. Maybe Verma & Pearl were asking for some other kind of equivalence? Grr, back to the semantics I suppose.

[This comment is no longer endorsed by its author]Reply

Aha. Preserving v-structures (colliders like ) is necessary and sufficient for equivalence[1]. So when rearranging fully-connected subgraphs, certainly we can't do it (cost-free) if it introduces or removes any v-structures.

Plausibly if we're willing to weaken by adding in additional arrows, there might be other sound ways to reorder fully-connected subgraphs - but they'd be non-invertible. Haven't thought about that.

  1. Verma & Pearl, Equivalence and Synthesis of Causal Models 1990 ↩︎

