Economist at the Global Priorities Institute. Working and upskilling in AI strategy and AI safety. I subscribe to Crocker's Rules and am especially interested to hear unsolicited constructive criticism.

Wiki Contributions


Jaime Sevilla from EpochAI wrote a literature review summarising the research and discussion on AI timelines. If you want to learn more, this might be a good place to start.

It seems like people disagree about how to think about the reasoning process within the advanced AI/AGI, and this is a crux for which research to focus on (please lmk if you disagree). E.g. one may argue AIs are (1) implicitly optimising a utility function (over states of the world), (2) explicitly optimising such a utility function, or (3) is it following a bunch of heuristics or something else? 

What can I do to form better views on this? By default, I would read about Shard Theory and "consequentialism" (the AI safety, not the moral philosophy term). 

Thanks for running this. 

My understanding is that reward specification and goal misgeneralisation are supposed to be synonymous words for outer and inner alignment (?)  I understand the problem of inner alignment to be about mesa-optimization. I don't understand how the two papers on goal misgeneralisation fit in:


(in both cases, there are no real mesa optimisers. It is just like the base objective is a set of several possible goals and the agents found one of these goals. This seems so obvious (especially as the optimisation pressure is not that high) - if you underspecify the goal, then different goals could emerge. 

Why are these papers not called reward misspecification or an outer alignment failure?

Thanks for writing this. An attempt of an extension: 

The likelihood of (internally aligned models) IAM, CAM, and DAM also depends on the ordering in which different properties appear during the training process. 

A very hand-wavy example: 

  1. If the model starts to have a current objective, when its understanding of the base objective is bad and general reasoning world modelling is also bad (so it can't do deception cognition), then SGD pushes towards the current objective being changed towards the base objective -> IAM or CAM
  2. If the model starts to have a current objective when its understanding of the base objective is bad, but its reasoning and general world modelling is good, then 
    1. SGD will either improve the understanding of the base objective (and will keep the unaligned current objective) -> DAM
      1. we would not see very clear warning shots (i.e. imperfect deception)
    2. or SGD will make my current objective be closer to the real base objective -> IAM or CAM
    3. or it will do a combination
    4. I think DAM is more likely than IAM or CAM, but I feel confused.
  3. If the model starts to have a current objective, when its understanding of  the base objective is good, but the general reasoning is bad, then SGD improves the general reasoning -> DAM
    1. we would see warning shots (imperfect deception)
  4. The model starts to have a current objective when its understanding of  the base objective is good, and  general reasoning is good -> DAM (if there exists some possible deception reasoning)

Knowing what develops after each other in the training process would update me on the likelihood of deceptive alignment. 

Thanks for writing this. A question:

Features as neurons is the more specific hypothesis that, not only do features correspond to directions, but that each neuron corresponds to a feature, and that the neuron’s activation is the strength of that feature on that input. 

Shouldn't it be "each feature corresponds to a neuron" rather than "each neuron corresponds to a feature"? 

Because some could be just calculations to get to a higher-level features (part of a circuit). 

I just saw this now. I would be interested in joining a game if you will run some in the next weeks. The Discord invite link expired unfortunately.