All of CharlotteS's Comments + Replies

How the frame of training stories improves my thinking: 

[Below are notes I wrote a few months ago after Reading the training story post from Evan Hubinger. I reflected on how it helped me think more clearly about alignment. I list some ways it informs my thinking about how to evaluate and prioritise alignment proposals and other ways of reducing x-risk that were not obvious to some people I talked with, so I want to share them here. ]

In short, the training story post says that when thinking about the alignment of an AI system, you should have a t... (read more)

Jaime Sevilla from EpochAI wrote a literature review summarising the research and discussion on AI timelines. If you want to learn more, this might be a good place to start.

It seems like people disagree about how to think about the reasoning process within the advanced AI/AGI, and this is a crux for which research to focus on (please lmk if you disagree). E.g. one may argue AIs are (1) implicitly optimising a utility function (over states of the world), (2) explicitly optimising such a utility function, or (3) is it following a bunch of heuristics or something else? 

What can I do to form better views on this? By default, I would read about Shard Theory and "consequentialism" (the AI safety, not the moral philosophy term). 

My impression is that basically no one knows how reasoning works, so people either make vague statements (I don't know what shard theory is supposed to be but when I've looked briefly at it it's either vague or obvious), or retreat to functional descriptions like "the AI follows a policy that acheives high reward" or "the AI is efficient relative to humans" or "the AI pumps outcomes" (see e.g. here: [] ). 

Thanks for running this. 

My understanding is that reward specification and goal misgeneralisation are supposed to be synonymous words for outer and inner alignment (?)  I understand the problem of inner alignment to be about mesa-optimization. I don't understand how the two papers on goal misgeneralisation fit in:


(in both cases, there are no real mesa optimisers. It is just like the... (read more)

Thanks for writing this. An attempt of an extension: 

The likelihood of (internally aligned models) IAM, CAM, and DAM also depends on the ordering in which different properties appear during the training process. 

A very hand-wavy example: 

  1. If the model starts to have a current objective, when its understanding of the base objective is bad and general reasoning world modelling is also bad (so it can't do deception cognition), then SGD pushes towards the current objective being changed towards the base objective -> IAM or CAM
  2. If the model star
... (read more)

Thanks for writing this. A question:

Features as neurons is the more specific hypothesis that, not only do features correspond to directions, but that each neuron corresponds to a feature, and that the neuron’s activation is the strength of that feature on that input. 

Shouldn't it be "each feature corresponds to a neuron" rather than "each neuron corresponds to a feature"? 

Because some could be just calculations to get to a higher-level features (part of a circuit). 

2Neel Nanda4mo
Fair point, corrected. IMO, the intermediate steps should mostly be counted as features in their own right, but it'd depend on the circuit. The main reason I agree is that neurons probably still do some other stuff, eg memory management or signal boosting earlier directions in the residual stream.

I just saw this now. I would be interested in joining a game if you will run some in the next weeks. The Discord invite link expired unfortunately.

2Daniel Kokotajlo5mo
Great! Email [], they'll know what's up. (I don't).