It seems like people disagree about how to think about the reasoning process within the advanced AI/AGI, and this is a crux for which research to focus on (please lmk if you disagree). E.g. one may argue AIs are (1) implicitly optimising a utility function (over states of the world), (2) explicitly optimising such a utility function, or (3) is it following a bunch of heuristics or something else?
What can I do to form better views on this? By default, I would read about Shard Theory and "consequentialism" (the AI safety, not the moral philosophy term).
Thanks for running this.
My understanding is that reward specification and goal misgeneralisation are supposed to be synonymous words for outer and inner alignment (?) I understand the problem of inner alignment to be about mesa-optimization. I don't understand how the two papers on goal misgeneralisation fit in:
(in both cases, there are no real mesa optimisers. It is just like the...
Thanks for writing this. An attempt of an extension:
The likelihood of (internally aligned models) IAM, CAM, and DAM also depends on the ordering in which different properties appear during the training process.
A very hand-wavy example:
Thanks for writing this. A question:
Features as neurons is the more specific hypothesis that, not only do features correspond to directions, but that each neuron corresponds to a feature, and that the neuron’s activation is the strength of that feature on that input.
Shouldn't it be "each feature corresponds to a neuron" rather than "each neuron corresponds to a feature"?
Because some could be just calculations to get to a higher-level features (part of a circuit).
I just saw this now. I would be interested in joining a game if you will run some in the next weeks. The Discord invite link expired unfortunately.
How the frame of training stories improves my thinking:
[Below are notes I wrote a few months ago after Reading the training story post from Evan Hubinger. I reflected on how it helped me think more clearly about alignment. I list some ways it informs my thinking about how to evaluate and prioritise alignment proposals and other ways of reducing x-risk that were not obvious to some people I talked with, so I want to share them here. ]
In short, the training story post says that when thinking about the alignment of an AI system, you should have a t... (read more)