Economist at the Global Priorities Institute. Working and upskilling in AI strategy and AI safety. I subscribe to Crocker's Rules and am especially interested to hear unsolicited constructive criticism. http://sl4.org/crocker.html
It seems like people disagree about how to think about the reasoning process within the advanced AI/AGI, and this is a crux for which research to focus on (please lmk if you disagree). E.g. one may argue AIs are (1) implicitly optimising a utility function (over states of the world), (2) explicitly optimising such a utility function, or (3) is it following a bunch of heuristics or something else?
What can I do to form better views on this? By default, I would read about Shard Theory and "consequentialism" (the AI safety, not the moral philosophy term).
Thanks for running this.
My understanding is that reward specification and goal misgeneralisation are supposed to be synonymous words for outer and inner alignment (?) I understand the problem of inner alignment to be about mesa-optimization. I don't understand how the two papers on goal misgeneralisation fit in:
(in both cases, there are no real mesa optimisers. It is just like the base objective is a set of several possible goals and the agents found one of these goals. This seems so obvious (especially as the optimisation pressure is not that high) - if you underspecify the goal, then different goals could emerge.
Why are these papers not called reward misspecification or an outer alignment failure?
Thanks for writing this. An attempt of an extension:
The likelihood of (internally aligned models) IAM, CAM, and DAM also depends on the ordering in which different properties appear during the training process.
A very hand-wavy example:
Knowing what develops after each other in the training process would update me on the likelihood of deceptive alignment.
Thanks for writing this. A question:
Features as neurons is the more specific hypothesis that, not only do features correspond to directions, but that each neuron corresponds to a feature, and that the neuron’s activation is the strength of that feature on that input.
Shouldn't it be "each feature corresponds to a neuron" rather than "each neuron corresponds to a feature"?
Because some could be just calculations to get to a higher-level features (part of a circuit).
I just saw this now. I would be interested in joining a game if you will run some in the next weeks. The Discord invite link expired unfortunately.
Jaime Sevilla from EpochAI wrote a literature review summarising the research and discussion on AI timelines. If you want to learn more, this might be a good place to start.