I've been thinking of Case 2. It seems harder to establish "capable of distinguishing between situations where the user wants A vs B" on individual examples since a random classifier would let you cherrypick some cases where this seems possible without the model really understanding. Though you could talk about individual cases as examples of Case 2. Agree that there's some implicit "all else being equal" condition, I'd expect currently it's not too likely to change conclusions. Ideally you'd just have the category A="best answer according to user" B="all answers that are worse than the best answer according to the user" but I think it's simpler to analyze more specific categories.
Link to contractor instructions implied in "You can read the instructions given to our contractors here" is missing.
I don't think all work of that form would measure misalignment, but some work of that form might, here's a description of some stuff in that space that would count as measuring misalignment.
Let A be some task (e.g. add 1 digit numbers), B be a task that is downstream of A (to do B, you need to be able to do A, e.g. add 3 digit numbers), M is the original model, M1 is the model after finetuning.
If the training on a downstream task was minimal, so we think it's revealing what the model knew before finetuning rather than adding knew knowledge, then better performance of M1 than M on A would demonstrated misalignment (don't have a precise definition of what would make finetuning minimal in this way, would be good to have a clearer criteria for that).
If M1 does better on B after finetuning in a way that implicitly demonstrates better knowledge of A, but does not do better on A when asked to do it explicitly, that would demonstrate that the finetuned M1 is misaligned (I think we might expect some version of this to happen by default though, since M1 might overfit to only doing tasks of type B. Maybe if you have a training procedure where M1 generally doesn't get worse at any tasks then I might hope that it would get better on A and be disappointed if it doesn't).
Even better than "Getting models to explain why they’re doing what they’re doing in simpler terms that connect to things the human overseers understand" would be getting models to actually do the task in ways that are simpler and connect to things that human overseers understand. E.g. if a model can solve a task in multiple steps by looking up relevant information by doing internet searches that are recorded and readable by the overseer instead of using knowledge opaquely measured in the weights, that seems like a step in the right direction.
One easy way to make people who can't solve the task for sandwiching is to take people who could solve the task and then give them insufficient time to solve it, or have them be uninformed of some relevant facts about the specific task they are trying to solve.
A simpler way to measure whether you are making progress towards sandwiching if you can't go there directly is to look at whether you can get people to provide better supervision with your tool than without your tool, that is accomplishing more on the task.
Both of these approaches feel like they aren't quite solving the whole problem, because ultimately we want systems that help humans supervise tasks where they haven't developed the right concepts, or couldn't understand them even with years of study.
Here is a regularly updated version of the vaccine chart https://covid19tracker.ca/vaccinegap.html
If the High-Rated Sentence Producer was restricted to output only single steps of a mathematical proof and the single steps were evaluated independently, with the human unable to look at previous steps, then I wouldn't expect this kind of reward hacking to occur. In math proofs, we can build proofs for more complex questions out of individual steps that don't need to increase in complexity.
As I see it, debate on arbitrary questions could work if we figured out how to do something similar, having arguments split into single steps and evaluated independently (as in the recent OpenAI debate work), such that the debate AI can tackle more complicated questions with steps that are restricted to the complexity that humans can currently work with. Hard to know if this is possible, but still seems worth trying to work on.
For the preference learning skepticism, does this extend to the research direction (that isn't yet a research area) of modelling long term preferences/preferences on reflection? This is more along the lines of the "AI-assisted deliberation" direction from ARCHES.To me it seems like AI alignment that can capture preferences on reflection could be used to find solutions to many of other problems. Though there are good reasons to expect that we'd still want to do other work (because we might need theoretical understanding and okay solutions before AI reaches the point where it can help on research, because we want to do work ourselves to be able to check solutions that AIs reach, etc.)It also seems like areas like FairML and Computational Social Choice will require preference learning as components - my guess is that people's exact preferences about fairness won't have a simple mathematical formulation, and will instead need to be learned. I could buy the position that the necessary progress in preference learning will happen by default because of other incentives.
One thing I'd like to see are some more fleshed out examples of the kinds of governance demands that you think might be important in the future and would be bottlenecked on research progress in these areas.
It seems that in principle a version of debate where only one agent makes statements and the other chooses which statements to expand could work, but it seems like it requires the judge to be very strict that the statement is 100% true. It seems hard to apply this kind of system to statements outside of formal mathematics.Systems where both agents can make statements seem like they might be less vulnerable to judges accepting statements that aren't 100% true. For one example, if both agents take turns being the arguer, then if both agents submit a path that is judged to be correct, you can stipulate that the agent with the shortest path wins (like imposing a simplicity prior).