Even better than "Getting models to explain why they’re doing what they’re doing in simpler terms that connect to things the human overseers understand" would be getting models to actually do the task in ways that are simpler and connect to things that human overseers understand. E.g. if a model can solve a task in multiple steps by looking up relevant information by doing internet searches that are recorded and readable by the overseer instead of using knowledge opaquely measured in the weights, that seems like a step in the right direction.
One easy way to make people who can't solve the task for sandwiching is to take people who could solve the task and then give them insufficient time to solve it, or have them be uninformed of some relevant facts about the specific task they are trying to solve.
A simpler way to measure whether you are making progress towards sandwiching if you can't go there directly is to look at whether you can get people to provide better supervision with your tool than without your tool, that is accomplishing more on the task.
Both of these approaches feel like they aren't quite solving the whole problem, because ultimately we want systems that help humans supervise tasks where they haven't developed the right concepts, or couldn't understand them even with years of study.
Here is a regularly updated version of the vaccine chart https://covid19tracker.ca/vaccinegap.html
If the High-Rated Sentence Producer was restricted to output only single steps of a mathematical proof and the single steps were evaluated independently, with the human unable to look at previous steps, then I wouldn't expect this kind of reward hacking to occur. In math proofs, we can build proofs for more complex questions out of individual steps that don't need to increase in complexity.
As I see it, debate on arbitrary questions could work if we figured out how to do something similar, having arguments split into single steps and evaluated independently (as in the recent OpenAI debate work), such that the debate AI can tackle more complicated questions with steps that are restricted to the complexity that humans can currently work with. Hard to know if this is possible, but still seems worth trying to work on.
For the preference learning skepticism, does this extend to the research direction (that isn't yet a research area) of modelling long term preferences/preferences on reflection? This is more along the lines of the "AI-assisted deliberation" direction from ARCHES.To me it seems like AI alignment that can capture preferences on reflection could be used to find solutions to many of other problems. Though there are good reasons to expect that we'd still want to do other work (because we might need theoretical understanding and okay solutions before AI reaches the point where it can help on research, because we want to do work ourselves to be able to check solutions that AIs reach, etc.)It also seems like areas like FairML and Computational Social Choice will require preference learning as components - my guess is that people's exact preferences about fairness won't have a simple mathematical formulation, and will instead need to be learned. I could buy the position that the necessary progress in preference learning will happen by default because of other incentives.
One thing I'd like to see are some more fleshed out examples of the kinds of governance demands that you think might be important in the future and would be bottlenecked on research progress in these areas.
It seems that in principle a version of debate where only one agent makes statements and the other chooses which statements to expand could work, but it seems like it requires the judge to be very strict that the statement is 100% true. It seems hard to apply this kind of system to statements outside of formal mathematics.Systems where both agents can make statements seem like they might be less vulnerable to judges accepting statements that aren't 100% true. For one example, if both agents take turns being the arguer, then if both agents submit a path that is judged to be correct, you can stipulate that the agent with the shortest path wins (like imposing a simplicity prior).
HCH could implement the decomposition oracle by searching over the space of all possible decompositions (it would just be quite expensive).
https://www.kialo.com/ lets people build debates on controversial topics in a heirarchical structure (more like stock debate, with both sides providing arguments), but doesn't seem to have been used for explanations/arguments. I'd also be pretty interested to see more attempts at heirarchical explanations.
I think there are situations where you can still have subproblems where the output of the subproblem is long. A contrived example: suppose you have a problem where you want to calculate XOR(f(a), f(b)), where f(a) and f(b) are long strings. It seems reasonable to decompose into x=f(a), y=f(b), z=XOR(x, y), despite x and y being long, because there's a simple way to combine them.If we had an AI system that could work on "making progress on a problem for an hour", then write down a complete description of everything it had figured out and pass that to another AI system, I'd count that as dividing the problem into subproblems, just in a way that's probably inefficient.I'd evaluate decompositions into subproblems by something like the total cost of solving a problem by dividing it into subproblems. Some decompositions would be efficent and others would be inefficient, sometimes this would be because the output is large but in other cases it could be because it takes a long time to write the input, or because there's a lot of work repeated between subproblems.