I still tentatively think the lexical update works, but it's been a while and I might be missing something.
I'll follow your convention that our universe is U-simple, our universal prior is U', and so the aliens' universe is U'-simple (I think---sorry if I got confused and that's not what you mean).
If we sample from a universe that is U'-simple, then:
Does that seem right?
(Note: this is a post from 2014 that I recently added to ai-alignment.com. I still think it's a very interesting scheme and I'm excited about people exploring better mechanisms for resolving arguments.)
I think the resulting odds won't reflect the probability of anything, because they depend a lot on whether Alice or Bob is more risk-tolerant (=rich).
If one of them is willing to tolerate risk equal to the value of Judy's time to hear out the argument, then you are fine. If the total willingness to risk of people who believe "Judy will believe X on reflection" is lower than the value of Judy's time, then I think you are basically inevitably stuck unless Judy is willing to risk her own attention. If she is willing to risk her own attention, then she can just give people a budget of "minutes" to spend making wagers, as discussed in the post, and as long as the budget is large enough relative to the size of the disagreement it seems like you are OK.
Also, it seems to me that your scheme works best for yes/no questions. For anything more complicated, Alice and Bob can cooperate to mislead Judy, which is especially scary in case of AIs. I'm not sure how to fix that problem: it seems to require a way for a non-expert to check the work of a malicious expert, not just adjudicate between two experts.
The scheme works if one of the experts advocates for the truth. If there are two options, and both players want to manipulate Judy into believing "yes," then you are similarly in trouble. I agree that if there are more options than experts then it becomes less likely that "by chance" someone wants to advocate for the right answer. But I think in general you are banking on there being some density of experts who want to argue for the truth because it is the truth.
For context, here's the one time in the interview I mention "AI risk" (quoting 2 earlier paragraphs for context):
Paul Christiano: I don’t know, the future is 10% worse than it would otherwise be in expectation by virtue of our failure to align AI. I made up 10%, it’s kind of a random number. I don’t know, it’s less than 50%. It’s more than 10% conditioned on AI soon I think.
Asya Bergal: I think my impression is that that 10% is lower than some large set of people. I don’t know if other people agree with that.
Paul Christiano: Certainly, 10% is lower than lots of people who care about AI risk. I mean it’s worth saying, that I have this slightly narrow conception of what is the alignment problem. I’m not including all AI risk in the 10%. I’m not including in some sense most of the things people normally worry about and just including the like ‘we tried to build an AI that was doing what we want but then it wasn’t even trying to do what we want’. I think it’s lower now or even after that caveat, than pessimistic people. It’s going to be lower than all the MIRI folks, it’s going to be higher than almost everyone in the world at large, especially after specializing in this problem, which is a problem almost no one cares about, which is precisely how a thousand full time people for 20 years can reduce the whole risk by half or something.
(But it's still the case that asked "Can you explain why it's valuable to work on AI risk?" I responded by almost entirely talking about AI alignment, since that's what I work on and the kind of work where I have a strong view about cost-effectiveness.)
E.g. if you have a broad distribution over possible worlds, some of which are "fragile" and have 100 things that cut value down by 10%, and some of which are "robust" and don't, then you get 10,000x more value from the robust worlds. So unless you are a priori pretty confident that you are in a fragile world (or they are 10,000x more valuable, or whatever), the robust worlds will tend to dominate.
Similar arguments work if we aggregate across possible paths to achieving value within a fixed, known world---if there are several ways things can go well, some of which are more robust, those will drive almost all of the EV. And similarly for moral uncertainty (if there are several plausible views, the ones that consider this world a lost cause will instead spend their influence on other worlds) and so forth. I think it's a reasonably robust conclusion across many different frameworks: your decision shouldn't end up being dominated by some hugely conjunctive event.
In the case of something like amplification or debate, I think the bet that you're making is that language modeling alone is sufficient to get you everything you need in a competitive way.
I'm skeptical of language modeling being enough to be competitive, in the sense of maximizing "log prob of some naturally occurring data or human demonstrations." I don't have a strong view about whether you can get away using only language data rather than e.g. taking images as input and producing motor torques as output.
I'm also not convinced that amplification or debate need to make this bet though. If we can do joint training / fine-tuning of a language model using whatever other objectives we need, then it seems like we could just as well do joint training / fine-tuning for a different kind of model. What's so bad if we use non-language data?
We could also ask: "Would AlphaStar remain as good as it is, if fine-tuned to answer questions?"
In either case it's an empirical question. I think the answer is probably yes if you do it carefully.
You could imagine separating this into two questions:
I normally imagine using joint training in these cases, rather than pre-training + fine-tuning. e.g., at every point in time we maintain an agent and a question-answerer, where the question-answerer "knows everything the agent knows." They get better together, with each gradient update affecting both of them, rather than first training a good agent and then adding a good question-answerer.
(Independently of concerns about mesa-optimization, I think the fine-tuning approach would have trouble because you couldn't use statistical regularities from the "main" objective to inform your answers to questions, and therefore your question answers will be dumber than the policy and so you couldn't get a good reward function or specification of catastrophically bad behavior.)
I don't have a big difference in my model of mid vs. final, they have very similar MMR, the difference between them is pretty small in the scheme of things (e..g probably smaller than the impact of doubling model size) and my picture isn't refined enough to appreciate those differences. For any particular dumb mistake I'd be surprised if the line between not making it and making it was in that particular doubling.