paulfchristiano

paulfchristiano's Comments

Clarifying The Malignity of the Universal Prior: The Lexical Update

I still tentatively think the lexical update works, but it's been a while and I might be missing something.

I'll follow your convention that our universe is U-simple, our universal prior is U', and so the aliens' universe is U'-simple (I think---sorry if I got confused and that's not what you mean).

If we sample from a universe that is U'-simple, then:

  • Assume the aliens care about U'-simplicity. They will preferentially sample from U', and so have U'(our world) mass on our world. Within that, they will correctly guess that the machine they are supposed to control is using U' as its prior. That is, they basically pay U'(our world) * P(us|someone using U' to predict).
  • But our universal prior was also U', wasn't it? So we are also paying U'(our world) to pick out our world. I.e. we pay U'(our world) * P(someone making important predictions | our world) * P(someone using U' to predict | someone making important predictions) * P(us|someone using U' to predict).
  • I don't see any program whose behavior depends on U(world) for the "real" simplicity prior U according to which our world is simple (and that concept seems slippery).

Does that seem right?

Of arguments and wagers

(Note: this is a post from 2014 that I recently added to ai-alignment.com. I still think it's a very interesting scheme and I'm excited about people exploring better mechanisms for resolving arguments.)

Of arguments and wagers
I think the resulting odds won't reflect the probability of anything, because they depend a lot on whether Alice or Bob is more risk-tolerant (=rich).

If one of them is willing to tolerate risk equal to the value of Judy's time to hear out the argument, then you are fine. If the total willingness to risk of people who believe "Judy will believe X on reflection" is lower than the value of Judy's time, then I think you are basically inevitably stuck unless Judy is willing to risk her own attention. If she is willing to risk her own attention, then she can just give people a budget of "minutes" to spend making wagers, as discussed in the post, and as long as the budget is large enough relative to the size of the disagreement it seems like you are OK.

Also, it seems to me that your scheme works best for yes/no questions. For anything more complicated, Alice and Bob can cooperate to mislead Judy, which is especially scary in case of AIs. I'm not sure how to fix that problem: it seems to require a way for a non-expert to check the work of a malicious expert, not just adjudicate between two experts.

The scheme works if one of the experts advocates for the truth. If there are two options, and both players want to manipulate Judy into believing "yes," then you are similarly in trouble. I agree that if there are more options than experts then it becomes less likely that "by chance" someone wants to advocate for the right answer. But I think in general you are banking on there being some density of experts who want to argue for the truth because it is the truth.

[AN #80]: Why AI risk might be solved without additional intervention from longtermists

For context, here's the one time in the interview I mention "AI risk" (quoting 2 earlier paragraphs for context):

Paul Christiano: I don’t know, the future is 10% worse than it would otherwise be in expectation by virtue of our failure to align AI. I made up 10%, it’s kind of a random number. I don’t know, it’s less than 50%. It’s more than 10% conditioned on AI soon I think.
[...]
Asya Bergal: I think my impression is that that 10% is lower than some large set of people. I don’t know if other people agree with that.
Paul Christiano: Certainly, 10% is lower than lots of people who care about AI risk. I mean it’s worth saying, that I have this slightly narrow conception of what is the alignment problem. I’m not including all AI risk in the 10%. I’m not including in some sense most of the things people normally worry about and just including the like ‘we tried to build an AI that was doing what we want but then it wasn’t even trying to do what we want’. I think it’s lower now or even after that caveat, than pessimistic people. It’s going to be lower than all the MIRI folks, it’s going to be higher than almost everyone in the world at large, especially after specializing in this problem, which is a problem almost no one cares about, which is precisely how a thousand full time people for 20 years can reduce the whole risk by half or something.

(But it's still the case that asked "Can you explain why it's valuable to work on AI risk?" I responded by almost entirely talking about AI alignment, since that's what I work on and the kind of work where I have a strong view about cost-effectiveness.)

[AN #80]: Why AI risk might be solved without additional intervention from longtermists

E.g. if you have a broad distribution over possible worlds, some of which are "fragile" and have 100 things that cut value down by 10%, and some of which are "robust" and don't, then you get 10,000x more value from the robust worlds. So unless you are a priori pretty confident that you are in a fragile world (or they are 10,000x more valuable, or whatever), the robust worlds will tend to dominate.

Similar arguments work if we aggregate across possible paths to achieving value within a fixed, known world---if there are several ways things can go well, some of which are more robust, those will drive almost all of the EV. And similarly for moral uncertainty (if there are several plausible views, the ones that consider this world a lost cause will instead spend their influence on other worlds) and so forth. I think it's a reasonably robust conclusion across many different frameworks: your decision shouldn't end up being dominated by some hugely conjunctive event.

A dilemma for prosaic AI alignment
In the case of something like amplification or debate, I think the bet that you're making is that language modeling alone is sufficient to get you everything you need in a competitive way.

I'm skeptical of language modeling being enough to be competitive, in the sense of maximizing "log prob of some naturally occurring data or human demonstrations." I don't have a strong view about whether you can get away using only language data rather than e.g. taking images as input and producing motor torques as output.

I'm also not convinced that amplification or debate need to make this bet though. If we can do joint training / fine-tuning of a language model using whatever other objectives we need, then it seems like we could just as well do joint training / fine-tuning for a different kind of model. What's so bad if we use non-language data?

A dilemma for prosaic AI alignment

We could also ask: "Would AlphaStar remain as good as it is, if fine-tuned to answer questions?"

In either case it's an empirical question. I think the answer is probably yes if you do it carefully.

You could imagine separating this into two questions:

  • Is there a policy that plays starcraft and answers questions, that is only slightly larger than a policy for playing starcraft alone? This is a key premise for the whole project. I think it's reasonably likely; the goal is only to answer questions the model "already knows," so it seems realistic to hope for only a constant amount of extra work to be able to use that knowledge to answer questions. I think most of the uncertainty here is about details of "know" and question-answering and so on.
  • Can you use joint optimization to find that policy with only slightly more training time? I think probably yes.
A dilemma for prosaic AI alignment

I normally imagine using joint training in these cases, rather than pre-training + fine-tuning. e.g., at every point in time we maintain an agent and a question-answerer, where the question-answerer "knows everything the agent knows." They get better together, with each gradient update affecting both of them, rather than first training a good agent and then adding a good question-answerer.

(Independently of concerns about mesa-optimization, I think the fine-tuning approach would have trouble because you couldn't use statistical regularities from the "main" objective to inform your answers to questions, and therefore your question answers will be dumber than the policy and so you couldn't get a good reward function or specification of catastrophically bad behavior.)

AlphaStar: Impressive for RL progress, not for AGI progress

I don't have a big difference in my model of mid vs. final, they have very similar MMR, the difference between them is pretty small in the scheme of things (e..g probably smaller than the impact of doubling model size) and my picture isn't refined enough to appreciate those differences. For any particular dumb mistake I'd be surprised if the line between not making it and making it was in that particular doubling.

Load More