VojtaKovarik

Background in mathematics (descriptive set theory, Banach spaces) and game-theory (mostly zero-sum, imperfect information games). CFAR mentor. Usually doing alignment research.

Comments

"Zero Sum" is a misnomer.

As a game theorist, I completely endorse the proposed terminology. Just don't tell other game theorists... Sometimes, things get even worse when some people use the term "general sum games" to refer to games that are not constant-sum.

I like to imagine different games on a scale between completely adversarial and completely cooperative. With things in the middle being called "mixed-motive games".

Integrating Hidden Variables Improves Approximation

I am usually reasonably good at translating from math to non-abstract intuitive examples...but I didn't have much success here. Do you have "in English, for simpletons" example to go with this? :-) (You know, something that uses apples and biscuits rather than English-but-abstract words like "there are many hidden variables mediating the interactions between observables" :D.)

Otherwise, my current abstract interpretation of this is something like: "There are detailed models, and those might vary a lot. And then there are very abstract models, which will be more similar to each other...well, except that they might also be totally useless." So I was hoping that a more specific example would clarify things for a bit and tell me whether there is more to this (and also whether I got it all wrong or not :-).)

Noise Simplifies

I have a long list of randomly-chosen numbers between 1 and 10, and I want to know whether their sum is even or odd.

I find your example here somewhat misleading. Suppose your random numbers weren't randomly drawn from 1-10, but from . If you don't know a single number, you still know that there is a 5:1 chance that it will be even (and hence not change the parity of the sum of the whole list). So if a single number is unknown, you will still want to take the sum of the ones you do know. In this light, your example seems like an exception, rather than the norm. (My main issue with it is that since it feels very ad-hoc, you might subconsciously come to the impression that the described behaviour is the norm.)

However, it might easily be that the class of these "exception" is important on its own. So I wouldn't want to shoot down the overall idea described in the post - I like it :-).

How should AI debate be judged?

Even if you keep the argumentation phase asymmetric, you might want to make the answering phase simultaneous or at least allow the second AI to give the same answer as the first AI (which can mean a draw by default).

This doesn't make for a very good training signal, but might have better equilibria.

AI Unsafety via Non-Zero-Sum Debate

I haven't yet thought about this in much detail, but here is what I have:

I will assume you can avoid getting "hacked" while overseeing the debate. If you don't assume that, then it might be important whether you can differentiate between arguments that are vs aren't relevant to the question at hand. (I suppose that it is much harder to get hacked when strictly sticking to a specific subject-matter topic. And harder yet if you are, e.g., restricted to answering in math proofs, which might be sufficient for some types of questions.)

As for the features of safe questions, I think that one axis is the potential impact of the answer and an orthogonal one is the likelihood that the answer will be undesirable/misaligned/bad. My guess is that if you can avoid getting hacked, then the lower-impact-of-downstream-consequences questions are inherently safer (from the trivial reason of being less impactful). But this feels like a cheating answer, and the second axis seems more interesting.

My intuition about the "how likely are we to get an aligned answer" axis is this: There questions where I am fairly confident in our judging skills (for example, math proofs). Many of those could fall into the "definitely safe" category. Then there is the other extreme of questions where our judgement might be very fallible - things that are too vague or that play into our biases. (For example hard philosophical questions and problems whose solutions depend on answers to such questions. E.g., I wouldn't trust myself to be a good judge of "how should we decide on the future of the universe" or "what is the best place for me to go for a vacation".) I imagine these are "very likely unsafe". And as a general principle, where there are two extremes, there often will be a continuum inbetween. Maybe "what is a reasonable way of curing cancer?" could fall here? (Being probably safe, but I wouldn't bet all my money on it.)

AI Unsafety via Non-Zero-Sum Debate

I agree with what Paul and Donald are saying, but the post was trying to make a different point.

Among various things needed to "make debate work", I see three separate sub-problems:

(A) Ensuring that "agents use words to get a human to select them as the winner; and that this is their only terminal goal" is a good abstraction. (Please read this in the intended meaning of the sentence. No, if there is a magical word that causes the human's head to explode and their body falls on the reward button, this doesn't count.)

(B) Having already accomplished (A), ensure that "agents use words to convince the human that their answer is better" is a good abstraction. (Not sure how to operationalize this, but you want to, for example, ensure that: (i) Agents do not collaboratively convince the human to give reward to both of them. (ii) If the human could in principle be brainwashed, the other agent will be able and incentivized to prevent this. In particular, no brainwashing in a single claim.)

(C)Having already accomplished (A) and (B), ensure that AIs in debate only convince us of safe and useful things.

While somewhat related, I think these three problems should be tackled separately as much as possible. Indeed, (A) seems to not really be specific to debate, because a similar problem can be posed for any advanced AI. Moreover, I think that if you are relying on the presence of the other agent to help you with (A) (e.g., one AI producing signals to block the other AI's signals), you have already made a mistake. On the other hand, it seems fine to rely on the presence of the other agent for both (B) and (C). However, my intuition is that these problems are mostly orthogonal - most solution to (B) will be compatible with most solutions to (C).

For (A), Michael Cohen's Boxed Myopic AI seems like a particularly relevant thing. (Not saying that what it proposes is enough, nor that it is required in all scenarios.) Michael's recent "AI Debate" Debate post seems to be primarily concerned about (B). Finally, this post could be rephrased as "When people talk about debate, they often focus on (C). And that seems fair. However, if you make debate non-zero-sum, your (B) will break.".

AI Unsafety via Non-Zero-Sum Debate

if you have 2 AI's that have entirely opposite utility functions, yet which assign different probabilities to events, they can work together in ways you don't want

That is a good point, and this can indeed happen. If I believe something is a piece of chocolate while you - hating me - believe it is poison, we will happily coordinate towards me eating it. I was assuming that the AIs are copies of each other, which would eliminate most of these cases. (The remaining cases would be when the two AIs somehow diverge during the debate. I totally don't see how this would happen, but that isn't a particularly strong argument.)

Also, the debaters better be comparably smart.

Yes, this seems like a necessary assumption in a symmetric debate. Once again, this is trivially satisfied if the debaters are copies of each other. It is interesting to note that this assumption might not be sufficient because even if the debate has symmetric rules, the structure of claims might not be. (That is, there is the thing with false claims that are easier to argue for than against, or potentially with attempted human-hacks that are easier to pull off than prevent.)

AI Unsafety via Non-Zero-Sum Debate

I think I understood the first three paragraphs. The AI "ramming a button to the human" clearly is a problem and an important one at that. However, I would say it is one that you already need to address in any single-agent scenario --- by preventing the AI from doing this (boxing), ensuring it doesn't want to do it (???), or by using AI that is incapable of doing it (weak ML system). As a result, I view this issue (even in this two-agent case) as orthogonal to debate. In the post, this is one of the things that hides under the phrase "assume, for the sake of argument, that you have solved all the 'obvious' problems".

Or did you have something else in mind by the first three paragraphs?

I didn't understand the last paragraph. Or rather, I didn't understand how it relates to debate, what setting the AIs appear in, and why would they want to behave as you describe.

How can Interpretability help Alignment?

An important consideration is whether the interpretability research which seems useful for alignment is research which we expect the mainstream ML research community to work on and solve suitably. Do you see a way of incentivizing the RL community to change this? (If possible, that would seem like a more effective approach than doing it "ourselves".)

There’s little research which focuses on interpreting reinforcement learning agents [...]. There is some work in DeepMind's safety team on this, isn't there? (Not to dispute the overall point though, "a part of DeepMind's safety team" is rather small compared to the RL community :-).)

Nitpicks and things I didn't get:

  • It was a bit hard to understand what you mean by the "research questions vs tasks" distinction. (And then I read the bullet point below it and came, perhaps falsely, to the conclusion that you are only after "reusable piece of wisdom" vs "one-time thing" distinction.)
  • There is something funny going on in this sentence:

If we believe a particular proposal is more or less likely than others to produce aligned AI, then we would preferentially work on interpretability research which we believe will help this proposal other work which wouldn't, as it wouldn't be as useful.

Book report: Theory of Games and Economic Behavior (von Neumann & Morgenstern)

Related to that: An interesting take (not only) on cooperative game theory is Schelling's The Strategy of Conflict (from 1960, resp. second edition from 1980, but I am not aware of sufficient follow-up research on the ideas presented there). And there might be some useful references in CLR's sequence on Cooperation, Conflict, and Transformative AI.

Load More