Background in mathematics (descriptive set theory, Banach spaces) and game-theory (mostly zero-sum, imperfect information games). CFAR mentor. Usually doing alignment research.


Risk Map of AI Systems
As a concrete tool for discussing AI risk scenarios, I wonder if it doesn't have too many parameters? Like you have to specify the map (at least locally), how far we can see, what will the impact be of this and that research...

I agree that the model does have quite a few parameters. I think you can get some value out of it already by being aware of what the different parameters are and, in case of a disagreement, identifying the one you disagree about the most.

>If we currently have some AI system x, we can ask which system are reachable from x --- i.e., which other systems are we capable of constructing now that we have x.
What are the constraints on "construction"? Because by definition, if those others systems are algorithms/programs, we can build them. It might be easier or faster to build them using x, but it doesn't make construction possible where it was impossible before. I guess what you're aiming at is something like "x reaches y" as a form of "x is stronger than y", and also "if y is stronger, it should be considered when thinking about x"?

I agree that if we can build x, and then build y (with the help of x), then we can in particular build y. So having/not having x doesn't make a difference on what is possible. I implicitly assumed some kind of constraint on the construction like "what can we build in half a year", "what can we build for $1M" or, more abstractly, "how far can we get using X units of effort".

[About the colour map:] This part confused me. Are you just saying that to have a color map (a map from the space of AIs to R), you need to integrate all systems into one, or consider all systems but one as irrelevant?

The goal was more to say that in general, it seems hard to reason about the effects of individual AI systems. But if we make some of the extra assumptions, it will make more sense to treat harmfulness as a function from AIs to $$\mathbb R$$.

Values Form a Shifting Landscape (and why you might care)
Of course if we just want some space of possible values, and where each value has an opinion of each value, then that's just a continuous function from a product of the space with itself, which isn't any problem.

Yeah, I just meant this simple thing that you can mathematically model as $$f : V \times V \to \mathbb R$$. I suppose it makes sense to consider special cases of this that would have better mathematical properties. But I don't have high-confidence intuitions on which special cases are the right ones to consider.

I mostly meant this as a tool that would allow people with different opinions to move their disagreements from "your model doesn't make sense" to "both of our models make sense in theory; the disagreement is an empirical one". (E.g., the value-drift situation from Figure 6 is definitely possible, but that doesn't necessarily mean that this is what is happening to us.)

Values Form a Shifting Landscape (and why you might care)

Thank you for the comment. As for the axes, the y-axis always denotes the desirability of the given value-system (except for Figure 1). And you are exactly right with the x-axis --- that is a multidimensional space of value-systems that we put on a line, because drawing this in 3D (well, (multi+1)-D :-) ) would be a mess. I will see if I can make it somewhat clearer in the post.

AI Problems Shared by Non-AI Systems
Much of the concern about AI systems is when they lack support for these kind of interventions, whether it be because they are too fast, too complex, or can outsmart the would-be intervening human trying to correct what they see as an error.

All of these possible causes for the lack of support are valid. I would like to add one more: when the humans that could provide this kind of support don't care about providing it or have incentives against providing it. For example, I could report a bug in some system, but this would cost me time and only benefit people I don't know, so I will happily ignore it :-).

"Zero Sum" is a misnomer.

As a game theorist, I completely endorse the proposed terminology. Just don't tell other game theorists... Sometimes, things get even worse when some people use the term "general sum games" to refer to games that are not constant-sum.

I like to imagine different games on a scale between completely adversarial and completely cooperative. With things in the middle being called "mixed-motive games".

Integrating Hidden Variables Improves Approximation

I am usually reasonably good at translating from math to non-abstract intuitive examples...but I didn't have much success here. Do you have "in English, for simpletons" example to go with this? :-) (You know, something that uses apples and biscuits rather than English-but-abstract words like "there are many hidden variables mediating the interactions between observables" :D.)

Otherwise, my current abstract interpretation of this is something like: "There are detailed models, and those might vary a lot. And then there are very abstract models, which will be more similar to each other...well, except that they might also be totally useless." So I was hoping that a more specific example would clarify things for a bit and tell me whether there is more to this (and also whether I got it all wrong or not :-).)

Noise Simplifies

I have a long list of randomly-chosen numbers between 1 and 10, and I want to know whether their sum is even or odd.

I find your example here somewhat misleading. Suppose your random numbers weren't randomly drawn from 1-10, but from . If you don't know a single number, you still know that there is a 5:1 chance that it will be even (and hence not change the parity of the sum of the whole list). So if a single number is unknown, you will still want to take the sum of the ones you do know. In this light, your example seems like an exception, rather than the norm. (My main issue with it is that since it feels very ad-hoc, you might subconsciously come to the impression that the described behaviour is the norm.)

However, it might easily be that the class of these "exception" is important on its own. So I wouldn't want to shoot down the overall idea described in the post - I like it :-).

How should AI debate be judged?

Even if you keep the argumentation phase asymmetric, you might want to make the answering phase simultaneous or at least allow the second AI to give the same answer as the first AI (which can mean a draw by default).

This doesn't make for a very good training signal, but might have better equilibria.

AI Unsafety via Non-Zero-Sum Debate

I haven't yet thought about this in much detail, but here is what I have:

I will assume you can avoid getting "hacked" while overseeing the debate. If you don't assume that, then it might be important whether you can differentiate between arguments that are vs aren't relevant to the question at hand. (I suppose that it is much harder to get hacked when strictly sticking to a specific subject-matter topic. And harder yet if you are, e.g., restricted to answering in math proofs, which might be sufficient for some types of questions.)

As for the features of safe questions, I think that one axis is the potential impact of the answer and an orthogonal one is the likelihood that the answer will be undesirable/misaligned/bad. My guess is that if you can avoid getting hacked, then the lower-impact-of-downstream-consequences questions are inherently safer (from the trivial reason of being less impactful). But this feels like a cheating answer, and the second axis seems more interesting.

My intuition about the "how likely are we to get an aligned answer" axis is this: There questions where I am fairly confident in our judging skills (for example, math proofs). Many of those could fall into the "definitely safe" category. Then there is the other extreme of questions where our judgement might be very fallible - things that are too vague or that play into our biases. (For example hard philosophical questions and problems whose solutions depend on answers to such questions. E.g., I wouldn't trust myself to be a good judge of "how should we decide on the future of the universe" or "what is the best place for me to go for a vacation".) I imagine these are "very likely unsafe". And as a general principle, where there are two extremes, there often will be a continuum inbetween. Maybe "what is a reasonable way of curing cancer?" could fall here? (Being probably safe, but I wouldn't bet all my money on it.)

AI Unsafety via Non-Zero-Sum Debate

I agree with what Paul and Donald are saying, but the post was trying to make a different point.

Among various things needed to "make debate work", I see three separate sub-problems:

(A) Ensuring that "agents use words to get a human to select them as the winner; and that this is their only terminal goal" is a good abstraction. (Please read this in the intended meaning of the sentence. No, if there is a magical word that causes the human's head to explode and their body falls on the reward button, this doesn't count.)

(B) Having already accomplished (A), ensure that "agents use words to convince the human that their answer is better" is a good abstraction. (Not sure how to operationalize this, but you want to, for example, ensure that: (i) Agents do not collaboratively convince the human to give reward to both of them. (ii) If the human could in principle be brainwashed, the other agent will be able and incentivized to prevent this. In particular, no brainwashing in a single claim.)

(C)Having already accomplished (A) and (B), ensure that AIs in debate only convince us of safe and useful things.

While somewhat related, I think these three problems should be tackled separately as much as possible. Indeed, (A) seems to not really be specific to debate, because a similar problem can be posed for any advanced AI. Moreover, I think that if you are relying on the presence of the other agent to help you with (A) (e.g., one AI producing signals to block the other AI's signals), you have already made a mistake. On the other hand, it seems fine to rely on the presence of the other agent for both (B) and (C). However, my intuition is that these problems are mostly orthogonal - most solution to (B) will be compatible with most solutions to (C).

For (A), Michael Cohen's Boxed Myopic AI seems like a particularly relevant thing. (Not saying that what it proposes is enough, nor that it is required in all scenarios.) Michael's recent "AI Debate" Debate post seems to be primarily concerned about (B). Finally, this post could be rephrased as "When people talk about debate, they often focus on (C). And that seems fair. However, if you make debate non-zero-sum, your (B) will break.".

Load More