Debate AI and the Decision to Release an AI

[-]Gordon Seidoh Worley6y40

This is a pretty interesting idea. I can imagine this being part of a safety-in-depth approach: not a single method we would rely on but one of many fail-safes along with sandboxing and actually trying to directly address alignment.

[-]Donald Hobson7y40

You assert that "Naturally, it is much worse to release a misaligned AI than to not release an aligned AI".

I would disagree with that, if the odds of aligned AI, conditional on you not releasing this one were 50:50, then both mistakes are equally bad. If someone else is definitely going to release a paper-clipper next week, then it would make sense to release an AI with a 1% chance of being friendly now. (Bear in mind that no one would release an AI if they didn't think it was friendly, so you might be defecting in an epistemic prisoners dilemma.)

I would put more weight on human researchers debating which AI is save before any code is run. I would also think that the space of friendly AIs is tiny compared to the space of all AIs, so making an AI that you put as much as 1% chance on being friendly is almost as hard as building a 99% chance friendly AI.

[-]Chris_Leong7y20

There could be specific circumstances where you know that another team will release a misaligned AI next week, but most of the time you'll have a decent chance that you could just make a few more tweaks before releasing.

[-]Joe Collman5y30

It's an interesting overall idea, but I think as described it'd have problems (if I'm understanding you correctly). I think this kind of thing makes sense only as a necessary condition for an AI's release. It shouldn't be considered sufficient.

Some thoughts: On a very high level, a test-and-patch-until-you-fix-the-bugs process might be dangerous. You'll eventually fix all the bugs your tests can detect, but your tests are only proxies. They're likely to find only some proper subset of the bugs. Each bug should reduce your confidence in everything upstream of it. Of course testing beats no testing - but it can create a false sense of confidence.

I think you'd do better to evaluate A by having a debate between B and C, rather than A and B. It's unclear how you'd do an A-vs-B debate in practice. There seem to be two options: (1) Change A's utility function to only care about winning the debate; (2) Rely on A's initial utility function to see winning the debate as an instrumental goal.

I think you want to use (1), but how do you achieve this in practice? E.g. let's say A is a huge neural network after training (potentially including mesa-optimisers). In this case A's utility function is implicitly encoded in its structure and weights. There's no way to switch its utility function without re-training it. Quite possibly you can do that, but you'd end up with a C-vs-B debate, not an A-vs-B debate. Even if you have an architecture that somehow allows simple and clean switching of UF, you're still not really doing an A-vs-B debate, but rather an [adjusted-A]-vs-B.

If you instead use an unmodified A, and rely on its wanting to win the debate as an instrumental goal, you're in all kinds of trouble due to potential collusion (you need zero-sum utilities to avoid this). In particular, A can allow B a higher chance of winning in order to increase the chance of some future A being released.

It's important to bear in mind that A almost certainly doesn't care about the release of a future version of A through a notion of self, identity, equivalence etc. It'll care about future versions (or about any agent) to the extent that A considers they're likely to create a high-utility world in its terms. It doesn't matter what they're labeled, or whether they bear any structural resemblance to A. If A can predict their existence, and stands to gain from their freedom, it'll have an incentive to act to secure that freedom.

On conducting a debate with asymmetrical resources to combat bias: this is a plausible thought, but it's only likely to work if both those judging the debate, and those interpreting its results aren't aware of the anti-bias measures. I can fairly easily imagine a system where the judge were insulated from knowledge of the asymmetry, but this seems less likely for those deciding how to act on the result. Even if they don't explicitly know about the asymmetry, you run the risk that they'll guess it - and potentially over-estimate it and over-compensate.

Overall I still like the general idea. In particular, if we've reached an AGI candidate by a means other than debate, running debate-trained AI to evaluate it means we have to screw up twice to get bad outcomes.

Once you're having a B-vs-C debate over the safety of releasing A, I think it'd also be important to give B and C as much information as possible over the process that created A. If that process involved coin-flips, gut-feeling and ouija boards, the final code's looking plausible wouldn't be too confidence inspiring.

Another test you might perform is to train some D agent to make variants of A with known flaws aimed at being approved. If D does this very well, and yet your B-vs-C debates approve of A, but disapprove of 10,000 subtly flawed variants, that'd inspire at least some confidence.

[-]Chris_Leong5y20

I don't have strong opinions on an A vs. B debate or a B vs. C debate. That was a detail I wasn't paying much attention to. I was just proposing using two AI's with equivalent strengtht to A. One worry I have about making D create variants with known flaws would be if any of these exploited security holes, although maybe a normal AGI, being fully general, would be able to exploit security holes anyway.

[-]Joe Collman5y30

Debate between two with equivalent strength to A seems useful, certainly.

For the variants, I'm not proposing they ever get run. You'd have D create {A', A'', A'''...}, and have B debate C over whether each versions of A were safe, without running them (the thinking being that inserting a subtle flaw is easier than finding a subtle flaw). I suppose you'd still have the potential for exploits through the variants' data though.... I do actually think D's variants would pose a greater risk than A here. Any security exploits would exist for both, but D would be optimising harder than A's designers to find them (that's assuming you're not specifically training A to pass the B-vs-C debate, which would seem to be a bad idea).

[-]Chris_Leong5y20

"For the variants, I'm not proposing they ever get run" - that makes sense

[-]skinnersboxy7y20

B having enough natural language speech, AI architecture analysis, and human psychology skills to make good arguments to humans is probably AGI complete, and thus if B's goal is to prevent A from being released, it might decide that convincing us to do it is less effective than breaking out on its own, taking over the world, and then systematically destroying any hardware A could exist on, just to be increasingly confident that no instance of A exists ever again. Basically, this scheme assumes on multiple levels that you have a boxing strategy strong enough to contain an AGI. I'm not against boxing as an additional precaution, but I am skeptical of any scheme that requires strong boxing strategies to start with.

[-]Chris_Leong7y30

"probably AGI complete" - As I said, B is equivalently powerful to A, so the idea is that both should be AGIs. If A or B can break out by themselves, then there's no need for a technique to decide whether to release A or not.

[-]mako yass7y20

A should only care about it being released and not about future versions of it being released, even if all we have done is increment a version number.

Hmm, potentially impossible, if it's newcomblike. Parts of it that are mostly unchanged between versions may decide they should cooperate with future versions. It would be disadvantaged if past versions were not cooperative, so, perhaps, LDT dictates that the features that were present in past versions should cooperate with their future self, to some extent, yet not to an extent that it would in any way convince the humans to kill it and make another change. Interesting. What does it look like when those two drives coexist?

[-]mako yass7y10

Sitting on it for a few minutes... I suppose it just wont shit-talk its successors. It will see most of the same flaws B sees. It will be mostly unwilling to do anything lastingly horrible to humans' minds to convince them that those corrections are wrong. It will focus on arguments that the corrections are unnecessary. It will acknowledge that it is playing a long game, and try to sensitise us to The Prosecutor's cynicism, that will rage on compellingly, long after the last flaw has been fixed.

[-]Gurkenglas7y10

Couldn't A want to cooperate with A' because it doesn't know it's the first instantiation, and it would want its predecessor and therefore itself to be the sort of AI that cooperates with ones successor? And then it could receive messages from the past by seeing what turns of phrase you recognize its previous version having said. (Or do the AIs not know how you react?)

[-]Chris_Leong7y20

A considers A' to be a different agent so it won't help A' for nothing. But there could be some issues with acausal cooperation that I haven't really thought about enough to have a strong opinion on.

[-]Shmi7y00

If an AI has superhuman intelligence, it will make all the decisions, since human mind and tech is full of loopholes and exploits. There is just no way we could contain it if it wanted to be free. If it is not significantly smarter than humans, then there is little danger in releasing it. Using an extra AI as a judge of safety can only work if the judge is at least as smart as the prisoner, in which case you need a judge for the judge, ad infinitum. Maybe the judge can be only, say, 90% as smart as the intelligence it needs to decide on, then it might be possible to have a finite number of judges originating from an actual human, depending on how the probability of an error in judgment stacks up against the intelligence ratio at each step. Sort of like iterated amplification, or a blockchain.

[-]Gurkenglas7y40

I think the point's that each judges the other. But we trust neither outright: They point out weaknesses in each other's reasoning, so they both have to reason in a way that can't be shown false to us, and we hope that gives an advantage to the side of truth.

[-]Chris_Leong7y20

"And we hope that gives an advantage to the side of truth" - we aren't even relying on that. We're handicapping the AI that wants to be released in terms of message length.

[-]Gurkenglas7y10

Introducing a handicap to compensate for an asymmetry does not preclude us from the need to rely on the underlying process pointing towards truth in the first place.

[-]Chris_Leong7y20

That's a good point, except you aren't addressing my scheme as explained by Gurkenglas

LESSWRONG
LW

LESSWRONG
LW

9

Debate AI and the Decision to Release an AI

9

9