The results I prove assume realizability, and some of the results are about traps, but independent of the results, the algorithm for picking actions resembles infra-Bayesianism. So I think we're taking similar objects and proving very different sorts of things.
Looks like we've been thinking along very similar lines! https://www.lesswrong.com/posts/RzAmPDNciirWKdtc7/pessimism-about-unknown-unknowns-inspires-conservatism
Would it count if a malicious actor successfully finetuned GPT-3 to e.g. incite violence while maintaining plausible deniability?
Yes, that would count. I suspect that many "unskilled workers" would (alone) be better at inciting violence while maintaining plausible deniability than GPT-N at the point in time the leading group had AGI. Unless it's OpenAI, of course :P
Regarding intentionality, I suppose I didn't clarify the precise meaning of "better at", which I did take to imply some degree of intentionality, or else I think "ends up" would have been a better word choice. The impetus for this point was Paul's concern that someone would have used an AI to kill you to take your money. I think we can probably avoid the difficulty of a rigorous definition intentionality, if we gesture vaguely at "the sort of intentionality required for that to be viable"? But let me know if more precision would be helpful, and I'll try to figure out exactly what I mean. I certainly don't think we need to make use of a version of intentionality that requires human-level reasoning.
Are you predicting there won't be any lethal autonomous weapons before AGI?
No... thanks for pressing me on this.
Better at killing an a context where either: the operator would punish the agent if they knew, or the state would punish the operator if they knew. So the agent has to conceal its actions at whichever the level the punishment would occur.
You're right--valuable is the wrong word. I guess I mean better at killing.
Yep, I agree it is useless with a horizon length of 1. See this section:
For concreteness, let its action space be the words in the dictionary, and I guess 0-9 too. These get printed to a screen for an operator to see. Its observation space is the set of finite strings of text, which the operator enters.
So at longer horizons, the operator will presumably be pressing "enter" repeatedly (i.e. submitting the empty string as the observation) so that more words of the message come through.
This is why I think the relevant questions are: at what horizon-length does it become useful? And at what horizon-length does it become dangerous?
At this point, the AI has strong incentive to manipulate its memory to produce cell phone signals, and create a super intelligence set to the task of controlling its future inputs.
Picking subroutines to run isn't in its action space, so it doesn't pick subroutines to maximize its utility. It runs subroutines according to its code. If the internals of the main agent involve an agent making choices about computation, then this problem could arise. Now we're not talking a chatbot agent but a totally different agent. I think you anticipate this objection when you say
(If this is outside its action space, then it can try to make a brainwashy message)
In one word??
Suppose you can't get the human to type the exact input you want now, but you can get the human to go away without inputting anything, while it slowly bootstraps an ASI which can type the desired string
Again, its action space is printing one word to a screen. It's not optimizing over a set of programs and then picking one in order to achieve its goals (perhaps by bootstrapping ASI).
Okay. I'll lower my confidence in my position. I think these two possibilities are strategically different enough, and each sufficiently plausible enough, that we should come up with separate plans/research agendas for both of them. And then those research agendas can be critiqued on their own terms.
For the purposes of this discussion, I think qualifies as a useful tangent, and this is the thread where a related disagreement comes to a head.
Edit: "valuable" was the wrong word. "Better at killing" is more to the point.
I mean that we don't have any process that looks like debate that could produce an agent that wasn't trying to kill you without being competitive
It took me an embarrassingly long time to parse this. I think it says: any debate-trained agent that isn't competitive will try to kill you. But I think the next clause clarifies that any debate-trained agent whose competitor isn't competitive will try to kill you. This may be moot if I'm getting that wrong.
So I guess you're imagining running Debate with horizons that are long enough that, in the absence of a competitor, the remaining debater would try to kill you. It seems to me that you put more faith in the mechanism that I was saying didn't comfort me. I had just claimed that a single-agent chatbot system with a long enough horizon would try to take over the world:
The existence of an adversary may make it harder for a debater to trick the operator, but if they're both trying to push the operator in dangerous directions, I'm not very comforted by this effect. The probability that the operator ends up trusting one of them doesn't seem (to me) so much lower than the probability the operator ends up trusting the single agent in the single-agent setup.
Running a debate between two entities that would both kill me if they could get away with it seems critically dangerous.
Suppose two equally matched people are trying shoot a basket from opposite ends of the 3-point line, before their opponent makes a basket. Each time they shoot, the two basketballs collide above the hoop and bounce off of each other, hopefully. Making the basket first = taking over the world and killing us on their terms. My view is that if they're both trying to make a basket, a basket being made is a more likely outcome than a basket not being made (if it's not too difficult for them to make the proverbial basket).
Side comment: so I think the existential risk is quite high in this setting, but I certainly don't think the existential risk is so low that there's little existential risk left to reduce with the boxing-the-moderator strategy. (I don't know if you'd have disputed that, but I've had conversations with others who did, so this seems like a good place to put this comment.)
No, but what are the approaches to avoiding deceptive alignment that don't go through competitiveness?
We could talk for a while about this. But I'm not sure how much hangs on this point if I'm right, since you offered this as an extra reason to care about competitiveness, but there's still the obvious reason to value competitiveness. And idea space is big, so you would have your work cut out to turn this from an epistemic landscape where two people can reasonably have different intuitions to an epistemic landscape that would cast serious doubt on my side.
But here's one idea: have the AI show messages to the operator that causes them to do better on randomly selected prediction tasks, and the operator's prediction depends on the message, obviously, but the ground truth is the counterfactual ground truth if the message were never shown, so the AI's message can't affect the ground truth.
And then more broadly, impact measures, conservatism, or utility information about counterfactuals to complicate wireheading, seem at least somewhat viable to me, and then you could have an agent that does more than show us text that's only useful if it's true. In my view, this approach is way more difficult to get safe, but if I had the position that we needed parity in competitiveness with unsafe competitors in order to use a chatbot to save the world, then I'd start to find these other approaches more appealing.