I read two days ago the Elicit presentation post, and as I have been reading intensely about AI safety lately, I was wondering this question : What are the tools and techniques you would use right now to determine that the current Elicit research assistant is a safe AI ?
What is described in the post is a narrow AI that will and does fail in obvious ways if it is not aligned. But how would you know it is not a small and limited AGI, aligned or not ?
I suppose that if the Elicit research assistant is currently an AGI, however how limited, its best course of action would be to pretend to be a narrow AI helping researchers in their goal of doing research. I imagine that it would get better chances of not being replaced by a new version and to get more upgrades and computing resources as the Elicit product develops.
So I was wondering, is this obviously dumb that this exact system as seen by Elicit users cannot ever be an AGI, even in the future ? If you had right now real suspicions it could be an AGI, apart from first making sure it is stopped, how would you try to determine whether it is an AGI ?
it's built on gpt3, which despite the strange blind spots of people on this forum, appears to me to be an AGI. however, GPT3 is not superhuman. I don't think this worry is necessary - there is a strong network of humans operating elicit.
Thank you for the reply. I know that worry is unnecessary, I was rather asking about what you would do if you didn't know for a fact that it was indeed based on GPT-3, or that humans were effectively overseeing it, to determine whether it is an unsafe AGI trying to manipulate humans using it ?
I know that no one could detect a super intelligent AGI trying to manipulate them, but I think it's can be non-obvious that a sub human AGI is trying to manipulate you if you don't look for it.
Primarily, I think that currently, no one uses AI systems with the expectation that it could try to deceive them, so they don't apply the basic level of doubt you put in every human when you don't know their intentions.
content note: I have a habit of writing english in a stream-of-consciousness way that sounds more authoritative than it should, and I don't care to try to remove that right now; please interpret this as me thinking out loud.
I think it's instructive to compare it to the youtube recommender, which is trying to manipulate you, and whose algorithm is publicly unknown (but must be similar in some important ways to what it was a few years ago when they published a paper about it, for response-latency reasons). In general, an intelligent agent even well above your capability level does is not guaranteed to be successful at manipulating you, and I don't see reason to believe that the available paths for manipulating someone will be significantly different for an AI than for a human, unless the AI is many orders of magnitude smarter than the human (which is not looking likely to be a thing that happens soon). Could elicit manipulate you? yeah, for sure it could. Should you trust it not to? nope. But because its output is grounded in concrete papers and discussion, its willingness to manipulate isn't the only factor. Detecting bad behavior would still be difficult, but the usual process of mentally modeling what incentives might lead an actor to become a bad actor doesn't seem at all hopeless against powerful AIs to me. The techniques used by human abusers are in the training data, would activate first if trying to manipulate, and the process of recognizing them is known.
Ultimately to reliably detect manipulation you need a model of the territory similarly good to the one you're querying. That's not always available, and overpowered search like used in AlphaZero and successors is likely to break it, but right now most ai deployments do not use that level of search, likely because capabilities practitioners know well that reward model hacking is likely if EfficientZero is aimed at real life.
Maybe this thinking-out-loud is useless. Not sure. My mental sampling temperature is too high for this site, idk. Thoughts?