The AGI needs to be honest

9Vlada

1rokosbasilisk

3Eviatar Man

1rokosbasilisk

1Yair Halberstadt

1rokosbasilisk

1Yair Halberstadt

4Yair Halberstadt

1rokosbasilisk

1Bezzi

1rokosbasilisk

1Randomized, Controlled

New Comment

12 comments, sorted by Click to highlight new comments since: Today at 5:10 PM

In general terms proof finding has exponential complexity while verification is polinomial. "Difficulty" in checking formal proofs is not a thing. Issue is that large portion of science has much weaker reasoning than formal systems. There we have somewhat vague/intuitive proofs (compared to fomal math proofs) that relay on common sense priors. That is possible route for falsehoods from LMs, but again it is not an issue really. Humans have deliberetly abused the scientific process since its inception. As long as we relay on the process as it is (instead of taking AI or whatever authority for its word) we are fine.

Also singularity is not a thing one should fear. Most interesting problems have exp complexity and this stands as had limitation for any system. There is pretty much no way around it. At best LMs will be human like in terms of AGI. They can be X times faster, but that advantage vanishes in front of problems with exp complexity.

language-models if they are AGI would surely surpass human-level understanding of language, humans need language for communication and book-keeping, "words" in any language are mostly for interesting abstractions from a human point-of-view, as for language models it need not have any language since it doesn't have an internal dialogue like humans do.

As it reaches a certain level of intelligence it starts forming increasingly complex abstractions that don't have (won't have) any vocabulary. It would be impossible to interpret its reasoning, and the only way left is to accept it.

verifying proof for riemann-hypothesis is not harder than generating one, but say if you have access to one alleged-proof of riemann-hypothesis which is long enough such that verifying itself is super-hard, then you have no evidence to say that the proof is correct unless you prove that the AGI generating the proof is indeed capable of generating such a proof and being honest to you.

Thinking about this a bit more, it's theoretically possible that the AGI has some high level proof that expands recursively into an almost infinite number of lines of some lower level formal language.

Presumably then the solution is to get the AGI to build a theorem prover for a higher level formal language, and provide a machine checkable proof that the theorem prover is a correct, and so in recursively till you can check the high level proof.

Also, the AGI can generate a long valid proof but it may not be for the question you have asked, since the assumption is that the problems described in natural language and its the AGI's job to understand and convert it to formal language and then prove it

I think instead of recursively asking for higher level proof it should be a machine-checkable regarding the correctness of the AGI itself?

You have asked the bot to summarize the proof for you in natural-language, but you know that the bot can easily trick you into accepting the proof.

I didn't fully get this point. Sure, there definitely are gigantic computer-generated proofs out there, the most famous being the proof of Four Color Theorem (which some mathematicians in the 80s rejected precisely because it was infeasible to check by hand); I accept that our bot could produce a truly huge blob.

On the other hand, optimizing a bogus proof for human validation sounds even more difficult than just search for a correct proof. To search for correct proofs you just need to know math; to search for plausible bogus proof you need to know math *and* how human mathematicians typically validate proofs. Are we just assuming to work with AGI here?

Imagine that you are a trained mathematician and you have been assigned the job of testing an arbitrarily intelligent chatbot for its intelligence.

You being knowledgeable about a fair amount of computer-science theory won’t test it with the likes of Turing-test or similar, since such a bot might not have any useful priors about the world.

You have asked it find a proof for the Riemann-hypothesis. the bot started its search program and after several months it gave you gigantic proof written in a proof checking language like coq.

You have tried to run the proof through a proof-checking assistant but quickly realized that checking that itself would years or decades, also no other computer except the one running the bot is sophisticated enough to run proof of such length.

You have asked the bot to provide you a zero-knowledge-proof, but being a trained mathematician you know that a zero-knowledge-proof of sufficient credibility requires as much compute as the original one. also, the correctness is directly linked to the length of the proof it generates.

You know that the bot may have formed increasingly complex abstractions while solving the problem, and it would be very hard to describe those in exact words to you.

You have asked the bot to summarize the proof for you in natural-language, but you know that the bot can easily trick you into accepting the proof.

You have now started to think about a bigger question, the bot essentially is a powerful optimizer. In this case, the bot is trained to find proofs, its reward is based on finding what a group of mathematicians agree on how a correct proof looks like.

But the bot being a bot doesn’t care about being honest to you or to itself, it is not rewarded for being “honest” it is only being rewarded for finding proof-like strings that humans may select or reject.

So it is far easier for it to find a large coq-program, large enough that you cannot check by any other means, than to actually solve riemann-hypothesis.

Now you have concluded that before you certify that the bot is intelligent, you have to prove that the bot is being honest.

Going by the current trend, it is okay for us to assume that such an arbitrarily intelligent bot would have a significant part of it based on the principles of the current deep-learning stack. assume it be a large neural-network-based agent, also assume that the language-understanding component is somewhat based on the current language-model design.

So how do you know that the large language model is being honest?

A quick look at the plots of results on truthful-qa dataset shows that truthfulness reduces with the model-size, going by this momentum any large-models trained on large datasets are more likely to give fluke answers to significantly complex questions.

Any significantly complex decision-question if cast into an optimization problem has one hard-to-find global-minima called “truth” but extremely large number of easy-to-find local-minima of falsehoods, how do you then make a powerful optimizer optimize for honesty?