Posting this on behalf of user PrimalShadow on the AstralCodexTen Discord:

What are ways in which an AI that is trained to prove specific math theorems is dangerous? In particular, imagine you have an AI which is trained to take a prompt asking to prove a formal-language-predicate, and then it produces a machine-readable proof of said thing.

I would posit some answers to my own question:

  • The AI ends up finding bugs in your prover instead of actually solving problems. (This is a failure mode, but probably not dangerous in the sense of being unfriendly)
  • The AI is used by humans to do dangerous things, e.g. break cryptography. (This is a type of danger, but not what I would call unfriendliness either.)
  • The AI somehow learns about the outside world, and modifies its outputs in an attempt to manipulate that world - E.g. to acquire more computing power - E.g. to get the operator to ask it easier questions
  • The AI as described is not directly dangerous, but some tweak to it is dangerous I wonder if I'm missing any broad categories.

I responded with the suggestion of what you might call "paperclipping" (it deciding it requires some astronomical amount of compute to solve the problem, so takes over the world, or maybe even the galaxy/beyond, to maximize compute), which PrimalShadow noted already falls into "The AI somehow learns about the outside world". He also pointed out that the AI wouldn't need to be trained on non-mathematical data, which would presumably reduce any risk of that happening.

Neither of us could easily think of a scenario where this trivially goes wrong, and we're both curious what the potential risks are from the perspective of more experienced people in the field.

7

New Answer
Ask Related Question
New Comment

4 Answers sorted by

The AI is asked mathematical questions relating to the big unsolved problems of physics: quantum mechanics vs. general relativity, the structure of space-time at the Planck scale, dark matter, dark energy, and so on. Its answers enable the easy construction of superweapons that convert matter directly into energy; or make anything radioactive decay instantly to its end state; or collapse the false vacuum; etc. Shortly thereafter, everyone dies.

ETA: see also.

The problem here is that there can be questions whose answers are dangerous to know.

How are you imagining making this thing? If you're training an optimizer that you don't understand, this can lead to pitfalls / unintended generalization. If it just pops into existence fulfilling the spec by fiat, I think that's a lot less dangerous than us using ML to make it.

Since the question is about potential dangers, I think it is worth assuming the worst here. Also, realistically, we don't have a magic want to pop things into existence by fiat so I would guess that by default if such an AI was created it would be created with ML. 

So lets say that this is trained largely autonomously with ML. Is there some way that would result in dangers outside the four already-mentioned categories?

3Charlie Steiner3mo
Well, you might train an agent that has preferences about solving theorems that it generalizes to preferences about the real world. Then you'd get something sort of like problem #3, but with way broader possibilities. You don't have to say "well, it was trained on theorems, so it's just going to affect what sorts of theorems it gets asked." It can have preferences about the world in general because it's getting its preferences by unintended generalization. Information leaks about the world (see acylhalide's comment) will lead to creative exploitation of hardware, software, and user vulnerabilities.

Some thoughts:

The AI somehow learns about the outside world

An AI can assign non-trivial probability to existence of an outside world even without coming into contact with it. And hence have reasons to be deceptive, or cautious, or play dumb.

Plus there will be information leak that could allow to infer it is inside a box.

 - if it can measure timings or error in how its instructions execute

 - if the math statement or its specification leak "the subset of maths that humans care about" as opposed to just "math that is natural for the universe". Humans have explored specific branches of math such as natural numbers, euclidean geometry and so on - just knowing that a species cares about these particular branches of math, leaks information.

 - if the OS or verifier etc that it is running inside of, leaks information about who has written them. Cause again, humans care about certain kinds of programs, for certain kinds of purposes in the real world. Human minds are capable of writing programs in certain specific patterns, with certain specific classes of mistakes and so on. And moden tech stacks typically run deep, looking at a tech stack can allow you to even infer information about the history of the whole stack and the culture, civilisation etc that might have iteratively built it.

P.S. Regarding capabilities: proof verification being dumb and mechanistic doesn't mean proof generation is also the same. (I'm not saying you're making this error, but people might.) It is entirely possible that the capabilities required for proof generation at that level are same as those required to solve humans - hence the open question around whether there exists a notion of "general intelligence".

AI becomes trusted and eventually makes proofs that can't otherwise be verified, makes one or more bad proofs that aren't caught, results used for something important, important thing breaks unexpectedly. 

Let's say we limit it further so all proofs have to be verifiable within some reasonable complexity limit. In such a case, we wouldn't need to trust it. What then?

2burmesetheater3mo
Realistically, a complexity limit on practical work may not be imposed if the AI is considered reliable enough and creating proofs too complex to otherwise verify is useful, and it's very difficult to see a complexity limit imposed for theoretical exploration that may end up in practical use. Still in your scenario the same end can be met with a volume problem where the ratio of new AI-generated proofs with important uses is greater than external capability of humans to verify, even in the case that individual AI proofs are in principle verifiable by humans, possibly because of some combination of enhanced productivity and reduced human skill (possibly less incentive to become skilled at proofs if AI seems to do it better).
1Yitz3mo
That seems like a very mild danger compared to risks from other AI models, tbh.