[ Question ]

What's wrong with these analogies for understanding Informed Oversight and IDA?

byWei_Dai1mo20th Mar 20193 comments


Ω 11

Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

In Can HCH epistemically dominate Ramanujan? Alex Zhu wrote:

If HCH is ascription universal, then it should be able to epistemically dominate an AI theorem-prover that reasons similarly to how Ramanujan reasoned. But I don’t currently have any intuitions as to why explicit verbal breakdowns of reasoning should be able to replicate the intuitions that generated Ramanujan’s results (or any style of reasoning employed by any mathematician since Ramanujan, for that matter).

And I answered:

My guess is that HCH has to reverse engineer the theorem prover, figure out how/why it works, and then reproduce the same kind of reasoning.

And then I followed up my own comment with:

It occurs to me that if the overseer understands everything that the ML model (that it’s training) is doing, and the training is via some kind of local optimization algorithm like gradient descent, the overseer is essentially manually programming the ML model by gradually nudging it from some initial (e.g., random) point in configuration space.

No one answered my comments with either a confirmation or denial, as to whether these guesses of how to understand Universality / Informed Oversight and IDA are correct. I'm surfacing this question as a top-level post because if "Informed Oversight = reverse engineering" and "IDA = programming by nudging" are good analogies for understanding Informed Oversight and IDA, it seems to have pretty significant implications.

In particular it seems to imply that there's not much hope for IDA to be competitive with ML-in-general, because if IDA is analogous to a highly constrained method of "manual" programming, that seems unlikely to be competitive with less constrained methods of "manual" programming (i.e., AIs designing and programming more advanced AIs in more general ways, similar to how humans do most programming today), which itself is presumably not competitive with general (unconstrained-by-safety) ML (otherwise ML would not be the competitive benchmark).

If these are not good ways to understand IO and IDA, can someone please point out why?


Ω 11

New Answer
New Comment
Ask Related Question

1 Answers

A universal reasoner is allowed to use an intuition "because it works." They only take on extra obligations once that intuition reflects more facts about the world which can't be cashed out as predictions that can be confirmed on the same historical data that led us to trust the intuition.

For example, you have an extra obligation if Ramanujan has some intuition about why theorem X is true, you come to trust such intuitions by verifying them against proof of X, but the same intuitions also suggest a bunch of other facts which you can't verify.

In that case, you can still try to be a straightforward Bayesian about it, and say "our intuition supports the general claim that process P outputs true statements;" you can then apply that regularity to trust P on some new claim even if it's not the kind of claim you could verify, as long as "P outputs true statements" had a higher prior than "P outputs true statements just in the cases I can check." That's an argument that someone can give to support a conclusion, and "does process P output true statements historically?" is a subquestion you can ask during amplification.

The problem becomes hard when there are further facts that can't be supported by this Bayesian reasoning (and therefore might undermine it). E.g. you have a problem if process P is itself a consequentialist, who outputs true statements in order to earn your trust but will eventually exploit that trust for their own advantage. In this case, the problem is that there is something going on internally inside process P that isn't surfaced by P's output. Epistemically dominating P requires knowing about that.

See the second and third examples in the post introducing ascription universality. There is definitely a lot of fuzziness here and it seems like one of the most important places to tighten up the definition / one of the big research questions for whether ascription universality is possible.