Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.
This is a linkpost for https://arxiv.org/abs/2211.06738

(I did not have anything to do with this paper and these are just my own takes.)

The Alignment Research Center recently published their second report, Formalizing the presumption of independence. While it's not explicitly about AI alignment, it's probably still interesting for some  people here.

Summary

The paper is about "heuristic arguments". These are similar to proofs, except that their conclusions are not guaranteed to be correct and can be overturned by counterarguments. Mathematicians often use these kinds of arguments, but in contrast to proofs, they haven't been formalized. The paper mainly describes the open problem of finding a good formalization of heuristic arguments. They do describe one attempt, "cumulant propagation", in Appendix D, but point out it can behave pathologically.

So what's the "presumption of independence" from the title? Lots of heuristic arguments work by assuming that some quantities are independent to simplify things, and that's what the paper focuses on. Such an argument can be overturned by showing that there’s actually some correlation we initially ignored, which should then lead to a more sophisticated heuristic argument with a potentially different conclusion.

What does this have to do with alignment?

The paper only very briefly mentions alignment (in Appendix F), more detailed discussion is planned for the future. But roughly:

Avoiding catastrophic failures. Heuristic arguments can let us better estimate the probability of rare failures, or failures which occur only on novel distributions where we cannot easily draw samples. This can be used during validation to estimate risk, or potentially during training to further reduce risk.

Eliciting latent knowledge. Heuristic arguments may let us see “why” a model makes its predictions. We could potentially use them to distinguish cases where similar behaviors are produced by very different mechanisms—for example distinguishing cases where a model predicts that a smiling human face will show up on camera because it predicts there will actually be a smiling human in the room, from cases where it makes the same prediction because it predicts that the camera will be tampered with. [...]

Neither of these applications is straightforward, and it should not be obvious that heuristic arguments would allow us to achieve either goal. [...]

Heuristic arguments can be seen as somewhere between interpretability and formal verification: unlike interpretability, heuristic arguments are meant to be machine-checkable and don't have to be human-understandable. But unlike formal proofs, they don't require perfect certainty and might be much easier to find.

Readers here might also be reminded of Logical Induction. This paper is trying to do something somewhat different though:

[Approaches to logical uncertainty] have primarily focused on establishing coherence conditions and on capturing inductive reasoning, i.e. ensuring that a reasoner eventually successfully predicts φ(n) given observations of φ(1), φ(2), . . . φ(n − 1). These systems would not automatically recognize intuitively valid heuristic arguments [...], although they would eventually learn to trust these arguments after observing them producing good predictions in practice.

Indeed, we can view ourselves as reasoners in exactly this situation, trying to understand and formalize a type of reasoning that appears to often make good predictions in practice. Formalizations of inductive reasoning may help clarify the standards we should use for evaluating a proposed heuristic estimator, but do not constitute a good heuristic estimator themselves.

So should you read the paper?

Given it's a 60-page report (though most of that's appendices) with basically no explicit discussion of alignment, I don't think this is a "must-read" for everyone. For example, if you haven't read the ELK report, I would strongly recommend that over this new paper.

On the other hand, if you work on something related, such as formal verification, ELK, or conceptual interpretability research, I think it makes a lot of sense to at least look at the main paper and Appendix F (16 pages and quite readable).

Personally, I also think this is just really interesting independent from alignment. Appendix B and C were my favorite parts from that perspective (though also the most speculative ones).

New Comment
2 comments, sorted by Click to highlight new comments since: Today at 7:28 AM

Great work!

Stuart Armstrong gave one more example of a heuristic argument based in the presumption of independence here.

https://www.lesswrong.com/posts/iNFZG4d9W848zsgch/the-goldbach-conjecture-is-probably-correct-so-was-fermat-s

There are a huge number of examples like that floating around in the literature, we link to some of them in the writeup. I think Terence Tao's blog is the easiest place to get an overview of these arguments, see this post in particular but he discusses this kind of reasoning often.

I think it's easy to give probabilistic heuristic arguments for about 80 of the ~100 conjectures in the wikipedia category unsolved problems in number theory

About 30 of those (including the Goldbach conjecture) follow from the Cramer random model of the primes. Another 9 are slightly non-trivial applications of random models for the primes. About 8 of them are simple heuristics for diophantine equations (like Fermat's last theorem). I estimate that another ~30 have arguments that are more diverse and interesting (I estimated the size of this set by randomly sampling some conjectures, stratified by difficulty, and seeing how often we could find an argument in an hour).

We'd guess that it's also possible to give arguments for the remaining ~20, it would just be too hard for us to do within an hour. Random representative examples for which we don't know heuristic arguments, sorted by apparent difficulty for a layperson:

  • The surprisingly low density of solutions to Lehmer's totient problem (link)
  •  The existence of an incongruent covering system with odd moduli (link)
  • The Birch and Swinnerton-Dyer conjecture (link)
  • The Grothendieck-Katz p-curvature conjecture (link)
  • Serre's Conjecture II (link)

This category is interesting to us because:

  1. We are interested in the conjecture that all true statements are have a probabilistic heuristic argument for plausibility, and this is a possible source of counterexamples to that conjecture.
  2. This is a nice suite of test cases to see if a proposed formalization of probabilistic heuristic arguments captures the kinds of arguments that are intuitively valid.

That said, I think that many people believe that number theory is an unusual domain that is particularly amenable to probabilistic heuristic arguments, and so it's likely not the best place to search for counterexamples.