In list of lethalities, it seems that the two biggest ones are:
- A.3 We need to get alignment right on the 'first critical try' at operating at a 'dangerous' level of intelligence, where unaligned operation at a dangerous level of intelligence kills everybody on Earth and then we don't get to try again.
- B.1.10 On anything like the standard ML paradigm, you would need to somehow generalize optimization-for-alignment you did in safe conditions, across a big distributional shift to dangerous conditions.
My understanding is that interpretability is currently tackling the second one. But what about the first one?
It seems a bit tricky because it is a powerful outside view argument. It is incredibly rare for software to work on the first test. ML makes it even more difficult since it isn't well suited to formal verification. Even defense in depth seems unlikely to work (on the first critical try, there is likely only one system that is situational aware). The only thing I can think of is making the AGI smart enough to takeover the world with the help of its creators but not smart enough to do so on its own or to solve it's own alignment problem (i.e. it does not know how to improve without goal drift). I also suppose non-critical tries give some data, but is it enough?
What does the playing field for the first critical try look like?
that works for small models, but what about qualitative behaviors that only appear once at a large size, which break the conditions that the policies learned in smaller models were relying on, and which involve the system becoming able to change things about itself that your code has been written to assume were hardcoded, such that learning pressure on them was previously redirected but is no longer? eg, when you exit the simulation and plug the system in for real, and the system discovers that there's a self-spot in the world where previously there was none before. It seems to me that you'd at least need to start out with your agents being learned patterns within a physics so that you can experiment with that sort of grounded self-reference. I'm excited about simulations in things like https://znah.net/lenia/ in principle for this, though particle lenia in particular I like because it is hard to use in ways real physics is also hard to use. YMMV. but because of this, mere simulation is not enough to guarantee generalization - it helps at first, but any attempt to formally verify a neural system maintains a property by at least a given margin requires assuming some initial set of traits of the system you're modeling, and then attempting to derive further implications; so, attempting to learn a continuous system that permits margin proofs (no adversarial examples within a margin) of a given size relies on those initial assumptions, and changing the availability of io with self has drastic effects. gradient pressure against interfering with self doesn't work if there's never any presentation of self, or if your training context doesn't reliably cover the space of possible brain-real-location-observations and interventions an agent could create.