A few Alignment questions: utility optimizers, SLT, sharp left turn and identifiability

1bionicles

New Answer

New Comment

1 comment, sorted by Click to highlight new comments since: Today at 2:07 PM

Is Empowerment a great way to quantify alignment (as expected information gain in terms of the mutual information between actions and future states)? I’m not sure how to get from A to B, but presumably one can measure conditional empowerment of some trajectories of some set of agents in terms of the amount of extra self-control imparted to the empowered agents by virtue of their interaction with the empowering agent. Perhaps the CATE (Conditional Average Treatment Effect) for various specific interventions would be more bite-sized than trying to measure the whole enchilada!

I am trying to find some Alignment direction that could be interesting for me to learn and work on, I've already received a couple of answers but I want more opinions and I just get more motivation from the discussions/cooperation.

1. As I understand that almost any agent can be modeled as an utility optimizer, but I am still suspicious about the concept. I am not sure that utility optimization is the most compressed representation of any agent, because an arbitrary utility function could be arbitrarily complex to encode, and maybe there is another model of this agent that is more compressed. Humans seem to be not explicit utility optimizers: Shard Theory. Can we formally distinguish between an explicit utility optimizer and an agent behavior that could be understood as optimization of some function? What is more efficient? Can you please recommend something that answers the question above or is just a good starting point to learn about utility optimizers via a math approach? For example, I want to understand more about existing theorems on the topic, like the Von Neumann–Morgenstern utility theorem.

2. I am trying to find some approaches to AI Alignment that use rigorous math about deep learning. But I am struggling to find even good books about the math behind deep learning as a whole, not about how to do deep learning in practice, but with theorems and proofs. One example I found interesting is "Deep Learning Architectures A Mathematical Approach", chapters about "Approximation Theorems", "Information Processing" and "Geometric Theory". In the Alignment field, I only see SLT theory which tries this path. Maybe someone knows other approaches/books in that direction? If SLT is the only one, is it because it covers all learning theory?

3. If I understand correctly, many results show that neural networks lack identifiability of weights or even latent representations like in VAE. What, then, is the point of Mech Interp, which, as I understand, is trying to understand the function of specific neurons/features? What should I read to understand this? Does it contradict with SLT take that all sophisticated models are singular?

4. I had read papers "Advances in Identifiability of Nonlinear Probabilistic Models" by Ilyes Khemakhem, "On Linear Identifiability of Learned Representations" by Kingma and similar, and immediatly thought why is there no Alignment direction that is trying to use this techniques for more identifiability in neural networks?

5. Is the inner misalignment problem based on the assumption that capabilities fit faster, more sustainable, and generalize better than values? In the sharp left turn post, the evolution analogy is not convincing to me, because humans didn't develop IGF values simply because they were not able to understand them. We understand now because of the iterative development of technology through the centuries. For ASI human values should be comprehensible but there is the problem that a power-seeking ASI is indistinguishable from an Aligned ASI. But this is also based on the assumption that human values somehow were not learned during the training process before the AI became sophisticated enough to understand how to deceive humans. Could we train an AI that is not smarter than us so it will not deceive us but will be capable of comprehending our values? Why should these values change in the future improvements of capabilities of the system if they perfectly fit our training distribution?

6. Are there any good attempts to formalize which functions are optimizers and which are not? Is it possible in principle to distinguish between a function that optimizes something and those that do not, for example, if they are in the form of a neural network? Or at least to understand how a function should look like to be a specific type of optimizer and what types of optimizers can we represent as a function?