Happy 2024! 🎇

To kick the year off, I thought it would be nice to (very informally) outline the research avenue I will be exploring for at least the next 8 weeks and the concrete goals of the project.

A major contributing factor to the existential danger of AGI systems is that there will not be any second chances. If a deployed AI is sufficiently powerful but is not aligned with humanities interests, there is very little we are able to do. John Wentworth has spoken about the lack of a "wind tunnel" for alignment, meaning we are limited in our ability to test and iterate on designs for an aligned AGI.

I’d like to explore if it is possible to construct such a wind tunnel. The idea is quite simple. We use tools from physics to bound an agent's maximum impact on its environment. If possible, this would enable the safe partial deployment of an AGI without needing to have completely solved the alignment problem.  


So where to begin? Here are some of the threads I’m exploring:

It has been known since the 1960s that exorcising Maxwell’s Demon involves recognising that the erasure of information produces heat. However, the thermodynamic analysis of computation in the previous century was limited by its focus on systems at thermal equilibrium. I’m excited about recent developments in the Stochastic Thermodynamics of Computation (Wolpert, 2019) which specifically addressed systems which are far from equilibrium. This is an extremely important distinction because humans, modern computers and AGIs are all far from equilibrium systems.

I am very interested in this paper (Evans, Milburn, Shrapnel, 2021) discussing the causal perspective with a thermodynamic model of an agent and am in contact with one of the authors. I suspect their model of a physical causal agent will be a useful starting point for my work. This paper is building off philosophy papers by Jenann Ismael, who has written some insightful work on causality and time, although I’m unsure how directly relevant it is to me. 

Jeremy England also studies far from equilibrium mechanics. He has multiple papers discussing various “life like” phenomena and Dissapative Adapation. I think many of his results certainly give hints that they might be alignment relevant, for example “Design of conditions for self-replication” (2019). 

Finally, all of the above is heavily relying on the work of Gavin Crooks and the famous Crooks Fluctuation Theorem (1999), which ties entropy production to the probability of forward and reverse processes. Very loosely, it explains why we encounter many processes that only occur in one direction through time, despite the microscropic laws of physics being time reversible.[1]  For more details see these notes (Liang 2018).


Concrete Goals for the next 2 months:
I believe the most important goal step is to build this into a formal research agenda which I can refer to in the future.  

In the next few days, I need to complete my second read through of Wolpert’s paper. I would also like to get through this series of lectures over the next 2 weeks. 

Meta:
It is not good that there is not a single, manageable research chunk to focus on. I could, for example, be taking a result from the literature and proving it in a more general manner. 

Even if the bounding/wind tunnel idea proves fruitless, I'm confident the work will achieve substantial deconfusing. 


 

  1. ^

    Go onto Amazon and order a frictionless billiard ball table plus a set of perfectly elastic balls. Randomly scatter them on the table, and then set them in motion. Video the result.

    If you show your friend the forward and backwards version of the film, they will be able to see they're different, but they can't specify which video was the original. 

    This is what is happening at microscopic scales. Microscopic physical laws exhibit time-reversal symmetry. 

New to LessWrong?

New Comment