Research Reflections — LessWrong

x

Research Reflections — LessWrong

Over the decade I've spent working on AI safety, I've felt an overall trend of divergence; research partnerships starting out with a sense of a common project, then slowly drifting apart over time. It has been frequently said that AI safety is a pre-paradigmatic field. This (with, perhaps, other contributing factors) means researchers have to optimize for their own personal sense of progress, based on their own research taste. In my experience, the tails come apart; eventually, two researchers are going to have some deep disagreement in matters of taste, which sends them down different paths.

Until the spring of this year, that is.

At the Agent Foundations conference at CMU,^[1] something seemed to shift, subtly at first. After I gave a talk -- roughly the same talk I had been giving for the past year -- I had an excited discussion about it with Scott Garrabrant. Looking back, it wasn't so different from previous chats we had had, but the impact was different; it felt more concrete, more actionable, something that really touched my research rather than remaining hypothetical. In the subsequent weeks, discussions with my usual circle of colleagues^[2] took on a different character -- somehow it seemed that, after all our diverse explorations, we had arrived at a shared space. (This is my own sense, which may not be shared by others.)

I wrote a paper for ILIAD over the summer, developing the ideas I got from that discussion with Scott.^[3] Writing this paper surprised me by bringing together several different ideas unexpectedly. Suddenly Scott's work on Finite Factored Sets and Cartesian Frames seemed not merely theoretically interesting -- not merely a great piece of work to observe from a slight distance -- but urgently interesting, like the beginning of a calculation I wanted to complete. I also surprised myself by bringing in some ideas from Critch's work on agent boundaries. The paper leaves much to be desired, but it is progress over my paper for last year's ILIAD.

My new paper also has some similarities to UDT 1.01. Diffractor's notion of "plannables vs observables" seems somewhat related to my notion of "internal observations vs external observations".

Sam Eisenstat also wrote a paper for this year's ILIAD: Condensation. I think this is an important paper. Sam's thinking on the nature of concepts has remained murky to me for several years; this paper brings some of those ideas to light, and I find the ideas to be quite interesting. More importantly for the narrative of this essay, the technical work is an extension of John Wentworth's work on Natural Abstractions -- Sam has some philosophical disagreements with John & considers Condensation to be reaching in a different direction, but on a mathematical level, it is (from my limited perspective, at least) a leap forward in John's program. Again I have this feeling: I've abstractly considered John's work "interesting" for some years, but Sam's paper has made John's work urgently interesting, actionable, compelling, imminently related to other things I want to do.

Sam's ILIAD paper and mine have some similarities. We both work in a sort of algebra of random variables. Sam defines morphisms over random variables, to form a category. I instead took inspiration from finite factored sets and some of Scott's earlier (unpublished) work leading up to finite factored sets, and modeled random variables as partitions, rather than the more standard definition used by Sam. I think Sam's choice was the better one, and I'm interested in trying to reformulate (and improve) my ideas in his formalism. This seems quite exciting: representing agents in a framework which also has tools for representing reasons for choices between ontologies. Optimistically, this could lead to a rich picture of when-it-makes-sense-to-model-something-as-an-agent.

Steve Petersen also seemed excited about paradigm convergence at ILIAD, expressing excitement about trying to bring together all the theories of abstraction (Sam's "Condensation", Daniel Dennett's "Real Patterns", John's "Natural Abstractions"/"Natural Latents", and Steve's own formalization of abstraction). (Hopefully I'm fairly representing Steve here.)

It is difficult to talk about an idea that is as of yet only glimpsed murkily, a vague pattern in the convergence of some lines of thinking. I have spoken mainly of my experiences, not the conjectured point of convergence. A proper development of these ideas would take many more pages, and perhaps years. But this is the season of Inkhaven, and I am writing short posts like this to get ideas out. Hopefully I will write about more pieces of the developing picture as the month goes on.

^{^}
Another related experience I had at CMU was several discussions with Cole Wyeth, which brought us closer to a shared perspective. We've been maintaining contact since then.
^{^}
I here refer to Scott Garrabrant, Sam Eisenstat, and TJ. While Sahil is also in my usual circle, what he is up to still feels distinct, not (yet) a part of this feeling of convergence. On the other hand, Sahil's thinking does have significant overlap with TJ and with Steve Petersen.
^{^}
I should note that I did not develop his ideas in the way he intended them; he has since explained significant differences in his intended idea. I continue to be interested in both versions of the idea.