*This post is motivated by research intuitions that better formalisms in consciousness research contribute to agent foundations in more ways than just the value loading problem. Epistemic status: speculative.*

David Marr's levels of analysis is the idea that any analysis of a system involves analyzing it at multiple, distinct levels of abstraction. These levels are the computational, which describes what it is the system is trying to do, the algorithmic, which describes which algorithms the system instantiates in order to accomplish that goal, and the implementation level, describing the hardware or substrate on which the system is running. Each level underdetermines the other levels. You can choose lots of different algorithms for a given goal, and algorithms don't restrict which goals can use them. A concrete example Marr uses, is that you'd have a very hard time figuring out what a feather was for if you'd never seen a bird flying, and if you only saw a bird flying you might have a very difficult time coming up with something like the design of a feather.

Imagine a world that had recently invented computers. The early examples are very primitive, but people can extrapolate and see that these things will be very powerful, likely transformative to society. They're pretty concerned about the potential for these changes to be harmful, maybe even catastrophic. Although people have done a bit of theoretical work on algorithms, it isn't all that sophisticated. But since the stakes are high, they try their best to start figuring out what it would mean for there to be such a thing as harmful algorithms, or how to bound general use algorithms such that they can only be used for certain things. They even make some good progress, coming up with the concept of ASICs so that they can maybe hard code the good algorithms and make it impossible to run the bad. They're still concerned that a sufficiently clever or sufficiently incentivized agent could use ASICs for bad ends somehow.

If this situation seems a bit absurd to you, it's because you intuitively recognize that the hardware level underdetermines the algorithmic level. I argue the possibility that we're making the same error now. The algorithmic level underdetermines the computational level, and no matter how many combinations of cleverly constructed algorithms you stack on themselves, you won't be able to bound the space of possible goals in a way that gets you much more than weak guarantees. In particular, a system constructed with the right intentional formalism should actively want to avoid being goodharted just like a human does. Such an agent should have knightian uncertainty and therefore also (potentially) avoid maximizing.

In physics (or the implementation level) there are notions of smallest units, and counting up the different ways these units can be combined creates the notion of thermodynamic entropy, we can also easily define distance functions. In information theory (or the algorithmic level) there are notions of bits, and counting up the different ways these bits could be creates the notion of information theoretic entropy, we can also define distance functions. I think we need to build a notion of units of intentionality (on the computation level), and measures of permutations of ways these units can be to give a notion of intentional (computational) entropy, along with getting what could turn out to be a key insight for aligning AI, a distance function between intentions.

In the same way that trying to build complex information processing systems without a concrete notion of information would be quite confusing, I claim that trying to build complex intentional systems without a concrete notion of intention is confusing. This may sound a bit far fetched, but I claim that it is exactly as hard to think about as information theory was before Shannon found a formalism that worked.

I think there are already several beachheads for this problem that are suggestive:

Predictive processing (relation to smallest units of intention).

In particular, one candidate for smallest unit is the smallest unit that a given feedback circuit (like a thermostat) can actually distinguish. We humans get around this by translating from systems in which we can make fewer distinctions (like say heat) into systems in which we can make more (like say our symbolic processing of visual information in the form of numbers).

Convergent instrumental goals (structural invariants in goal systems).

In particular I think it would be worth investigating differing intuitions about just how much a forcing function convergent instrumental goals are. Do we expect a universe optimized by a capability boosted Gandhi and Clippy to be 10% similar, 50%, 90% or perhaps 99.9999+% similar?

Modal Logic (relation to counterfactuals and as semantics for the intentionality of beliefs).

Goodhart's taxonomy begins to parameterize, and therefore define distance functions for divergence of intent.

Some other questions:

How do simple intentions get combined to form more complex intentions? I think this is tractable via experimentation with simple circuits. This could also suggest approaches to pre-rationality via explaining (rigorously) how complex priors arise from homeostatic priors.

In Buddhism, intention is considered synonymous with consciousness, while in the west this is considered a contentious claim. What simple facts, if known, would collapse the seeming complexity here?

Can we consider intentions as a query language? If so, what useful ideas or results can we port over from database science? Is the apparent complexity of human values a side effect of the dimensionality of the space more so than the degree of resolution on any particular dimension?

Note:

When I read vague posts like this myself, I sometimes have vague objections but don't write them up due to the effort to bridge the inferential distance to the author and also the sense that the author will interpret attempts to bridge that distance as harsher criticism than I intend. Please feel free to give half formed criticism and leave me to fill in the blanks. It might poke my own half formed thoughts in this area in an interesting way.