This post is motivated by research intuitions that better formalisms in consciousness research contribute to agent foundations in more ways than just the value loading problem. Epistemic status: speculative.
David Marr's levels of analysis is the idea that any analysis of a system involves analyzing it at multiple, distinct levels of abstraction. These levels are the computational, which describes what it is the system is trying to do, the algorithmic, which describes which algorithms the system instantiates in order to accomplish that goal, and the implementation level, describing the hardware or substrate on which the system is running. Each level underdetermines the other levels. You can choose lots of different algorithms for a given goal, and algorithms don't restrict which goals can use them. A concrete example Marr uses, is that you'd have a very hard time figuring out what a feather was for if you'd never seen a bird flying, and if you only saw a bird flying you might have a very difficult time coming up with something like the design of a feather.
Imagine a world that had recently invented computers. The early examples are very primitive, but people can extrapolate and see that these things will be very powerful, likely transformative to society. They're pretty concerned about the potential for these changes to be harmful, maybe even catastrophic. Although people have done a bit of theoretical work on algorithms, it isn't all that sophisticated. But since the stakes are high, they try their best to start figuring out what it would mean for there to be such a thing as harmful algorithms, or how to bound general use algorithms such that they can only be used for certain things. They even make some good progress, coming up with the concept of ASICs so that they can maybe hard code the good algorithms and make it impossible to run the bad. They're still concerned that a sufficiently clever or sufficiently incentivized agent could use ASICs for bad ends somehow.
If this situation seems a bit absurd to you, it's because you intuitively recognize that the hardware level underdetermines the algorithmic level. I argue the possibility that we're making the same error now. The algorithmic level underdetermines the computational level, and no matter how many combinations of cleverly constructed algorithms you stack on themselves, you won't be able to bound the space of possible goals in a way that gets you much more than weak guarantees. In particular, a system constructed with the right intentional formalism should actively want to avoid being goodharted just like a human does. Such an agent should have knightian uncertainty and therefore also (potentially) avoid maximizing.
In physics (or the implementation level) there are notions of smallest units, and counting up the different ways these units can be combined creates the notion of thermodynamic entropy, we can also easily define distance functions. In information theory (or the algorithmic level) there are notions of bits, and counting up the different ways these bits could be creates the notion of information theoretic entropy, we can also define distance functions. I think we need to build a notion of units of intentionality (on the computation level), and measures of permutations of ways these units can be to give a notion of intentional (computational) entropy, along with getting what could turn out to be a key insight for aligning AI, a distance function between intentions.
In the same way that trying to build complex information processing systems without a concrete notion of information would be quite confusing, I claim that trying to build complex intentional systems without a concrete notion of intention is confusing. This may sound a bit far fetched, but I claim that it is exactly as hard to think about as information theory was before Shannon found a formalism that worked.
I think there are already several beachheads for this problem that are suggestive:
Predictive processing (relation to smallest units of intention).
In particular, one candidate for smallest unit is the smallest unit that a given feedback circuit (like a thermostat) can actually distinguish. We humans get around this by translating from systems in which we can make fewer distinctions (like say heat) into systems in which we can make more (like say our symbolic processing of visual information in the form of numbers).
Convergent instrumental goals (structural invariants in goal systems).
In particular I think it would be worth investigating differing intuitions about just how much a forcing function convergent instrumental goals are. Do we expect a universe optimized by a capability boosted Gandhi and Clippy to be 10% similar, 50%, 90% or perhaps 99.9999+% similar?
Modal Logic (relation to counterfactuals and as semantics for the intentionality of beliefs).
Goodhart's taxonomy begins to parameterize, and therefore define distance functions for divergence of intent.
Some other questions:
How do simple intentions get combined to form more complex intentions? I think this is tractable via experimentation with simple circuits. This could also suggest approaches to pre-rationality via explaining (rigorously) how complex priors arise from homeostatic priors.
In Buddhism, intention is considered synonymous with consciousness, while in the west this is considered a contentious claim. What simple facts, if known, would collapse the seeming complexity here?
Can we consider intentions as a query language? If so, what useful ideas or results can we port over from database science? Is the apparent complexity of human values a side effect of the dimensionality of the space more so than the degree of resolution on any particular dimension?
When I read vague posts like this myself, I sometimes have vague objections but don't write them up due to the effort to bridge the inferential distance to the author and also the sense that the author will interpret attempts to bridge that distance as harsher criticism than I intend. Please feel free to give half formed criticism and leave me to fill in the blanks. It might poke my own half formed thoughts in this area in an interesting way.
Since some readers are probably unaware of this, worth noting explicitly that the sense of "intention" used in philosophy is different from the common meaning of the term. From the linked article:
On the subject of intentionality/reference/objectivity/etc, On the Origin of Objects is excellent. My thinking about reference has a kind of discontinuity from before reading this book to after reading it. Seriously, the majority of analytic philosophy discussion of indexicality, qualia, reductionism, etc seems hopelessly confused in comparison.
Reading this now, thanks.
Figure it's worth saying this is also very much where I'm trying to take my own research in AI safety. I'd phrase it as we are currently hopelessly confused about what it is we even think we want to do to approach alignment so much so that we can't even state the problem formally (although I've made a start at it, and I view Stuart Armstrong's research as doing the same even if I disagree with him on the specifics of his approach). I agree with you on what things seem to be pointing us towards something, and I think we even have tastes of what the formalism we need looks like already, but there's also a lot to be done to get us there.
Another direction that has been stubbornly resisting crystallizing is the idea that goodhearting is a positive feature in adversarial environments via something like granting ͼ-differential privacy while still allowing you to coordinate with others by taking advantage of one way functions. i.e. hashing shared intents to avoid adversarial pressure. This would make this sort of work part of a billion year arms race where one side attempts to reverse engineer signaling mechanisms while the other side tries to obfuscate them to prevent the current signaling frontier from becoming saturated and therefore worthless. Similar to parasite arms races.
(I don't have much experience thinking in these terms, so maybe the question is dumb/already answered in the post. But anyway: )
Do you have some more-detailed (and stupidly explicit) examples of the intentional and algorithmic views on the same thing, and how to translate between them?
This is a good question. I think ways of thinking about Marr's levels itself might be underdetermined and therefore worth trying to crux on. Let's take the example of birds again. On the implementation level we can talk about the physical systems of a bird interacting with its environment. On the algorithmic level we can talk about patterns of behavior supported by the physical environment that allow the bird to do certain tasks. On the computational (intentional) level we can talk about why those tasks are useful in terms of some goal architecture like survival and sexual selection. We can think about underdetermination when we have any notion that different goals might in theory be instantiated by two otherwise similar birds, when we have a notion of following different strategies to achieve the same goal, or when we think about having the same goals and strategies instantiated on a different substrate (simulations).
One of the reasons I think this topic is confusing is that in reality we only ever have access to the algorithmic level. We don't have direct access to the instantiation level (physics) we just have algorithms that more or less reliably return physics shaped invariances. Likewise we don't have direct access to our goals, we just have algorithms that return goodharted proxies that we use to triangulate on inferred goals. We improve the accuracy of these algorithms over time through a pendulum swing from modification to the representation vs modification to the traversal (this split is also from Marr).
editted into post.