I think it would help a lot to provide people with examples. For example, here
Many machine learning research agendas for safety are investigating issues identified years earlier by foundational research, and are at least partly informed by that research.
You say that, but then don't provide any examples. I imagine readers just not thinking of any, and then moving on without feeling any more convince.
Overall, I think that it's hard for people to believe agent foundations will be useful because they're not visualizing any compelling concrete path where it makes a big difference.
The first is what Garrett points out, that probabilities are map things, and it’s a bit… weird for our measure of a (presumably) territory thing to be dependent on them. It’s the same sort of trickiness that I don’t feel we’ve properly sorted out in thermodynamics—namely, that if we take the existence of macrostates to be reflections of our uncertainty (as Jaynes does), then it seems we are stuck saying something to the effect of “ice cubes melt because we become more uncertain of their state,” which seems… wrong.
For this part, my answer is Kolmogorov complexity. An ice cube has lower K-complexity than the same amount of liquid water, which is a fact about the territory and not our maps. (And if a state has lower K-complexity, it's more knowable; you can observe fewer bits, and predict more of the state.)
One of my ongoing threads is trying to extend this to optimization. I think a system is being objectively optimized if the state's K-complexity is being reduced. But I'm still working through the math.
Yeah... so these are reasonable thoughts of the kind that I thought through a bunch when working on this project, and I do think they're resolvable, but to do so I'd basically be writing out my optimization sequence.
I agree with Alexander below though, a key part of optimization is that it is not about utility functions, it is only about a preference ordering. Utility functions are about choosing between lotteries, which is a thing that agents do, whereas optimization is just about going up an ordering. Optimization is a thing that a whole system does, which is why there's no agent/environment distinction. Sometimes, only a part of the system is responsible for the optimization, and in that case you can start to talk about separating them, and then you can ask questions about what that part would do if it were placed in other environments.
Yeah, this is why we need a better explainer for agent foundations. I won't do it justice in this comment but I'll try to say some helpful words. (Have you read the Rocket Alignment Problem?)
Do you expect there will be a whole new paradigm, and that current neural networks will be nothing like future AIs?
I can give an easy "no" to this question. I do not necessarily expect future AIs to work in a whole new paradigm.
My understanding is that you're trying to build a model of what actual AI agents will be like.
This doesn't really describe what I'm doing. I'm trying to help figure out what AIs we should build, so I'm hoping to affect what actual AI agents will be like.
But more of what I'm doing is trying to understand what the space of possible agents looks like at all. I can see how that could sound like someone saying, "it seems like we don't know how to build a safe bridge, so I'm going to start by trying to understand what the space of possible configurations of matter looks like at all" but I do think it's different than that.
Let me try putting it this way. The arguments that AI could be an existential risk were formed before neural networks were obviously useful for anything. So the inherent danger of AIs does not come from anything particular to current systems. These arguments use specific properties about the general nature of intelligence and agency. But they are ultimately intuitive arguments. The intuition is good enough for us to know that the arguments are correct, but not good enough to help us understand how to build safe AIs. I'm trying to find the formalization behind those intuitions, so that we can have any chance at building a safe thing. Once we get some formal results about how powerful AIs could be safe even in principle, then we can start thinking about how to build versions of existing systems that have those properties. (And yes, that's a really long feedback loop, so I try to recurringly check that my trains of ideas could still in principle apply to ML systems.)
I'd agree that the bits of output are not independent in some physical sense. But they're definitely independent in my mind! If I hear that the 100th binary digit of pi is 1, then my subjective probability over the 101st digit does not update at all, and remains at 0.5/0.5. So this still feels like a frequentism/Bayesianism thing to me.
Re: the modified experiment about random strings, you say that "To get the string of random bits we have to sample a coin flip, and then make two copies of the outcome". But there's nothing preventing the universe from simply containing to copies of the same random string, created causally independently. But that's also vanishingly unlikely as the string gets longer.
Yeah, I think I agree that the resolution here is something about how we should use these words. In practice I don't find myself having to distinguish between "statistics" and "probability" and "uncertainty" all that often. But in this case I'd be happy to agree that "all statistical correlations are due to casual influences" given that we mean "statistical" in a more limited way than I usually think of it.
But I don't think we know how to properly formalise or talk about that yet.
A group of LessWrong contributors has made a lot of progress on these ideas of logical uncertainty and (what I think they're now calling) functional decision theory over the last 15ish years, although I don't really follow it myself, so I'm not sure how close they'd say we are to having it properly formalized.
Thanks for writing that out! I've enjoyed thinking this through some more.
I agree that, if you instantiated many copies of the program across the universe as your sampling method, or somehow otherwise "ran them many times", then their outputs would be independent in the sense that P(A, B) = P(A, B). This also holds true if, on each run, there was some "local" error to the program's otherwise deterministic output.
I had intended to be using the program's output as a time series of bits, where we are considering the bits to be "sampling" from A and B. Let's say it's a program that outputs the binary digits of pi. I have no idea what the bits are (after the first few) but there is a sense in which P(A) = 0.5 for either A = 0 or A = 1, and at any timestep. The same is true for P(B). So P(A)P(B) = 0.25. But clearly P(A = 0, B = 0) = 0.5, and P(A = 0, B = 1) = 0, et cetera. So in that case, they're not probabilistically independent, and therefore there is a correlation not due to a causal influence.
But this is in a Bayesian framing, where the probability isn't a physical thing about the programs, it's a thing inside my mind. So, while there is a common source of the correlation (my uncertainty over what the digits of pi are) it's certainly not a "causal influence" on A and B.
This matters to me because, in the context of agent foundations and AI alignment, I want my probabilities to be representing my state of belief (or the agent's state of belief).
Ultimately, all statistical correlations are due to casual influences.
As a regular LW reader who has never been that into causality, this reads as a blisteringly hot take to me. My first thought is, what about acausal correlations? You could have two instances of the same program running on opposite sides of the universe, and their outputs would be the same, but there is clearly no causal influence there. The next example that comes to mind is two planets orbiting their respective stars which just so happen to have the same orbital period; their angular offset over time will correlate, and again their is no common cause.
(In both cases you could say that the common cause is something like the laws of physics allowing two copies of similar systems to come into existence, but I would say that stretches the concept of causality beyond usefulness.)
I also notice that there's no wikipedia page for "Reichenbach’s Common Cause Principle", which makes me think it's not a particularly widely accepted idea. (In any case I don't think this has an effect on the value of the rest of this sequence.)
Unrelatedly, why not make this a cross-post rather than a link-post?