Requisite Background: Embedded Agency Sequence
Fast forward a few years, and imagine that we have a complete physical model of an e-coli bacteria. We know every function of every gene, kinetics of every reaction, physics of every membrane and motor. Computational models of the entire bacteria are able to accurately predict responses to every experiment we run.
Biologists say things like “the bacteria takes in information from its environment, processes that information, and makes decisions which approximately maximize fitness within its ancestral environment.” We have strong outside-view reasons to expect that the information processing in question probably approximates Bayesian reasoning (for some model of the environment), and the decision-making process approximately maximizes some expected utility function (which itself approximates fitness within the ancestral environment).
So presumably, given a complete specification of the bacteria’s physics, we ought to be able to back out its embedded world-model and utility function. How exactly do we do that, mathematically? What equations do we even need to solve?
As a computational biology professor I used to work with said, “Isn’t that, like, the entire problem of biology?”
Economists say things like “financial market prices provide the best publicly-available estimates for the probabilities of future events.” Prediction markets are an easy case, but let’s go beyond that: we have massive amounts of price data and transaction data from a wide range of financial markets - futures, stocks, options, bonds, forex... We also have some background general economic data, e.g. Fed open-market operations and IOER rate, tax code, regulatory code, and the like. How can we back out the markets’ implicit model of the economy as a whole? What equations do we need to solve to figure out, not just what markets expect, but markets’ implicit beliefs about how the world works?
Then the other half: aside from what markets expect, what do markets want? Can we map out the (approximate, local) utility functions of the component market participants, given only market data?
Imagine we have a complete model of the human connectome. We’ve mapped every connection in one human brain, we know the dynamics of every cell type. We can simulate it all accurately enough to predict experimental outcomes.
Psychologists (among others) expect that human brains approximate Bayesian reasoning and utility maximization, at least within some bounds. Given a complete model of the brain, presumably we could back out the human’s beliefs, their ontology, and what they want. How do we do that? What equations would we need to solve?
Pull up the specifications for a trained generative adversarial network (GAN). We have all the parameters, we know all the governing equations of the network.
We expect the network to approximate Bayesian reasoning (for some model). Indeed, GAN training is specifically set up to mimic the environment of decision-theoretic agents. If anything is going to precisely approximate mathematical ideal agency, this is it. So, given the specification, how can we back out the network’s implied probabilistic model? How can we decode its internal ontology - and under what conditions do we expect it to develop nontrivial ontological structure at all?
We have strong outside-view reasons to expect that the information processing in question probably approximates Bayesian reasoning (for some model of the environment), and the decision-making process approximately maximizes some expected utility function (which itself approximates fitness within the ancestral environment).
The use of "approximates" in this sentence (and in the post as a whole) is so loose as to be deeply misleading - for the same reasons that the "blue-minimising robot" shouldn't be described as maximising some expected utility function, and the information processing done by a single neuron shouldn't be described as Bayesian reasoning (even approximately!)
See also: coherent behaviour in the real world is an incoherent concept.
I think the idea that real-world coherence can't work mainly stems from everybody relying on the VNM utility theorem, and then trying to make it work directly without first formulating the agent's world-model as a separate step. If we just forget about VNM utility theorem and come at the problem from a more principled Bayesian angle instead, things work out just fine.
Here's the difference: VNM utility theorem postulates "lotteries" as something already present in the ontology. Agents have preferences over lotteries directly, and agents' preferences must take probabilities as inputs. There's no built-in notion of what exactly "randomness" means, what exactly a "probability" physically corresponds to, or anything like that. If we formulate those notions correctly, then things work, but VNM utility does not itself provide the formulation, so everybody gets confused.
Contrast that with e.g. FTAP + dutch book arguments: these provide a similar-looking conclusion to VNM utility theory (i.e. maximize expected utility), but the assumptions are quite different. In particular, they do not start with any inherent notion of "probability" - assuming inexploitability, they show that some (not necessarily unique) probability distribution exists, under which the agent can be interpreted as maximizing utility. This puts focus on the real issue: what exactly is the agent's world-model?
As you say in the post you linked:
those hypothetical choices are always between known lotteries with fixed probabilities, rather than being based on our subjective probability estimates as they are in the real world... VNM coherence is not well-defined in this setup, so if we want to formulate a rigorous version of this argument, we’ll need to specify a new definition of coherence which extends the standard instantaneous-hypothetical one.
... which is exactly right. That's why I consider VNM coherence a bad starting point for this sort of thing.
Getting more into the particulars of that post...
I would summarize the main argument in your post as roughly: "we can't observe counterfactual behavior, and without that we can't map the utility function, unless the utility function is completely static and depends only on current state of the world." So we can't map utilities over trajectories, we can't map off-equilibrium strategies, we can't map time-dependent utilities, etc.
The problem with that line of argument is that it treats the agent as a black box. Breaking open the black box is what embedded agency is all about, including all the examples in the OP. Once the black box is open, we do not need to rely on observed behavior - we know what the internal gears are, so we can talk about counterfactual behavior directly. In particular, once the black box is open, we can (in principle) talk about the agent's internal ontology. Once the agent's internal ontology is known, possibilities like "the agent prefers to travel in circles" are hypotheses we can meaningfully check - not by observing the agent's behavior, but by seeing what computation it performs with its internal notion of "travelling in circles".
What's the connection to "Embedded Agency", and what do you mean by using the term?
(This piece sounds like it's about extracting utility functions and probability distributions, and it's not clear how that's related (in the framework this post outlines).)
[As a side note, I notice that the habit of "pepper things with hyperlinks whenever possible" seems to be less common on modern LW than it was on old LW, but I think it was actually a pretty great habit and I'd like to see more of it.]
Thanks for bringing up the hyperlink thing; I will use them more liberally in the future. When writing for a LW audience, I tend to lean toward fewer links to avoid sounding patronizing. But actually thinking about it for a second, that seems like a very questionable gain with a significant cost.
Yeah, this seems true. Might be subtle UI things. We could probably also push towards this by making searching for links easier, for example by having a Github style search that shows up when you start typing some character (like / or #)
Let me know if you've read the link Vaniver gave and the connection still isn't clear. If that's the case, then there's an inferential gap I've failed to notice, and I'll probably write a whole additional post to flesh out that connection.
One way that comes to mind is to use the constructive VNM utility theorem proof. The construction is going to be approximate because the system's rationality is. So next things to study include in what way the rationality is approximate, and how well this and other constructions preserve this (and other?) approximations.
Oh, and isn't inverse reinforcement learning about this?
See my reply to ricraz's comment for my thoughts on using VNM utility theorem in general. The use you suggest could work, but if we lean on VNM then the hard part of the problem is backing out the agent's internal probabilistic model.
IRL is about this, but the key difference is that it black-boxes the agent. It doesn't know what the agent's internal governing equations look like, it just sees the outputs.