Martín Soto

Mathematical Logic grad student, doing AI Safety research for ethical reasons.

Working on conceptual alignment, decision theory, cooperative AI and cause prioritization.

My webpage.

Leave me anonymous feedback.


Counterfactuals and Updatelessness
Quantitative cruxes and evidence in Alignment

Wiki Contributions


This is pure gold, thanks for sharing!

Didn't know about ruliad, thanks!

I think a central point here is that "what counts as an observer (an agent)" is observer-dependent (more here) (even if under our particular laws of physics there are some pressures towards agents having a certain shape, etc., more here). And then it's immediate each ruliad has an agent (for the right observer) (or similarly, for a certain decryption of it).

I'm not yet convinced "the mapping function/decryption might be so complex it doesn't fit our universe" is relevant. If you want to philosophically defend "functionalism with functions up to complexity C" instead of "functionalism", you can, but C starts seeming arbitrary?

Also, a Ramsey-theory argument would be very cool.

Yep! Although I think the philosophical point goes deeper. The algorithm our brains themselves use to find a pattern is part of the picture. It is a kind of "fixed (de/)encryption".

Thank you, habryka!

As mentioned in my answer to Eliezer, my arguments were made with that correct version of updatelessness in mind (not "being scared to learn information", but "ex ante deciding whether to let this action depend on this information"), so they hold, according to me.
But it might be true I should have stressed this point more in the main text.

Yep! I hadn't included pure randomization in the formalism, but it can be done and will yield some interesting insights.

As you mention, we can also include pseudo-randomization. And taking these bounded rationality considerations into account also makes our reasoning richer and more complex: it's unclear exactly when an agent wants to obfuscate its reasoning from others, etc.

First off, that  was supposed to be , sorry.

The agent might commit to "only updating on those things accepted by program ", even when it still doesn't have the complete infinite list of "exactly in which things does  update" (in fact, this is always the case, since we can't hold an infinite list in our head). It will, at the time of committing, know that  updates on certain things, doesn't update on others... and it is uncertain about exactly what it does in all other situations. But that's okay, that's what we do all the time: decide on an endorsed deliberation mechanism based on its structural properties, without yet being completely sure of what it does (otherwise, we wouldn't need the deliberation). But it does advise against committing while being too ignorant.

Is it possible for all possible priors to converge on optimal behavior, even given unlimited observations?

Certainly not, in the most general case, as you correctly point out.

Here I was studying a particular case: updateless agents in a world remotely looking like the real world. And even more particular: thinking about the kinds of priors that superintelligences created in the real world might actually have.

Eliezer believes that, in these particular cases, it's very likely we will get optimal behavior (we won't get trapped priors, nor commitment races). I disagree, and that's what I argue in the post.

I'm also surprised that dynamic stability leads to suboptimal outcomes that are predictable in advance. Intuitively, it seems like this should never happen.

If by "predictable in advance" you mean "from the updateless agent's prior", then nope! Updatelessness maximizes EV from the prior, so it will do whatever looks best from this perspective. If that's what you want, then updatelessness is for you! The problem is, we have many pro tanto reasons to think this is not a good representation of rational decision-making in reality, nor the kind of cognition that survives for long in reality. Because of considerations about "the world being so complex that your prior will be missing a lot of stuff". And in particular, multi-agentic scenarios are something that makes this complexity sky-rocket.
Of course, you can say "but that consideration will also be included in your prior". And that does make the situation better. But eventually your prior needs to end. And I argue, that's much before you have all the necessary information to confidently commit to something forever (but other people might disagree with this).

Is this consistent with the way you're describing decision-making procedures as updateful and updateless?

Absolutely. A good implementation of UDT can, from its prior, decide on an updateful strategy. It's just it won't be able to change its mind about which updateful strategy seems best. See this comment for more.

"flinching away from true information"

As mentioned also in that comment, correct implementations of UDT don't actually flinch away from information: they just decide ex ante (when still not having access to that information) whether or not they will let their future actions depend on it.

The problem remains though: you make the ex ante call about which information to "decision-relevantly update on", and this can be a wrong call, and this creates commitment races, etc.

I'm not sure we are in disagreement. No one is negating that the territory shapes the maps (which are part of the territory). The central point is just that our perception of the territory is shaped by our perceptors, etc., and need not be the same. It is still conceivable that, due to how the territory shapes this process (due to the most likely perceptors to be found in evolved creatures, etc.), there ends up being a strong convergence so that all maps represent isomorphically certain territory properties. But this is not a given, and needs further argumentation. After all, it is conceivable for a territory to exist that incentivizes the creation of two very different and non-isomorphic types of maps. But of course, you can argue our territory is not such, by looking at its details.

Where “joint carvy-ness” will end up being, I suspect, related to “gears that move the world,” i.e., the bits of the territory that can do surprisingly much, have surprisingly much reach, etc.

I think this falls for the same circularity I point at in the post: you are defining "naturalness of a partition" as "usefulness to efficiently affect / control certain other partitions", so you already need to care about the latter. You could try to say something like "this one partition is useful for many partitions", but I think that's physically false, by combinatorics (in all cases you can always build as many partitions that are affected by another one). More on these philosophical subtleties here: Why does generalization work?

Load More