Thane Ruthenis

Wiki Contributions

Comments

Sorted by

I'm not sure of this.  It seems at least possible that we could get an equilibrium where everyone does use the unfiltered UP (in some part of their reasoning process), trusting that no one will manipulate them because (a) manipulative behavior is costly and (b) no one has any reason to expect anyone else will reason differently from them, so if you choose to manipulate someone else you're effectively choosing that someone else will manipulate you.

Fair point! I agree.

When you say "Tegmark IV," I assume you mean the computable version -- right?

Yep.

We have this sort of symmetry-breaker in the version of the argument that postulates, by fiat, a "UP-using dupe" somewhere, for some reason

Correction: on my model, the dupe is also using an approximation of the UP, not the UP itself. I. e., it doesn't need to be uncomputable. The difference between it and the con men is just the naivety of the design. It generates guesses regarding what universes it's most likely to be in (potentially using abstract reasoning), but then doesn't "filter" these universes; doesn't actually "look inside" and determine if it's a good idea to use a specific universe as a model. It doesn't consider the possibility of being manipulated through it; doesn't consider the possibility that it contains daemons.

I. e.: the real difference is that the "dupe" is using causal decision theory, not functional decision theory.

We can just notice that we'd all be better off if no one did the malign thing, and then no one will do it

I think that's plausible: that there aren't actually that many "UP-using dupes" in existence, so the con men don't actually care to stage these acausal attacks.

But: if that is the case, it's because the entities designing/becoming powerful agents considered the possibility of con men manipulating the UP, and so made sure that they're not just naively using the unfiltered (approximation of the) UP.

That is: yes, it seems likely that the equilibrium state of affairs here is "nobody is actually messing with the UP". But it's because everyone knows the UP could be messed with in this manner, so no-one is using it (nor its computationally tractable approximations).

It might also not be the case, however. Maybe there are large swathes of reality populated by powerful yet naive agents, such that whatever process constructs them (some alien evolution analogue?), it doesn't teach them good decision theory at all. So when they figure out Tegmark IV and the possibility of acausal attacks/being simulation-captured, they give in to whatever "demands" are posed them. (I. e., there might be entire "worlds of dupes", somewhere out there among the mathematically possible.)

That said, the "dupe" label actually does apply to a lot of humans, I think. I expect that a lot of people, if they ended up believing that they're in a simulation and that the simulators would do bad things to them unless they do X, would do X. The acausal con men would only care to actually do it, however, if a given person is (1) in the position where they could do something with large-scale consequences, (2) smart enough to consider the possibility of simulation-capture, (3) not smart enough to ignore blackmail.

Consider a different problem: a group of people are posed some technical or mathematical challenge. Each individual person is given a different subset of the information about the problem, and each person knows what type of information every other participant gets.

Trivial example: you're supposed to find the volume of a pyramid, you (participant 1) are given its height and the apex angles for two triangular faces, participant 2 is given the radius of the sphere on which all of the pyramid's vertices lie and all angles of the triangular faces, participant 3 is given the areas of all faces, et cetera.

Given this setup, if you're skilled at geometry, you can likely figure out which of the participants can solve the problem exactly, which can only put upper and lower bounds on the volume, and what those upper/lower bounds are for each participant. You don't need to model your competitors' mental states: all you need to do is reason about the object-level domain, plus take into account what information they have. No infinite recursion happens, because you can abstract out the particulars of how others' minds work.

This works assuming that everyone involved is perfectly skilled at geometry: that you don't need to predict what mistakes the others would make (which would depend on the messy details of their minds).

Speculatively, this would apply to deception as well. You don't necessarily need to model others' brain states directly. If they're all perfectly skilled at deception, you can predict what deceptions they'd try to use and how effective they'd be based on purely objective information: the sociopolitical landscape, their individual skills and comparative advantages, et cetera. You can "skip to the end": predict everyone playing their best-move-in-circumstances-where-everyone-else-plays-their-best-move-too.

Objectively, the distribution of comparative advantages is likely very different, so even if everyone makes their best move, some would hopelessly lose. (E. g., imagine if one of the experts is a close friend of a government official and the other is a controversial figure who'd been previously judged guilty of fraud.)

Speculatively, similar works for the MUP stuff. You don't actually need to model the individual details of other universes. You can just use abstract reasoning to figure out what kinds of universes are dense across Tegmark IV, figure out what (distributions over) entities inhabit them, figure out (distributions over) how they'd reason, and what (distributions over) simulations they'd run, and to what (distribution over the) output this process converges given the objective material constraints involved. Then take actions that skew said distribution-over-the-output in a way you want.

Again, this is speculative: I don't know that there are any math proofs that this is possible. But it seems plausible enough that something-like-this might work, and my understanding is that the MUP argument (and other kinds of acausal-trade setups) indeed uses this as a foundational assumption. (I. e., it assumes that the problem is isomorphic (in a relevant sense) to my pyramid challenge above.)

(IIRC, the Acausal Normalcy post outlines some of the relevant insights, though I think it doesn't precisely focus on the topic at hand.)

Why?

It was already known the AGI Labs were experimenting with synthetic data and that OpenAI are training GPT-5, and the article is light on new details:

  • It's not really true that modern AIs "can't reliably solve math problems they haven't seen before": this depends on the operationalization of "a math problem" and "seen before". All this statement says is "Strawberry is better at math than the SOTA models", which in turn means "nonzero AI progress".
  • Similar for hallucinations.
  • The one concrete example is solving New York Connections, but Claude 3.5 can already do it on a good day.

I mean, the state of affairs is by no means not worrying, but I don't really see what's in this article would prompt a meaningful update?

Answer by Thane Ruthenis100

Here's my understanding of the whole thing:

  • "Malign universal prior" arguments basically assume a setup in which we have an agent with a big dumb hard-coded module whose goal is to find this agent's location in Tegmark IV. (Or maybe perform some other important task that requires reasoning about Tegmark IV, but let's run with that as the example.)
  • The agent might be generally intelligent, the Solomonoff-induction-approximating module might be sophisticated in all kind of ways, but it's "dumb" or "naive" in an important sense: it's just trying to generate the best-guess distribution over the universes the agent is in, no matter their contents, then blindly acts on it.
  • Importantly, this process doesn't necessarily involve actually running any low-level simulations of other universes. Generally intelligent/abstract reasoning, some steps of which might literally replicate the reasoning steps of Paul's post, would also fit the bill.
  • The MUP argument is that this is sufficient for alien consequentialists to take advantage. The agent is asking, "where am I most likely to be?", and the alien consequentialists are skewing the distribution such that the most likely correct answer is "simulation-captured by acausal aliens" or whatever.
    • (And then the malign output is producing "predictions" about the future of the agent's universe like "the false vacuum collapse is going to spontaneously trigger in the next five minutes unless you perform this specific sequence of actions that happen to rewrite your utility function in such-and-such ways", and our big dumb agent is gormlessly buying this, and its "real" non-simulation-captured instance rewrites itself accordingly.)
  • Speed prior vs. complexity prior: a common guess regarding the structure of Tegmark IV is that this is how it works, it penalizes K-complexity but doesn't care how much memory/compute it needs to allocate to run a universe. If that is true, then any sufficiently good approximation of Solomonoff induction – any sufficiently good procedure for getting an answer to "where am I most likely to be?", including abstract reasoning – would take this principle into account, and bump up the probability of being in low-complexity universes.

This all seems to check out to me. Admittedly I didn't actually confirm this with any proponents of the argument, though.

(Probably also worth stating that I don't think the MUP is in any way relevant to real life. AI progress doesn't seem to be on the track where it features AGIs that use big dumb "where am I?" modules. E. g., if an AGI is born of anything like an RL-trained LLM, seems unlikely that its "where am I?" reasoning would be naive in the relevant sense. It'd be able to "manually" filter out universes with malign consequentialists, given good decision theory. You know, like we can.

The MUP specifically applies to highly abstract agent-foundations designs where we hand-code each piece, that currently don't seem practically tractable at all.)

... is that why this post has had unusually many downvotes?

Nah, I'd expect it's more broadly because it's making an arrogant-feeling claim that the organization-design market is inefficient and that you think you can beat it. Explicitly suggesting that a world in which the median person were you wouldn't be like this maybe put that in sharper relief, but I don't think this specific framing was a leading cause at all.

There might also be some backlash from people who'd previously interfaced with relatively more benign big organizations, who are weary of these sorts of models and view them as overly cynical/edgy.

So how on Earth (as opposed to Wentworld) are large organizations stable?

My model, prior to reading further:

Well, what do such structures maximize? By what goals are they optimal?

The answer that seems obvious to me is "the perceived power of those at the top of such organizations". The people constituting the-organization-as-a-whole are obviously in a suboptimal configuration for maximizing their collective economic efficiency, or their collective power. But the people in charge find themselves in nominal command of an enormous pile of resources. They are poorer, in an absolute sense, than they would've been if the world were in the equilibrium state in which people are arranged in more efficient economic structures. But they are relatively much more rich and influential than their underlings, which parses as optimal by their power- and status-maximizing heuristics, so it's in their individual interests to preserve this status quo.

At the cultural level, this equilibrium is maintained by such organizations instinctively cooperating with each other to preserve it (as Zvi outlines e. g. here, I think). This is probably also culturally supported by upper managers essentially wanting to feel like "the king of their kingdom", using historical kingdoms/governments as reference points. Said kingdom/governments, in turn, have historically ended up in big-dumb-organization configurations to maximize the power of those at their helm. (Except they spread their influence by more obvious conquest-and-pillaging, not by hiring people.)

Everyone outside mazes isn't well-versed enough in the subject to understand what they're signing up for when joining, or unable to imagine superior alternative configurations, or culturally indoctrinated to view such structures as inherently valuable as well. So they don't demand larger pay, aren't unwilling to work with large organizations, don't vote to dissolve them, et cetera.

… so e.g. large companies or government agencies are basically runaway human monuments of dominance and submission, limited mainly by their budget.

Yeah, that seems to agree with my model.

I'd only add that this setup increases upper managers' status outside the organizations as well, and in a fairly robust manner. It's not quite like with the rhesus monkeys: the CEO of a Fortune 500 company, if they leave their organization and seek to join a new one, wouldn't start at the very bottom at all. They'd often end up at the executive level as well, or even straight-up be re-hired as a CEO.

You know how the org structure is always hierarchical for some reason? And managers end up being massive communication-bottlenecks across the org, because horizontal coordination usually has to route through them? That’s not a bug, that’s a feature of an organization whose managers optimize first for dominance-status, and only secondarily for making money.

Hm, I don't know if that's necessarily a bug from the perspective of efficient organization design either. Inasmuch as organizations are control systems reflecting the domain they control, the hierarchy reflects the hierarchy of natural abstractions, with the managers/subsystems at higher levels "acting as natural latents", inducing conditional independence between lower-level subsystems. This has productivity-boosting results as well: see e. g. Joel's "development abstraction layer" argument (the people within the subsystem can focus their optimization-power solely on the domain they're responsible for, with no distractions).

Obviously this is horribly mismanaged in practice, but it feels like a warped feature of a fundamentally productive structure.

I'd expect that if the natural-abstractions theory gets to the point where it's theoretically applicable to fluid dynamics, then demonstrating said applicability would just be a matter of devoting some amount of raw compute to the task; it wouldn't be bottlenecked on human cognitive resources. You'd be able to do things like setting up a large-scale fluid simulation, pointing the pragmascope at it, and seeing it derive natural abstractions that match the abstractions human scientists and engineers derived for modeling fluids. And in the case of fluids specifically, I expect you wouldn't need that much compute.

(Pure mathematical domains might end up a different matter. Roughly speaking, because of the vast gulf of computational complexity between solving some problems approximately (BPP) vs. exactly. "Deriving approximately-correct abstractions for fluids" maps to the former, "deriving exact mathematical abstractions" to the latter.)

 Let's see if I get this right...

  • Let's interpret the set  as the set of all possible visual sensory experiences , where  defines the color of the th pixel.
  • Different distributions over elements of this set correspond to observing different objects; for example, we can have  and , corresponding to us predicting different sensory experiences when looking at cars vs. apples.
  • Let's take some specific specific set of observations , from which we'd be trying to derive a latent.
  • We assume uncertainty regarding what objects generated the training-set observations, getting a mixture of distributions .
  • We derive a natural latent  for  such that  for all allowed .
  • This necessarily implies that  also induces independence between different sensory experiences for each individual distribution in the mixture:  and .
  • If the set  contains some observations generated by cars and some observations generated by apples, yet a nontrivial latent over the entire set nonetheless exists, then this latent must summarize information about some feature shared by both objects.
    • For example, perhaps it transpired that all cars depicted in this dataset are red, and all apples in this dataset are red, so  ends up as "the concept of redness".
  • This latent then could, prospectively, be applied to new objects. If we later learn of the existence of  – an object seeing which predicts yet another distribution over visual experiences – then  would "know" how to handle this "out of the box". For example, if we have a set of observations  such that it contains some red cars and some red ink, then  would be natural over this set under both distributions, without us needing to recompute it.
  • This trick could be applied for learning new "features" of objects. Suppose we have some established observation-sets  and , which have nontrivial natural latents  and . To find new "object-agnostic" latents, we can try to form new sets of observations from subsets of those observations, define corresponding distributions, and see if mixtures of distributions over those subsets have nontrivial latents.
    • Formally:  where  and , then , and we want to see if we have a new  that induces (approximate) independence between all  both under the "apple" and the "car" distributions.
    • Though note that it could be done the other way around as well: we could first learn the latents of "redness" and e. g. "greenness" by grouping all red-having and green-having observations, then try to find some subsets of those sets which also have nontrivial natural latents, and end up deriving the latent of "car" by grouping all red and green objects that happen to be cars.
      • (Which is to say, I'm not necessarily sure there's a sharp divide between "adjectives" and "nouns" in this formulation. "The property of car-ness" is interpretable as an adjective here, and "greenery" is interpretable as a noun.)
    • I'd also expect that the latent over , i. e. , could be constructed out of  and  (derived, respectively, from a pure-cars dataset and an all-red dataset)? In other words, if we simultaneously condition a dateset of red cars on a latent derived from a dataset of any-colored cars and a latent derived from a dateset of red-colored objects, then this combined latent  would induce independence across  (which  wouldn't be able to do on its own, due to the instances sharing color-related information in addition to car-ness)?
  • All of this is interesting mostly in the approximate-latent regime (this allows us to avoid the nonrobust-to-tiny-mixtures trap), and in situations in which we already have some established latents which we want to break down into interoperable features.
    1. In principle, if we have e. g. two sets of observations that we already know correspond to nontrivial latents, e. g.  and , we could directly try to find subsets of their union that correspond to new nontrivial latents, in the hopes of recovering some features that'd correspond to grouping observations along some other dimension.
    2. But if we already have established "object-typed" probability distributions  and , then hypothesizing that the observations are generated by an arbitrary mixture of these distributions allows us to "wash out" any information that doesn't actually correspond to some robustly shared features of cars-or-apples.
    3. That is: consider if  is 99% cars, 1% apples. Then an approximately correct natural latent over it is basically just , maybe with some additional noise from apples thrown in. This is what we'd get if we used the "naive" procedure in (1) above. But if we're allowed to mix up the distributions, then "ramping" up the "apple" distribution (defining , say) would end up with low probabilities assigned to all observations corresponding to cars, and now the approximately correct natural latent over this dataset would have more apple-like qualities. The demand for the latent to be valid on arbitrary  then "washes out" all traces of car-ness and apple-ness, leaving only redness.

Is this about right? I'm getting a vague sense of some disconnect between this formulation and the OP...

Load More