The world would have the wisest minds cower in fear, and the fair of heart despair. Human nature, once reflected upon and made to cohere, is revealed as near unbearable to share. Even so, it seems plain from here -- the latent harmony makes it clear! -- we tend to converge when we start to care.

Wiki Contributions


I think this is brilliant as a direction to think in, but I'm object-level skeptical. I could be missing important details.

Summary of what I think I understand

  1. A superintelligent AI is built[1] to optimise for .
  2. That function effectively tells the AI to figure out: "If you extrapolate from the assumption that uniquely-identifiable- was actually  (ceteris paribus), what would uniquely-identifiable- have been?" And then take its own best guess about that as its objective function.
  3. In practice, that may look something like this: The AI starts looking for referents of  and , and learns that the only knowable instances are in the past. Assuming it does counterfactual reasoning somewhat like humans, it will then try to reconstruct/simulate the situation in as much heuristically relevant detail as possible. Finally, it runs the experiment forwards from the earliest possible time the counterfactual assumption can be inserted (i.e. when Cindy ran the program that produced ).
    1. (Depending on how it's built, the AI has already acquired a probability distribution over what its utility function could be, and this includes some expectancy over Cindy's values. Therefore, it plausibly tries to minimise side-effects.)
  4. At some point in this process, it notices that the contents of  are extremely sensitive to what went on in Cindy's brain at time T. So brainscanning her is obvious for accuracy and repeatability.
  5. In the simulation, Cindy sees with joy that  is  so she gets to work. Not willing to risk delaying more than 24-ish hours (for reasons), she finally writes into .
  6. As long as that is the AI's best guess, it is now motivated to repeat the experiment with the new message. This allows successive Cindys to work[2] on the problem until one of them declares success and writes a function plainly into .

Implementation details

  • The AI might be uncertain about how precise its simulations are, in which case it could want to run a series of these experiments with varying seeds before adopting whatever function the plurality converges to. The uncertainty compounds, however, so simulation-batches which output answers in fewer iterations (for whatever reason) will weigh more.
  • I'm not sure  will be interpreted as transitive between simulations by default. I think it depends on preferences regarding degrees of freedom in the logic used to interpret , if both the inner and outer function depend on mutually exclusive counterfactuals over the same state of reality (or variable). Each step is locally coherent, but you could get stuck in a repeating loop.
  • We can't take for granted that arbitrary superintelligences will have human heuristics for what counts as "correct" counterfactual reasoning. It seems shaky to rely on it. (I notice you discuss this in the comments.)

Why I don't think it works

  • It does not seem to do anything to inner alignment afaict, and it seems too demanding and leaky to solve outer alignment.
  • I don't see how to feasibly translate QACI() into actual code that causes an AI to use it as a target for all its goal-seeking abilities.
  • Even if it were made into a loss function, you can't train a transformer on it without relevant training data.
  • If the plan is to first train a transformer-ish AI on normal data and only later swap its objective function (assuming it were made into one), then the network will already have encoded proxies for its old function, and its influence will (optimistically) see long-tailed exponential decay with training time.
  • If instead the plan is to first train an instruction-executing language model with sufficient understanding of human-symbolic math or English, this seems risky for traditional reasons but might be the most feasible way to try to implement it. I think this direction is worth exploring.
  • A mathematically precise (though I disagree the term is applicable here) objective function doesn't matter when you have a neural net trying its best to translate it into imprecise proxies which actually work in its environment.
  1. ^

    I recommend going all-in on building one. I suspect this is the bottleneck, and going full speed at your current course is often the best way to discover that you need to find a better course--or, indeed, win. Uncertainty does not imply slowing down.

  2. ^

    This ends up looking like a sort of serial processor constrained by a sharp information bottleneck between iterations.

I don't buy the anthropic interpretation for the same reason I don't buy quantum immortality or grabby aliens, so I'm still weakly leaning towards thinking that decoherence matters. Weirdly I haven't seen this dilemma discussed before, and I've not brought it up because I think it's ifonharazdous--for the same reasons you point out in the post. I also tried to think of ways to exploit this for moral gain two years ago! So I'm happy to see I'm not the only one (e.g., you mention entropy control).

I was going to ask a question, but I went looking instead. Here's my new take:

After some thinking, I now have a very shoddy informal mental model in my head, and I'm calling it the "Quantum Shuffle Theory" ⋆★°. If (de)coherence is based on a local measure of similarity between states, then you'd likely see spontaneous recoherence in regions that line up. It could reach equilibrium if the rate of recoherence is a function of decoherence (at some saturation point for decohered configurations).

Additionally, the model (as it works in my head at least) would predict that reality fluid isn't both ergodic and infinite, because we'd observe maxed out effects of interference literally everywhere all the time (although maybe that's what we observe, idk). Or, at least, it puts some limits on how the multiverse could be infinite.

Some cherry-picked support I found for this idea:

"Next, we show that the evolution of the pure-basis states reveals an interesting phenomenon as the system, after decoherence, evolves toward the equilibrium: the spontaneous recoherence of quantum states. ... This phenomenon reveals that the reservoir only shuffle the original information carried out by the initial state of the system instead of erasing it. ... Therefore, spontaneous recoherence is not a property associated only with coherent-state superpositions." (2010)

Also, Bose-Einstein condensates provide some evidence for the shuffling interpretation, but my qm is weak enough that idk whether bog standard MWI predicts it too.

"a Bose–Einstein condensate (BEC) is a state of matter that is typically formed when a gas of bosons at very low densities is cooled to temperatures very close to absolute zero (−273.15 °C or −459.67 °F). Under such conditions, a large fraction of bosons occupy the lowest quantum state, at which point microscopic quantum mechanical phenomena, particularly wavefunction interference, become apparent macroscopically."

fwiw I think stealing money from mostly-rich-people in order to donate it isn't obviously crazy. Decouple this claim from anything FTX did in particular, since I know next to nothing about the details of what happened there. From my perspective, it could be they were definite villains or super-ethical risk-takers (low prior).

Thought I'd say because I definitely feel reluctance to say so. I don't like this feeling, and it seems like good anti-bandwagon policy to say a thing when one feels even slight social pressure to shut up.

Thanks! ChatGPT was unable to answer my questions, so I resorted to google, and to my surprise found a really high-quality LW post on the issue. All roads lead to LessRome it seems.

It is a great irony that the introductory post has 10 comments. This crowd has wisely tried to protect their minds against the viruses of worship, but perhaps we could be a little less scared of simple acts of gratitude.

Thank you. That's all.

I don't know that much about the field or what top researchers are thinking about, so I know I'm naive about most of my independent models. But I think it's good for my research trajectory to act on my inside views anyway. And to talk about them with people who may sooner show me how naive I am. :)

Tbc, my understanding of FF is "I watched him explain it on YT". My scary-feeling is just based on feeling like it could get close to mimicking what the brain does during sleep, and that plays a big part of autonomous learning. Sleeping is not just about cycles of encoding and consolidation, it's also about mysterious tricks for internally reorganising and generalising knowledge. And/or maybe it's about confabulating sensory input as adversarial training data for learning to discern between real and imagined input. Either way, I expect there to be untapped potential for ANN innovation at the bottom, and "sleep" is part of it.

One the other hand, if they don't end up cracking the algorithms behind sleep and the like, this could be good wrt safety, given that I'm tentatively pessimistic about the potential of the leading paradigm to generalise far and learn to be "deeply" coherent.

My point is that we couldn't tell if it were genius. If it's incredibly smart in domains we don't understand or care about, it wouldn't be recognisably genius.

Thanks for link! Doing factor analysis is a step above just eyeballing it, but even that's anthropomorphic if the factors are derived from performance on very human tasks. The more objective (but fuzzy) notion of intelligence I have in mind is something about efficiently halving some mathematical term for "weighted size of search space".

Oh, and also... This post and the comment thread is full of ideas that people can use to fuel their interest in novel capabilities research. Seems risky. Quinton's points about DNA and evolution can be extrapolated to the hypothesis that "information bottlenecks" could be a cost-effective way of increasing the rate at which networks generalise, and that may or may not be something we want. (This is a known thing, however, so it's not the riskiest thing to say.)

Load More