1. I'm effectively certain that a weakly defined version of "mesaoptimizers"- like a component of a network which learns least squares optimization in some context- is practically reachable by SGD.
  2. I'm pretty sure there exists some configuration of an extremely large neural network such that it contains an agentic mesaoptimizer capable of influencing the training gradient for at least one step in a surprising way. In other words, gradient hacking.
  3. I think there's a really, really big gap between these which is where approximately all the relevant and interesting questions live, and I'd like to see more poking aimed at that gap.
  4. Skipping the inspection of that gap probably throws out useful research paths.
  5. I think this belongs to a common, and more general, class of issue.

Boltzmann mesaoptimizers

Boltzmann brains are not impossible. They're a valid state of matter, and there could be a physical path to that state.[1] You might just have to roll the dice a few times, for a generous definition of few.

And yet I feel quite confident in claiming that we won't observe a Boltzmann brain in the wild within the next few years (for a similar definition of few).[2]

The existence of a valid state, and a conceivable path to reach that state, is not enough to justify a claim that that state will be observed with non-negligible probability.

I suspect that agentic mesaoptimizers capable of intentionally distorting training are more accessible than natural Boltzmann brains by orders of magnitude of orders of magnitude, but it is still not clear to me that the strong version is something that can come into being accidentally in all architectures/training modes that aren't specifically designed to defend against it.[3]

Minding the gap

I feel safe in saying that it would be a mistake to act on the assumption that you're a Boltzmann brain. If you know your history is coherent (to the degree that it is) only by pure chance, and your future practically doesn't exist[4], then in any given moment, your choices are irrelevant. Stuff like 'making decisions' or 'seeking goals' goes out the window.

Observing that there is the logical possibility of a capable misaligned agentic mesaoptimizer and then treating any technique that admits such a possibility as irrevocably broken would be similarly pathological.[5]

If we cannot find a path that eliminates the logical possibility by construction, we must work within the system that contains the possibility. It's a little spookier, not having the force of an impossibility proof behind you, but complex systems can be stable. It may be the case that the distance from the natural outcome and the pathological outcome is enormous, and that the path through mindspace doesn't necessarily have the maximal number of traps permitted by your current uncertainty.

It's worth trying to map out that space.

  1. Can we create a clear example of an agentic mesaoptimizer in any design[6] at sub-dangerous scales so that we can study it?
  2. Can we learn anything from the effect of scaling on designs that successfully exhibit agentic mesaoptimizers?
  3. What designs make agentic mesaoptimizers a more natural outcome?
  4. What designs makes agentic mesaoptimizers less natural?
  5. How far does SGD need to walk from a raw goal agnostic simulator to find its way to an agentic mesaoptimizer, for any type of agent?
  6. How much force does it require to push an originally goal agnostic simulator into agentic mesaoptimization?
  7. Try repeating all of the above questions for nonagentic mesaoptimizers. Can we learn anything about the distance between agentic and nonagentic mesaoptimizers? Is agency in a mesaoptimizer a natural attractor? Under what conditions?
  8. Can we construct any example of gradient hacking, no matter how contrived or ephemeral?
  9. Can we construct any example of gradient hacking that persists in the face of optimization pressure?
  10. Can we make any claims about the development of agentic mesaoptimizers and gradient hacking that don't assume a lack of capability in the model?
  11. What types of optimizers increase or decrease the risk of agentic mesaoptimizers and gradient hacking? Did we get lucky with the specific way we usually use SGD, and slight variations spell doom, or does it turn out that the distance to pathologies is great enough that, say, genetic optimization would also suppress (or simply miss) them in the same ways?

These are the kinds of questions I'd really like more information about. In the context of mesaoptimizers, I think this type of question is where the meat is.

There has been good work done here! For example, I'd like to call out one post by @leogao I found unusually useful and which seems to have low visibility: Towards Deconfusing Gradient Hacking

Do think to check that you're not igniting the atmosphere

To be clear, I think work that identifies logically possible problems like these is good. It's important to ask the question so that someone will actually check.

Strong mesaoptimizers and gradient hacking are problems I want more clarity on. Practical agentic mesaoptimizers would be a major vulnerability to the kind of path I'm working on; I want to know if it would fail.

I'm not advocating for throwing out the security mindset, but I am against a naive security mindset- the kind you might get when optimizing only for finding clever conditional problems, rather than optimizing for security.

When trying to design a robust system, security mindset implies (among other things) that you should be aware that your regions of uncertainty could be filled with traps.

Security mindset does not imply that you should shrug and forever assume that there are infinite traps everywhere.[7] That process does not yield security.

This is not a unique phenomenon

A few months ago, I was poking at some simple options for mapping/controlling out of distribution behavior. Some of them were quickly testable ideas.[8]

But... they were clearly no better than tissue paper when faced with a strong unbound adversary. The only version that had any hopes of resisting a strong attack was so limiting that it was practically useless.

So, without thinking too much about it, I put the whole research path aside to pursue other options that... didn't make me think to hold them to the same bar? For some reason? If I had, I would have preemptively thrown out every single research path. That seems bad.

Since then, with some more time to poke things and learn, I actually think there are promising conditions[9] where the original proposals might actually be applicable because they're not being tasked with fighting an actively hostile superintelligence. They could still be useless, but they're not definitely useless. Progress is technically possible!

With the number of half-formed ideas I've churned through, I worry that I have inappropriately filtered others without remembering it. I wouldn't be surprised if other people have made the same mistake too. In a field as conceptually unforgiving as this one, there is a bit of background pressure to just assume that something can't possibly work.

I'm going to try to keep that impulse in check and poke ideas with a stick a little bit first.

  1. ^

    Especially if your observations of physics to the contrary are rather misleading, what with you being a Boltzmann brain and all.

  2. ^

    New big bang-ish events seem simpler, and more probable, than Boltzmann brains to the degree that I'm pretty sure actual brains dominate.

  3. ^

    I do think that strong adversaries of this kind can be more likely in some types of training. As a trivial example: take an arbitrarily strong simulator, directly provide strong reflective access to the model, and allow the model to simulate long agent trajectories that are then used as training data, and subject it to the kind of RL fine tuning that breaks myopic action. This isn't guaranteed to result in gradient hacking, but it sure seems to make it easier!

  4. ^

    The probability of a coherent continuation of a previous Boltzmann brain's experiences in another Boltzmann brain isn't zero, but it sure isn't high! I'd like rather more probability density than Boltzmann immortality implies!

  5. ^

    This is not intended as a subtweet or strawman of anyone's position; I've never seen anyone advocate for this.

  6. ^

     "Design" here means everything about the model and its influences: the loss function, fine tuning, reward signal, the connectivity of layers, regularization strategies, the type of data, hyperparameters, everything.

  7. ^

    I suspect this could also be misconstrued as an implied jab so, again, no subtweeting intended here. Just pointing to a view I don't like, not implying I think anyone holds this view.

  8. ^

    Some of them were listed in this braindump. They're simple enough that I assume they've been tried elsewhere, but I haven't seen it yet.

  9. ^


New Comment
4 comments, sorted by Click to highlight new comments since: Today at 10:59 PM

Highly encourage more poking-things-with-sticks! I think a quick look at historical trends in innovation shows that at the time of conception, people around innovators (and often the innovators themselves) typically do not expect their ideas to fully work. This is reasonable, because much of the time even great-sounding ideas fail, but that’s ultimately not a reason not to try.

Is this the first time that the word "Boltzmann" has been used to describe contemporary/near future ML? If not, how frequently has the word "boltzmann" been used in this way? 

Also, I know this question might be a bit of a curve ball, but what pros and cons can you think of for using the word "boltzmann"? (feel free to DM me if there's anything you'd rather not say publicly, which is definitely the right way to approach it imo). I'm really interested in AI safety communication, which is why I'm asking these slightly off-topic questions.

Is this the first time that the word "Boltzmann" has been used to describe contemporary/near future ML? If not, how frequently has the word "boltzmann" been used in this way? 

Not sure- I haven't seen it used before in this way, at least.

Also, I know this question might be a bit of a curve ball, but what pros and cons can you think of for using the word "boltzmann"?

Most lesswrong readers have probably encountered the concept of Boltzmann brains and can quickly map some of its properties over to other ideas, but I'd be surprised if "Boltzmann brain" would mean much to the median member of not-lesswrong. Having to explain both sides of the analogy, especially when both sides are complicated and weird, limits the explanatory value.

Worse, that Boltzmann fellow was known for a rather large number of things. If you called something a "Boltzmann distribution" intending this post's usage, you'd probably get weird looks and a great deal of confusion.

I also really didn't spend much time searching for the best possible fit- it was the first thing that came to mind that had the properties "conceivable," "extremely impactful if true," and "extremely improbable." There's probably some other analogy you could make with some extra property that would be even tighter.

So... probably fine if you're talking about ideas that don't overload existing terminology, and if whoever you're talking to has a high probability of associating "Boltzmann" with "brain," but otherwise iffy.

It's probably fine-ish to allocate another reference to the concept, though I personally might suggest expanding it all the way out to "boltzmann brain mesaoptimizer".

Are you familiar with restricted boltzmann machines? I think Hinton has described them as the other branch besides backprop that actually works, though I'm not finding the citation for that claim right now. In any case, they're a major thread in machine learning research, and are what machine learning researchers will think of first. That said, boltzmann brains have a wikipedia page which does not mention lesswrong; I don't think they're a lesswrong-specific concept in any way.

New to LessWrong?