Lucius Bushnaq

AI notkilleveryoneism researcher at Apollo, focused on interpretability.

Wiki Contributions


If actually enforcing the charter leads to them being immediately disempowered, it‘s not worth anything in the first place. We were already in the “worst case scenario”. Better to be honest about it. Then at least, the rest of the organisation doesn‘t get to keep pointing to the charter and the board as approving their actions when they don‘t.

The charter it is the board’s duty to enforce doesn‘t say anything about how the rest of the document doesn‘t count if investors and employees make dire enough threats, I‘m pretty sure.

IIRC this is probably the case for a broad range of non-NN models. I think the original Double Descent paper showed it for random Fourier features.

My current guess is that NN architectures are just especially affected by this, due to having even more degenerate behavioral manifolds, ranging very widely from tiny to large RLCTs.

I am not a fan of the current state of the universe. Mostly the part where people keep dying and hurting all the time. Humans I know, humans I don't know, other animals that might or might not have qualia, possibly aliens in distant places and Everett branches. It's all quite the mood killer for me, to put it mildly. 

So if we pull off not dying and turning Earth into the nucleus of an expanding zero utility stable state, superhuman AI seems great to me.

However, there are mostly no such constraints in ANN training (by default), so it doesn't seem destined to me that LLM behaviour should "compress" very much

The point of the Singular Learning Theory digression was to help make legible why I think this is importantly false. NN training has a strong simplicity bias, basically regardless of the optimizer used for training, and even in the absence of any explicit regularisation. This bias towards compression is a result of the particular degenerate structure of NN loss landscapes, which are in turn a result of the NN architectures themselves. Simpler solutions in these loss landscapes have a lower "learning coefficient," which you might conceptualize as an "effective" parameter count, meaning they occupy more (or higher dimensional, in the idealized case) volume in the loss landscape than more complicated solutions with higher learning coefficients.

This bias in the loss landscapes isn't quite about simplicity alone. It might perhaps be thought of as a particular mix of a simplicity prior, and a peculiar kind of speed prior. 

That is why Deep Learning works in the first place. That is why NN training can readily yield solutions that generalize far past the training data, even when you have substantially more parameters than data points to fit on. That is why, with a bit of fiddling around, training a transformer can get you a language model, whereas training a giant polynomial on predicting internet text will not get you a program that can talk.  SGD or no SGD, momentum or no momentum, weight regularisation or no weight regularisation. Because polynomial loss landscapes do not look like NN loss landscapes.

I have now seen this post cited in other spaces, so I am taking the time to go back and write out why I do not think it holds water.

I do not find the argument against the applicability of the Complete Class theorem convincing.

See Charlie Steiner's comment: 

You just have to separate "how the agent internally represents its preferences" from "what it looks like the agent us doing." You describe an agent that dodges the money-pump by simply acting consistently with past choices. Internally this agent has an incomplete representation of preferences, plus a memory. But externally it looks like this agent is acting like it assigns equal value to whatever indifferent things it thought of choosing between first.

Decision theory is concerned with external behaviour, not internal representations. All of these theorems are talking about whether the agent's actions can be consistently described as maximising a utility function. They are not concerned whatsoever with how the agent actually mechanically represents and thinks about its preferences and actions on the inside. To decision theory, agents are black boxes. Information goes in, decision comes out. Whatever processes may go on in between are beyond the scope of what the theorems are trying to talk about.


Money-pump arguments for Completeness (understood as the claim that sufficiently-advanced artificial agents will have complete preferences) assume that such agents will not act in accordance with policies like ‘if I previously turned down some option X, I will not choose any option that I strictly disprefer to X.’ But that assumption is doubtful. Agents with incomplete preferences have good reasons to act in accordance with this kind of policy: (1) it never requires them to change or act against their preferences, and (2) it makes them immune to all possible money-pumps for Completeness. 

As far as decision theory is concerned, this is a complete set of preferences. Whether the agent makes up its mind as it goes along or has everything it wants written up in a database ahead of time matters not a peep to decision theory. The only thing that matters is whether the agent's resulting behaviour can be coherently described as maximising a utility function. If it quacks like a duck, it's a duck.

... No?

I don't see what part of the graphs would lead to that conclusion. As the paper says, there's a memorization, circuit formation and cleanup phase. Everywhere along these lines in the three phases, the network is building up or removing pieces of internal circuitry. Every time an elementary piece of circuitry is added or removed, that corresponds to moving into a different basin (convex subset?). 

Points in the same basin are related by internal symmetries. They correspond to the same algorithm, not just in the sense of having the same input-output behavior on the training data (all points on the zero loss manifold have that in common), but also in sharing common intermediate representations. If one solution has a piece of circuitry another doesn't, they can't be part of the same basin. Because you can't transform them into each other through internal symmetries.

So the network is moving through different basins all along those graphs.

I don't see how the mechanistic interpretability of grokking analysis is evidence against this.

At the start of training, the modular addition network is quickly evolving to get increasingly better training loss by overfitting on the training data. Every time it gets an answer in the training set right that it didn't before, it has to have moved from one behavioural manifold in the loss landscape to another. It's evolved a new tiny piece of circuitry, making it no longer the same algorithm it was a couple of batch updates ago.

Eventually, it reaches the zero loss manifold. This is a mostly fully connected subset of parameter space. I currently like to visualise it like a canyon landscape, though in truth it is much more high dimensional. It is made of many basins, some broad (high dimensional), some narrow (low dimensional), connected by paths, some straight, some winding. 

A path through the loss landscape visible in 3D doesn't correspond to how and what the neural network is actually learning. Almost all of the changes to the loss are due to the increasingly good implementation of Algorithm 1; but apparently, the entire time, the gradient also points towards some faraway implementation of Algorithm 2.

In the broad basin picture, there aren't just two algorithms here, but many. Every time the neural network constructs a new internal elementary piece of circuitry, that corresponds to moving from one basin in this canyon landscape to another. Between the point where the loss flatlines and the point where grokking happens, the network is moving through dozens of different basins or more. Eventually, it arrives at the largest, most high dimensional basin in the landscape, and there it stays.

the entire time the neural network's parameters visibly move down the wider basin

I think this might be the source of confusion here. Until grokking finishes, the network isn't even in that basin yet. You can't be in multiple basins simultaneously.

At the time the network is learning the pieces of what you refer to as algorithm 2, it is not yet in the basin of algorithm 2. Likewise, if you went into the finished network sitting in the basin of algorithm 2 and added some additional internal piece of circuitry into it by changing the parameters, that would take it out of the basin of algorithm 2 and into a different, narrower one. Because it's not the same algorithm any more. It's got a higher effective parameter count now, a bigger Real Log Canonical Threshold.

Points in the same basin correspond to the same algorithm. But it really does have to be the same algorithm. The definition is quite strict here. What you refer to as superpositions of algorithm 1 and algorithm 2 are all various different basins in parameter space. Basins are regions where every point maps to the same algorithm, and all of those superpositions are different algorithms. 

I also have this impression, except it seems to me that it's been like this for several months at least. 

The Open Philanthropy people I asked at EAG said they think the bottleneck is that they currently don't have enough qualified AI Safety grantmakers to hand out money fast enough. And right now, the bulk of almost everyone's funding seems to ultimately come from Open Philanthropy, directly or indirectly.

You can easily get a draw against any AI in the world at Tic-Tac-Toe. In fact, provided the game actually stays confined to the actions on the board, you can draw AIXI at Tic-Tac-Toe. That's because Tic-Tac-Toe is a very small game with very few states and very few possible actions, and so intelligence, the ability to pick good actions, doesn't grant any further advantage in it past a certain pretty low threshold. 

Chess has more actions and more states, so intelligence matters more. But probably still not all that much compared to the vastness of the state and action space the physical universe has. If there's some intelligence threshold past which minds pretty much always draw against each other in chess even if there is a giant intelligence gap between them, I wouldn't be that surprised. Though I don't have much knowledge of the game.

In the game of Real Life, I very much expect that "human level" is more the equivalent of a four year old kid who is currently playing their third ever game of chess, and still keeps forgetting half the rules every minute. The state and action space is vast, and we get to observe humans navigating it poorly on a daily basis. Though usually only with the benefit of hindsight. In many domains, vast resource mismatches between humans do not outweigh skill gaps between humans. The Chinese government has far more money than OpenAI, but cannot currently beat OpenAI at making powerful language models. All the usual comparisons between humans and other animals also apply. This vast difference in achieved outcomes from small intelligence gaps even in the face of large resource gaps does not seem to me to be indicative of us being anywhere close to the intelligence saturation threshold of the Real Life game.

There is no one theory of time in physics.

There are many popular hypotheses with all kinds of different implications related to time in some way, but those aren't part of standard textbook physics. They're proposed extensions of our current models. I'm talking about plain old general relativity+Standard Model QFT here. Spacetime is a four-dimensional manifold, fields in the SM Lagrangian have support on that manifold, all of those field have CPT symmetry. Don't go asking for quantum gravity or other matters related to UV-completion.[1]

All that gives you is an asymmetry, a distinction between the past and future, within a static block universe. It doesn't get you away from stasis to give you a dynamic "moving cursor" kind of present moment.

Combined with locality, the rule that things in spacetime can only affect things immediately adjacent to them, yeah, it does. Computations can only act on bits that are next to them in spacetime. To act on bits that are not adjacent, "channels" in spacetime have to connect those bits to the computation, carrying the information. So processing bits far removed from  at  is usually hard, due to thermodynamics, and takes place by proxy, using inference on bits near  that have mutual information with the past or future bits of interest. Thus computations at  effectively operate primarily on information near , with everything else grasped from that local information. From the perspective of such a computation, that's a "moving cursor".

(I'd note though that asymmetry due to thermodynamics on its own could presumably already serve fine for distinguishing a "present", even if there was no locality. In that case, the "cursor" would be a boundary to one side of which the computation loses a lot of its ability to act on bits. From the inside perspective, computations at  would be distinguishable from computations at  and  in such a universe, by what algorithms are used to calculate on specific bits, with algorithms that act on bits "after"  being more expensive at . I don't think self-aware algorithms in that world would have quite the same experience of "present" we do, but I'd guess they would have some "cursor-y" concept/sensation.

I'm not sure how hard constructing a universe without even approximate locality,  but with thermodynamics-like behaviour and the possibility of Turing-complete computation would be though. Not sure if it is actually a coherent set-up. Maybe coupling to non-local points that hard just inevitably makes everything max-entropic everywhere and always.)

  1. ^

    I mean, do ask, by all means, but the answer probably won't be relevant for this discussion, because you can get planet earth and the human brains on it thinking and perceiving a present moment from a plain old SM lattice QFT simulation. Everyone in that simulation quickly dies because the planet has no gravity and spins itself apart, but they sure are experiencing a present until then.[2]

  2. ^

    Except there also might not be a Born rule in the simulation, but let's also ignore that, and just say we read off what's happening in the high amplitude parts of the simulated earth wave-function without caring that the amplitude is pretty much a superfluous pre-factor that doesn't do anything in the computation.

Load More