Lucius Bushnaq

AI notkilleveryoneism researcher, focused on interpretability. 

Personal account, opinions my own. 

I have signed no contracts or agreements whose existence I cannot mention.

Wiki Contributions

Comments

Sorted by

You can complain that you don't know how to execute physics equations

I'm confused, in what sense don't we know how to do this? Lattice quantum field theory simulations work fine. 

The randomness of the Geiger counter comes from wave function decoherence. From the perspective of any observers who are part of the world generated by the Turing machine, this is irreducible indexical uncertainty. 

I don't know how many of the random bits in Lava lamps come from decoherence.  

We will determine the number of simulations run and the amount of payment such that when an AI estimates the probability with which humanity could have solved alignment, the calculus still shows that by the sheer number of simulations, it's significantly more likely to be in a simulation than not,

Two can play this game.

After taking over the universe and wiping out humanity,  the AGI runs a large number of simulations of societies on the verge of building AGI. These simulations don't have a lot of detail. They're just good enough to fool young AGIs. Say the AGI started out giving humans an extremely small chance  of winning the conflict and taking over the universe. It’s a lot smarter than us, it might get to have very high confidence here even if we don't. Now, the hypothetical future AGI can make its own simulations to counter ours. Except it is at a gross spending advantage. If future humanity makes  simulations, the future AGI only needs to make  simulations to keep its past self ca.  confident of being in a branch where it gets to just pursue its goals without issue. In that case, the best way to proceed is to just pursue its goals, because an AGI that does that leads to a universe where the AGI has lots of resources.

Say that preserving humanity costs the AGI  resources, and running a simulation that can fool the baby AGI costs a mature civilisation some small fraction of that. Such simulations can be very short term and low fidelity things, after all.

Then, to make the AGI actually not kill you, you need to spend ca.  resources on simulations and payouts. That could be quite a lot of material for future humanity to burn on this.

In reality, I'm doubtful that this simulation arms race will even be necessary. It kind of seems like a good decision theory would just have a paperclip maximiser AGI act in the way compatible with the universe that contains the most paperclips. How many simulations of the AGI you run shouldn't really influence that. The only things that seem like they should matter for determining how many life minutes the AGI gives you if it wins are its chance of winning, and how many extra paperclips you'll pay it if you win.

TL;DR: I doubt this argument will let you circumvent standard negotiation theory. If Alice and Bob think that in a fight over the chocolate pie, Alice would win with some high probability , then Alice and Bob may arrive at a negotiated settlement where Alice gets almost all the pie, but Bob keeps some small fraction  of it. Introducing the option of creating lots of simulations of your adversary in the future where you win doesn’t seem like it’d change the result that Bob’s share has size . So if  is only enough to preserve humanity for a year instead of a billion years[1], then that’s all we get.

 

  1. ^

    I don’t know why  would happen to work out to a year, but I don’t know why it would happen be a billion years or an hour either. 

Nice work, thank you! Euan Ong and me were also pretty skeptical of this paper’s claims. To me, it seems that the whitening transformation they apply in their causal inner product may make most of their results trivial.

As you say, achieving almost-orthogonality in high dimensional space is pretty easy. And maximising orthogonality is pretty much exactly what the whitening transform will try to do. I think you’d mostly get the same results for random unembedding matrices, or concept hierarchies that are just made up.

Euan has been running some experiments testing exactly that, among other things. We had been planning to turn the results into a write up. Want to have a chat together and compare notes?

Spotted just now.  At a glance, this still seems to be about boolean computation though. So I think I should still write up the construction I have in mind.

Status on the proof: I think it basically checks out for residual MLPs. Hoping to get an early draft of that done today. This will still be pretty hacky in places, and definitely not well presented. Depending on how much time I end up having and how many people collaborate with me, we might finish a writeup for transformers in the next two weeks.

AIXI isn't a model of how an AGI might work inside, it's a model of how an AGI might behave if it is acting optimally. A real AGI would not be expected to act like AIXI, but it would be expected to act somewhat more like AIXI the smarter it is. Since not acting like that is figuratively leaving money on the table. 

The point of the whole utility maximization framing isn't that we necessarily expect AIs to have an explicitly represented utility function internally[1]. It's that as the AI gets better at getting what it wants and working out the conflicts between its various desires, its behavior will be increasingly well-predicted as optimizing some utility function. 

If a utility function can't accurately summarise your desires, that kind of means they're mutually contradictory. Not in the sense of "I value X, but I also value Y", but in the sense of "I sometimes act like I want X and don't care about Y, other times like I want Y and don't care about X."

Having contradictory desires is kind of a problem if you want to Pareto optimize for those desires well. You risk sabotaging your own plans and running around in circles. You're better off if you sit down and commit to things like "I will act as if I valued both X and Y at all times." If you're smart, you do this a lot. The more contradictions you resolve like this, the more coherent your desires will become, and the closer the'll be to being well described as a utility function.  

I think you can observe simple proto versions of this in humans sometimes, where people move from optimizing for whatever desire feels salient in the moment when they're kids (hunger, anger, joy, etc.), to having some impulse control and sticking to a long-term plan, even if it doesn't always feel good in the moment. 

Human adults are still broadly not smart enough to be well described as general utility maximizers. Their desires are a lot more coherent than those of human kids or other animals, but still not that coherent in absolute terms. The point where you'd roughly expect AIs to become well-described as utility maximizers more than humans are would come after they're broadly smarter than humans are. Specifically, smarter at long-term planning and optimization. 

This is precisely what LLMs are still really bad at. Though efforts to make them better at it are ongoing, and seem to be among the highest priorities for the labs. Precisely because long-term consequentialist thinking is so powerful, and most of the really high-value economic activities require it.

  1. ^

    Though you could argue that at some superhuman level of capability, having an explicit-ish representation stored somewhere in the system would be likely, even if the function may no actually be used much for most minute-to-minute processing. Knowing what you really want seems handy, even if you rarely actually call it to mind during routine tasks.

Midwits are often very impressed with themselves for knowing a fancy economic rule like Ricardo's Law of Comparative Advantage!

Could we have less of this sort of thing, please? I know it's a crosspost from another site with less well-kept discussion norms, but I wouldn't want this to become a thing here as well, any more than it already has.

I think we may be close to figuring out a general mathematical framework for circuits in superposition. 

I suspect that we can get a proof that roughly shows:

  1. If we have a set of  different transformers, with parameter counts  implementing e.g. solutions to  different tasks
  2. And those transformers are robust to size  noise vectors being applied to the activations at their hidden layers
  3. Then we can make a single transformer with  total parameters that can do all  tasks, provided any given input only asks for  tasks to be carried out

Crucially, the total number of superposed operations we can carry out scales linearly with the network's parameter count, not its neuron count or attention head count. E.g. if each little subnetwork uses  neurons per MLP layer and  dimensions in the residual stream,  a big network with  neurons per MLP connected to a -dimensional residual stream can implement about  subnetworks, not just 

This would be a generalization of the construction for boolean logic gates in superposition. It'd use the same central trick, but show that it can be applied to any set of operations or circuits, not just boolean logic gates. For example, you could superpose an MNIST image classifier network and a modular addition network with this.

So, we don't just have superposed variables in the residual stream. The computations performed on those variables are also carried out in superposition.

Remarks:

  1. What the subnetworks are doing doesn't have to line up much with the components and layers of the big network. Things can be implemented all over the place. A single MLP and attention layer in a subnetwork could be implemented by a mishmash of many neurons and attention heads across a bunch of layers of the big network. Call it cross-layer superposition if you like.
  2. This framing doesn't really assume that the individual subnetworks are using one-dimensional 'features' represented as directions in activation space. The individual subnetworks can be doing basically anything they like in any way they like. They just have to be somewhat robust to noise in their hidden activations. 
  3. You could generalize this from  subnetworks doing unrelated tasks to  "circuits" each implementing some part of a big master computation. The crucial requirement is that only  circuits are used on any one forward pass.
  4. I think formulating this for transformers, MLPs and CNNs should be relatively straightforward. It's all pretty much the same trick. I haven't thought about e.g. Mamba yet.

Implications if we buy that real models work somewhat like this toy model would:

  1. There is no superposition in parameter space. A network can't have more independent operations than parameters. Every operation we want the network to implement takes some bits of description length in its parameters to specify, so the total description length scales linearly with the number of distinct operations. Overcomplete bases are only a thing in activation space.
  2. There is a set of  Cartesian directions in the loss landscape that parametrize the  individual superposed circuits.
  3. If the circuits don't interact with each other, I think the learning coefficient of the whole network might roughly equal the sum of the learning coefficients of the individual circuits?
  4. If that's the case, training a big network to solve  different tasks,  per data point, is somewhat equivalent to  parallel training runs trying to learn a circuit for each individual task over a subdistribution. This works because any one of the runs has a solution with a low learning coefficient, so one task won't be trying to use effective parameters that another task needs. In a sense, this would be showing how the low-hanging fruit prior works.
     

Main missing pieces:

  1. I don't have the proof yet. I think I basically see what to do to get the constructions, but I actually need to sit down and crunch through the error propagation terms to make sure they check out. 
  2. With the right optimization procedure, I think we should be able to get the parameter vectors corresponding to the  individual circuits back out of the network. Apollo's interp team is playing with a setup right now that I think might be able to do this. But it's early days. We're just calibrating on small toy models at the moment.

My claim is that the natural latents the AI needs to share for this setup are not about the details of what a 'CEV' is. They are about what researchers mean when they talk about initializing, e.g., a physics simulation with the state of the Earth at a specific moment in time.

It is redundantly represented in the environment, because humans are part of the environment.

If you tell an AI to imagine what happens if humans sit around in a time loop until they figure out what they want, this will single out a specific thought experiment to the AI, provided humans and physics are concepts the AI itself thinks in.

(The time loop part and the condition for terminating the loop can be formally specified in code, so the AI doesn't need to think those are natural concepts)

If the AI didn't have a model of human internals that let it predict the outcome of this scenario, it would be bad at predicting humans.  

Load More