All of beren's Comments + Replies

Yes, I guess I am overstating the possible speedup if I call it 'much much faster', but there ought to at least be a noticeable speedup by cutting out the early steps if it's basically just wasting time/data/compute to fix the distributions. It might also converge to a better and different optimum.

I think we agree here. Testing whether it converges to a better optimum would also be interesting. 

Perhaps more interestingly is the consequences for the training and arch: a lot of stuff with Transformers, like special burnin schedules or heavy (ab)use of n

... (read more)

Also, I meant to ask you, what does the learning rate schedule of these models look like? In a lot of the summary statistics plots we see either peaks and asymptotes and sometimes clear phase transitions between checkpoints 20 and 40, and I was wondering if this is related to the learning rate schedule somehow (end of warmup?)

Linear warm-up over the first 10% of training, then cosine decay to a minimum of one-tenth the peak LR which is set to occur at the end of training (300B tokens). Peak LRs vary by model but are roughly consistent with GPT-3 and OPT values. You can find all the config details on GitHub. [] The main divergence relevant to this conversation from mainstream approaches is that we use a constant batch size (2M) throughout scaling. Prior work uses batch sizes up to 10x smaller for the smallest models, but we find that we can train large batch small models without any problems. This enables us to achieve a substantial wall-clock speed-up for small models by throwing more GPUs at them. We continue to use this batch size for the 11B model for consistency, although the standard progression of batch sizes would encourage one of 3M or 4M by that point. Checkpoint 20 and 40 are at 20k and 40k iterations respectively, and the entire training runs for 143k iterations. So they occur relatively shortly after the LR peaks, but don't coincide with anything I know to be particularly special.

That was my own immediate response: "if these distributions are so universal, why doesn't this show that standard initializations suck, and that you should reverse-engineer the final distribution and initialize that way?"

It might show this. As far as I know nobody has done this experiment. Either way results would be interesting.

Either the model won't train or will train much slower, which suggests that the understanding or training setup here is totally wrong in some way; or it will train at the same speed, suggesting that the distributions are misleading

... (read more)
Yes, I guess I am overstating the possible speedup if I call it 'much much faster', but there ought to at least be a noticeable speedup by cutting out the early steps if it's basically just wasting time/data/compute to fix the distributions. It might also converge to a better and different optimum. Perhaps more interestingly is the consequences for the training and arch: a lot of stuff with Transformers, like special burnin schedules or heavy (ab)use of normalization has long struck me as potentially just hacks around bad initializations that are trying to cause divergence. I've [] long been impressed by how it can be possible to remove normalization entirely or train stable vanilla NNs 10,000 layers deep just by improving the initialization/distribution. Reverse-engineering the final distribution may be a helpful method. If you use the final distribution, you may be able to drop some complexity from the overall Transformer recipe.

It looks like you're experimenting with the 5 smallest models, but haven't done analysis on the 2.8B, 6.9B, or 12B models. Is that something you're planning on adding, or no?

We have done some preliminary analyses on these as well. Primary issue is just that these experiments take longer since the larger models take longer to instantiate from checkpoint (which adds up when there are 142 checkpoints). Am planning to run the same experiments on the larger models and update the post with them at some point however.

I am really very surprised that the distributi

... (read more)
1Mark Goodhead1mo
Have you tried fitting a Student's t distribution? The nice thing about that distribution is the nu parameter completely controls the shape of the tails and is equivalent to the gaussian where nu is infinite; this would allow you to plot a cool graph of nu against checkpoint steps to get an easy visualisation of exactly how the shape of the tails changes over time.

I understand that these are working with public checkpoints but I'd be interested if you have internal models to see similar statistics for the size of weight updates, both across the training run, and within short periods, to see if there are correlations between which weights are updated. Do you get quite consistent, smooth updates, or can you find little clusters where connected weights all change substantially in just a few steps?

We do have internal models and we have run similar analyses on them. For obvious reasons I can't say too much about this, bu... (read more)

My understanding is that after a lot of simplifications, policy gradients just takes a noisy gradient step in the direction of minimising Bellman error, and so in the limit of infinite data/computation/visiting all states in the world, it is 'guaranteed' to converge to an optimal policy for the MDP. Q learning and other model-free algorithms have similar guarantees. In practice, with function approximation, and PPOs regularisation bits, these guarantees do not hold anymore, but the fundamental RL they are built off of does have them. The place to go deeper into this is Sutton and Bart's textbook and also Bertsekas' dynamic programming textbook

Yeah, I've read those books, although I admit to heavily skimming Bertsekas. 

I agree that this will probably wash out with strong optimization against. and that such confusions become less likely the more different the world models of yourself and the other agent that you are trying to simulate is -- this is exactly what we see with empathy in humans! This is definitely not proposed as a full 'solution' to alignment. My thinking is that a.) this effect may be useful for us in providing a natural hook to 'caring' about others which we can then design training objectives and regimens to allow us to extend and optimise this value shard to a much greater extent than it occurs naturally.

1Ben Amitay2mo
We agree 😀 What do you think about some brainstorming in the chat about how to use that hook?

(i) It will actually be possible to (measure and) limit the amount of "optimization pressure" that an advanced A(G)I exerts (towards a given goal).If so, how likely do you think (i) is to be true? 

If you have any ideas (even very rough/vague ones) for how to realize (i), I'd be curious to read them.

For this, it is not clear to me that it is impossible or even extremely difficult to do this, at least in a heuristic way. I think that managing to successfully limit the optimization power applied against our defences is fundamental to coming up with... (read more)

I suspect that {the cognitive process that produced the above sentence} is completely devoid of security mindset. If so, might be worth trying to develop security mindset? And/or recognize that one is liable to (i.a.) be wildly over-optimistic about various alignment approaches. (I notice that that sounded unkind; sorry, not meaning to be unkind.)

Yep this is definitely not proposed as some kind of secure solution to alignment (if only the world were so nice!). The primary point is that if this mechanism exists it might provide some kind of base signa... (read more)

Hmm. I wonder if you'd agree that the above relies on at least the following assumptions being true: * (i) It will actually be possible to (measure and) limit the amount of "optimization pressure" that an advanced A(G)I exerts (towards a given goal). * (ii) It will be possible to end the acute risk period using an A(G)I that is limited in the above way. If so, how likely do you think (i) is to be true? If you have any ideas (even very rough/vague ones) for how to realize (i), I'd be curious to read them. I think realizing (i) would probably be at least nearly as hard as the whole alignment problem. Possibly harder. (I don't see how one would in actual practice even measure "optimization pressure".)

I think this is a mechanism that actually happens a lot. People generally do lose a lot of empathy with experience and age. People definitely get de-sensitized to both strongly negative and strongly positive experiences after viewing them a lot. I actually think that this is more likely than the RL story -- especially with positive-valence empathy which under the RL story people would be driven to seek out. 

But then every time that empathy thing happens, I obviously don’t then immediately eat chocolate. So the reward model would get an error signal—th

... (read more)
2Steven Byrnes2mo
Thanks! In that case I’d be interested in the “some other region to detect whether the empathy triggered for a friend or enemy”. How is that region doing that? Specifically, (1) what exactly is the “low level substrate”, (2) what are the exact recipes for turning those things into the full complex repertoire of social emotions? Those are major research interests [] of mine. Happy for you & anyone else to join / share ideas :)

It seems like the AIs we build will be very different from us, at least in terms of basic drives. I can definitely empathize when there's some common currency to the experience (for ex. they're feeling pain, and I've also experienced pain), but probably less so when there's a greater gap. Since AIs won't share any of our physiology or evolutionary history, I worry that that common currency will be missing, which would seemingly incentivize the AI having separate circuits for modeling humans and for modeling itself

Yes, this depends a lot on the self model o... (read more)

My prediction is that there really is an evolved nudge towards empathy in the human motivational system, and that human psychology - like usually being empathetic but sometimes modulating it and often justifying self-serving actions - is sculpted by such evolved nudges, and wouldn't be recapitulates in AI lacking those nudges.

I agree -- this is partly what I am trying to say in the contextual modulation section. The important thing is that the base capability for empathy might exist as a substrate to then get sculpted by gradient descent / evolution to imp... (read more)

Goal misgeneralization (the global min might also be malign OOD). The thing you're talking about here I would basically describe as the first thing.

Is there a way this is different from standard goodhearting concerns? I totally agree that this is a problem but it seems importantly conceptually distinct to me from gradient hackers/mesaoptimization

Strongly upvoted this post. I agree very strongly with every point here. The biggest consideration for me is that alignment seems like the kind of problem which is primarily bottlenecked on serial conceptual insights rather than parallel compute. If we already had alignment methods that we know would work if we just scaled them up, the same way we have with capabilities, then racing to endgame might make sense given the opportunity costs of delaying aligned AGI. Given that a.) we don't have such techniques and b.) even if we did it would be hard to be so c... (read more)

The local minima point is interesting. My initial question is how this squares with both theoretical and empirical findings that networks generally don't seem to get stuck in local minima and the lots of hints that the general loss landscape in which they operate is fairly benign?

2Donald Hobson2mo
All those results show is that the local minima usually do similarly well. There are many local minima of a large neural net. Most of them get similarly high scores on the test data. Some of those could still be malign. (Ie it gets 99% of the answers correct, but the set it gets wrong is maliciously chosen.)

I think this is only possible if the coupling between the gradient hacker's implementation of its malign behaviour and the good performance is extremely strong and essentially the correlation has to be 1. It is not like gradient descent has only one knob to turn for 'more gradient hacker' or 'less gradient hacker'. Instead, it has access to all of the internal weights of the gradient hacker and will change them to both a.) strengthen the positive aspects of the gradient hacker wrt the outer loss and b.) weaken the negative aspects. I.e. so if the gradient ... (read more)

I broadly agree with a lot of shard theory claims. However, the important thing to realise is that 'human values' do not really come from inner misalignment wrt our innate reward circuitry but rather are the result of a very long process of social construction influenced both by our innate drives but also by the game-theoretic social considerations needed to create and maintain large social groups, and that these value constructs have been distilled into webs of linguistic associations learnt through unsupervised text-prediction-like objectives which is ho... (read more)

FWIW I don't consider myself to be arguing against planning over a world model []. 
Can you give me some examples here? I don't know that I follow what you're pointing at. 

The convergence theorems basically say that optimizing for P[t] converges to optimizing for T[t+d] for some sufficient timespan d.

The idea of a convergence theorem showing that optimizing any objective leads to empowerment has been brought up a bunch of times in these discussions, as in this quote. Is there some well-known proof/paper where this is shown? AFAICT the original empowerment do not show any proof like this (may have missed it). Is this based off of Alex Turner's work ( which results in a different measure than information theoretic empowerment (but intuitively related), or something else?

Excellent post btw.

Yes they have. There's quite a large literature on animal emotion and cognition and my general synthesis is that animals (at least mammals) have at least the same basic emotions as humans and often quite subtle ones such as empathy and a sense of fairness. It seems pretty likely to me whatever the set of base reward functions encoded in the mammalian basal ganglia and hypothalamus is, it can quite robustly generate expressed behavioural 'values' that fall within some broadly humanly recognisable set.

This is definitely the case. My prior is relatively strong that intelligence is compact, at least for complex and general tasks and behaviours. Evidence for this comes from ML -- the fact that the modern ML paradigm of huge network + lots of data + general optimiser being able to solve a large number of tasks is a fair bit of evidence for this. Other evidence is existence of g and cortical uniformity in general, as well as our flexibility at learning skills like chess, mathematics etc which we clearly do not have any evolutionarily innate specialisation fo... (read more)

What if we learn new domains by rewiring/specialising/developing new neural circuitry for them. We have a general optimiser that does dedicated cross domain optimisation by developing narrow optimisers?

I'm basing my thinking here primarily off of Herculano Houzel's work. If you have reasons you think this is wrong or counterarguments, I would be very interested in them as this is a moderately important part of my general model of AI.

the brain imaging studies also show predicting intelligence taps into a lot more aspects of static neuroanatomy or dynamic patterns than simply brain volume

Do you have links for these studies? Would leave to have a read about the static and dynamic correlates of g are from brain imaging!

This is a good idea! I hadn't thought that much about specific synaptic efficiency metrics. If we think about this in a bit more detail, these would effectively corresponds to some kind of changes in hyper parameters for an ML model. I.e. more rapid synaptic changes = potential for higher learning rate effectively. The more rapid synaptic formation (and potentially pruning?) is harder to model in ML but I guess would be an increase in effective parameter count. 

1Kyle Walsh3mo
Thinking about these as changes in hyperparameters is probably the closest analogy from a ML perspective. I should note that my own area of expertise is genetic epidemiology and neuroscience, not ML, so I am less fluent discussing the computational domain than human-adjacent biological structures. At the risk of speaking outside my depth, I offer the following from the perspective of a geneticist/neuroscientist: My intuition (FWIW) is that all human brains are largely running extremely similar models, and that the large IQ differences observed are either due to 1) inter-individual variability in neuronal performance (the cycling aspect I reference above), or 2) the number of parameters that can be quickly called from storage. The former seems analogous to two machines running the same software but with an underlying difference in hardware (eg, clock rate), while the latter seems more analogous to two machines running the same software but with vastly different levels of RAM. I can’t decide whether having better functionality at the level of individual neurons is more likely to generate benefit in the “clock rate” or the “RAM” domain. Both seem plausible, and again, my apologies for jettisoning LLM analogies for more historical ones drawn from the PC era. At least I didn’t say some folks were still running vacuum tubes instead of transistors!

To some extent yes speed can compensate for intelligence but this isn't really related to the question of FOOM. 

In theory, if we have an AGI which is human level but 1000x faster, it might be able to perform at the level of 1000 humans rather than a human from the year 3000. If we have a giant population of AGIs such that we can replicate the entire edifice of human science but running at 1000x faster, then sure. In practice though by Amdahl's law such a speed increase would just move the bottleneck to something else (probably running experiments/gath... (read more)

That's what I meant, serial speedup of 1000x, and separately from that a sufficient population. Assuming 6 hours a day of intensive work for humans, 5 days a week, there is a 5.6x speedup from not needing to rest. With 3/4 words per token, a 1000x speedup given no need to rest requires generation speed of 240 tokens/s. LLMs can do about 20-100 tokens/s when continuing a single prompt. Response latency is already a problem in practice, so it's likely to improve.

I largely disagree about the intrinsic motivation/reward function points. There is a lot of evidence that there is at least some amount of general intelligence which is independent of interest in particular fields/topics. Of course, if you have a high level of intelligence + interest then your dataset will be heavily oriented towards that topic and you will gain a lot of skill in it, but the underlying aptitude/intelligence can be factored out of this. 

How exactly specific interests are encoded is a different and also super fascinating question! It de... (read more)

Meant to comment on this a while back but forgot. I have thought about this also and broadly agree that early AGI with 'thoughts' at GHz levels is highly unlikely. Originally this was because pre-ML EY and the community broadly associated thoughts with CPU ops but in practice thoughts are more like forward passes through the model. 

As Connor Sullivan says, the reasons brains can have low clock rates is that our intelligence algorithms are embarrassingly parallel, as is current ML. Funnily enough, for large models (and definitely if we were to run forw... (read more)

Strongly upvoted, I found this very valuable/enlightening. I think you should make this a top level answer.
I'll check your post out. I've found Cannell's post very dense/hard to read the times I've attempted it. I guess there's a large inferential distance in some aspects, so lots of it go over my head.

Yes definitely. Pretty much the main regions of interest to us are from Par-human up. Returns are almost definitely not consistent across scales. But what really matters for Xrisk is whether they are positive or negative around current or near-future ML models -- i.e. can existing models or AGIs we create in the next few decades self improve to super intelligence or not?

I'm curious what you think about my post expressing scepticism of the relevance of recursive self improvement to the deep learning paradigm [].

Yes definitely. Based on my own estimates of approximate brain scale it is likely that current largest.  ML projects (GPT4) are within an OOM or so of effective parameter count already (+- 1-2 OOM) and we will definitely have brain-scale ML systems being quite common within a decade and probably less -- hence short timelines. Strong agree that it is much easier to add compute/energy to ML models vs brains.

Have you written your estimates of brain scale up anywhere?

I've had similar questions to this before in terms of how human individual differences appear so great when the actual seeming differences in neurophysiology between +3 and -3 SD humans are so small. My current view on this is that:

a.)  General 'peak' human cognition is pretty advanced and the human brain is large even by current ML standards so by the scaling laws we should be pretty good vs existing ML systems at general tasks. This means that human intelligence is pretty 'far out' compared to current ML often and that scaling ML tasks much beyond h... (read more)

I think that this is correct with one caveat: 1. We are closing the gap between human brains and ML models, and I think this will probably happen a decade or so away from now. I think that ML and human brains will converge to the same or similar performance this century, and the big difference is more energy can be added in pretty reliably to the ML model while humans don't enjoy this advantage.

Unfortunately our code is tied too closely to our internal infrastructure for it to be worth disentangling for this post. I am considering putting together a repo containing all the plots we made though, since in the post we only publish a few exemplars and ask people to trust that the rest look similar. Most of the experiments are fairly simple and involves just gathering activations or weight data and plotting them.

Whoops! Thanks for spotting. Fixed!

Very fair point! I somehow forgot to add a counterpoint like this in there as I intended. Updated now

Sure the limbic system evolved over millions of years, but that doesn't mean we need to evolve it as well -- we could just study it and reimplement it directly without (much) iteration. I am not necessarily saying that this is a good approach to alignment -- I personally would prefer a more theoretically grounded one also. But I think it is an interesting existence proof that imprinting fairly robust drives into agents through a very low bandwidth channel even after a lot of experience and without much RL is possible in practice.

I agree that fictional/cultural evidence is important for how people generalise their innate responses to new stimuli. Specifically, I think something similar to Steven Byrnes' proxy matching is going on. 

The idea is that we have certain hardwired instincts such as fear of death that are triggered in specific scenarios and we also independently learn a general world-model based on unsupervised learning which learns an independent and potentially un-emotive concept of death. Then we associate our instinctive reactions with this concept such that eventu... (read more)

I do want to note that it can also hijack instrumental convergence in order to achieve alignment.

Yes to some extent. Humans are definitely not completely robust to RSI / at a reflectively stable equilibrium. I do suspect though that sexual desire is at least partially reflectively stable. If people could arbitrarily rewrite their psychology I doubt that most would completely remove their sex drive or transmute it into some completely alien type of desire (some definitely would and I also think there'd be a fair bit of experimentation around the margin as well as removing/tweaking some things due to social desirability biases).

The main point though is ... (read more)

Personally, I am broadly in agreement with most of these points and especially 2, which seems very understudied given its likely importance to our survival. Would love to chat privately about your thoughts and hunches if you'd be up for it. 

Yes, this is the few-shot alignment world described in the post. I agree that in principle if boxing could completely halt RSI then that would be fantastic but that especially with each iteration of RSI there is some probability that the box will fail and we would then get unbounded RSI. This means we would get effectively a few 'shots' to align our boxed AGI before we die.

i.e. if evolution has precisely titrated the percentage of homosexuality etc so as to maximise IGF taking into account benefits of bonding, additional childcare, group selection etc, then this is actually evidence for evolution achieving a much greater level of alignment than otherwise!

What is evolution's true goal? If it's genetic fitness, then I don't see how this demonstrates alignment. Human sexuality is still just an imperfect proxy, and doesn't point at the base objective at all.  I agree that it's very interesting how robust this is to the environment we grow up in, and I would expect there to be valuable lessons here for how value formation happens (and how we can control this process in machines).

Indeed, this is exactly the kind of thing I am gesturing at. Certainly, all our repertoires of sexual behaviour are significantly shaped by RL. My point is that evolution has somehow in this case mostly solved some pointers-like problem to get the reward model to suddenly include rewards for sexual behaviour, can do so robustly, and can do so a long time after birth after a decade or so of unsupervised learning and RL has already occurred. Moreover, this reward model leads to people robustly pursuing this goal even fairly off-distribution from the ancestral environment.

Indeed, but insofar as this bonding function enhances IGF then this actually makes it an even more impressive example of alignment to evolution's true goal. I know that there are a bunch of potential evolutionary rationales proposed for homosexuality but I personally haven't studied it in depth nor are any super convincing to me so I'm just assuming the worst-case scenario for evolution here.

i.e. if evolution has precisely titrated the percentage of homosexuality etc so as to maximise IGF taking into account benefits of bonding, additional childcare, group selection etc, then this is actually evidence for evolution achieving a much greater level of alignment than otherwise!

I feel like this is a good point in general but I think there is an important but subtle distinction between the two examples. In the first case of the GAN it is that there is the distinction between the inner optimization loop of the ML algorithm and the outer loop of humans performing an evolutionary search process to get papers/make pretty pictures. 


In the wire-heading case this feels different in that you have essentially two separate value functions -- a cortical LM based one which can extrapolate values in linguistic/concept space and a cl... (read more)

2Steven Byrnes3mo
I guess I want to call the second one “the actual value function defined in the agent’s source code” and the first one “the agent’s learned concept of ‘value function’” (or relatedly, “the agent’s learned concept of ‘pleasure’” / “the agent’s learned concept of ‘satisfaction’” / whatever). Other than that, I don’t think we’re in disagreement about anything, AFAICT.

I want to add to the general agreement here that I broadly agree with this strategy and think that in practice this kind of things (multiple alignment solutions strapped onto a system in parallel) will be what alignment success looks like in parallel, and have written about this before. This agreement is primarily with the swiss-cheese/assemblage strategy. 

It is important to note that I think that hodgepodge alignment strategies only really apply to near-term proto-AGI systems that we will build and won't work against a system that has strongly recurs... (read more)

Define a new type that only allows ints < 9000 if needed? For your broader question, I think there might be some safety relevant properties of some alignment primitives.

So the quantitative experiment you propose is a good idea -- and we will be working along these lines, extending the very preliminary experiments in the post about how big of an effect edits like this will have.

In terms of the polytopes, you are right that this doesn't really fit in with that framework but assumes a pure linear directions framework. We aren't really wedded to any specific viewpoint and are trying a lot of different perspectives to try to figure out what the correct ontology to understand neural network internals is.

Broadly, I agree with this. We are never going to have a full mechanistic understanding of literally every circuit in a TAI model in time for it to be alignment relevant (we may have fully reversed engineered some much smaller 'model organisms' by this time though). Nor are individual humans ever going to understand all the details of exactly how such models function (even small models). 

However, the arguments for mechanistic interpretability in my view are as follows:

1.) Model capacities probably follow some kind of Pareto principle -- 20% or the cir... (read more)

6Charlie Steiner4mo
Yeah, I think these are good points. However, I think that #1 is actually misleading. If we measure "work" in loss or in bits, then yes absolutely we can probably figure out the components that reduce loss the most. But lots of very important cognition goes into getting the last 0.01 bits of loss in LLMs, which can have big impacts on the capabilities of the model and the semantics of the outputs. I'm pessimistic on human-understanding based approaches to auditing such low-loss-high-complexity capabilities.

This seems like a super interesting result! Thanks for linking; I wasn't aware of it. I haven't specifically looked for this pattern in GPT2-medium but I will now! Interestingly, we have also been thinking along similar lines of a 3 phase sequence for processing in residual nets like transformers where the first few layers do some kind of 'large scale' reshaping process of the geometry of the data while the later layers mostly do some kind of smaller refinements which don't change the basic geometry of the representation much, and then the final layer does... (read more)

This seems like an important but I am not sure I completely follow. How do rays differ from directions here? I agree that the SVD directions won't recover any JL kind of dense packing of directions since it is constrained to, at maximum, the dimension of the matrix. The thinking here is then that if the model tends to pack semantically similar directions into closely related dimensions, then the SVD would pick up on at least an average of this and represent it. 

I also think something to keep in mind is that we are doing the SVDs over the OV and MLP we... (read more)

Great point. I agree that the singular vectors become unstable when the singular values are very close (and meaningless within the span when identical). However I don't think this is the main driver of the effect in the post. The graph of the singular vectors shown is quite misleading about the gap (this was my bad!). Because the OV matrix is effectively of rank 64, there is the sudden jump down to almost 0 which dominates the log-scale plotting. I was originally using that graph to try to show that effect, but in retrospect it is kind of an obvious one an... (read more)

Yes, this is correct. SVD necessarily won't recover the full JL packing. Given that we don't know the extent to which the network uses the full JL capacity, then SVD might still get a reasonable fraction of the relevant directions. Also, if the network packs semantically similar vectors close to one another, then the SVD direction might also represent some kind of useful average of them.


Indeed, we are looking at sparse coding to try to construct an over complete basis, as a parallel project. Stay tuned for this.

Load More