Good idea -- will run this experiment!
Also, I meant to ask you, what does the learning rate schedule of these models look like? In a lot of the summary statistics plots we see either peaks and asymptotes and sometimes clear phase transitions between checkpoints 20 and 40, and I was wondering if this is related to the learning rate schedule somehow (end of warmup?)
That was my own immediate response: "if these distributions are so universal, why doesn't this show that standard initializations suck, and that you should reverse-engineer the final distribution and initialize that way?"
It might show this. As far as I know nobody has done this experiment. Either way results would be interesting.
...Either the model won't train or will train much slower, which suggests that the understanding or training setup here is totally wrong in some way; or it will train at the same speed, suggesting that the distributions are misleading
It looks like you're experimenting with the 5 smallest models, but haven't done analysis on the 2.8B, 6.9B, or 12B models. Is that something you're planning on adding, or no?
We have done some preliminary analyses on these as well. Primary issue is just that these experiments take longer since the larger models take longer to instantiate from checkpoint (which adds up when there are 142 checkpoints). Am planning to run the same experiments on the larger models and update the post with them at some point however.
...I am really very surprised that the distributi
I understand that these are working with public checkpoints but I'd be interested if you have internal models to see similar statistics for the size of weight updates, both across the training run, and within short periods, to see if there are correlations between which weights are updated. Do you get quite consistent, smooth updates, or can you find little clusters where connected weights all change substantially in just a few steps?
We do have internal models and we have run similar analyses on them. For obvious reasons I can't say too much about this, bu...
My understanding is that after a lot of simplifications, policy gradients just takes a noisy gradient step in the direction of minimising Bellman error, and so in the limit of infinite data/computation/visiting all states in the world, it is 'guaranteed' to converge to an optimal policy for the MDP. Q learning and other model-free algorithms have similar guarantees. In practice, with function approximation, and PPOs regularisation bits, these guarantees do not hold anymore, but the fundamental RL they are built off of does have them. The place to go deeper into this is Sutton and Bart's textbook and also Bertsekas' dynamic programming textbook
I agree that this will probably wash out with strong optimization against. and that such confusions become less likely the more different the world models of yourself and the other agent that you are trying to simulate is -- this is exactly what we see with empathy in humans! This is definitely not proposed as a full 'solution' to alignment. My thinking is that a.) this effect may be useful for us in providing a natural hook to 'caring' about others which we can then design training objectives and regimens to allow us to extend and optimise this value shard to a much greater extent than it occurs naturally.
(i) It will actually be possible to (measure and) limit the amount of "optimization pressure" that an advanced A(G)I exerts (towards a given goal).If so, how likely do you think (i) is to be true?
If you have any ideas (even very rough/vague ones) for how to realize (i), I'd be curious to read them.
For this, it is not clear to me that it is impossible or even extremely difficult to do this, at least in a heuristic way. I think that managing to successfully limit the optimization power applied against our defences is fundamental to coming up with...
I suspect that {the cognitive process that produced the above sentence} is completely devoid of security mindset. If so, might be worth trying to develop security mindset? And/or recognize that one is liable to (i.a.) be wildly over-optimistic about various alignment approaches. (I notice that that sounded unkind; sorry, not meaning to be unkind.)
Yep this is definitely not proposed as some kind of secure solution to alignment (if only the world were so nice!). The primary point is that if this mechanism exists it might provide some kind of base signa...
I think this is a mechanism that actually happens a lot. People generally do lose a lot of empathy with experience and age. People definitely get de-sensitized to both strongly negative and strongly positive experiences after viewing them a lot. I actually think that this is more likely than the RL story -- especially with positive-valence empathy which under the RL story people would be driven to seek out.
...But then every time that empathy thing happens, I obviously don’t then immediately eat chocolate. So the reward model would get an error signal—th
It seems like the AIs we build will be very different from us, at least in terms of basic drives. I can definitely empathize when there's some common currency to the experience (for ex. they're feeling pain, and I've also experienced pain), but probably less so when there's a greater gap. Since AIs won't share any of our physiology or evolutionary history, I worry that that common currency will be missing, which would seemingly incentivize the AI having separate circuits for modeling humans and for modeling itself
Yes, this depends a lot on the self model o...
My prediction is that there really is an evolved nudge towards empathy in the human motivational system, and that human psychology - like usually being empathetic but sometimes modulating it and often justifying self-serving actions - is sculpted by such evolved nudges, and wouldn't be recapitulates in AI lacking those nudges.
I agree -- this is partly what I am trying to say in the contextual modulation section. The important thing is that the base capability for empathy might exist as a substrate to then get sculpted by gradient descent / evolution to imp...
Goal misgeneralization (the global min might also be malign OOD). The thing you're talking about here I would basically describe as the first thing.
Is there a way this is different from standard goodhearting concerns? I totally agree that this is a problem but it seems importantly conceptually distinct to me from gradient hackers/mesaoptimization
Strongly upvoted this post. I agree very strongly with every point here. The biggest consideration for me is that alignment seems like the kind of problem which is primarily bottlenecked on serial conceptual insights rather than parallel compute. If we already had alignment methods that we know would work if we just scaled them up, the same way we have with capabilities, then racing to endgame might make sense given the opportunity costs of delaying aligned AGI. Given that a.) we don't have such techniques and b.) even if we did it would be hard to be so c...
The local minima point is interesting. My initial question is how this squares with both theoretical and empirical findings that networks generally don't seem to get stuck in local minima and the lots of hints that the general loss landscape in which they operate is fairly benign?
I think this is only possible if the coupling between the gradient hacker's implementation of its malign behaviour and the good performance is extremely strong and essentially the correlation has to be 1. It is not like gradient descent has only one knob to turn for 'more gradient hacker' or 'less gradient hacker'. Instead, it has access to all of the internal weights of the gradient hacker and will change them to both a.) strengthen the positive aspects of the gradient hacker wrt the outer loss and b.) weaken the negative aspects. I.e. so if the gradient ...
I broadly agree with a lot of shard theory claims. However, the important thing to realise is that 'human values' do not really come from inner misalignment wrt our innate reward circuitry but rather are the result of a very long process of social construction influenced both by our innate drives but also by the game-theoretic social considerations needed to create and maintain large social groups, and that these value constructs have been distilled into webs of linguistic associations learnt through unsupervised text-prediction-like objectives which is ho...
The convergence theorems basically say that optimizing for P[t] converges to optimizing for T[t+d] for some sufficient timespan d.
The idea of a convergence theorem showing that optimizing any objective leads to empowerment has been brought up a bunch of times in these discussions, as in this quote. Is there some well-known proof/paper where this is shown? AFAICT the original empowerment do not show any proof like this (may have missed it). Is this based off of Alex Turner's work (https://arxiv.org/pdf/1912.01683.pdf) which results in a different measure than information theoretic empowerment (but intuitively related), or something else?
Excellent post btw.
Yes they have. There's quite a large literature on animal emotion and cognition and my general synthesis is that animals (at least mammals) have at least the same basic emotions as humans and often quite subtle ones such as empathy and a sense of fairness. It seems pretty likely to me whatever the set of base reward functions encoded in the mammalian basal ganglia and hypothalamus is, it can quite robustly generate expressed behavioural 'values' that fall within some broadly humanly recognisable set.
This is definitely the case. My prior is relatively strong that intelligence is compact, at least for complex and general tasks and behaviours. Evidence for this comes from ML -- the fact that the modern ML paradigm of huge network + lots of data + general optimiser being able to solve a large number of tasks is a fair bit of evidence for this. Other evidence is existence of g and cortical uniformity in general, as well as our flexibility at learning skills like chess, mathematics etc which we clearly do not have any evolutionarily innate specialisation fo...
I'm basing my thinking here primarily off of Herculano Houzel's work. If you have reasons you think this is wrong or counterarguments, I would be very interested in them as this is a moderately important part of my general model of AI.
the brain imaging studies also show predicting intelligence taps into a lot more aspects of static neuroanatomy or dynamic patterns than simply brain volume
Do you have links for these studies? Would leave to have a read about the static and dynamic correlates of g are from brain imaging!
This is a good idea! I hadn't thought that much about specific synaptic efficiency metrics. If we think about this in a bit more detail, these would effectively corresponds to some kind of changes in hyper parameters for an ML model. I.e. more rapid synaptic changes = potential for higher learning rate effectively. The more rapid synaptic formation (and potentially pruning?) is harder to model in ML but I guess would be an increase in effective parameter count.
To some extent yes speed can compensate for intelligence but this isn't really related to the question of FOOM.
In theory, if we have an AGI which is human level but 1000x faster, it might be able to perform at the level of 1000 humans rather than a human from the year 3000. If we have a giant population of AGIs such that we can replicate the entire edifice of human science but running at 1000x faster, then sure. In practice though by Amdahl's law such a speed increase would just move the bottleneck to something else (probably running experiments/gath...
I largely disagree about the intrinsic motivation/reward function points. There is a lot of evidence that there is at least some amount of general intelligence which is independent of interest in particular fields/topics. Of course, if you have a high level of intelligence + interest then your dataset will be heavily oriented towards that topic and you will gain a lot of skill in it, but the underlying aptitude/intelligence can be factored out of this.
How exactly specific interests are encoded is a different and also super fascinating question! It de...
Meant to comment on this a while back but forgot. I have thought about this also and broadly agree that early AGI with 'thoughts' at GHz levels is highly unlikely. Originally this was because pre-ML EY and the community broadly associated thoughts with CPU ops but in practice thoughts are more like forward passes through the model.
As Connor Sullivan says, the reasons brains can have low clock rates is that our intelligence algorithms are embarrassingly parallel, as is current ML. Funnily enough, for large models (and definitely if we were to run forw...
I've written up some of my preliminary thought and estimates here: https://www.beren.io/2022-08-06-The-scale-of-the-brain-vs-machine-learning/.
Jacob Cannell's post on brain efficiency https://www.lesswrong.com/posts/xwBuoE9p8GE7RAuhd/brain-efficiency-much-more-than-you-wanted-to-know is also very good
Yes definitely. Pretty much the main regions of interest to us are from Par-human up. Returns are almost definitely not consistent across scales. But what really matters for Xrisk is whether they are positive or negative around current or near-future ML models -- i.e. can existing models or AGIs we create in the next few decades self improve to super intelligence or not?
Yes definitely. Based on my own estimates of approximate brain scale it is likely that current largest. ML projects (GPT4) are within an OOM or so of effective parameter count already (+- 1-2 OOM) and we will definitely have brain-scale ML systems being quite common within a decade and probably less -- hence short timelines. Strong agree that it is much easier to add compute/energy to ML models vs brains.
I've had similar questions to this before in terms of how human individual differences appear so great when the actual seeming differences in neurophysiology between +3 and -3 SD humans are so small. My current view on this is that:
a.) General 'peak' human cognition is pretty advanced and the human brain is large even by current ML standards so by the scaling laws we should be pretty good vs existing ML systems at general tasks. This means that human intelligence is pretty 'far out' compared to current ML often and that scaling ML tasks much beyond h...
Unfortunately our code is tied too closely to our internal infrastructure for it to be worth disentangling for this post. I am considering putting together a repo containing all the plots we made though, since in the post we only publish a few exemplars and ask people to trust that the rest look similar. Most of the experiments are fairly simple and involves just gathering activations or weight data and plotting them.
Whoops! Thanks for spotting. Fixed!
Very fair point! I somehow forgot to add a counterpoint like this in there as I intended. Updated now
Sure the limbic system evolved over millions of years, but that doesn't mean we need to evolve it as well -- we could just study it and reimplement it directly without (much) iteration. I am not necessarily saying that this is a good approach to alignment -- I personally would prefer a more theoretically grounded one also. But I think it is an interesting existence proof that imprinting fairly robust drives into agents through a very low bandwidth channel even after a lot of experience and without much RL is possible in practice.
I agree that fictional/cultural evidence is important for how people generalise their innate responses to new stimuli. Specifically, I think something similar to Steven Byrnes' proxy matching is going on.
The idea is that we have certain hardwired instincts such as fear of death that are triggered in specific scenarios and we also independently learn a general world-model based on unsupervised learning which learns an independent and potentially un-emotive concept of death. Then we associate our instinctive reactions with this concept such that eventu...
Yes to some extent. Humans are definitely not completely robust to RSI / at a reflectively stable equilibrium. I do suspect though that sexual desire is at least partially reflectively stable. If people could arbitrarily rewrite their psychology I doubt that most would completely remove their sex drive or transmute it into some completely alien type of desire (some definitely would and I also think there'd be a fair bit of experimentation around the margin as well as removing/tweaking some things due to social desirability biases).
The main point though is ...
Personally, I am broadly in agreement with most of these points and especially 2, which seems very understudied given its likely importance to our survival. Would love to chat privately about your thoughts and hunches if you'd be up for it.
Yes, this is the few-shot alignment world described in the post. I agree that in principle if boxing could completely halt RSI then that would be fantastic but that especially with each iteration of RSI there is some probability that the box will fail and we would then get unbounded RSI. This means we would get effectively a few 'shots' to align our boxed AGI before we die.
i.e. if evolution has precisely titrated the percentage of homosexuality etc so as to maximise IGF taking into account benefits of bonding, additional childcare, group selection etc, then this is actually evidence for evolution achieving a much greater level of alignment than otherwise!
Indeed, this is exactly the kind of thing I am gesturing at. Certainly, all our repertoires of sexual behaviour are significantly shaped by RL. My point is that evolution has somehow in this case mostly solved some pointers-like problem to get the reward model to suddenly include rewards for sexual behaviour, can do so robustly, and can do so a long time after birth after a decade or so of unsupervised learning and RL has already occurred. Moreover, this reward model leads to people robustly pursuing this goal even fairly off-distribution from the ancestral environment.
Indeed, but insofar as this bonding function enhances IGF then this actually makes it an even more impressive example of alignment to evolution's true goal. I know that there are a bunch of potential evolutionary rationales proposed for homosexuality but I personally haven't studied it in depth nor are any super convincing to me so I'm just assuming the worst-case scenario for evolution here.
I feel like this is a good point in general but I think there is an important but subtle distinction between the two examples. In the first case of the GAN it is that there is the distinction between the inner optimization loop of the ML algorithm and the outer loop of humans performing an evolutionary search process to get papers/make pretty pictures.
In the wire-heading case this feels different in that you have essentially two separate value functions -- a cortical LM based one which can extrapolate values in linguistic/concept space and a cl...
I want to add to the general agreement here that I broadly agree with this strategy and think that in practice this kind of things (multiple alignment solutions strapped onto a system in parallel) will be what alignment success looks like in parallel, and have written about this before. This agreement is primarily with the swiss-cheese/assemblage strategy.
It is important to note that I think that hodgepodge alignment strategies only really apply to near-term proto-AGI systems that we will build and won't work against a system that has strongly recurs...
So the quantitative experiment you propose is a good idea -- and we will be working along these lines, extending the very preliminary experiments in the post about how big of an effect edits like this will have.
In terms of the polytopes, you are right that this doesn't really fit in with that framework but assumes a pure linear directions framework. We aren't really wedded to any specific viewpoint and are trying a lot of different perspectives to try to figure out what the correct ontology to understand neural network internals is.
Broadly, I agree with this. We are never going to have a full mechanistic understanding of literally every circuit in a TAI model in time for it to be alignment relevant (we may have fully reversed engineered some much smaller 'model organisms' by this time though). Nor are individual humans ever going to understand all the details of exactly how such models function (even small models).
However, the arguments for mechanistic interpretability in my view are as follows:
1.) Model capacities probably follow some kind of Pareto principle -- 20% or the cir...
This seems like a super interesting result! Thanks for linking; I wasn't aware of it. I haven't specifically looked for this pattern in GPT2-medium but I will now! Interestingly, we have also been thinking along similar lines of a 3 phase sequence for processing in residual nets like transformers where the first few layers do some kind of 'large scale' reshaping process of the geometry of the data while the later layers mostly do some kind of smaller refinements which don't change the basic geometry of the representation much, and then the final layer does...
This seems like an important but I am not sure I completely follow. How do rays differ from directions here? I agree that the SVD directions won't recover any JL kind of dense packing of directions since it is constrained to, at maximum, the dimension of the matrix. The thinking here is then that if the model tends to pack semantically similar directions into closely related dimensions, then the SVD would pick up on at least an average of this and represent it.
I also think something to keep in mind is that we are doing the SVDs over the OV and MLP we...
Great point. I agree that the singular vectors become unstable when the singular values are very close (and meaningless within the span when identical). However I don't think this is the main driver of the effect in the post. The graph of the singular vectors shown is quite misleading about the gap (this was my bad!). Because the OV matrix is effectively of rank 64, there is the sudden jump down to almost 0 which dominates the log-scale plotting. I was originally using that graph to try to show that effect, but in retrospect it is kind of an obvious one an...
Yes, this is correct. SVD necessarily won't recover the full JL packing. Given that we don't know the extent to which the network uses the full JL capacity, then SVD might still get a reasonable fraction of the relevant directions. Also, if the network packs semantically similar vectors close to one another, then the SVD direction might also represent some kind of useful average of them.
Indeed, we are looking at sparse coding to try to construct an over complete basis, as a parallel project. Stay tuned for this.
I think we agree here. Testing whether it converges to a better optimum would also be interesting.
... (read more)