All of peterbarnett's Comments + Replies

Your link to Import AI 337 currently links to the email, it should be this:

I was previously pretty dubious about interpretability results leading to capabilities advances. I've only really seen two papers which did this for LMs and they came from the same lab in the past few months. 
It seemed to me like most of the advances in modern ML (other than scale) came from people tinkering with architectures and seeing which modifications increased performance.[1]

But in a conversation with Oliver Habryka and others, it was brought up that as AI models are getting larger and more expensive, this tinkering will get more difficult and ... (read more)

They did run the tests for all models, from Table 1:

(the columns are GPT-4, GPT-4 (no vision), GPT-3.5)

So could an AI engineer create an AI blob of compute the same size as the brain, with its same structural parameters, feed it the same training data, and get the same result ("don't steal" rather than "don't get caught")?

There is a disconnect with this question. 

I think Scott is asking “Supposing an AI engineer could create something that was effectively a copy of a human brain and the same training data, then could this thing learn the “don’t steal” instinct over the “don’t get caught” instinct?” 
Eliezer is answering “Is an AI engineer able to create a copy of the human brain, provide it with the same training data a human got, and get the “don’t steal” instinct?”

Yeah, this read really bizarrely to me. This is a good way of making sense of that section, maybe. But then I'm still confused why Scott concluded "oh I was just confused in this way" and then EY said "yup that's why you were confused", and I'm still like "nope Scott's question seems correctly placed; evolutionary history is indeed screened off by the runtime hyperparameterization and dataset." 

Are you offering productivity/performance coaching for new alignment researchers, or coaching + research mentorship for alignment researchers? 

If you're offering research mentorship, it might be helpful to give some examples of the type of research you do

Yeah, I totally agree. My motivation for writing the first section was that people use the word 'deception' to refer to both things, and then make what seem like incorrect inferences. For example, current ML systems do the 'Goodhart deception' thing, but then I've heard people use this to imply that it might be doing 'consequentialist deception'. 

These two things seem close to unrelated, except for the fact that 'Goodhart deception' shows us that AI systems are capable of 'tricking' humans. 

Okay I see, yep that makes sense to me (-:

This phrase is mainly used and known by the euthanasia advocacy movement. This might also be bad, but not as bad as the association with Jim Jones.

1[comment deleted]1y

My initial thought is that I don't see why this powerful optimizer would attempt to optimize things in the world, rather than just do some search thing internally. 

I agree with your framing of "how did this thing get made, since we're not allowed to just postulate it into existence?". I can imagine a language model which manages to output words which causes strokes in whoever reads its outputs, but I think you'd need a pretty strong case for why this would be made in practice by the training process. 

Say you have some powerful optimizer language ... (read more)

This list of email scripts from 80,000 Hours also seems useful here

My Favourite Slate Star Codex Posts

This is what I send people when they tell me they haven't read Slate Star Codex and don't know where to start.

Here are a bunch of lists of top ssc posts:

These lists are vaguely ranked in the order of how confident I am that they are good (if interested in psychology almost all the stuff here is good (read more)

Hmm, this might come down to how independent the parameters are. I think in large networks there will generally be enough independence between the parameters for local minima to be rare (although possible). 

As a toy example of moving from one algorithm to another, if the network is large enough we can just have the output being a linear combination of the two algorithms and up-regulate one, and down regulate the other:

And  is changed from 1 to 0. 

The network needs to be large enough so the ... (read more)

I can see how the initial parameters are independent. After a significant amount of training though...?
Sure, but that's not moving from A to B. That's pruning from A+B to B. ...which, now that I think about is, is effectively just a restatement of the Lotto Ticket Hypothesis[1]. Hm. I wonder if the Lotto Ticket Hypothesis holds for grok'd networks? 1. ^ etc.

I think there is likely to be a path from a model with shallow circuits to a model with deeper circuits which doesn't need any 'activation energy' (it's not stuck in a local minimum). For a model with many parameters, there are unlikely to be many places where all the deriviatives of the loss wrt all the parameters are zero. There will almost always be at least one direction to move in which decreases the shallow circuit while increasing the general one, and hence doesn't really hurt the loss.

Hm. This may be a case where this domain is very different than the one I know, and my heuristics are all wrong. In RTL I can see incrementally moving from one implementation of an algorithm to another implementation of the same algorithm, sure. I don't see how you could incrementally move from, say, a large cascaded-adder multiplier to a Karatsuba multiplier, without passing through an intermediate state of higher loss. In other words: This is a strong assertion. I see the statistical argument that it is the case in the case of independent parameters - but the very fact that you've been training the network rather implies the parameters are no longer independent. The connectivity of the transistors in a storage array of a 1MiB SRAM has millions[1] of parameters. Nevertheless, changing any one connection decreases the fitness. (And hence the derivative is ~0, with a negative 2nd derivative.) Changing a bunch of connections - say, by adding a row and removing a column - may improve fitness. But there's a definite activation energy there. 1. ^ Or, depending on how exactly you count, significantly more than millions of parameters.

I'm from Dunedin and went to highschool there (in the 2010s), so I guess I can speak to this a bit.

Co-ed schools were generally lower decile (=lower socio-economic backgrounds) than the single sex schools (here is data taken from wikipedia on this). The selection based on 'ease of walking to school' is still a (small) factor, but I expect this would have been a larger factor in the 70s when there was worse public transport. In some parts of NZ school zoning is a huge deal, with people buying houses specifically to get into a good zone (especially in Auckla... (read more)

I think that this is a possible route to take, I don't think we currently have a good enough understanding of how to control\align mesa-optimizers to be able to do this. 

I worry a bit that even if we correctly align a mesa-optimizer (make its mesa-objective aligned with the base-objective), its actual optimization process might be faulty/misaligned and so it would mistakenly spawn a misaligned optimizer. I think this faulty optimization process is most likely to happen sometime in the middle of training, where the mesa-optimizer is able to make anothe... (read more)

2Jalex Stark2y
I agree with you that "hereditary mesa-alignment" is hard. I just don't think that "avoid all mesa-optimization" is a priori much easier.

So do you think that the only way to get to AGI is via a learned optimizer? 
I think that the definitions of AGI (and probably optimizer) here are maybe a bit fuzzy.

I think it's pretty likely that it is possible to develop AI systems which are more competent than humans in a variety of important domains, which don't perform some kind of optimization process as part of their computation. 

2Sam Ringer2y
I think the failure case identified in this post is plausible (and likely) and is very clearly explained so props for that! However, I agree with Jacob's criticism here. Any AGI success story basically has to have "the safest model" also be "the most powerful" model, because of incentives and coordination problems. Models that are themselves optimizers are going to be significantly more powerful and useful than "optimizer free" models. So the suggestion of trying to avoiding mesa-optimization altogether is a bit of a fabricated option. There is an interesting parallel here with the suggestion of just "not building agents" ( So from where I am sitting, we have no option but to tackle aligning the mesa-optimizer cascade head-on.
AGI will require both learning and planning, the latter of which is already then a learned mesa optimizer. And AGI may help create new AGI, which is also a form of mesa-optimization. Yes it's unavoidable. To create friendly but powerful AGI, we need to actually align it to human values. Creating friendly but weak AI doesn't matter.

Thanks for your reply! I agree that I might be a little overly dismissive of the loss landscape frame. I agree mostly with your point about convergent gradient hackers existing minima of the base-objective, I briefly commented on this (although not in the section on coupling)

Here we can think of the strong coupling regime as being a local minimum in the loss landscape, such that any step to remove the gradient hacker leads locally to worse performance. In the weak coupling regime, the loss landscape will still increase in the directions which directly hurt

... (read more)
From my view, the determinism isn't actually the main takeaway from realizing that the loss landscape is stationary, the categorization is. Also, I would argue that there's a huge practical difference between having mesaobjectives that happen to be local mins of the base objective and counting on quirks/stop-gradients; for one, the former is severely constrained on the kinds of objectives it can have (i.e we can imagine that lots of mesaobjectives, especially more blatant ones, might be harder to couple), which totally rules out some more extreme conceptions of gradient hackers that completely ignore the base objective. Also, convergent gradient hackers are only dangerous out of distribution - in theory if you had a training distribution that covered 100% of the input space then convergent gradient hackers would cease to be a thing (being really suboptimal on distribution is bad for the base objective).

There's a drug called Orlistat for treating obesity which works by preventing you from absorbing fats when you eat them. I've heard (somewhat anecdotally) that one of the main effects is forcing you to eat a low fat diet, because otherwise there are quite unpleasant 'gastrointestinal side effects' if you eat a lot of fat. 

Oh you're right! Thanks for catching that. I think I was lead astray because I wanted there to be a big payoff for averting the bad event, but I guess the benefit is just not having to pay D.
I'll have a look and see how much this changes things

Edit: Fixed it up now, none of the conclusions seem to change (which is good because they seemed like common sense!). Thanks for reading this and pointing that out!

Thanks! Yeah, I definitely think that "it's okay to slack today if I pull up the average later on" is a pretty common way people lose productivity. I think one framing could be that if you do have an off day, that doesn't have to put you off track forever, and you can make up for it in the future. 

I make the graphs using the [matplotlib xkcd mode](, it's super easy you use, you just put your plotting in a "with plt.xkcd():" block 

My read of Russel's position is that if we can successfully make the agent uncertain about its model for human preferences then it will defer to the human when it might do something bad, which hopefully solves (or helps with) making it corrigible.

I do agree that this doesn't seem to help with inner-alignment stuff though, but I'm still trying to wrap my head around this area.