All of Quintin Pope's Comments + Replies

I don't think this is a strawman. E.g., in How likely is deceptive alignment?, Evan Hubinger says:

We're going to start with simplicity. Simplicity is about specifying the thing that you want in the space of all possible things. You can think about simplicity as “How much do you have to aim to hit the exact thing in the space of all possible models?” How many bits does it take to find the thing that you want in the model space? And so, as a first pass, we can understand simplicity by doing a counting argument, which is just asking, how many models are in ea

... (read more)
[Low importance aside] I think this is equivalent to a well known approximation from algorithmic information theory. I think this approximation might be too lossy in practice in the case of actual neural nets though.
I'm sympathetic to pushing back on counting arguments on the ground 'it's hard to know what the exact measure should be, so maybe the measure on the goal of "directly pursue high performance/anything nearly perfectly correlated the outcome that it reinforced (aka reward)" is comparable/bigger than the measure on "literally any long run outcome"'. So I appreciate the push back here. I just think the exact argument and the comparison to overfitting is a strawman. (Note that above I'm assuming a specific goal slot, that the AI's predictions are aware of what its goal slot contains, and that in order for the AI to perform sufficiently well as to be a plausible result of training it has to explicitly "play the training game" (e.g. explicitly reason about and try to get high performance). It also seems reasonable to contest these assumption, but this is a different thing than the counting argument.) (Also, if we imagine an RL'd neural network computing a bunch of predictions, then it does seem plausible that it will have a bunch of long horizon predictions with higher aggregate measure than predicting things that perfectly correlate with the outcome that was reinforced (aka reward)! As in, if we imagine randomly sampling a linear probe, it will be far more likely to sample a probe where most of the variance is driven by long run outcomes than to sample a linear probe which is almost perfectly correlated with reward (e.g. a near perfect predictor of reward up to monotone regression). Neural networks are likely to compute a bunch of long range predictions at least as intermediates, but they only need to compute things that nearly perfectly correlate with reward once! (With some important caveats about transfer from other distributions.)) I also think Evan's arguments are pretty sloppy in this presentation and he makes a bunch of object level errors/egregious simplifications FWIW, but he is actually trying to talk about models represented in weight space and how many bit

We argue against the counting argument in general (more specifically, against the presumption of a uniform prior as a "safe default" to adopt in the absence of better information). This applies to the hazy counting argument as well. 

We also don't really think there's that much difference between the structure of the hazy argument and the strict one. Both are trying to introduce some form of ~uniformish prior over the outputs of a stochastic AI generating process. The strict counting argument at least has the virtue of being precise about which stochas... (read more)

I agree that you can't adopt a uniform prior. (By uniform prior, I assume you mean something like, we represent goals as functions from world states to a (real) number where the number says how good the world state is, then we take a uniform distribution over this function space. (Uniform sampling from function space is extremely, extremely cursed for analysis related reasons without imposing some additional constraints, so it's not clear uniform sampling even makes sense!))

Separately, I'm also skeptical that any serious historical arguments were actually ... (read more)

How many times has someone expressed "I'm worried about 'goal-directed optimizers', but I'm not sure what exactly they are, so I'm going to work on deconfusion."? There's something weird about this sentiment, don't you think?

IMO, the weird/off thing is that the people saying this don't have sufficient evidence to highlight this specific vibe bundle as being a "real / natural thing that just needs to be properly formalized", rather than there being no "True Name" for this concept, and it turns out to be just another situationally useful high level abstracti... (read more)

Are you claiming that future powerful AIs won't be well described as pursuing goals (aka being goal-directed)? This is the read I get from the the "dragon" analogy you mention, but this can't possibly be right because AI agents are already obviously well described as pursuing goals (perhaps rather stupidly). TBC the goals that current AI agents end up pursuing are instructions in natural language, not something more exotic.

(As far I can tell the word "optimizer" in "goal-directed optimizer" is either meaningless or redundant, so I'm ignoring that.)

Perhaps ... (read more)

The "alignment technique generalise across human contributions to architectures" isn't about the SLT threat model. It's about the "AIs do AI capabilities research" threat model. 

4Thane Ruthenis2mo
Not sure what the relevance is? I don't believe that "we possess innate (and presumably God-given) concepts that are independent of the senses", to be clear. "Children won't be able to instantly understand how to parse a new sense and map its feedback to the sensory modalities they've previously been familiar with, but they'll grok it really fast with just a few examples" was my instant prediction upon reading the titular question.

(Didn't consult Nora on this; I speak for myself)

I only briefly skimmed this response, and will respond even more briefly.

Re "Re: "AIs are white boxes""

You apparently completely misunderstood the point we were making with the white box thing. It has ~nothing to do with mech interp. It's entirely about whitebox optimization being better at controlling stuff than blackbox optimization. This is true even if the person using the optimizers has no idea how the system functions internally. 

Re: "Re: "Black box methods are sufficient"" (and the other stuff ab... (read more)

1[comment deleted]2mo
So it seems both "sides" are symmetrically claiming misunderstanding/miscommunication from the other side, after some textual efforts to bridge the gap have been made. Perhaps an actual realtime convo would help? Disagreement is one thing, but symmetric miscommunication and increasing tones of annoyance seem avoidable here.  Perhaps Nora's/your planned future posts going into more detail regarding counters to pessimistic arguments will be able to overcome these miscommunications, but this pattern suggests not.  Also I'm not so sure this pattern of "its better to skim and say something, half-baked rather than not read or react at all" is helpful, rather than actively harmful in this case. At least, maybe 3/4th baked or something might be better? Miscommunications and anti-willingness to thoroughly engage are only snowballing.  I also could be wrong in thinking such a realtime convo hasn't happened.
Yes, but you were arguing for that using examples of "morally evaluating" and "grokking the underlying simple moral rule", not of caring.

You apparently completely misunderstood the point we were making with the white box thing.


I think you need to taboo the term white box and come up with a new term that will result in less confusion/less people talking past each other.

5the gears to ascension3mo
One major intuition pump I think important: evolution doesn't get to evaluate everything locally. Gradient descent does. As a result, evolution is slow to eliminate useless junk though it does do so eventually. Gradient descent is so eager to do it that we call it catastrophic forgetting. Gradient descent wants to use everything in the system for whatever it's doing, right now. I disagree with the optimists that this makes it trivial because to me it appears that the dynamics that make short term misalignment likely are primarily organizational among humans - the incentives of competition between organizations and individual humans. Also RL-first ais will inline those dynamics much faster than RLHF can get them out.

I think training such an AI to be really good at chess would be fine. Unless "Then apply extreme optimization pressure for never losing at chess." means something like "deliberately train it to use a bunch of non-chess strategies to win more chess games, like threatening opponents, actively seeking out more chess games in real life, etc", then it seems like you just get GPT-5 which is also really good at chess. 

In retrospect, the example I used was poorly specified. It wouldn't surprise me if the result of the literal interpretation was "the AI refuses to play chess" rather than any kind of worldeating. The intent was to pick a sparse/distant reward that doesn't significantly constrain the kind of strategies that could develop, and then run an extreme optimization process on it. In other words, while intermediate optimization may result in improvements to chess playing, being better at chess isn't actually the most reliable accessible strategy to "never lose at chess" for that broader type of system and I'd expect superior strategies to be found in the limit of optimization.

I really don't like that you've taken this discussion to Twitter. I think Twitter is really a much worse forum for talking about complex issues like this than LW/AF.

I haven't "taken this discussion to Twitter". Joe Carlsmith posted about the paper on Twitter. I saw that post, and wrote my response on Twitter. I didn't even know it was also posted on LW until later, and decided to repost the stuff I'd written on Twitter here. If anything, I've taken my part of the discussion from Twitter to LW. I'm slightly baffled and offended that you seem to be platform-... (read more)


If anything, I've taken my part of the discussion from Twitter to LW.

Good point. I think I'm misdirecting my annoyance here; I really dislike that there's so much alignment discussion moving from LW to Twitter, but I shouldn't have implied that you were responsible for that—and in fact I appreciate that you took the time to move this discussion back here. Sorry about that—I edited my comment.

And my response is that I think the model pays a complexity penalty for runtime computations (since they translate into constraints on parameter values which are

... (read more)

Reposting my response on Twitter (To clarify, the following was originally written as a Tweet in response to Joe Carlsmith's Tweet about the paper, which I am now reposting here):

I just skimmed the section headers and a small amount of the content, but I'm extremely skeptical. E.g., the "counting argument" seems incredibly dubious to me because you can just as easily argue that text to image generators will internally create images of llamas in their early layers, which they then delete, before creating the actual asked for image in the later layers. There

... (read more)

(Partly re-hashing my response from twitter.)

I'm seeing your main argument here as a version of what I call, in section 4.4, a "speed argument against schemers" -- e.g., basically, that SGD will punish the extra reasoning that schemers need to perform. 

(I’m generally happy to talk about this reasoning as a complexity penalty, and/or about the params it requires, and/or about circuit-depth -- what matters is the overall "preference" that SGD ends up with. And thinking of this consideration as a different kind of counting argument *against* schemers see... (read more)


I really don't like all this discussion happening on Twitter, and I appreciate that you took the time to move this back to LW/AF instead. I think Twitter is really a much worse forum for talking about complex issues like this than LW/AF.

Regardless, some quick thoughts:

[have some internal goal x] [backchain from wanting x to the stuff needed to get x (doing well at training)] [figure out how to do well at training] [actually do well at training]

and in comparison, the "honest" / direct solution looks like:

[figure out how to do well at training] [actually d

... (read more)

This is a great post! Thank you for writing it.

There's a huge amount of ontological confusion about how to think of "objectives" for optimization processes. I think people tend to take an inappropriate intentional stance and treat something like "deliberately steering towards certain abstract notions" as a simple primitive (because it feels introspectively simple to them). This background assumption casts a shadow over all future analysis, since people try to abstract the dynamics of optimization processes in terms of their "true objectives", when there re... (read more)

I think this is really lucid and helpful:
-4Jonas Hallgren4mo
Well, I'm happy to be a European citizen in that case, lol. I really walked into that one.

I strong downvoted and strong disagree voted. The reason I did both is because I think what you're describing is a genuinely insane standard to take for liability. Holding organizations liable for any action they take which they do not prove is safe is an absolutely terrible idea. It would either introduce enormous costs for doing anything, or allow anyone to be sued for anything they've previously done.

I think you're talking past each other. I interpret Nathan as saying that he could prove that everyone on earth has been harmed, but that he couldn't do that in a safe manner.
4Nathan Helm-Burger4mo
Thanks Quintin. That's useful. I think the general standard of holding organizations liable for any action which they do not prove to be safe is indeed a terrible idea. I do think that certain actions may carry higher implicit harms, and should be held to a higher standard of caution. Perhaps you, or others, will give me your opinions on the following list of actions. Where is a good point, in your opinion, to 'draw the line'? Starting from what I would consider 'highly dangerous and much worse than Llama2' and going down to 'less dangerous than Llama2', here are some related actions. 1. Releasing for free a software product onto the internet explicitly engineered to help create bio-weapons. Advertising this product as containing not only necessary gene sequences and lab protocols, but also explicit instructions for avoiding government and organization safety screens. Advertising that this product shows multiple ways to create your desired bio-weapon, including using your own lab equipment or deceiving Contract Research Organizations into unwittingly assisting you.  2. Releasing the same software product with the same information, but not mentioning to anyone what it is intended for. Because it is a software product, rather than an ML model, the information is all correct and not mixed in with hallucinations. The user only needs to navigate to the appropriate part of the app to get the information, rather than querying a language model. 3. Releasing a software product not actively intended for the above purposes, but which does happen contain all that information and can incidentally be used for those purposes.  4. Releasing an LLM which was trained on a dataset containing this information, and can regurgitate the information accurately. Furthermore, tuning this LLM before release to be happy to help users with requests to carry out a bio-weapons project, and double-checking to make sure that the information given is accurate. 5. Releasing an LLM that happens
Also given the rest of the replies, I think he means that it would be challenging for a plaintiff to safely prove that Llama 2 enables terrorists to make bioweapons, not that the alleged harm is "making open-source AI without proving it safe" or such.
FWIW the heavy negative karma means that his answer is hidden by default, so that readers can't easily see why OP thinks it might make sense to bring a lawsuit against Meta, which seems bad.

I really don't want to spend even more time arguing over my evolution post, so I'll just copy over our interactions from the previous times you criticized it, since that seems like context readers may appreciate.

In the comment sections of the original post:

Your comment

[very long, but mainly about your "many other animals also transmit information via non-genetic means" objection + some other mechanisms you think might have caused human takeoff]

My response

I don't think this objection matters for the argument I'm making. All the cross-generational informatio

... (read more)

I'll try to keep it short

All the cross-generational information channels you highlight are at rough saturation, so they're not able to contribute to the cross-generational accumulation of capabilities-promoting information.

This seems clearly contradicted by empirical evidence. Mirror neurons would likely be able to saturate what you assume is brains learning rate, so not transferring more learned bits is much more likely because marginal cost of doing so is higher than than other sensible options. Which is a different reason than "saturated, at capac... (read more)

3Jonas Hallgren5mo
Isn't there an alternative story here where we care about the sharp left turn, but in the cultural sense, similar to Drexler's CAIS where we have similar types of experimentation as happened during the cultural evolution phase?  You've convinced me that the sharp left turn will not happen in the classical way that people have thought about it, but are you that certain that there isn't that much free energy available in cultural style processes? If so, why? I can imagine that there is something to say about SGD already being pretty algorithmically efficient, but I guess I would say that determining how much available free energy there is in improving optimisation processes is an open question. If the error bars are high here, how can we then know that the AI won't spin up something similar internally?  I also want to add something about genetic fitness becoming twisted as a consequence of cultural evolutionary pressure on individuals. Culture in itself changed the optimal survival behaviour of humans, which then meant that the meta-level optimisation loop changed the underlying optimisation loop. Isn't the culture changing the objective function still a problem that we have to potentially contend with, even though it might not be as difficult as the normal sharp left turn? For example, let's say that we deploy GPT-6 and it figures out that in order to solve the loosely defined objective that we have determined for it using (Constitutional AI)^2 should be discussed by many different iterations of itself to create a democratic process of multiple COT reasoners. This meta-process seems, in my opinion, like something that the cultural evolution hypothesis would predict is more optimal than just one GPT-6, and it also seems a lot harder to align than normal? 

I think this post greatly misunderstands mine. 

Firstly, I'd like to address the question of epistemics. 

When I said "there's no reason to reference evolution at all when forecasting AI development rates", I was referring to two patterns of argument that I think are incorrect: (1) using the human sharp left turn as evidence for an AI sharp left turn, and (2) attempting to "rescue" human evolution as an informative analogy for other aspects of AI development.

(Note: I think Zvi did follow my argument for not drawing inferences about the odds of the ... (read more)

I realize this is accidentally sounds like it's saying two things at once (that autonomous learning relies on the generator-discriminator gap of the domain, and then that it relies on the gap for the specific agent (or system in general)). I think it's the agent's capabilities that matter, that the domain determines how likely the agent is to have a persistent gap between generation and discrimination, and I don't think the (basic) dynamics are too difficult.

You start with a model M and initial data distribution D. You train M on D such that M is now a mod

... (read more)
On the additional commentary section: On the first section, we disagree on the degree of similarity in the metaphors. I agree with you that we shouldn't care about 'degree of similarity' and instead build causal models. I think our actual disagreements here lie mostly in those causal models, the unpacking of which goes beyond comment scope. I agree with the very non-groundbreaking insights listed, of course, but that's not what I'm getting out of it. It is possible that some of this is that a lot of what I'm thinking of as evolutionary evidence, you're thinking of as coming from another source, or is already in your model in another form to the extent you buy the argument (which often I am guessing you don't).  On the difference in SLT meanings, what I meant to say was: I think this is sufficient to cause our alignment properties to break.  In case it is not clear: My expectation is that sufficiently large capabilities/intelligence/affordances advances inherently break our desired alignment properties under all known techniques.  On the passage you find baffling: Ah, I do think we had confusion about what we meant by inner optimizer, and I'm likely still conflating the two somewhat. That doesn't change me not finding this heartening, though? As in, we're going to see rapid big changes in both the inner optimizer's power (in all senses) and also in the nature and amount of training data, where we agree that changing the training data details changes alignment outcomes dramatically.   On the impossible-to-you world: This doesn't seem so weird or impossible to me? And I think I can tell a pretty easy cultural story slash write an alternative universe novel where we honor those who maximize genetic fitness and all that, and have for a long time - and that this could help explain why civilization and our intelligence developed so damn slowly and all that. Although to truly make the full evidential point that world then has to be weirder still where humans are much
On concrete example 2: I see four bolded claims in 'fast takeoff is still possible.' Collectively, to me, in my lexicon and way of thinking about such things, they add up to something very close to 'alignment is easy.'  The first subsection says human misalignment does not provide evidence for AI misalignment, which isn't one of the two mechanisms (as I understand this?), and is instead arguing against an alignment difficulty. The bulk of the second subsection, starting with 'Let’s consider eight specific alignment techniques,' looks to me like an explicit argument that alignment looks easy based on your understanding of the history from AI capabilities and alignment developments so far?  The third subsection seems to also spend most of its space on arguing its scenario would involve manageable risks (e.g. alignment being easy), although you also argue that evolution/culture still isn't 'close enough' to teach us anything here?  I can totally see how these sections could have been written out with the core intention to explain how distinct-from-evolution mechanisms could cause fast takeoffs. From my perspective as a reader, I think my response and general takeaway that this is mostly an argument for easy alignment is reasonable on reflection, even if that's not the core purpose it serves in the underlying structure, and it's perhaps not a fully general argument. On concrete example 3: I agree that what I said was a generalization of what you said, and you instead said something more specific. And that your later caveats make it clear you are not so confident that things will go smoothly in the future. So yes I read this wrong and I'm sorry about that. But also I notice I am confused here - if you didn't mean for the reader to make this generalization, if you don't think that failure of current capabilities advances to break current alignment techniques isn't strong evidence for future capabilities advances not breaking then-optimal alignment techniques, then w

On Quintin's secondly's concrete example 1 from above:

I think the core disagreement here is that Quintin thinks that you need very close parallels in order for the evolutionary example to be meaningful, and I don't think that at all. And neither of us can fully comprehend why the other person is going with as extreme a position as we are on that question? 

Thus, he says, yes of course you do not need all those extra things to get misalignment, I wasn't claiming that, all I was saying was this would break the parallel. And I'm saying both (1) that misal... (read more)

(Writing at comment-speed, rather than carefully-considered speed, apologies for errors and potential repetitions, etc)

On the Evo-Clown thing and related questions in the Firstly section only.

I think we understand each other on the purpose of the Evo-Clown analogy, and I think it is clear what our disagreement is here in the broader question?

I put in the paragraph Quintin quoted in order to illustrate that, even in an intentionally-absurd example intended to illustrate that A and B share no causal factors, A and B still share clear causal factors, and the ... (read more)

In general one should not try to rescue intuitions, and the frequency of doing this is a sign of serious cognitive distortions. You should only try to rescue intutions when they have a clear and validated predictive or pragmatic track record. The reason for this is very simple - most intuitions or predictions one could make are wrong, and you need a lot of positive evidence to privilege any particular hypotheses re how or what to think. In the absence of evidence, you should stop relying on an intuition, or at least hold it very lightly.

Thank you for the very detailed and concrete response. I need to step through this slowly to process it properly and see the extent to which I did misunderstand things, or places where we disagree.

Addressing this objection is why I emphasized the relatively low information content that architecture / optimizers provide for minds, as compared to training data. We've gotten very far in instantiating human-like behaviors by training networks on human-like data. I'm saying the primacy of data for determining minds means you can get surprisingly close in mindspace, as compared to if you thought architecture / optimizer / etc were the most important.

Obviously, there are still huge gaps between the sorts of data that an LLM is trained on versus the implici... (read more)

I believe the human visual cortex is actually the more relevant comparison point for estimating the level of danger we face due to mesaoptimization. Its training process is more similar to the self-supervised / offline way in which we train (base) LLMs. In contrast, 'most abstract / "psychological"' are more entangled in future decision-making. They're more "online", with greater ability to influence their future training data.

I think it's not too controversial that online learning processes can have self-reinforcing loops in them. Crucially however, such ... (read more)

I'm guessing you misunderstand what I meant when I referred to "the human learning process" as the thing that was a ~ 1 billion X stronger optimizer than evolution and responsible for the human SLT. I wasn't referring to human intelligence or what we might call human "in-context learning". I was referring to the human brain's update rules / optimizer: i.e., whatever quasi-Hebbian process the brain uses to minimize sensory prediction error, maximize reward, and whatever else factors into the human "base objective". I was not referring to the intelligences t... (read more)

5Thomas Larsen5mo
Thanks for the response!  I think I understand these points, and I don't see how this contradicts what I'm saying. I'll try rewording.  Consider the following gaussian process:  Each blue line represents a possible fit of the training data (the red points), and so which one of these is selected by a learning process is a question of inductive bias. I don't have a formalization, but I claim: if your data-distribution is sufficiently complicated, by default, OOD generalization will be poor.  Now, you might ask, how is this consistent with capabilities to generalizing? I note that they haven't generalized all that well so far, but once they do, it will be because the learned algorithm has found exploitable patterns in the world and methods of reasoning that generalize far OOD.  You've argued that there are different parameter-function maps, so evolution and NNs will generalize differently, this is of course true, but I think its besides the point. My claim is that doing selection over a dataset with sufficiently many proxies that fail OOD without a particularly benign inductive bias leads (with high probability) to the selection of function that fails OOD. Since most generalizations are bad, we should expect that we get bad behavior from NN behavior as well as evolution. I continue to think evolution is valid evidence for this claim, and the specific inductive bias isn't load bearing on this point -- the related load bearing assumption is the lack of a an inductive bias that is benign.  If we had reasons to think that NNs were particularly benign and that once NNs became sufficiently capable, their alignment would also generalize correctly, then you could make an argument that we don't have to worry about this, but as yet, I don't see a reason to think that a NN parameter function map is more likely to lead to inductive biases that pick a good generalization by default than any other set of inductive biases.  It feels to me as if your argument is that we unders

It doesn't mention the literal string "gradient descent", but it clearly makes reference to the current methodology of training AI systems (which is gradient descent). E.g., here:

The techniques OpenMind used to train it away from the error where it convinces itself that bad situations are unlikely? Those generalize fine. The techniques you used to train it to allow the operators to shut it down? Those fall apart, and the AGI starts wanting to avoid shutdown, including wanting to deceive you if it’s useful to do so.

The implication is that the dangerous beha... (read more)

1Max H5mo
That's not what I'm criticizing you for. I elaborated a bit more here; my criticism is that this post sets up a straw version of the SLT argument to knock down, of which assuming it applies narrowly to "spiky" capability gains via SGD is one example. The actual SLT argument is about a capabilities regime (human-level+), not a specific method for reaching it or how many OOM of optimization power are applied before or after. The reasons to expect a phase shift in such a regime are because (by definition) a human-level AI is capable of reflection, deception, having insights that no other human has had before (as current humans sometimes do), etc. Note, I'm not saying that setting up a strawman automatically invalidates all the rest of your claims, nor that you're obligated to address every possible kind of criticism. But I am claiming that you aren't passing the ITT of someone who accepts the original SLT argument (and probably its author). But it does mean you can't point to support for your claim that evolution provides no evidence for the Pope!SLT, as support for the claim that evolution provides no evidence for the Soares!SLT, and expect that to be convincing to anyone who doesn't already accept that  Pope!SLT == Soares!SLT.

I've recently decided to revisit this post. I'll try to address all un-responded to comments in the next ~2 weeks.

Part of this is just straight disagreement, I think; see So8res's Sharp Left Turn and follow-on discussion.

Evolution provides no evidence for the sharp left turn

But for the rest of it, I don't see this as addressing the case for pessimism, which is not problems from the reference class that contains "the LLM sometimes outputs naughty sentences" but instead problems from the reference class that contains "we don't know how to prevent an ontological collapse, where meaning structures constructed under one world-model compile to something different under a di

... (read more)

There was an entire thread about Yudkowsky's past opinions on neural networks, and I agree with Alex Turner's evidence that Yudkowsky was dubious. 

I also think people who used brain analogies as the basis for optimism about neural networks were right to do so.

Roughly, the core distinction between software engineering and computer security is whether the system is thinking back.

Yes, and my point in that section is that the fundamental laws governing how AI training processes work are not "thinking back". They're not adversaries. If you created a misaligned AI, then it would be "thinking back", and you'd be in an adversarial position where security mindset is appropriate.

What's your story for specification gaming?

"Building an AI that doesn't game your specifications" is the actual "alignment question" we should b... (read more)

Ok, it sounds to me like you're saying: "When you train ML systems, they game your specifications because the training dynamics are too dumb to infer what you actually want. We just need One Weird Trick to get the training dynamics to Do What You Mean Not What You Say, and then it will all work out, and there's not a demon that will create another obstacle given that you surmounted this one." That is, training processes are not neutral; there's the bad training processes that we have now (or had before the recent positive developments) and eventually will be good training processes that create aligned-by-default systems. Is this roughly right, or am I misunderstanding you?

If you created a misaligned AI, then it would be "thinking back", and you'd be in an adversarial position where security mindset is appropriate.

Cool, we agree on this point.

my point in that section is that the fundamental laws governing how AI training processes work are not "thinking back". They're not adversaries.

I think we agree here on the local point but disagree on its significance to the broader argument. [I'm not sure how much we agree-I think of training dynamics as 'neutral', but also I think of them as searching over program-space in order to fi... (read more)

It can be induced on MNIST by deliberately choosing worse initializations for the model, as Omnigrok demonstrated.

Got it, thanks!

Re empirical evidence for influence functions:

Didn't the Anthropic influence functions work pick up on LLMs not generalising across lexical ordering? E.g., training on "A is B" doesn't raise the model's credence in "Bs include A"?

Which is apparently true:

2Fabien Roger5mo
That's an exciting experimental confirmation! I'm looking forward for more predictions like those. (I'll edit the post to add it, as well as future external validation results.)

I think you're missing something regarding David's contribution:

Yes, this reasoning was for capabilities benchmarks specifically. Data goes further with future algorithmic progress, so I thought a narrower criteria for that one was reasonable. 

So, you are deliberately targeting models such as LLama-2, then? Searching HuggingFace for "Llama-2" currently brings up 3276 models. As I understand the legislation you're proposing, each of these models would have to undergo government review, and the government would have the perpetual capacity to arbitrarily pull the plug on any of them.

I expect future small, open-source... (read more)

(ETA: these are my personal opinions) 


  1. We're going to make sure to exempt existing open source models. We're trying to avoid pushing the frontier of open source AI, not trying to put the models that are already out their back in the box, which I agree is intractable. 
  2. These are good points, and I decided to remove the data criteria for now in response to these considerations. 
  3. The definition of frontier AI is wide because it describes the set of models that the administration has legal authority over, not the set of models that would be r
... (read more)
5Nora Belrose6mo
It's more than misleading, it's simply a lie, at least insofar as developers outside of Google, OpenAI, and co. use the Llama 2 models.

Your current threshold does include all Llama models (other than llama-1 6.7/13 B sizes), since they were trained with > 1 trillion tokens. 

I also think 70% on MMLU is extremely low, since that's about the level of ChatGPT 3.5, and that system is very far from posing a risk of catastrophe. 

The cutoffs also don't differentiate between sparse and dense models, so there's a fair bit of non-SOTA-pushing academic / corporate work that would fall under these cutoffs.

Very far in qualitative capability or very far in effective flop? I agree on the qualitative capability, but disagree on the effective flop. It seems quite plausible (say 5%) that models with only 1,000x more training compute than GPT-3.5 pose a risk of catastrophe. This would be GPT-5.

Your current threshold does include all Llama models (other than llama-1 6.7/13 B sizes), since they were trained with > 1 trillion tokens. 

Yes, this reasoning was for capabilities benchmarks specifically. Data goes further with future algorithmic progress, so I thought a narrower criteria for that one was reasonable. 

I also think 70% on MMLU is extremely low, since that's about the level of ChatGPT 3.5, and that system is very far from posing a risk of catastrophe. 

This is the threshold for the government has the ability to say no to, an... (read more)

Out of curiosity, I skimmed the Ted Gioia linked article and encountered this absolutely wild sentence:

AI is getting more sycophantic and willing to agree with false statements over time.

which is just such a complete misunderstanding of the results from Discovering Language Model Behaviors with Model-Written Evaluations. Instantly disqualified the author from being someone I'd pay attention to for AI-related analysis. 

Perhaps, but that's not the literal meaning of the text.

Here’s what we now know about AI:

  • [...]
  • AI potentially creates a situation where millions of people can be fired and replaced with bots [...]
Yes, but that ”generative AI can potentially replace millions of jobs” is not contradictory to the statement that it eventually ”may turn out to be a dud”. I initially reacted in the same way as you to the exact same passage but came to the conclusion that it was not illogical. Maybe I’m wrong but I don’t think so.
2Bill Benzon6mo
You're right, and I don't know what Gioia would say if pressed. But it might be something like: "Millions of people will be replaced by bots and then the businesses will fall apart because the bots don't behave as advertised. So now millions are out of jobs and the businesses that used to employ them are in trouble."

Seems contradictory to argue both that generative AI is useless and that it could replace millions of jobs.

1Bill Benzon6mo
You're probably right. I note, however, that this is territory that's not been well-charted. So it's not obvious to me just what to make of the inconsistency. It doesn't (strongly) contradict Gioia's main point, which is that LLMs seem to be in trouble in the commercial sphere.
I think the auther ment that there was a perception that it could replace millions of jobs, and so an incentive for business to press forward with their implementation plans, but that this would eventually back fire if the hallucination problem is insoluble.

My assumption is that GPT-4 has a repetition penalty, so if you make it predict all the same phrase over and over again, it puts almost all its probability on a token that the repetition penalty prevents it from saying, with the leftover probability acting similarly to a max entropy distribution over the rest of the vocab.

This happens with GPT-3.5 too, BTW.
1Kshitij Sachan7mo
By repetition penalty do you mean an explicit logit bias when sampling or internally it's generalized to avoiding repeated tokens?

Here's the sycophancy graph from Discovering Language Model Behaviors with Model-Written Evaluations:

For some reason, the LW memesphere seems to have interpreted this graph as indicating that RLHF increases sycophancy, even though that's not at all clear from the graph. E.g., for the largest model size, the base model and the preference model are the least sycophantic, while the RLHF'd models show no particular trend among themselves. And if anything, the 22B models show decreasing sycophancy with RLHF steps.

What this graph actually shows is increasing syc... (read more)

Actually, Towards Understanding Sycophancy in Language Models presents data supporting the claim that RL training can intensify sycophancy. EG from figure 6

I mean (1). You can see as much in the figure displayed in the linked notebook:

Note the lack of decrease in the val loss.

I only train for 3e4 steps because that's sufficient to reach generalization with implicit regularization. E.g., here's the loss graph I get if I set the batch size down to 50:

Setting the learning rate to 7e-2 also allows for generalization within 3e4 steps (though not as stably):

The slingshot effect does take longer than 3e4 steps to generalize:

1Eric J. Michaud7mo
Huh those batch size and learning rate experiments are pretty interesting!

I don't think that explicitly aiming for grokking is a very efficient way to improve the training of realistic ML systems. Partially, this is because grokking definitionally requires that the model first memorize the data, before then generalizing. But if you want actual performance, then you should aim for immediate generalization.

Further, methods of hastening grokking generalization largely amount to standard ML practices such as tuning the hyperparameters, initialization distribution, or training on more data.  

This post mainly argues that evolution does not provide evidence for the sharp left turn. Sudden capabilities jumps from other sources, such as those you mention, are more likely, IMO. My first reply to your comment is arguing that the mechanisms behind the human sharp left turn wrt evolution probably still won't arise in AI development, even if you go up an abstraction level. One of those mechanisms is a 5 - 9 OOM jump in usable optimization power, which I think is unlikely.

Am I missing something here, or is this just describing memetics?

It is not describing memetics, which I regard as a mostly confused framework that primes people to misattribute the products of human intelligence to "evolution". However, even if evolution meaningfully operates on the level of memes, the "Evolution" I'm referring to when I say "Evolution applies very little direct optimization power to the middle level" is strictly biological evolution over the genome, not memetic at all. 

Memetic evolution in this context would not have inclusive geneti... (read more)

1Cornelius Dybdahl10mo
New memes may arise either by being mutated from other memes or by invention ex nihilo - either of which involves some degree of human intelligence. However, if a meme becomes prevalent, it is not because all of its holders have invented it independently. It has rather spread because it is adapted both to the existing memetic ecosystem as well as to human intelligence. Of course, if certain memes reduce the likelihood of reproduction, that provides an evolutionary pressure for human intelligence to change to be more resistant to that particular kind of meme, so there are very complex interactions. It is not a confused framework - at least not inherently - and it does not require us to ignore the role of human intelligence. My argument is that evolution selects simultaneously for genetic and memetic fitness, and that both genes and memes tend to be passed on from parent to child. Thus, evolution operates at a combined genetic-memetic level where it optimizes for inclusive genetic-memetic fitness. Though genes and memes correspond to entirely different mediums, they interact in complex ways when it comes to evolutionary fitness, so the mechanisms are not that straightforwardly separable. In addition, there are social network effects and geographic localization influencing what skills people are likely to acquire, such that skills have a tendency to be heritable in a manner that is not easily reducible to genetics, but which nevertheless influences evolutionary fitness. If we look aside from the fact that memes and skills can be transferred in manners other than heredity, then we can sorta model them as an extended genome. But the reason we can say that it is bad for humans to become addicted to ice cream is because we have an existing paradigm that provides us with a deep understanding of nutrition, and even here, subtle failures in the paradigm have notoriously done serious harms. Do you regard our understanding of morality as more reliable than our understanding
-1Ben Pace10mo
These don’t seem very relevant counterarguments, I think literally all are from people who believe that AGI is an extinction-level threat soon facing our civilization. Perhaps you mean “>50% of extinction-level bad outcomes” but I think that the relevant alternative viewpoint that would calm someone is not that the probability is only 20% or something, but is “this is not an extinction-level threat and we don’t need to be worried about it”, for which I have seen no good argument for (that engages seriously with any misalignment concerns).
The strongest argument against AI doom I can imagine runs as follows: AI can kill all humans for two main reasons: to (a) prevent a threat to itself and (b) to get human's atoms. But:   (a) AI will not kill humans as a threat before it creates powerful human-independent infrastructure (nanotech) as in that case, it will run out of electricity etc.  AI will also not kill humans after it creates nanotech, as we can't destroy nanotech (even with nukes).  Thus, AI will not kill humans to prevent the threat neither before, nor after nanotech, – so it will never happens for this reason.   (b) Human atoms constitute 10E-24 of all atoms in the Solar system. Humans may have small instrumental value for trade with aliens, for some kinds of work or as training data sources. Even a small instrumental value of humans will be larger than the value of their atoms, as the value of atoms is very-very small. Humans will not be killed for atoms. Thus humans will not be killed either as a threat or for atoms.  But there are other ways how AI catastrophe can kill everybody: wrongly aligned AI performs wireheading, Singleton halts, or there will be war between several AIs. Each of this risk is not necessary outcome.But together they have high probability mass.
3Christopher King10mo
To clarify, I'm thinking mostly about the strength of the strongest counter-argument, not the quantity of counter-arguments. But yes, what counts as a strong argument is a bit subjective and a continuum. I wrote this post because of the counter-arguments I know I know of are strong enough to be "strong" by my standards. Personally my strongest counter-argument is "humanity actually will recognize the x-risk in time to take alignment seriously, delaying the development of ASI if necessary", but even that isn't backed up by too much evidence (the only previous example I know of is when we avoided nuclear holocaust).

Some counter evidence:

  • Kernelized Concept Erasure: concept encodings do have nonlinear components. Nonlinear kernels can erase certain parts of those encodings, but they cannot prevent other types of nonlinear kernels from extracting concept info from other parts of the embedding space.
  • Limitations of the NTK for Understanding Generalization in Deep Learning: the neural tangent kernels of realistic neural networks continuously change throughout their training. Further, neither the initial kernels nor any of the empirical kernels from mid-training can reprodu
... (read more)
Thanks for these links! This is exactly what I was looking for as per Cunningham's law. For the mechanistic mode connectivity, I still need to read the paper, but there is definitely a more complex story relating to the symmetries rendering things non-connected by default but once you account for symmetries and project things into an isometric space where all the symmetries are collapsed things become connected and linear again. Is this different to that?   I agree about the NTK. I think this explanation is bad in its specifics although I think the NTK does give useful explanations at a very coarse level of granularity. In general, to put a completely uncalibrated number on it, I feel like NNs are probably '90% linear' in their feature representations. Of course they have to have somewhat nonlinear representations as well. But otoh if we could get 90% of the way to features that would be massive progress and might be relatively easy. 

The description complexity of hypotheses AIXI considers is dominated by the bridge rules which translate from 'physical laws of universes' to 'what am I actually seeing?'. To conclude Newtonian gravity, AIXI must not only infer the law of gravity, but also that there is a camera, that it's taking a photo, that this is happening on an Earth-sized planet, that this planet has apples, etc. These beliefs are much more complex than the laws of physics. 

One issue with AIXI is that it applies a uniform complexity penalty to both physical laws and bridge rule... (read more)

That is a good point. But bridging laws probably aren't that complex. At least, not for inferring the basic laws of physics. How many things on the order of Newtonian physics physics do you need? A hundred? A thousand? That could plausibly fit into a few megabytes. So it seems plausible that you could have GR + QFT and a megabyte of briding laws plus some other data to specify local conditions and so on.  And if you disagree with that, then how much data do you think AIXI would need? Let's say you're talking about a video of an apple falling in a forest with the sky and ground visible. How much data would you need, then? 1GB? 1TB? 1 PB? I think 1GB is also plausible, and I'd be confused if you said 1TB.

Autonomous learning basically requires there to be a generator-discriminator gap in the domain in question, i.e., that the agent trying to improve its capabilities in said domain has to be better able to tell the difference between its own good and bad outputs. If it can do so, it can just produce a bunch of outputs, score their goodness, and train / reward itself on its better outputs. In both situations you note (AZ and human mathematicians) there's such a gap, because game victories and math results can both be verified relatively more easily than they ... (read more)

I don't think I'm concerned by moving up a level in abstraction. For one, I don't expect any specific developer to suddenly get access to 5 - 9 OOMs more compute than any previous developer. For another, it seems clear that we'd want the AIs being built to be misaligned with whatever "values" correspond to the outer selection signals associated with the outer optimizer in question (i.e., "the people doing the best on benchmarks will get their approaches copied, get more funding, etc"). Seems like an AI being aligned to, like, impressing its developers? doi... (read more)

7Steven Byrnes10mo
I wrote: And then you wrote: Isn’t that kinda a strawman? I can imagine a lot of scenarios where a training run results in a qualitatively better trained model than any that came before—I mentioned three of them—and I think “5-9OOM more compute than any previous developer” is a much much less plausible scenario than any of the three I mentioned.

I don't think this objection matters for the argument I'm making. All the cross-generational information channels you highlight are at rough saturation, so they're not able to contribute to the cross-generational accumulation of capabilities-promoting information. Thus, the enormous disparity between the brain's with-lifetime learning versus evolution cannot lead to a multiple OOM faster accumulation of capabilities as compared to evolution.

When non-genetic cross-generational channels are at saturation, the plot of capabilities-related info versus generati... (read more)

Hey Quintin thanks for the diagram. Have you tried comparing the cumulative amount of genetic info over 3.5B years? Isn't it a big coincidence that the time of brains that process info quickly / increase information rapidly, is also the time where those brains are much more powerful than all other products of evolution? (The obvious explanation in my view is that brains are vastly better optimizers/searchers per computation step, but I'm trying to make sure I understand your view.)

That's not at all clear to me. Inductive biases clearly differ between humans, yet we are not all terminally misaligned with each other. E.g., split brain patients are not all wired value aliens, despite a significant difference in architecture. Also, training on human-originated data causes networks to learn human-like inductive biases (at least somewhat).

Thanks for weighing in Quintin! I think I basically agree with dxu here. I think this discussion shows that Rob should probably rephrase his argument as something like "When humans make plans, the distribution they sample from has all sorts of unique and interesting properties that arise from various features of human biology and culture and the interaction between them. Big artificial neural nets will lack these features, so the distribution they draw from will be significantly different -- much bigger than the difference between any two humans, for examp... (read more)

I feel like there's a significant distance between what's being said formally versus the conclusions being drawn. From Rob:

If you sampled a random plan from the space of all writable plans (weighted by length, in any extant formal language)

From you:

the simplicity bias of SGD on NNs is different than some people think -- it is weighted towards broad basins / connected regions. It's still randomly sampling from the set of all low loss NN parameter configurations, but with a different bias/prior.

The issue is that literally any plan generation / NN training pr... (read more)

Isn't it enough that they do differ? Why do we need to be able to accurately/precisely characterize the nature of the difference, to conclude that an arbitrary inductive bias different from our own is unlikely to sample the same kinds of plans we do?

These kind of 'twist on known optimizers' papers are pretty common, and they mostly don't amount to too much. E.g., the only difference between Adam and "SafeRate[Adam direction]" is that they used their second-order method to automatically tune the learning rate of the Adam optimizer. Such automatic hyperparameter tuning has been a thing for a long time. E.g., here's a paper from ~30 years ago.

Also note that Adam pretty much keeps up with SafeRate in the above plot until the loss drops to ~, which is extremely low, and very far beyond what any plausib... (read more)

One can always reparameterize any given input / output mapping as a search for the minima of some internal energy function, without changing the mapping at all. 

The main criteria to think about is whether an agent will use creative, original strategies to maximize inner objectives, strategies which are more easily predicted by assuming the agent is "deliberately" looking for extremes of the inner objectives, as opposed to basing such predictions on the agent's past actions, e.g., "gather more computational resources so I can find a high maximum".

1Fergus Fettes1y
This is the closest thing yet! Thank you. Maybe that is it.

Pretty much. Though I'd call it a "fast takeoff" instead of "sharp left turn" because I think "sharp left turn" is supposed to have connotations beyond "fast takeoff", e.g., "capabilities end up generalizing further than alignment".

Right, you are saying evolution doesn't provide evidence for AI capabilities generalizing further than alignment, but then only consider the fast takeoff part of the SLT to be the concern. I know you have stated reasons why alignment would generalize further than capabilities, but do you not think an SLT-like scenario could occur in the two capability jump scenarios you listed?
Load More