All of Rohin Shah's Comments + Replies

I feel like a lot of these arguments could be pretty easily made of individual AI safety researchers. E.g.

Misaligned Incentives

In much the same way that AI systems may have perverse incentives, so do the [AI safety researchers]. They are [humans]. They need to make money, [feed themselves, and attract partners]. [Redacted and redacted even just got married.] This type of accountability to [personal] interests is not perfectly in line with doing what is good for human interests. Moreover, [AI safety researchers are often] technocrats whose values and demogr

... (read more)
Thanks. I agree that the points apply to individual researchers. But I don't think that it applies in a comparably worrisome way because individual researchers do not have comparable intelligence, money, and power compared to the labs. This is me stressing the "when put under great optimization pressure" of Goodhart's Law. Subtle misalignments are much less dangerous when there is a weak optimization force behind the proxy than when there is a strong one. 

Sounds reasonable, though idk what you think realistic values of N are (my wild guess with hardly any thought is 15 minutes - 1 day).

EDIT: Tbc in the 1 day case I'm imagining that most of the time goes towards running the experiment -- it's more a claim about what experiments we want to run. If we just talk about the time to write the code and launch the experiment I'm thinking of N in the range of 5 minutes to 1 hour.

Cool, that all roughly makes sense to me :)

I was certainly imagining at least some amount of multi-tasking (e.g. 4 projects at once each of which runs 8x faster). This doesn't feel that crazy to me, I already do a moderate amount of multi-tasking.

Multi-tasking where you are responsible for the entire design of the project? (Designing the algorithm, choosing an experimental setting and associated metrics, knowing the related work, interpreting the results of the experiments, figuring out what the next experiment should be, ...)

Suppose today I gave you a dev... (read more)

Probably yes for realistic values of N? Assuming the box is pretty smart at understanding instructions (and has an understanding of my typical ontology to the extent that you would get after working with me a few weeks and reading various posts) and the box will ask follow-up questions in cases where the instructions are unclear. (And we can do small diffs with reduced latency like asking the results to be plotted in a different way.) My main concern is running out of ideas after a while despite copies of myself with more thinking time having more time to generate ideas.

I agree it helps to run experiments at small scales first, but I'd be pretty surprised if that helped to the point of enabling a 30x speedup -- that means that the AI labor allows you get 30x improvement in compute needed beyond what would be done by default by humans (though the 30x can include e.g. improving utilization, it's not limited just to making individual experiments take less time).

I think the most plausible case for your position would be that the compute costs for ML research scale much less than quadratically with the size of the pretrained m... (read more)

Overall, this has updated me to some extent and it seems less plausible to me that ML research can achieve 30x speedups while having human researchers do all of the high level ideas. (I think the picture looks importantly better when AIs are removing this bottleneck.) The situation I was imagining is where most experiments use some combination of: * A relatively small amount of finetuning/inference on the biggest models (including for the actual AI researcher) * Larger (possibly full) training runs, but at much smaller scale (e.g. GPT-3 level performance models) Then, we can in total afford ~training dataset sized amounts of finetuning/inference for the biggest models (by the inference availability argument). And GPT-3 performance experiments will be pretty cheap. So assuming our base model looks like GPT-6 with the expected compute requirement and model size, this is a huge amount of possible inference availability. So, the implicit claim is that compute costs scale much less than quadratically. It's certainly not obvious ML research can be progressed fast enough with this little compute. I was certainly imagining at least some amount of multi-tasking (e.g. 4 projects at once each of which runs 8x faster). This doesn't feel that crazy to me, I already do a moderate amount of multi-tasking. Note that this often involves multiple people working on the same paper. In the AI case, the division of labor might look at least somewhat different. (Though I don't think this changes the picture very much from what you're describing because most people now aren't the "ideas" people.)

I think ML research in particular can plausibly be accelerated by maybe 30x by only making it extremely fast and cheap to go from high level ideas to implemented experiments (rather than needing to generate these high level ideas)

Why doesn't compute become the bottleneck well before the 30x mark? It seems like the AIs have to be superhuman at something to overcome that bottleneck (rather than just making it fast and cheap to implement experiments). Indeed the AIs make the problem somewhat worse, since you have to spend compute to run the AIs.

I guess I'm not that sold that compute will actually be that much of a key bottleneck for experiments in the future in a way that can't be overcome with 2x additional labor and/or a 2x slow down. Like in many cases you can spend additional labor to reduce compute usage of experiments. (E.g., first run the experiments on smaller models.) And, we're conditioning on having really powerful AIs which correlates with a high baseline level of compute and that will help whenever we can run experiments at small scales and then use scaling laws etc. Further, the current scaling laws imply huge inference availablity if huge amounts of compute are used for training. This might depend on what type of ML research we're talking about.

Come on, the claim "the evidence suggests that if the current ML systems were trying to deceive us, we wouldn't be able to change them not to" absent any other qualifiers seems pretty clearly false. It is pretty important to qualify that you are talking about deceptive alignment or backdoors specifically (e.g. I'm on board with Ryan's phrasing).

There's a huge disanalogy between your paper's setup and deception-in-general, which is that in your paper's setup there is no behavioral impact at training time. Deception-in-general (e.g. sycophancy) often has beh... (read more)

Ah, sure—I agree that we don't say anything about sycophancy-style deception. I interpreted "deception" there in context to refer to deceptive alignment specifically. The word deception is unfortunately a bit overloaded.

I think you mostly need to hope that it doesn't matter (because the crazy XOR directions aren't too salient) or come up with some new idea.

Yeah certainly I'd expect the crazy XOR directions aren't too salient.

I'll note that if it ends up these XOR directions don't matter for generalization in practice, then I start to feel better about CCS (along with other linear probing techniques). I know that for CCS you're more worried about issues around correlations with features like true_according_to_Alice, but my feeling is that we might be able to handle spuriou

... (read more)
4Sam Marks1mo
I agree with this! (And it's what I was trying to say; sorry if I was unclear.) My point is that  { features which are as crazy as "true according to Alice" (i.e., not too crazy)}  seems potentially manageable, where as  { features which are as crazy as arbitrary boolean functions of other features }  seems totally unmanageable. Thanks, as always, for the thoughtful replies.

Yeah, agreed that's a clear overclaim.

In general I believe that many (most?) people take it too far and make incorrect inferences -- partly on priors about popular posts, and partly because many people including you believe this, and those people engage more with the Simulators crowd than I do.

Fwiw I was sympathetic to nostalgebraist's positive review saying:

sometimes putting a name to what you "already know" makes a whole world of difference. [...] I see these takes, and I uniformly respond with some version of the sentiment "it seems like you aren't thin

... (read more)

Yeah, I would be surprised if this is a good first-order approximation of what is going on inside an LLM. Or maybe you mean this in a non-mechanistic way?

Yes, I definitely meant this in the non-mechanistic way. Any mechanistic claims that sound simulator-flavored based just on the evidence in this post sounds clearly overconfident and probably wrong. I didn't reread this post carefully but I don't remember seeing mechanistic claims in it.

I agree that in a non-mechanistic way, the above will produce reasonable predictions, but that's because that's basicall

... (read more)
Hmm, yeah, this perspective makes more sense to me, and I don't currently believe you ended up making any of the wrong inferences I've seen others make on the basis of the post.  I do sure see many other people make inferences of this type. See for example the tag page for Simulator Theory which says:  This also directly claims that the physics the system learned are "the mechanics underlying our world", which I think isn't totally false (they have probably learned a good chunk of the mechanics of our world) but is inaccurate as something trying to describe most of what is going on in a base model's cognition.

The thing that's confusing here is that the two-way XORs that my experiments are looking at just seem clearly not useful for anything.

Idk, I think it's pretty hard to know what things are and aren't useful for predicting the next token. For example, some of your features involve XORing with a "has_not" feature -- XORing with an indicator for "not" might be exactly what you want to do to capture the effect of the "not".

(Tbc here the hypothesis could be "the model computes XORs with has_not all the time, and then uses only some of them", so it does have some... (read more)

2Sam Marks1mo
I agree that "the model has learned the algorithm 'always compute XORs with has_not'" is a pretty sensible hypothesis. (And might be useful to know, if true!) FWIW, the stronger example of "clearly not useful XORs" I was thinking of has_true XOR has_banana, where I'm guessing you're anticipating that this XOR exists incidentally. Focusing again on the Monster gridworld setting, here are two different ways that your goals could misgeneralize: 1. player_has_shield is spuriously correlated with high_score during training, so the agent comes to value both 2. monster_present XOR high_score is spuriously correlated with high_score during training, so the agent comes to value both. These are pretty different things that could go wrong. Before realizing that these crazy XOR features existed, I would only have worried about (1); now that I know these crazy XOR features exist ... I think I mostly don't need to worry about (2), but I'm not certain and it might come down to details about the setting. (Indeed, your CCS challenges work has shown that sometimes these crazy XOR features really can get in the way!) I agree that you can think of this issue as just being the consequence of the two issues "there are lots of crazy XOR features" and "linear probes can pick up on spurious correlations," I guess this issue feels qualitatively new to me because it just seems pretty untractable to deal with it on the data augmentation level (how do you control for spurious correlations with arbitrary boolean functions of undesired features?). I think you mostly need to hope that it doesn't matter (because the crazy XOR directions aren't too salient) or come up with some new idea. I'll note that if it ends up these XOR directions don't matter for generalization in practice, then I start to feel better about CCS (along with other linear probing techniques).[1] If I had to articulate my reason for being surprised here, it'd be something like: 1. I didn't expect LLMs to compute many XO

I think the main thing I'd point to is this section (where I've changed bullet points to numbers for easier reference):

I can’t convey all that experiential data here, so here are some rationalizations of why I’m partial to the term, inspired by the context of this post:

  1. The word “simulator” evokes a model of real processes which can be used to run virtual processes in virtual reality.
  2. It suggests an ontological distinction between the simulator and things that are simulated, and avoids the fallacy of attributing contingent properties of the latter to the for
... (read more)
Yeah, I would be surprised if this is a good first-order approximation of what is going on inside an LLM. Or maybe you mean this in a non-mechanistic way? I agree that in a non-mechanistic way, the above will produce reasonable predictions, but that's because that's basically a description of the task the LLM is trained on.  Like, the above sounds similar to me to "in order to predict what AlphaZero will do, choose some promising moves, then play forward the game and predict after which moves AlphaZero is most likely to win, then adopt the move that most increases the probability of winning as your prediction of what AlphaZero does". Of course, that is approximately useless advice, since basically all you've done is describe the training setup of AlphaZero. As a mechanistic explanation, I would be surprised if even with amazing mechanistic interpretability you will find some part of the LLM whose internal structure corresponds in a lot of detail to the mind or brain of the kind of person it is trying to "simulate". I expect the way you get low loss here will involve an enormous number of non-simulating cognition (see again my above analogy about how when humans engage in roleplay, we engage in a lot of non-simulating cognition).  To maybe go into a bit more depth on what wrong predictions I've seen people make on the basis of this post:  * I've seen people make strong assertions about what kind of cognition is going on inside of LLMs, ruling out things like situational awareness for base models (it's quite hard to know whether base models have any situational awareness, though RLHF'd models clearly have some level, I also think what situational awareness would mean for base models is a bit confusing, but not that confusing, like it would just mean that as you scale up the model its behavior would become quite sensitive to the context in which it is run) * I've seen people make strong predictions that LLM performance can't become superhuman on various tasks, s

Nice post, and glad this got settled experimentally! I think it isn't quite as counterintuitive as you make it out to be -- the observations seem like they have reasonable explanations.

I feel pretty confident that there's a systematic difference between basic features and derived features, where the basic features are more "salient" -- I'll be assuming such a distinction in the rest of the comment.

(I'm saying "derived" rather than "XOR" because it seems plausible that some XOR features are better thought of as "basic", e.g. if they were very useful for the... (read more)

4Sam Marks1mo
I agree with a lot of this, but some notes: The thing that's confusing here is that the two-way XORs that my experiments are looking at just seem clearly not useful for anything. So I think any utility explanation that's going to be correct needs to be a somewhat subtle one of the form "the model doesn't initially know which XORs will be useful, so it just dumbly computes way more XORs than it needs, including XORs which are never used in any example in training." Or in other words "the model has learned the algorithm 'compute lots of XORs' rather than having learned specific XORs which it's useful to compute." I think this subtlety changes the story a bit. One way that it changes the story is that you can't just say "the model won't compute multi-way XORs because they're not useful" -- the two-way XORs were already not useful! You instead need to argue that the model is implementing an algorithm which computed all the two-way XORs but didn't compute XORs of XORs; it seems like this algorithm might need to encode somewhere information about which directions correspond to basic features and which don't. Even though on a surface level this resembles the failure discussed in the post (because one feature is held fixed during training), I strongly expect that the sorts of failures you cite here are really generalization failure for "the usual reasons" of spurious correlations during training. For example, during training (because monsters are present), "get a high score" and "pick up shields" are correlated, so the agents learn to value picking up shields. I predict that if you modified the train set so that it's no longer useful to pick up shields (but monsters are still present), then the agent would no longer pick up shields, and so would no longer misgeneralize in this particular way. In contrast, the point I'm trying to make in the post is that RAX can cause problems even in the absence of spurious correlations like this.[1] As you noted, it will sometimes be

Are you saying that this claim is supported by PCA visualizations you've done?

Yes, but they're not in the paper. (I also don't remember if these visualizations were specifically on banana/shed or one of the many other distractor experiments we did.)

I'll say that I've done a lot of visualizing true/false datasets with PCA, and I've never noticed anything like this, though I never had as clean a distractor feature as banana/shed.

It is important for the distractor to be clean (otherwise PCA might pick up on other sources of variance in the activations as the ... (read more)

4Sam Marks2mo
Thanks for the detailed replies!

(To summarize the parallel thread)

The claim is that the learned probe is . As shown in Theorem 1, if you chug through the math with this probe, it gets low CCS loss and leads to an induced classifier .*

You might be surprised that this is possible, because the CCS normalization is supposed to eliminate  -- but what the normalization does is remove linearly-accessible information about . However,  is not linearly accessible, and... (read more)

The point is that while the normalization eliminates , it does not eliminate , and it turns out that LLMs really do encode the XOR linearly in the residual stream.

Why does the LLM do this? Suppose you have two boolean variables  and . If the neural net uses three dimensions to represent , and , I believe that allows it to recover arbitrary boolean functions of  and  linearly from the residual stream. So you might expect the LLM to do this "by default"... (read more)

3Sam Marks2mo
Thanks! I'm still pretty confused though. It sounds like you're making an empirical claim that in this banana/shed example, the model is representing the features has_banana(x), has_true(x), and has_banana(x)⊕has_true(x) along linearly independent directions. Are you saying that this claim is supported by PCA visualizations you've done? Maybe I'm missing something, but none of the PCA visualizations I'm seeing in the paper seem to touch on this. E.g. visualization in figure 2(b) (reproduced below) is colored by is_true(x), not has_true(x). Are there other visualizations showing linear structure to the feature has_banana(x)⊕has_true(x) independent of the features has_banana(x) and has_true(x)? (I'll say that I've done a lot of visualizing true/false datasets with PCA, and I've never noticed anything like this, though I never had as clean a distractor feature as banana/shed.) More broadly, it seems like you're saying that you think in general, when LLMs have linearly-represented features a and b they will also tend to linearly represent the feature a⊕b. Taking this as an empirical claim about current models, this would be shocking. (If this was meant to be a claim about a possible worst-case world, then it seems fine.)  For example, if I've done my geometry right, this would predict that if you train a supervised probe (e.g. with logistic regression) to classify a=0 vs 1 on a dataset where b=0, the resulting probe should get ~50% accuracy on a test dataset where b=1. And this should apply for any features a,b. But this is certainly not the typical case, at least as far as I can tell! Concretely, if we were to prepare a dataset of 2-token prompts where the first word is always "true" or "false" and the second word is always "banana" or "shed," do you predict that a probe trained with logistic regression on the dataset {(true banana,1),(false banana,0)} will have poor accuracy when tested on {(true shed,1),(false shed,1)}?

Thanks for the edit :)

As I mentioned elsewhere (not this website) I don't agree with "will reliably lead people to false beliefs", if we're talking about ML people rather than LW people (as was my audience for that blog post).

I do think that it's a reasonable hypothesis to have, and I assign it more likelihood than I would have a year ago (in large part from you pushing some ML people on this point, and them not getting it as fast as I would have expected).

Like, just look at this quote from the post you mentioned:

Unfortunately, AI systems trained with reinforcement learning only optimize features specified in the reward function and are indifferent to anything we might’ve inadvertently left out. 

And you probably didn't even select that post for this particular misunderstanding.

(Presumably you are talking about how reward is not the optimization target.)

While I agree that the statement is not literally true, I am still basically on board with that sentence and think it's a reasonable shorthand for the tr... (read more)

I didn't mean to imply that you in particular didn't understand the reward point, and I apologize for not writing my original comment more clearly in that respect. Out of nearly everyone on the site, I am most persuaded that you understood this "back in the day." I meant to communicate something like "I think the quoted segment from Rohin and Dmitrii's post is incorrect and will reliably lead people to false beliefs."

(I didn't follow this argument at the time, so I might be missing key context.)

The blog post "Reward is not the optimization target" gives the following summary of its thesis,

  1. Deep reinforcement learning agents will not come to intrinsically and primarily value their reward signal; reward is not the trained agent’s optimization target.
  2. Utility functions express the relative goodness of outcomes. Reward is not best understood as being a kind of utility function. Reward has the mechanistic effect of chiseling cognition into t
... (read more)

Some thoughts on my journey in particular:

  1. When I joined AI safety in late 2017 (having read approximately nothing in the field), I thought of the problem as "construct a utility function for an AI system to optimize", with a key challenge being the fragility of value. In hindsight this was clearly wrong.
    1. The Value Learning sequence was in large part a result of my journey away from the utility function framing.
    2. That being said, I suspect I continued to think that fragility-of-value type issues were a significant problem, probably until around mid-2019 (see n
... (read more)

Good point on the rotational symmetry, that makes sense now.

I still think that this assumption is fairly realistic because in practice, most pairs of unrelated features would co-occur only very rarely, and I expect the winner-take-all dynamic to dominate most of the time. But I agree that it would be nice to quantify this and test it out.

Agreed that's a plausible hypothesis. I mostly wish that in this toy model you had a hyperparameter for the frequency of co-occurrence of features, and identified how it affects the rate of incidental polysemanticity.

I think I agree with all of that (with the caveat that it's been months and I only briefly skimmed the past context, so further thinking is unusually likely to change my mind).

My guess is that this result is very sensitive to the design of the training dataset:

the input/output data pairs are  for , where  is the  basis vector.

In particular, I think it is likely very sensitive to the implicit assumption that feature i and feature j never co-occur on a single input. I'd be interested to see experiments where each feature is turned on with some (not too small) probability, independently of all other features, similarly to the original toy models setting. This would result in so... (read more)

4Victor Lecomte3mo
Thanks for the feedback! Definitely! I still think that this assumption is fairly realistic because in practice, most pairs of unrelated features would co-occur only very rarely, and I expect the winner-take-all dynamic to dominate most of the time. But I agree that it would be nice to quantify this and test it out. If there is no L1 regularization on activations, then every hidden neuron would indeed be highly "polysemantic" in the sense that it has nonzero weights for each input feature. But on the other hand, the whole encoding space would become rotationally symmetric, and when that's the case it feels like polysemanticity shouldn't be about individual neurons (since the canonical basis is not special anymore) and instead about the angles that different encodings form. In particular, as long as mgen, the space of optimal solutions for this setup requires the encodings Wi to form angles of at least 90° with each other, and it's unclear whether we should call this polysemantic. So one of the reasons why we need L1 regularization is to break the rotational symmetry and create a privileged basis: that way, it's actually meaningful to ask whether a particular hidden neuron is representing more than one feature.

One piece missing here, insofar as current methods don't get to 99% of loss recovered, is repeatedly drilling into the residual until they do get to 99%.

When you do that using existing methods, you lose the sparsity (e.g. for circuit finding you have to include a large fraction of the model to get to 99% loss recovered).

It's of course possible that this is because the methods are bad, though my guess is that at the 99% standard this is reflecting non-sparsity / messiness in the territory (and so isn't going to go away with better methods). I do expect we c... (read more)

I have the opposite expectation there; I think it's just that current methods are pretty primitive.

This seems like exactly what mech interp is doing? Circuit finding is all about finding sparse subgraphs. It continues to work with large models, when trying to explain a piece of the behavior of the large model. SAE stands for sparse autoencoder: the whole point is to find the basis in which you get sparsity. I feel like a lot of mech interp has been almost entirely organized around the principle of modularity / sparsity, and the main challenge is that it's hard (you don't get to 99% of loss recovered, even on pieces of behavior, while still being meaningfully sparse).

SAEs are almost the opposite of the principle John is advocating for here. They deliver sparsity in the sense that the dictionary you get only has a few neurons not be in the zero state at the same time, they do not deliver sparsity in the sense of a low dimensional summary of the relevant information in the layer, or whatever other causal cut you deploy them on. Instead, the dimensionality of the representation gets blown up to be even larger

One piece missing here, insofar as current methods don't get to 99% of loss recovered, is repeatedly drilling into the residual until they do get to 99%. That's a pretty core part of what makes science work, in general. And yeah, that's hard (at least in the sense of being a lot of work; more arguable whether it's hard in a stronger sense than that).

This has been stated many times before (I believe I heard it in Chris Olah’s 80k episode first) but worth reiterating.

The reference I like best is

Unless by "shrugs" you mean the details of what the partial hypothesis says in this particular case are still being worked out.

Yes, that's what I mean.

I do agree that it's useful to know whether a partial hypothesis says anything or not; overall I think this is good info to know / ask for. I think I came off as disagreeing more strongly than I actually did, sorry about that.

Do you have any plans to do this?

No, we're moving on to other work: this took longer than we expected, and was less useful for alignment than we hoped (though that part wasn't that unex... (read more)

Which of these theories [...] can predict the same "four novel predictions about grokking" yours did? The relative likelihoods are what matters for updates after all.

I disagree with the implicit view on how science works. When you are a computationally bounded reasoner, you work with partial hypotheses, i.e. hypotheses that only make predictions on a small subset of possible questions, and just shrug at other questions. This is mostly what happens with the other theories:

  1. Difficulty of representation learning: Shrugs at our prediction about  /
... (read more)
Implictly, I thought if a you have a partial hypothesis of grokking, then if it shrugs at an grokking related phenomena it should be penalized. Unless by "shrugs" you mean the details of what the partial hypothesis says in this particular case are still being worked out. But in that case, confirming the partial hypothesis doesn't say anything yet about some phenomena is still useful info. I'm fairly sure this belief was what generated my question.  Thank you for going through the theories and checking what they have to say. That was helpful to me.  Do you have any plans to do this? How much time do you think it would take? And do you have any predictions for what should happen in these cases?

From page 6 of the paper:

Ungrokking can be seen as a special case of catastrophic forgetting (McCloskey and Cohen, 1989; Ratcliff, 1990), where we can make much more precise predictions. First, since ungrokking should only be expected once , if we vary  we predict that there will be a sharp transition from very strong to near-random test accuracy (around ). Second, we predict that ungrokking would arise even if we only remove examples from the training dataset, whereas catastrophic forgetting typically involves trainin

... (read more)

I think that post has a lot of good ideas, e.g. the idea that generalizing circuits get reinforced by SGD more than memorizing circuits at least rhymes with what we claim is actually going on (that generalizing circuits are more efficient at producing strong logits with small param norm). We probably should have cited it, I forgot that it existed.

But it is ultimately a different take and one that I think ends up being wrong (e.g. I think it would struggle to explain semi-grokking).

I also think my early explanation, which that post compares to, is basically... (read more)

I agree -- the point is that if you train on addition examples without any modular wraparound (whether you think of that as regular addition or modular addition with a large prime, doesn't super matter), then there is at least some evidence that you get a different representation than the one Nanda et al found.

I think I would particularly critique DeepMind and OpenAI's interpretability works, as I don't see how this reduces risks more than other works that they could be doing, and I'd appreciate a written plan of what they expect to achieve.

I can't speak on behalf of Google DeepMind or even just the interpretability team (individual researchers have pretty different views), but I personally think of our interpretability work as primarily a bet on creating new affordances upon which new alignment techniques can be built, or existing alignment techniques can be en... (read more)

To give props to your last paragraphs, you are right about my concern that most alignment work is less important than governance work. Most of the funding in AI safety goes to alignment, AI governance is comparatively neglected, and I'm not sure that's the best allocation of resources. I decided to write this post specifically on interpretability as a comparatively narrow target to train my writing. 

I hope to work on a more constructive post, detailing constructive strategic considerations and suggesting areas of work and theories of impact that I thi... (read more)

Makes sense, but I think this is not the optimal resource allocation. I explain why below: Yes, the pause is my secondary goal (edit: conditional on no significant alignment progress, otherwise smart scaling and regulations are my priorities). My primary goal remains coordination and safety culture. Mainly, I believe that one of the main pivotal processes goes through governance and coordination.  A quote that explains my reasoning well is the following: * "That is why focusing on coordination is crucial! There is a level of coordination above which we don’t die - there is no such threshold for interpretability. We currently live in a world where coordination is way more valuable than interpretability techniques. So let’s not forget that non-alignment aspects of AI safety are key! AI alignment is only a subset of AI safety! (I’m planning to deep-dive more into this in a following post)." That's why I really appreciate Dan Hendryck's work on coordination.  And I think DeepMind and OpenAI could make a huge contribution by doing technical work that is useful for governance.  We've talked a bit during the EAG, and I understood that there's something like a numerus clausus in DeepMind's safety team. In that case, since interpretability doesn't require a lot of computing power/prestige and as DeepMind has a very high level of prestige, you should use it to write papers that help with coordination. Interpretability could be done outside the labs. For example, some of your works like Model evaluation for extreme risks, or Goal Misgeneralization: Why Correct Specifications Aren't Enough For Correct Goals, are great for such purpose!
  I agree with this perspective if we can afford the time to perform interpretability work on all of model setups - which our head count is too low to do that. Given the urgency to address the alignment challenge quickly, it's better to encourage (or even prioritize) conceptually sound interpretability work rather than speculative approaches. 

Yeah, that seems like a reasonable operationalization of "capable of doing X". So my understanding is that (1), (3), (6) and (7) would not falsify the hypothesis under your operationalization, (5) would falsify it, (2) depends on details, and (4) is kinda ambiguous but I tend to think it would falsify it.

2Lukas Finnveden3mo
I think (5) also depends on further details. As you have written it, both the 2023 and 2033 attempt uses similar data and similar compute. But in my proposed operationalization, "you can get it to do X" is allowed to use a much greater amount of resources ("say, 1% of the pre-training budget") than the test for whether the model is "capable of doing X" ("Say, at most 1000 data points".) I think that's important: * If both the 2023 and the 2033 attempt are really cheap low-effort attempts, then I don't think that the experiment is very relevant for whether "you can get it to do X" in the sort of high-stakes, high-efforts situations that I'm imagining that we'll be in when we're trying to eval/align AI models to avoid takeover. * It seems super plausible that a low-effort attempt could fail, and then succeed later-on with 10 more years knowledge of best practices. I wouldn't learn much from that happening once. * If both the 2023 and the 2033 attempts are really expensive and high-effort (e.g. 1% of pre-training budget), then I think it's very plausible that the 2033 training run gave the model new capabilities that it didn't have before. * And in particular: capabilities that the model wouldn't have been able to utilize in a takeover attempt that it was very motivated to do its best at. (Which is ultimately what we care about.)   By a similar argument, I would think that (4) wouldn't falsify the hypothesis as-written, but would falsify the hypothesis if the first run was a much more high-effort attempt. With lots of iteration by a competent team, and more like a $1,000,000 budget. But the 2nd run, with a much more curated and high-quality dataset, still just used $1,000 of training compute.   One thing that I'm noticing while writing this is something like: The argument that "elicitation efforts would get to use ≥1% of the training budget" makes sense if we're eliciting all the capabilities at once, or if there's only a few important capabilities to e

Which of (1)-(7) above would falsify the hypothesis if observed? Or if there isn't enough information, what additional information do you need to tell whether the hypothesis has been falsified or not?

The “no sandbagging on checkable tasks” hypothesis: With rare exceptions, if a not-wildly-superhuman ML model is capable of doing some task X, and you can check whether it has done X, then you can get it to do X using already-available training techniques (e.g., fine-tuning it using gradient descent).[1]

I think as phrased this is either not true, or tautological, or otherwise imprecisely specified (in particular I'm not sure what it means for a model to be "capable of" doing some task X -- so far papers define that to be "can you quickly finetune the model... (read more)

4Lukas Finnveden7mo
Here's a proposed operationalization. For models that can't gradient hack: The model is "capable of doing X" if it would start doing X upon being fine-tuned to do it using a hypothetical, small finetuning dataset that demonstrated how to do the task. (Say, at most 1000 data points.) (The hypothetical fine-tuning dataset should be a reasonable dataset constructed by a hypothetical team of human who knows how to do the task but aren't optimizing the dataset hard for ideal gradient updates to this particular model, or anything like that.) For models that might be able to gradient-hack, but are well-modelled as having certain goals: The model is "capable of doing X" if it would start doing X if doing X was a valuable instrumental goal, for it. For both kinds: "you can get it to do X" if you could make it do X with some large amount of research+compute budget (say, 1% of the pre-training budget), no-holds-barred. Edit: Though I think your operationalization also looks fine. I mainly wanted to point out that the "finetuning" definition of "capable of doing X" might be ok if you include the possibility of finetuning on hypothetical datasets that we don't have access to. (Since we only know how to check the task — not perform it.)
4Tom Davidson7mo
I read "capable of X" as meaning something like "if the model was actively trying to do X then it would do X". I.e. a misaligned model doesn't reveal the vulnerability to humans during testing bc it doesn't want them to patch it, but then later it exploits that same vulnerability during deployment bc it's trying to hack the computer system

I expect a delay even in the infinite data case, I think?

Although I'm not quite sure what you mean by "infinite data" here -- if the argument is that every data point will have been seen during training, then I agree that there won't be any delay. But yes training on the test set (even via "we train on everything so there is no possible test set") counts as cheating for this purpose.

Honestly I'd be surprised if you could achieve (2) even with explicit regularization, specifically on the modular addition task.

(You can achieve it by initializing the token embeddings to those of a grokked network so that the representations are appropriately structured; I'm not allowing things like that.)

EDIT: Actually, Omnigrok does this by constraining the parameter norm. I suspect this is mostly making it very difficult for the network to strongly memorize the data -- given the weight decay parameter the network "tries" to learn a high-param norm memo... (read more)

If you train on infinite data, I assume you'd not see a delay between training and testing, but you'd expect a non-monotonic accuracy curve that looks kind of like the test accuracy curve in the finite-data regime? So I assume infinite data is also cheating?

To add on my thinking in particular: my view for at least a couple of years was that alignment would go mainstream at some point and discourse quality would then fall. I didn't really see a good way for me to make the public discourse much better -- I am not as gifted at persuasive writing as (say) Eliezer, nor are my views as memetically fit. As a result, my plan has been to have more detailed / nuanced conversations with individuals and/or small groups, and especially to advise people making important decisions (and/or make those decisions myself), and t... (read more)

In particular, this point of view further (and perhaps almost completely) demystifies the use of the Fourier basis. 

I disagree at least with the "almost completely" version of this claim:

Notice that the operation you want to learn is manifestly a convolution operation, i.e.

This also applies to the non-modular addition operation, but I think it's pretty plausible that if you train on non-modular addition (to the point of ~perfect generalization), the network would learn an embedding that converts the "tokenized" ... (read more)

1Spencer Becker-Kahn6mo
Thanks for the comment Rohin, that's interesting (though I haven't looked at the paper you linked). I'll just record some confusion I had after reading your comment that stopped me replying initially: I was confused by the distinction between modular and non-modular because I kept thinking: If I add a bunch of numbers x and y and don't do any modding, then it is equivalent to doing modular addition modulo some large number (i.e. at least as large as the largest sum you get). And otoh if I tell you I'm doing 'addition modulo 113', but I only ever use inputs that add up to 112 or less, then you never see the fact that I was secretly intending to do modular addition. And these thoughts sort of stopped me having anything more interesting to add.  

Since I'm an author on that paper, I wanted to clarify my position here. My perspective is basically the same as Steven's: there's a straightforward conceptual argument that goal-directedness leads to convergent instrumental subgoals, this is an important part of the AI risk argument, and the argument gains much more legitimacy and slightly more confidence in correctness by being formalized in a peer-reviewed paper.

I also think this has basically always been my attitude towards this paper. In particular, I don't think I ever thought of this paper as provid... (read more)

You're right, I incorrectly interpreted the sup as an inf, because I thought that they wanted to assume that there exists a prompt creating an adversarial example, rather than saying that every prompt can lead to an adversarial example.

I'm still not very compelled by the theorem -- it's saying that if adversarial examples are always possible (the sup condition you mention) and you can always provide evidence for or against adversarial examples (Definition 2) then you can make the adversarial example probable (presumably by continually providing evidence for adversarial examples). I don't really feel like I've learned anything from this theorem.

3Johannes Treutlein8mo
My takeaway from looking at the paper is that the main work is being done by the assumption that you can split up the joint distribution implied by the model as a mixture distribution  P=αP0+(1−α)P1, such that the model does Bayesian inference in this mixture model to compute the next sentence given a prompt, i.e., we have P(s∣s0)=P(s⊗s0)P(s0). Together with the assumption that P0 is always bad (the sup condition you talk about), this makes the whole approach with giving more and more evidence for P0 by stringing together bad sentences in the prompt work. To see why this assumption is doing the work, consider an LLM that completely ignores the prompt and always outputs sentences from a bad distribution with α probability and from a good distribution with (1−α) probability. Here, adversarial examples are always possible. Moreover, the bad and good sentences can be distinguishable, so Definition 2 could be satisfied. However, the result clearly does not apply (since you just cannot up- or downweigh anything with the prompt, no matter how long). The reason for this is that there is no way to split up the model into two components P0 and P1, where one of the components always samples from the bad distribution. This assumption implies that there is some latent binary variable of whether the model is predicting a bad distribution, and the model is doing Bayesian inference to infer a distribution over this variable and then sample from the posterior. It would be violated, for instance, if the model is able to ignore some of the sentences in the prompt, or if it is more like a hidden Markov model that can also allow for the possibility of switching characters within a sequence of sentences (then either P0 has to be able to also output good sentences sometimes, or the assumption P=αP0+(1−α)P1 is violated). I do think there is something to the paper, though. It seems that when talking e.g. about the Waluigi effect people often take the stance that the model is doing this
2Lukas Finnveden8mo
Yeah, I also don't feel like it teaches me anything interesting.

Did you ever look at the math in the paper? Theorem 1 looks straight up false to me (attempted counterexample) but possibly I'm misunderstanding their assumptions.

1Fabien Roger8mo
I hadn't looked at their math too deeply and it indeed seems that an assumption like "the bad behavior can be achieved after an arbitrary long prompt" is missing (though maybe I missed an hypothesis where this implicitly shows up). Something along these lines is definitely needed for the summary I made to make any sense.

(cross-posted from EAF, thanks Richard for suggesting. There's more back-and-forth later.)

I'm not very compelled by this response.

It seems to me you have two points on the content of this critique. The first point:

I think it's bad to criticize labs that do hits-based research approaches for their early output (I also think this applies to your critique of Redwood) because the entire point is that you don't find a lot until you hit.

I'm pretty confused here. How exactly do you propose that funding decisions get made? If some random person says they are pursu... (read more)

8Marius Hobbhahn8mo
I'm not going to crosspost our entire discussion from the EAF.  I just want to quickly mention that Rohin and I were able to understand where we have different opinions and he changed my mind about an important fact. Rohin convinced me that anti-recommendations should not have a higher bar than pro-recommendations even if they are conventionally treated this way. This felt like an important update for me and how I view the post. 

Biggest disagreement I have with him is that, from my perspective, is that he understands that the optimization deck stacks against us, but he does not understand the degree to which the optimization deck is stacked against us or the extent to which this ramps up and diversifies and changes its sources as capabilities ramp up, and thus thinks many things could work that I don’t see as having much chance of working. I also don’t think he’s thinking enough about what type and degree of alignment would ‘be enough’ to survive the resulting Phase 2.

I'm not committing to it, but if you wrote up concrete details here, I expect I'd engage with it.

Thanks! I've seen various attempts to articulate different aspects of this but I may be able to find a better way to put it, and hopefully it would at least help you understand the position better.

Sounds plausible! I haven't played much with Conway's Life.

(Btw, you may want to make this comment on the original post if you'd like the original author to see it.)

I agree "information loss" seems kinda sketchy as a description of this phenomenon, it's not what I would have chosen.

I forget if I already mentioned this to you, but another example where you can interpret randomization as worst-case reasoning is MaxEnt RL, see this paper. (I reviewed an earlier version of this paper here (review #3).)

That does seem like better terminology! I'll go change it now.

Possibly, but in at least one of the two cases I was thinking of when writing this comment (and maybe in both), I made the argument in the parent comment and the person agreed and retracted their point. (I think in both cases I was talking about deceptive alignment via goal misgeneralization.)

Okay, I understand how that addresses my edit.

I'm still not quite sure why the lightcone theorem is a "foundation" for natural abstraction (it looks to me like a nice concrete example on which you could apply techniques) but I think I should just wait for future posts, since I don't really have any concrete questions at the moment.

4Thane Ruthenis9mo
My impression is that it being a concrete example is the why. "What is the right framework to use?" and "what is the environment-structure in which natural abstractions can be defined?" are core questions of this research agenda, and this sort of multi-layer locality-including causal model is one potential answer. The fact that it loops-in the speed of causal influence is also suggestive — it seems fundamental to the structure of our universe, crops up in a lot of places, so the proposition that natural abstractions are somehow downstream of it is interesting.

Okay, that mostly makes sense.

note that the resampler itself throws away a ton of information about  while going from  to . And that is indeed information which "could have" been relevant, but almost always gets wiped out by noise. That's the information we're looking to throw away, for abstraction purposes.

I agree this is true, but why does the Lightcone theorem matter for it?

It is also a theorem that a Gibbs resampler initialized at equilibrium will produce  distributed according to , and as you say it's c... (read more)

Sounds like we need to unpack what "viewing X0 as a latent which generates X" is supposed to mean. I start with a distribution P[X]. Let's say X is a bunch of rolls of a biased die, of unknown bias. But I don't know that's what X is; I just have the joint distribution of all these die-rolls. What I want to do is look at that distribution and somehow "recover" the underlying latent variable (bias of the die) and factorization, i.e. notice that I can write the distribution as P[X]=∑iP[Xi|Λ]P[Λ], where Λ is the bias in this case. Then when reasoning/updating, we can usually just think about how an individual die-roll interacts with Λ, rather than all the other rolls, which is useful insofar as Λ is much smaller than all the rolls. Note that P[X|Λ] is not supposed to match P[X]; then the representation would be useless. It's the marginal ∑iP[Xi|Λ]P[Λ] which is supposed to match P[X]. The lightcone theorem lets us do something similar. Rather all the Xi's being independent given Λ, only those Xi's sufficiently far apart are independent, but the concept is otherwise similar. We express P[X] as ∑X0P[X|X0]P[X0] (or, really, ∑ΛP[X|Λ]P[Λ], where Λ summarizes info in X0 relevant to X, which is hopefully much smaller than all of X).
Load More