All of Dan H's Comments + Replies

I asked for permission via Intercom to post this series on March 29th. Later, I asked for permission to use the [Draft] indicator and said it was written by others. I got permission for both of these, but the same person didn't give permission for both of these requests. Apologies this was not consolidated into one big ask with lots of context. (Feel free to get rid of any undue karma.)

Dan H22dΩ61313

It's a good observation that it's more efficient; does it trade off performance? (These sorts of comparisons would probably be demanded if it was submitted to any other truth-seeking ML venue, and I apologize for consistently being the person applying the pressures that generic academics provide. It would be nice if authors would provide these comparisons.)


Also, taking affine combinations in weight-space is not novel to Schmidt et al either. If nothing else, the Stable Diffusion community has been doing that since October to add and subtract capabili

... (read more)
Some direct quantitative comparison between activation-steering and task-vector-steering (at, say, reducing toxicity) is indeed a very sensible experiment for a peer reviewer to ask for and I would like to see it as well.
Dan H22dΩ23-3

steering the model using directions in activation space is more valuable than doing the same with weights, because in the future the consequences of cognition might be far-removed from its weights (deep deceptiveness)

(You linked to "deep deceptiveness," and I'm going to assume is related to self-deception (discussed in the academic literature and in the AI and evolution paper). If it isn't, then this point is still relevant for alignment since self-deception is another internal hazard.)

I think one could argue that self-deception could in some instances be ... (read more)

I personally don't "dismiss" the task vector work. I didn't read Thomas as dismissing it by not calling it the concrete work he is most excited about -- that seems like a slightly uncharitable read?  I, personally, think the task vector work is exciting. Back in Understanding and controlling a maze-solving policy network [], I wrote (emphasis added): I'm highly uncertain about the promise of activation additions. I think their promise ranges from pessimistic "superficial stylistic edits" to optimistic "easy activation/deactivation of the model's priorities at inference time." In the optimistic worlds, activation additions do enjoy extreme advantages over task vectors, like accessibility of internal model properties which aren't accessible to finetuning (see the speculation portion of the post []). In the very pessimistic worlds, activation additions are probably less directly important than task vectors.  I don't know what world we're in yet.
4Thomas Kwa22d
* Deep deceptiveness is not quite self-deception. I agree that there are some circumstances where defending from self-deception advantages weight methods, but these seem uncommon. * I thought briefly about the Ilharco et al paper and am very impressed by it as well. * Thanks for linking to the resources. I don't have enough time to reply in depth, but the factors in favor of weight vectors and activation vectors both seem really complicated, and the balance still seems in favor of activation vectors, though I have reasonably high uncertainty.
Note that task vectors require finetuning. From the newly updated related work section:
Dan H23dΩ7120

Page 4 of this paper compares negative vectors with fine-tuning for reducing toxic text:

In Table 3, they show in some cases task vectors can improve fine-tuned models.

Insofar as you mean to imply that "negative vectors" are obviously comparable to our technique, I disagree. Those are not activation additions, and I would guess it's not particularly similar to our approach. These "task vectors" involve subtracting weight vectors, not activation vectors. See also footnote 39 (EDIT: and the related work appendix now talks about this directly).

Yes, I'll tend to write up comments quickly so that I don't feel as inclined to get in detailed back-and-forths and use up time, but here we are. When I wrote it, I thought there were only 2 things mentioned in the related works until Daniel pointed out the formatting choice, and when I skimmed the post I didn't easily see comparisons or discussion that I expected to see, hence I gestured at needing more detailed comparisons. After posting, I found a one-sentence comparison of the work I was looking for, so I edited to include that I found it, but it was oddly not emphasized. A more ideal comment would have been "It would be helpful to me if this work would more thoroughly compare to (apparently) very related works such as ..."

I'm also not able to evaluate the object-level of "was this post missing obvious stuff it'd have been good to improve", but, something I want to note about my own guess of how an ideal process would go from my current perspective: I think it makes more sense to think of posting on LessWrong as "submitting to a journal", than "publishing a finished paper." So, the part where some people then comment "hey, this is missing X" is more analogous to the thing where you submit to peer review and they say "hey, you missed X", then publishing a finished paper in a journal and it missing X. I do think a thing LessWrong is missing (or, doesn't do a good enough job at) is a "here is the actually finished stuff". I think the things that end up in the Best of LessWrong, after being subjected to review, are closer to that, but I think there's room to improve that more, and/or have some kind of filter for stuff that's optimized to meet academic-expectations-in-particular.

In many of my papers, there aren't fairly similar works (I strongly prefer to work in areas before they're popular), so there's a lower expectation for comparison depth, though breadth is always standard. In other works of mine, such as this paper on learning the the right thing in the presence of extremely bad supervision/extremely bad training objectives, we contrast with the two main related works for two paragraphs, and compare to these two methods for around half of the entire paper.

The extent of an adequate comparison depends on the relatedness. I'm ... (read more)

Yeah, it's totally possible that, as I said, there is a specific other paper that is important to mention or where the existing comparison seems inaccurate. This seems quite different from a generic "please have more thorough related work sections" request like the one you make in the top-level comment (which my guess is was mostly based on your misreading of the post and thinking the related work section only spans two paragraphs). 
Dan H23dΩ3131

Yes, I was--good catch. Earlier and now, unusual formatting/and a nonstandard related works is causing confusion. Even so, the work after the break is much older. The comparison to works such as is not in the related works and gets a sentence in a footnote: "That work took vectors between weights before and after finetuning on a new task, and then added or subtracted task-specific weight-diff vectors."

Is this big difference? I really don't know; it'd be helpful if they'd contrast more. Is this work very novel and useful, an... (read more)


On the object-level, deriving task vectors in weight-space from deltas in fine-tuned checkpoints is really different from what was done here, because it requires doing a lot of backward passes on a lot of data. Deriving task vectors in activation-space, as done in this new work, requires only a single forward pass on a truly tiny amount of data. So the data-efficiency and compute-efficiency of the steering power gained with this new method is orders of magnitude better, in my view.

Also, taking affine combinations in weight-space is not novel to Schmidt et ... (read more)

The level of comparison between the present paper and this paper seems about the same as I see in papers you have been a co-author in.  E.g. in [] the Related Works section is basically just a list of papers, with maybe half a sentence describing their relation to the paper. This seems normal and fine, and I don't see even papers you are a co-author on doing something substantively different here (this is again separate from whether there are any important papers omitted from the list of related works, or whether any specific comparisons are inaccurate, it's just making a claim about the usual level of detail that related works section tend to go into).

Background for people who understandably don't habitually read full empirical papers:
Related Works sections in empirical papers tend to include many comparisons in a coherent place. This helps contextualize the work and helps busy readers quickly identify if this work is meaningfully novel relative to the literature. Related works must therefore also give a good account of the literature. This helps us more easily understand how much of an advance this is. I've seen a good number of papers steering with latent arithmetic in the past year, but I would be su... (read more)

I think you might be interpreting the break after the sentence "Their results are further evidence for feature linearity and internal activation robustness in these models." as the end of the related work section? I'm not sure why that break is there, but the section continues with them citing Mikolov et al (2013), Larsen et al (2015), White (2016), Radford et al (2016), and Upchurch et al (2016) in the main text, as well as a few more papers in footnotes.
Dan H24dΩ2174

Could these sorts of posts have more thorough related works sections? It's usually standard for related works in empirical papers to mention 10+ works. Update: I was looking for a discussion of, assumed it wasn't included in this post, and many minutes later finally found a brief sentence about it in a footnote.

2Bogdan Ionut Cirstea11d
The (overlapping) evidence from Deep learning models might be secretly (almost) linear [] could also be useful / relevant, as well as these 2 papers on 'semantic differentials' [] and (contextual) word embeddings: SensePOLAR: Word sense aware interpretability for pre-trained [] contextual word embeddings [], Semantic projection recovers rich human knowledge of multiple object features from word embeddings [].
Thanks for the feedback. Some related work was "hidden" in footnotes because, in an earlier version of the post, the related work was in the body and I wanted to decrease the time it took a reader to get to our results. The related work section is now basically consolidated into the appendix. I also added another paragraph:

I don't understand this comment. I did a quick count of related works that are mentioned in the "Related Works" section (and the footnotes of that section) and got around 10 works, so seems like this is meeting your pretty arbitrarily established bar, and there are also lots of footnotes and references to related work sprinkled all over the post, which seems like the better place to discuss related work anyways.

I am not familiar enough with the literature to know whether this post is omitting any crucial pieces of related work, but the relevant section of ... (read more)

Maybe also [1607.06520] Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings is relevant as early (2016) work concerning embedding arithmetic.

Answer by Dan HMar 07, 202360

Open Problems in AI X-Risk:

Thermodynamics theories of life can be viewed as a generalization of Darwinism, though in my opinion the abstraction ends up being looser/less productive, and I think it's more fruitful just to talk in evolutionary terms directly.

You might find these useful:

God's Utility Function

A New Physics Theory of Life

Entropy and Life (Wikipedia)

AI and Evolution

1Jonas Hallgren3mo
I understand how that is generally the case, especially when considering evolutionary systems' properties. My underlying reason for developing this is that I predict using ML methods on entropy-based descriptions of chaos in NNs will be easier than looking at pure utility functions when it comes to power-seeking.  I imagine that there is a lot more work on existing methods for measuring causal effects and entropy descriptions of the internal dynamics of a system. I will give an example as the above seems like I'm saying "emergence" as an answer to why consciousness exists, it's non-specific.  If I'm looking at how deception will develop inside an agent, I can think of putting internal agents or shards against each other in some evolutionary tournament. I don't know how to set up an arbitrary utility for these shards, so I don't know how to use the evolutionary theory here. I do know how to set up a potential space of the deception system landscape based on a linear space of the significant predictive variables. I can then look at how much each shard is affecting the predictive variables and then get a prediction of what shard/inner agent will dominate the deception system through the level of power-seeking it has. Now I'm uncertain whether I would need to care about the free energy minimisation part of it or not. Still, it seems to me that it is more useful to describe power-seeking and what shard/inner agent ends up on top in terms of information entropy. (I might be wrong and if so I would be happy to be told so.)
Dan H4moΩ8129

"AI Safety" which often in practice means "self driving cars"

This may have been true four years ago, but ML researchers at leading labs rarely directly work on self-driving cars (e.g., research on sensor fusion). AV is has not been hot in quite a while. Fortunately now that AGI-like chatbots are popular, we're moving out of the realm of talking about making very narrow systems safer. The association with AV was not that bad since it was about getting many nines of reliability/extreme reliability, which was a useful subgoal. Unfortunately the world has not ... (read more)

2David Scott Krueger (formerly: capybaralet)4mo
Unfortunately, I think even "catastrophic risk" has a high potential to be watered down and be applied to situations where dozens as opposed to millions/billions die.  Even existential risk has this potential, actually, but I think it's a safer bet.

When ML models get more competent, ML capabilities researchers will have strong incentives to build superhuman models. Finding superhuman training techniques would be the main thing they'd work on. Consequently, when the problem is more tractable, I don't see why it'd be neglected by the capabilities community--it'd be unreasonable for profit maximizers not to have it as a top priority when it becomes tractable. I don't see why alignment researchers have to work in this area with high externalities now and ignore other safe alignment research areas (in pra... (read more)

Empiricists think the problem is hard, AGI will show up soon, and if we want to have any hope of solving it, then we need to iterate and take some necessary risk by making progress in capabilities while we go.

This may be so for the OpenAI alignment team's empirical researchers, but other empirical researchers note we can work on several topics to reduce risk without substantially advancing general capabilities. (As far as I can tell, they are not working on any of the following topics, rather focusing on an avenue to scalable oversight which, as instantiat... (read more)

1Shoshannah Tekofsky4mo
Thank you! I appreciate the in-depth comment. Do you think any of these groups hold that all of the alignment problem can be solved without advancing capabilities?

For a discussion of capabilities vs safety, I made a video about it here, and a longer discussion is available here.

Sorry, I am just now seeing since I'm on here irregularly.

So any robustness work that actually improves the robustness of practical ML systems is going to have "capabilities externalities" in the sense of making ML products more valuable.

Yes, though I do not equate general capabilities with making something more valuable. As written elsewhere,

It’s worth noting that safety is commercially valuable: systems viewed as safe are more likely to be deployed. As a result, even improving safety without improving capabilities could hasten the onset of x-risks.

... (read more)

making them have non-causal decision theories

How does it distinctly do that?

It's from the post: Discovering Language Model Behaviors with Model-Written Evaluations, where they have this to say about it: Basically, the AI is intending to one-box on Newcomb's problem, which is a sure sign of non-causal decision theories, since causal decision theory chooses to two-box on Newcomb's problem. Link below: []

Salient examples are robustness and RLHF. I think following the implied strategy---of avoiding any safety work that improves capabilities ("capability externalities")---would be a bad idea.

There are plenty of topics in robustness, monitoring, and alignment that improve safety differentially without improving vanilla upstream accuracy: most adversarial robustness research does not have general capabilities externalities; topics such as transparency, trojans, and anomaly detection do not; honesty efforts so far do not have externalities either. Here is analy... (read more)

I agree that some forms of robustness research don't have capabilities externalities, but the unreliability of ML systems is a major blocker to many applications. So any robustness work that actually improves the robustness of practical ML systems is going to have "capabilities externalities" in the sense of making ML products more valuable.

I disagree even more strongly with "honesty efforts don't have externalities:" AI systems confidently saying false statements is a major roadblock to lots of applications (e.g. any kind of deployment by Google), so this... (read more)

Much more importantly, I think RLHF has backfired in general, due to breaking myopia and making them have non-causal decision theories, and only condition is necessary to make this alignment scheme net negative.
Dan H9moΩ235222

I am strongly in favor of our very best content going on arXiv. Both communities should engage more with each other.

As follows are suggestions for posting to arXiv. As a rule of thumb, if the content of a blogpost didn't take >300 hours of labor to create, then it probably should not go on arXiv. Maintaining a basic quality bar prevents arXiv from being overriden by people who like writing up many of their inchoate thoughts; publication standards are different for LW/AF than for arXiv. Even if a researcher spent many hours on the project, arXiv moderato... (read more)

As an explanation, because this just took me 5 minutes of search: This is the section "Computers and Society (cs.CY [http://cs.CY])"

Strongly agree. Three examples of work I've put on Arxiv which originated from the forum, which might be helpful as a touchstone. The first was cited 7 times the first year, and 50 more times since.  The latter two were posted last year, and have not been indexed by Google as having been cited yet. 

As an example of a technical but fairly conceptual paper, there is the Categorizing Goodhart's law paper. I pushed for this to be a paper rather than just a post, and I think that the resulting exposure was very worthwhile. Scott wrote the original pos... (read more)

Dan H9moΩ5110

Here's a continual stream of related arXiv papers available through reddit and twitter.

Dan H9moΩ51310

I should say formatting is likely a large contributing factor for this outcome. Tom Dietterich, an arXiv moderator, apparently had a positive impression of the content of your grokking analysis. However, research on arXiv will be more likely to go live if it conforms to standard (ICLR, NeurIPS, ICML) formatting and isn't a blogpost automatically exported into a TeX file.

I agree that formatting is the most likely issue. The content of Neel's grokking work is clearly suitable for arXiv (just very solid ML work). And the style of presentation of the blog post is already fairly similar to a standard paper (e.g. is has an Introduction section, lists contributions in bullet points, ...).

So yeah, I agree that formatting/layout probably will do the trick (including stuff like academic citation style).

This is why we introduced X-Risk Sheets, a questionnaire that researchers should include in their paper if they're claiming that their paper reduces AI x-risk. This way researchers need to explain their thinking and collect evidence that they're not just advancing capabilities.

We now include these x-risk sheets in our papers. For example, here is an example x-risk sheet included in an arXiv paper we put up yesterday.

At first glance of seeing this, I'm reminded of the safety questionnaires I had to fill out as part of running a study when taking experimental psychology classes in undergrad. It was a lot of annoyance and mostly a box ticking exercise. Everyone mostly did what they wanted to do anyway, and then hurriedly gerrymandered that questionnaire right before the deadline, so the faculty would allow them to proceed. Except the very conscientious students, who saw this as an excellent opportunity to prove their box ticking diligence. 

As a case in point, I migh... (read more)

Note I'm mainly using this as an opportunity to talk about ideas and compute in NLP.

I don't know how big an improvement DeBERTaV2 is over SoTA.

DeBERTaV2 is pretty solid and mainly got its performance from an architectural change. Note the DeBERTa paper was initially uploaded in 2020, but it was updated early this year to include DeBERTa V2. The previous main popular SOTA on SuperGLUE was T5 (which beat RoBERTa). DeBERTaV2 uses 8x fewer parameters and 4x less compute than T5. DeBERTa's high performance isn't an artifact of SuperGLUE; in downstream tasks suc... (read more)

RE: "like I'm surprised if a clever innovation does more good than spending 4x more compute"

Earlier this year, DeBERTaV2 did better on SuperGLUE than models 10x the size and got state of the art.

Models such as DeBERTaV3 can do better than on commonsense question answering tasks than models that are tens or several hundreds of times larger.


Accuracy: 84.6   1  Parameters: 0.4B


Accuracy: 83.5  1  Parameters: 11B

Fine-tuned GPT-3

73.0  1  175B

Bidirectional models + train... (read more)

* ETA: I'm talking about the comparison to SOTA from a new clever trick. I'm not saying that "the cumulative impacts of all clever ideas is <4x," that would be obviously insane. (I don't know how big an improvement DeBERTaV2 is over SoTA.  But isn't RoBERTa from August 2019, basically contemporary with SuperGLUE, and gets 84.6% accuracy with many fewer parameters than T5? So I don't think I care at all about the comparison to T5.) * I said I would be surprised in a couple years, not that I would be surprised now. * I'm less surprised on SuperGLUE than downstream applications. * Much of the reason for the gap seems to be that none of the models you are comparing DeBERTaV2 against seem to be particularly optimized for SuperGLUE performance (in part because it's a new-ish benchmark that doesn't track downstream usefulness that well, so it's not going to be stable until people try on it). (ETA: actually isn't this just because you aren't comparing to SOTA? I think this was probably just a misunderstanding.) * Similarly, I expect people to get giant model size gains on many more recent datasets for a while at the beginning (if people try on them), but I think the gains from small projects or single ideas will be small by the time that a larger effort has been made.

In safety research labs in academe, we do not have a resource edge compared to the rest of the field.

We do not have large GPU clusters, so we cannot train GPT-2 from scratch or fine-tune large language models in a reasonable amount of time.

We also do not have many research engineers (currently zero) to help us execute projects. Some of us have safety projects from over a year ago on the backlog because there are not enough reliable people to help execute the projects.

These are substantial bottlenecks that more resources could resolve.