Recent Discussion

Produced as part of the SERI ML Alignment Theory Scholars Program - Winter 2022 Cohort.

What if I told you that in just one weekend you can get up to speed doing practical Mechanistic Interpretability research on Transformers? Surprised? Then this is your tutorial!

I'll give you a view to how I research Transformer circuits in practice, show you the tools you need, and explain my thought process along the way. I focus on the practical side to get started with interventions; for more background see point 2 below.


  1. Understanding the Transformer architecture: Know what the residual stream is, how attention layers and MLPs work, and how logits & predictions work. For future sections familiarity with multi-head attention is useful. Here’s a link to Neel’s glossary which provides excellent explanations for most

Awesome, updated!

(Probably somebody else has said most of this. But I personally haven't read it, and felt like writing it down myself, so here we go.)



I think that EA burnout usually results from prolonged dedication to satisfying the values you think you should have, while neglecting the values you actually have.

Setting aside for the moment what “values” are and what it means to “actually” have one, suppose that I actually value these things (among others):

True Values

  • Abundance
  • Power
  • Novelty
  • Social Harmony
  • Beauty
  • Growth
  • Comfort
  • The Wellbeing Of Others
  • Excitement
  • Personal Longevity
  • Accuracy

One day I learn about “global catastrophic risk”: Perhaps we’ll all die in a nuclear war, or an AI apocalypse, or a bioengineered global pandemic, and perhaps one of these things will happen quite soon. 

I recognize that GCR is a direct threat to The Wellbeing Of Others and...

This post crystallized some thoughts that have been floating in my head, inchoate, since I read Zvi's stuff on slack and Valentine's "Here's the Exit."

Part of the reason that it's so hard to update on these 'creative slack' ideas is that we make deals among our momentary mindsets to work hard when it's work-time. (And when it's literally the end of the world at stake, it's always work-time.) "Being lazy" is our label for someone who hasn't established that internal deal between their varying mindsets, and so is flighty and hasn't precommitted to getting st... (read more)

It seems like a major issue here is that people often have limited introspective access to what their "true values" are. And it's not enough to know some of your true values; in the example you give the fact that you missed one or two causes problems even if most of what you're doing is pretty closely related to other things you truly value. (And "just introspect harder" increases the risk of getting answers that are the results of confabulation and confirmation bias rather than true values, which can cause other problems.)
A quote I find relevant:
I'm still mulling this over and may continue doing so for a while. I really appreciate this comment though, and I do expect to respond to it. :)

Produced as part of the SERI ML Alignment Theory Scholars Program - Winter 2022 Cohort

A big thank you to all of the people who gave me feedback on this post: Edmund Lao, Dan Murfet, Alexander Gietelink Oldenziel, Lucius Bushnaq, Rob Krzyzanowski, Alexandre Variengen, Jiri Hoogland, and Russell Goyder.

Statistical learning theory is lying to you: "overparametrized" models actually aren't overparametrized, and generalization is not just a question of broad basins.

The standard explanation thrown around here for why neural networks generalize well is that gradient descent settles in flat basins of the loss function. On the left, in a sharp minimum, the updates bounce the model around. Performance varies considerably with new examples. On the right, in a flat minimum, the updates settle to zero. Performance is stabler under perturbations.

To first...

I'm still thinking about this (unsuccessfully). Maybe my missing piece is that the examples I'm considering here still do not have any of the singularities that this topic focuses on! What are the simplest examples with singularities? Say again we're fitting y = f(x) for over some parameters. And specifically let's consider the points (0,0) and (1,0) as our only training data. Thenf1(x)=a b+cxhas minimal loss set{a=0orb=0andc=0}. That has a singularity at (0,0,0). I don't really see why it would generalize better thanf2(x)=a+cxorf3(x)=a+b+cx, neither of which have singularities in their minimal loss sets. These still are only examples of the type B behavior where they already are effectively just two parameters, so maybe there's no further improvement for a singularity to give? Consider insteadf4(x)=a+bx+cdx2. Here the minimal loss set has a singularity when at (0,0,0,0). But maybe now if we're at that point, the model has effectively reduced down tof4(x)=a+bx+0since perturbing either c or d away from zero would still keep the last term zero. So maybe this is a case wheref4has type A behavior in general (since the x^2 term can throw off generalizability compared to a linear) but approximates type B behavior near the singularity (since the x^2 term becomes negligible even if perturbed)? That seems to be the best picture of this argument that I've been able to convince myself of so-far! Singularities are (sometimes) points where type A behavior becomes type B behavior.

I wrote a follow-up that should be helpful to see an example in more detail. The example I mention is the loss function (=potential energy) . There's a singularity at the origin. 

This does seem like an important point to emphasize: symmetries in the model  (or  if you're making deterministic predictions) and the true distribution  lead to singularities in the loss landscape . There's an important distinction between  and .

In this post I’m going to describe my basic justification for working on RLHF in 2017-2020, which I still stand behind. I’ll discuss various arguments that RLHF research had an overall negative impact and explain why I don’t find them persuasive.

I'll also clarify that I don't think research on RLHF is automatically net positive; alignment research should address real alignment problems, and we should reject a vague association between "RLHF progress" and "alignment progress."

Background on my involvement in RLHF work

Here are some background views about alignment I held in 2015 and still hold today. I expect disagreements about RLHF will come down to  disagreements about this background:

  • The simplest plausible strategies for alignment involve humans (maybe with the assistance of AI systems) evaluating a model’s actions based on

Creating in vitro examples of problems analogous to the ones that will ultimately kill us, e.g. by showing agents engaging in treacherous turns due to reward hacking or exhibiting more and more of the core features of deceptive alignment.


Has ARC got a written policy for if/when similar experiment generate inconclusive but possible evidence of dangerous behaviour.

If so would you consider sharing it (or a non-confidential version) for other organisations to use. 

It depends a lot on the use case. When it comes to what I'm doing with ChatGPT, I care more about the quality of the best answer when I generate five answers to a prompt than I care about the quality of the worst answer. I can choose the best answer myself and ignore the others. Many use cases have ways to filter for valuable results either automatically or by letting a human filter.
Fair. I think the crucial question to Ajeya & Matthew's discussion of "Why the hype now?" is exactly how much worse the non-RLHF models that had been available since at least last March (davinci, code-davinci-002, text-davinci-002) actually were than the RLHF models made available just recently (text-davinci-003 and ChatGPT's underlying model). I stand by the opinion that the besides the new chat stuff, most of the improvement happened within the old cohort, rather than between cohorts, so I attribute the recent hype to the convenient and free chat interface.
Would love to learn more about the model(s) behind CharacterAI. Anyone know if there's publicly available information on them?

I want to thank Sebastian Farquhar, Laurence Midgley and Johan van den Heuvel, for feedback and discussion on this post. 

Some time ago I asked the question “What is the role of Bayesian ML in AI safety/alignment?”. The response of the EA and Bayesian ML community was very helpful. Thus, I decided to collect and distill the answers and provide more context for current and future AI safety researchers.

Clarification: I don’t think many people (<1% of the alignment community) should work on Bayesian ML or that it is even the most promising path to alignment. I just want to provide a perspective and give an overview. I personally am not that bullish on Bayesian ML anymore (see shortcomings) but I’m in a relatively unique position where I have a decent...

Lots of Bayes fans, but can't seem to define what Bayes is. Since Bayes theorem is a reformulation of the chain rule, anything that is probabilistic "uses Bayes theorem" somewhere, including all frequentist methods. Frequentists quantify uncertainty also, via confidence sets, and other ways. Continuous updating has to do with "online learning algorithms," not Bayes. --- Bayes is when the target of inference is a posterior distribution. Bonus Bayes points: you don't care about frequentist properties like consistency of the estimator.

Also, people on Less Wrong are normally interested in a different kind of Bayesianism: Bayesian epistemology. The type philosophers talk about. But Bayesian ML is based on the other type of Bayesianism: Bayesian statistics. The two have little to do with each other.

What’s the type signature of an agent?

For instance, what kind-of-thing is a “goal”? What data structures can represent “goals”? Utility functions are a common choice among theorists, but they don’t seem quite right. And what are the inputs to “goals”? Even when using utility functions, different models use different inputs - Coherence Theorems imply that utilities take in predefined “bet outcomes”, whereas AI researchers often define utilities over “world states” or “world state trajectories”, and human goals seem to be over latent variables in humans’ world models.

And that’s just goals. What about “world models”? Or “agents” in general? What data structures can represent these things, how do they interface with each other and the world, and how do they embed in their low-level world? These are all...

Review[This review is currently a work in progress[1] [#fnt14santbkb]. ] EPISTEMIC STATUS I am an aspiring selection theorist and I have thoughts[2] [#fn0fq01i05vkfu]. INTRODUCTION I explain why I'm excited about the post then cover where I think it falls short and offer a few suggestions for improvement. WHY SELECTION THEOREMS? Learning about selection theorems was very exciting. It's one of those concepts that felt so obviously right. A missing component in my alignment ontology that just clicked and made everything stronger. I think that selection theorems provide a robust framework with which to formulate (and prove) safety desiderata/guarantees for AI systems that are robust to arbitrary capability amplification. Furthermore, selection theorems seem to be very robust to paradigm shifts in the development artificial intelligence [] . That is regardless of what changes in architecture or training methodology subsequent paradigms may bring, I expect selection theoretic results to still apply[3] [#fn3k9fylfb7u2]. DIGRESSION: ASYMPTOTIC ANALYSIS My preferred analogy for selection theorems is asymptotic complexity [] in computer science []. Using asymptotic analysis [] we can make highly non-trivial statements about the performance of particular algorithms that abstract away the underlying architecture, hardware, and other implementation details. As long as the implementation of the algorithm is amenable to our (very general) models of computation, the complexity theoretic guarantee will generally still apply. For example, we have a very robust proof that no comparison based sorting algorithm can attain better time complexity thanO(nlogn). The model behind the lower bound of comparis

To: @johnswentworth 


Senpai notice me. 🥺

@Raemon []: here's the review I mentioned wanting to write. I'm wiped for the current writing session, but may extend it further later in the day over the coming week? [When does the review session end?]
To get the best posts emailed to you, create an account! (2-3 posts per week, selected by the LessWrong moderation team.)
Subscribe to Curated posts
Log In Reset Password
...or continue with
This seems wrong to me about academia - I'd say it's driven by "learning cool things you can summarize in a talk". Also in general I feel like this logic would also work for why we shouldn't work inside buildings, or with computers.
1Garrett Baker8h
Hm. Good points. I guess what I really mean with the academia points is that it seems like academia has many blockers and inefficiencies that I think are made in such a way so that capabilities progress is vastly easier than alignment progress to jump through, and extra-so for capabilities labs. Like, right now it seems like a lot of alignment work is just playing with a bunch of different reframings of the problems to see what sticks or makes problems easier. You have more experience here, but my impression of a lot of academia was that it was very focused on publishing lots of papers with very legible results (and also a meaningless theory section). In such a world, playing around with different framings of problems doesn't succeed, and you end up pushed towards framings which are better on the currently used metrics. Most currently used metrics for AI stuff are capabilities oriented, so that means doing capabilities work, or work that helps push capabilities.
I think it's true that the easiest thing to do is legibly improve on currently used metrics. I guess my take is that in academia you want to write a short paper that people can see is valuable, which biases towards "I did thing X and now the number is bigger". But, for example, if you reframe the alignment problem and show some interesting thing about your reframing, that can work pretty well as a paper (see The Off-Switch Game, Optimal Policies Tend to Seek Power). My guess is that the bigger deal is that there's some social pressure to publish frequently (in part because that's a sign that you've done something, and a thing that closes a feedback loop).

Maybe a bigger deal is that by the nature of a paper, you can't get too many inferential steps away from the field.


This post is meant to be a linkable resource. Its core is a short list of guidelines that are intended to be fairly straightforward and uncontroversial, for the purpose of nurturing and strengthening a culture of clear thinking, clear communication, and collaborative truth-seeking.

"Alas," said Dumbledore, "we all know that what should be, and what is, are two different things.  Thank you for keeping this in mind."

There is also (for those who want to read past the simple list) substantial expansion/clarification of each specific guideline, along with justification for the overall philosophy behind the set.

Prelude: On Shorthand

Once someone has a deep, rich understanding of a complex topic, they are often able to refer to that topic with short, simple sentences that correctly convey the intended meaning to other...

4Vaughn Papenhausen2h
My model of gears to ascension, based on their first 2 posts, is that they're not complaining about the length for their own sake, but rather for the sake of people that they link this post to who then bounce off because it looks too long. A basics post shouldn't have the property that someone with zero context is likely to bounce off it, and I think gears to ascension is saying that the nominal length (reflected in the "43 minutes") is likely to have the effect of making people who get linked to this post bounce off it, even though the length for practical purposes is much shorter.
Yes, agreed. I think that people who are actually going to link this to someone with zero context are going to say "just look at the bulleted list" and that's going to 100% solve the problem for 90% of the people. I think that the set of people who bounce for the reason of "deterred by the stated length and didn't read the first paragraph to catch the context" but who would otherwise have gotten value out of my writing is very very very very very small, and wrong to optimize for. I separately think that the world in general and LW in particular already bend farther over backwards than is optimal to reach out to what I think of in my brain as "the tl;dr crowd." I'm default skeptical of "but you could reach these people better if you X;" I already kinda don't want to reach them and am not making plans which depend upon them.
[Thought experiment meant to illustrate potential dangers of discourse policing] Imagine 2 online forums devoted to discussing creationism. Forum #1 is about 95% creationists, 5% evolutionists. It has a lengthy document, "Basics of Scientific Discourse", which runs to about 30 printed pages. The guidelines in the document are fairly reasonable. People who post to Forum #1 are expected to have read and internalized this document. It's common for users to receive warnings or bans for violating guidelines in the "Basics of Scientific Discourse" document. These warnings and bans fall disproportionately on evolutionists, for a couple reasons: (a) evolutionist users are less likely to read and internalize the guidelines (evolutionist accounts tend to be newly registered, and not very invested in forum discussion norms) and (b) forum moderators are all creationists, and they're far more motivated to find guideline violations in the posts of evolutionist users than creationist users (with ~30 pages of guidelines, there's often something to be found). The mods are usually not very interested in discussing a warning or a ban. Forum #2 is about 80% creationists, 20% evolutionists. The mods at Forum #2 are more freewheeling and fun. Rather than moderating harshly, the mods at Forum #2 focus on setting a positive example of friendly, productive discourse. The ideological split among the mods at Forum #2 is the same as that of the forum of the whole: 80% creationists, 20% evolutionists. It's common for creationist mods to check with evolutionist mods before modding an evolutionist post, and vice versa. When a user at Forum #2 is misbehaving, the mods at Forum #2 favor a Hacker News-like approach of sending the misbehaving user a private message and having a discussion about their posts. Which forum do you think would be quicker to reach a 50% creationists / 50% evolutionists split?

I think this thought experiment isn't relevant, because I think there are sufficient strong disanalogies between [your imagined document] and [this actual document], and [the imagined forum trying to gain members] and [the existing LessWrong].

i.e. I think the conclusion of the thought experiment is indeed as you are implying, and also that this fact doesn't mean much here.

Making the rounds.

User: When should we expect AI to take over?

ChatGPT: 10

User: 10?  10 what?

ChatGPT: 9 

ChatGPT 8