Produced as part of the SERI ML Alignment Theory Scholars Program - Winter 2022 Cohort.
What if I told you that in just one weekend you can get up to speed doing practical Mechanistic Interpretability research on Transformers? Surprised? Then this is your tutorial!
I'll give you a view to how I research Transformer circuits in practice, show you the tools you need, and explain my thought process along the way. I focus on the practical side to get started with interventions; for more background see point 2 below.
Prerequisites:
(Probably somebody else has said most of this. But I personally haven't read it, and felt like writing it down myself, so here we go.)
I think that EA burnout usually results from prolonged dedication to satisfying the values you think you should have, while neglecting the values you actually have.
Setting aside for the moment what “values” are and what it means to “actually” have one, suppose that I actually value these things (among others):
True Values
One day I learn about “global catastrophic risk”: Perhaps we’ll all die in a nuclear war, or an AI apocalypse, or a bioengineered global pandemic, and perhaps one of these things will happen quite soon.
I recognize that GCR is a direct threat to The Wellbeing Of Others and...
This post crystallized some thoughts that have been floating in my head, inchoate, since I read Zvi's stuff on slack and Valentine's "Here's the Exit."
Part of the reason that it's so hard to update on these 'creative slack' ideas is that we make deals among our momentary mindsets to work hard when it's work-time. (And when it's literally the end of the world at stake, it's always work-time.) "Being lazy" is our label for someone who hasn't established that internal deal between their varying mindsets, and so is flighty and hasn't precommitted to getting st...
Produced as part of the SERI ML Alignment Theory Scholars Program - Winter 2022 Cohort
A big thank you to all of the people who gave me feedback on this post: Edmund Lao, Dan Murfet, Alexander Gietelink Oldenziel, Lucius Bushnaq, Rob Krzyzanowski, Alexandre Variengen, Jiri Hoogland, and Russell Goyder.
Statistical learning theory is lying to you: "overparametrized" models actually aren't overparametrized, and generalization is not just a question of broad basins.
To first...
I wrote a follow-up that should be helpful to see an example in more detail. The example I mention is the loss function (=potential energy) . There's a singularity at the origin.
This does seem like an important point to emphasize: symmetries in the model (or if you're making deterministic predictions) and the true distribution lead to singularities in the loss landscape . There's an important distinction between and .
In this post I’m going to describe my basic justification for working on RLHF in 2017-2020, which I still stand behind. I’ll discuss various arguments that RLHF research had an overall negative impact and explain why I don’t find them persuasive.
I'll also clarify that I don't think research on RLHF is automatically net positive; alignment research should address real alignment problems, and we should reject a vague association between "RLHF progress" and "alignment progress."
Here are some background views about alignment I held in 2015 and still hold today. I expect disagreements about RLHF will come down to disagreements about this background:
Creating in vitro examples of problems analogous to the ones that will ultimately kill us, e.g. by showing agents engaging in treacherous turns due to reward hacking or exhibiting more and more of the core features of deceptive alignment.
Has ARC got a written policy for if/when similar experiment generate inconclusive but possible evidence of dangerous behaviour.
If so would you consider sharing it (or a non-confidential version) for other organisations to use.
I want to thank Sebastian Farquhar, Laurence Midgley and Johan van den Heuvel, for feedback and discussion on this post.
Some time ago I asked the question “What is the role of Bayesian ML in AI safety/alignment?”. The response of the EA and Bayesian ML community was very helpful. Thus, I decided to collect and distill the answers and provide more context for current and future AI safety researchers.
Clarification: I don’t think many people (<1% of the alignment community) should work on Bayesian ML or that it is even the most promising path to alignment. I just want to provide a perspective and give an overview. I personally am not that bullish on Bayesian ML anymore (see shortcomings) but I’m in a relatively unique position where I have a decent...
Also, people on Less Wrong are normally interested in a different kind of Bayesianism: Bayesian epistemology. The type philosophers talk about. But Bayesian ML is based on the other type of Bayesianism: Bayesian statistics. The two have little to do with each other.
What’s the type signature of an agent?
For instance, what kind-of-thing is a “goal”? What data structures can represent “goals”? Utility functions are a common choice among theorists, but they don’t seem quite right. And what are the inputs to “goals”? Even when using utility functions, different models use different inputs - Coherence Theorems imply that utilities take in predefined “bet outcomes”, whereas AI researchers often define utilities over “world states” or “world state trajectories”, and human goals seem to be over latent variables in humans’ world models.
And that’s just goals. What about “world models”? Or “agents” in general? What data structures can represent these things, how do they interface with each other and the world, and how do they embed in their low-level world? These are all...
Maybe a bigger deal is that by the nature of a paper, you can't get too many inferential steps away from the field.
This post is meant to be a linkable resource. Its core is a short list of guidelines that are intended to be fairly straightforward and uncontroversial, for the purpose of nurturing and strengthening a culture of clear thinking, clear communication, and collaborative truth-seeking.
"Alas," said Dumbledore, "we all know that what should be, and what is, are two different things. Thank you for keeping this in mind."
There is also (for those who want to read past the simple list) substantial expansion/clarification of each specific guideline, along with justification for the overall philosophy behind the set.
Once someone has a deep, rich understanding of a complex topic, they are often able to refer to that topic with short, simple sentences that correctly convey the intended meaning to other...
I think this thought experiment isn't relevant, because I think there are sufficient strong disanalogies between [your imagined document] and [this actual document], and [the imagined forum trying to gain members] and [the existing LessWrong].
i.e. I think the conclusion of the thought experiment is indeed as you are implying, and also that this fact doesn't mean much here.
Making the rounds.
User: When should we expect AI to take over?
ChatGPT: 10
User: 10? 10 what?
ChatGPT: 9
ChatGPT 8
...
Awesome, updated!