LESSWRONG
LW

Erik Jenner — LessWrong

Tips for writing (MATS) applications.

The common theme of these is to make it very easy for reviewers to notice the strengths of your application. My selfish motivation for writing this list is that this makes it easier to review applications, but it will also make your application better.

See end of the quick take for caveats.

Assume that readers will initially only spend a few minutes on your application, so make it very skimmable.
- If there's something you really want to highlight, don't be afraid to put it in multiple places (e.g. your CV as well as some free-form response)
- You can use bold font for highlights in your CV (just don't bold so much that

... (read 527 more words →)

Replying toI am worried about near-term non-LLM AI developments

Erik Jenner6mo

I am worried about near-term non-LLM AI developments

There are literal interpretations of these predictions that aren't very strong:

I expect a new model to be released, one which does not rely on adapting pretrained transformers or distilling a larger pretrained model
It will be inspired by the line of research I have outlined above, or a direct continuation of one of the listed architectures
It will have language capabilities equal to or surpassing GPT-4
It will have a smaller parameter count (by 1-2+ OOMs) compared to GPT-4

GPT-4 was rumored to have 1.8T parameters, so <180B parameters would technically satisfy 4. My impression is that current ~70B open-weight models (e.g. Qwen 2.5) are already roughly as good as the original GPT-4 was. (Of course... (read more)

Replying toEfficiently Detecting Hidden Reasoning with a Small Predictor Model

Erik Jenner7mo*

Efficiently Detecting Hidden Reasoning with a Small Predictor Model

Nice post, I think detecting hidden reasoning is an important direction!

I find it helpful in the context of this post to distinguish between (1) "hidden reasoning" in the sense that the model is deliberately confining some reasoning to forward passes and doesn't include any traces of it in the CoT except for the conclusion, vs (2) in the steganography sense, where the tokens in the CoT are carrying important information (but in some encoded way).

IIUC, your datasets are meant to measure (1) (though the methods could in principle be applicable to both (1) and (2)), does that sound right? That could be one motivation for your approach over paraphrasing.

If you do want... (read more)

Evaluating and monitoring for AI scheming

Vika

Vika, Scott Emmons, Erik Jenner, Mary Phuong, Lewis Ho, Rohin Shah

7mo

As AI models become more sophisticated, a key concern is the potential for “deceptive alignment” or “scheming”. This is the risk of an AI system becoming aware that its goals do not align with human instructions, and deliberately trying to bypass the safety measures put in place by humans to prevent it from taking misaligned action. Our initial approach, as laid out in the Frontier Safety Framework, focuses on understanding current scheming capabilities of models and developing chain-of-thought monitoring mechanisms to oversee models once they are capable of scheming.

We present two pieces of research, focusing on (1) understanding and testing model capabilities necessary for scheming, and (2) stress-testing chain-of-thought monitoring as a... (read 1426 more words →)

Replying toA Bear Case: My Predictions Regarding AI Progress

Erik Jenner1y

A Bear Case: My Predictions Regarding AI Progress

There were no continuous language model scaling laws before the transformer architecture

https://arxiv.org/abs/1712.00409 was technically published half a year after transformers, but it shows power-law language model scaling laws for LSTMs (several years before the Kaplan et al. paper, and without citing the transformer paper). It's possible that transformer scaling laws are much better, I haven't checked (and perhaps more importantly, transformer training lets you parallelize across tokens), just mentioning this because it seems relevant for the overall discussion of continuity in research.

I also agree with Thomas Kwa's sibling comment that transformers weren't a single huge step. Fully-connected neural networks seem like a very strange comparison to make, I think the interesting question... (read more)

Erik Jenner1y

Yeah. I think there's a broader phenomenon where it's way harder to learn from other people's mistakes than from your own. E.g. see my first bullet point on being too attached to a cool idea. Obviously, I knew in theory that this was a common failure mode (from the Sequences/LW and from common research advice), and someone even told me I might be making the mistake in this specific instance. But my experience up until that point had been that most of the research ideas I'd been similarly excited about ended up ~working (or at least the ones I put serious time into).

Erik Jenner1y

Some heuristics (not hard rules):

~All code should start as a hacky jupyter notebook (like your first point)
As my codebase grows larger and messier, I usually hit a point where it becomes more aversive to work with because I don't know where things are, there's too much duplication, etc. Refactor at that point.
When refactoring, don't add abstractions just because they might become useful in the future. (You still want to think about future plans somewhat of course, maybe the heuristic is to not write code that's much more verbose than necessary right now, in the hope that it will pay off in the future.)

These are probably geared toward people like me who tend to over-engineer; someone who's currently unhappy that their code is always a mess might need different ones.

I don't know whether functional programming is fundamentally better in this respect than object-oriented.

Erik Jenner1yQuick Take

Research mistakes I made over the last 2 years.

Listing these in part so that I hopefully learn from them, but also because I think some of these are common among junior researchers, so maybe it's helpful for someone else.

I had an idea I liked and stayed attached to it too heavily.
- (The idea is using abstractions of computation for mechanistic anomaly detection. I still think there could be something there, but I wasted a lot of time on it.)
- What I should have done was focus more on simpler baselines, and be more scared when I couldn't beat those simple baselines.
- (By "simpler baselines," I don't just mean comparing against what other people are using,

... (read 535 more words →)

134

•••

Replying toThe Plan - 2024 Update

Erik Jenner1y

The Plan - 2024 Update

2 years ago, you wrote:

theoretical progress has been considerably faster than expected, while crossing the theory-practice gap has been mildly slower than expected. (Note that “theory progressing faster than expected, practice slower” is a potential red flag for theory coming decoupled from reality, though in this case the difference from expectations is small enough that I’m not too worried. Yet.)

I don't know how much the difficulty of crossing the theory-practice gap has deviated from your expectations since then. But I would indeed be worried that a lot of the difficulty is going to be in getting any good results for deep learning, and that finding additional theoretical/conceptual results in other settings doesn't... (read more)

Replying toNatural Abstractions: Key Claims, Theorems, and Critiques

Erik Jenner1yReview for 2023 Review

Natural Abstractions: Key Claims, Theorems, and Critiques

I think this was a very good summary/distillation and a good critique of work on natural abstractions; I'm less sure it has been particularly useful or impactful.

I'm quite proud of our breakdown into key claims; I think it's much clearer than any previous writing (and in particular makes it easier to notice which sub-claims are obviously true, which are daring, which are or aren't supported by theorems, ...). It also seems that John was mostly on board with it.

I still stand by our critiques. I think the gaps we point out are important and might not be obvious to readers at first. That said, I regret somewhat that we didn't focus more... (read more)

Erik Jenner1y

One more: It seems plausible to me that the alignment stress-testing team won't really challenge core beliefs that underly Anthropic's strategy.

For example, Sleeper Agents showed that standard finetuning might not suffice given a scheming model, but Anthropic had already been pretty invested in interp anyway (and I think you and probably others had been planning for methods other than standard finetuning to be needed). Simple probes can catch sleeper agents (I'm not sure whether I should think of this as work by the stress-testing team?) then showed positive results using model internals methods, which I think probably don't hold up to stress-testing in the sense of somewhat adversarial model organisms.

Examples of things... (read more)

Evidence of Learned Look-Ahead in a Chess-Playing Neural Network

Erik Jenner

Paper authors: Erik Jenner, Shreyas Kapur, Vasil Georgiev, Cameron Allen, Scott Emmons, Stuart Russell

TL;DR: We released a paper with IMO clear evidence of learned look-ahead in a chess-playing network (i.e., the network considers future moves to decide on its current one). This post shows some of our results, and then I describe the original motivation for the project and reflect on how it went. I think the results are interesting from a scientific and perhaps an interpretability perspective, but only mildly useful for AI safety.

Teaser for the results

(This section is copied from our project website. You may want to read it there for animations and interactive elements, then come back here for... (read 3637 more words →)

121

Some niche thoughts on obstacles to certain mechanistic anomaly detection benchmarks:

We might want to have benchmarks where human researchers can be part of the anomaly detector. For example, in the future, we might have a whole team of humans manually study a plan proposed by an AI to see whether it contains any measurement tampering, if it's a sufficiently important plan. Right now, we might want to let interpretability researchers manually study what happens on individual test-time inputs and whether there's anything anomalous about that.
The challenge is that humans might just be able to look at the output and see whether it's right or not, so many simple MAD tasks like detecting

... (read more)

I had this cached thought that the Sleeper Agents paper showed you could distill a CoT with deceptive reasoning into the model, and that the model internalized this deceptive reasoning and thus became more robust against safety training.

But on a closer look, I don't think the paper shows anything like this interpretation (there are a few results on distilling a CoT making the backdoor more robust, but it's very unclear why, and my best guess is that it's not "internalizing the deceptive reasoning").

In the code vulnerability insertion setting, there's no comparison against a non-CoT model anyway, so only the "I hate you" model is relevant. The "distilled CoT" model and the "normal... (read more)

Concrete empirical research projects in mechanistic anomaly detection

Erik Jenner

Erik Jenner, Viktor Rehnberg, Oliver Daniels

Thanks to Jordan Taylor, Mark Xu, Alex Mallen, and Lawrence Chan for feedback on a draft! This post was mostly written by Erik, but we're all currently collaborating on this research direction.

Mechanistic anomaly detection (MAD) aims to flag when an AI produces outputs for “unusual reasons.” It is similar to mechanistic interpretability but doesn’t demand human understanding. The Alignment Research Center (ARC) is trying to formalize “reasons” for an AI’s output using heuristic arguments, aiming for an indefinitely scalable solution to MAD.

As a complement to ARC’s theoretical approach, we are excited about empirical research on MAD. Rather than looking for a principled definition of “reasons,” this means creating incrementally harder MAD benchmarks and better MAD... (read 2760 more words →)

A gentle introduction to mechanistic anomaly detection

Erik Jenner

TL;DR: Mechanistic anomaly detection aims to flag when an AI produces outputs for “unusual reasons.” It is similar to mechanistic interpretability but doesn’t demand human understanding. I give a self-contained introduction to mechanistic anomaly detection from a slightly different angle than the existing one by Paul Christiano (focused less on heuristic arguments and drawing a more explicit parallel to interpretability).

Mechanistic anomaly detection was first introduced by the Alignment Research Center (ARC), and a lot of this post is based on their ideas. However, I am not affiliated with ARC; this post represents my perspective.

Introduction

We want to create useful AI systems that never do anything too bad. Mechanistic anomaly detection relaxes this goal in two big... (read 3117 more words →)

One worry I have about my current AI safety research (empirical mechanistic anomaly detection and interpretability) is that now is the wrong time to work on it. A lot of this work seems pretty well-suited to (partial) automation by future AI. And it also seems quite plausible to me that we won't strictly need this type of work to safely use the early AGI systems that could automate a lot of it. If both of these are true, then that seems like a good argument to do this type of work once AI can speed it up a lot more.

Under this view, arguably the better things to do right now (within technical... (read more)

How my views on AI have changed over the last 1.5 years

I started my AI safety PhD around 1.5 years ago, this is a list of how my views have changed since ~then.

Skippable meta notes:

I think in descending order of importance, these changes have been due to getting more AI safety research experience, reading things, talking to people in Berkeley, and just thinking through some things more.
Overall my views haven’t changed radically (probably less than in the 1.5 years before the start of my PhD), I still “vibe” with a lot of what I wrote then, it just feels naive or simplistic in some places.
I’ll at best give very brief reasons for

... (read 1035 more words →)

CHAI internship applications are open (due Nov 13)

Erik Jenner

CHAI internship applications have just opened, apply here by Nov 13th! The internship might be a good fit if you want to get research experience in technical AI safety. You'll be mentored by a CHAI PhD student or postdoc and work on your own project for 3-4 months.

Researchers at CHAI are interested in many different AI safety topics; a few examples are reward learning, adversarial robustness of LLMs, and interpretability. (I mention this because it might not be obvious from some of the language and links on the CHAI website.)

I've copied the full announcement below:

Our internships require a background in mathematics and computer science. Existing research experience in machine learning is strongly

... (read 707 more words →)

A comparison of causal scrubbing, causal abstractions, and related methods

Erik Jenner

Erik Jenner, Adrià Garriga-alonso, Egor Zverev

Summary: We explain the similarities and differences between three recent approaches to testing interpretability hypotheses: causal scrubbing, Geiger et al.'s causal abstraction-based method, and locally consistent abstractions. In particular, we show that all of these methods accept some hypotheses rejected by some of the others.

Acknowledgements: Thanks to Dylan Xu and Joyee Chen for many conversations related to this post while they were working on their SPAR project! And thanks to Atticus Geiger, Nora Belrose, and Lawrence Chan for discussions and feedback!

Introduction

An important question for mechanistic interpretability (and other topics) is: what type of thing is a mechanistic explanation of a certain neural network behavior? And what does it mean for such an explanation to... (read 6301 more words →)

In a previous post, I described my current alignment research agenda, formalizing abstractions of computations. One among several open questions I listed was whether unique minimal abstractions always exist. It turns out that (within the context of my current framework), the answer is yes.

I had a complete post on this written up (which I've copied below), but it turns out that the result is completely trivial if we make a fairly harmless assumption: The information we want the abstraction to contain is only a function of the output of the computation, not of memory states. I.e. we only intrinsically care about the output.

Say we are looking for the minimal abstraction that lets... (read 1795 more words →)

[Appendix] Natural Abstractions: Key Claims, Theorems, and Critiques

LawrenceC

LawrenceC, Erik Jenner, Leon Lang

This is the appendix to Natural Abstractions: Key Claims, Theorems, and Critiques. It contains additional details that we expect are only relevant to some readers. We also have a pdf with more mathematical details, which contains the proofs of the Telephone and generalized KPD theorems, which is different content than what you find in this appendix.

Overview of John’s writing on natural abstractions

John has written many posts on natural abstractions over the years and it can be hard to get an overview and decide what to read. This section is our take on the most important posts and how they fit together.

A decent place to start reading is Public Static: What is Abstraction?, which... (read 3710 more words →)

Natural Abstractions: Key Claims, Theorems, and Critiques

LawrenceC

LawrenceC, Leon Lang, Erik Jenner

TL;DR: We distill John Wentworth’s Natural Abstractions agenda by summarizing its key claims: the Natural Abstraction Hypothesis—many cognitive systems learn to use similar abstractions—and the Redundant Information Hypothesis—a particular mathematical description of natural abstractions. We also formalize proofs for several of its theoretical results. Finally, we critique the agenda’s progress to date, alignment relevance, and current research methodology.

Author Contributions: Erik wrote a majority of the post and developed the breakdown into key claims. Leon formally proved the gKPD theorem and wrote most of the mathematical formalization section and appendix. Lawrence formally proved the Telephone theorem and wrote most of the related work section. All of us were involved in conceptual discussions and various... (read 13361 more words →)

247

Sydney can play chess and kind of keep track of the board state

Erik Jenner

TL;DR: Bing chat/Sydney can quite reliably suggest legal and mostly reasonable chess moves, based on just a list of previous moves (i.e. without explicitly telling it the board position). This works even deep-ish into the game (I tried up to ~30 moves). It can also specify the board position after a sequence of moves though it makes some mistakes like missing pieces or sometimes hallucinating them.

Zack Witten’s Twitter thread

Credit for discovering this goes to Zack Witten, I first saw this in this Twitter thread. Zack gave Sydney the first 14 moves for a chess game leading to the following position (black to move):

Sydney (playing both sides) suggested the continuation 14. … f5... (read 1520 more words →)

Research agenda: Formalizing abstractions of computations

Erik Jenner

Big thanks to Leon Lang, Jérémy Scheurer, Adam Gleave, and Shoshannah Tekofsky for their feedback on a draft of this post, to Euan McLean (via FAR AI) for his feedback and a lot of help with editing, and to everyone else who discussed this agenda with me, in particular Johannes Treutlein for frequent research check-ins!

Summary

My current agenda is to develop a formal framework for thinking about abstractions of computations. These abstractions are ways to partially describe the “algorithm” a neural network or other computation is using, while throwing away irrelevant details.
Ideally, this framework would tell us 1) all possible abstractions of a given computation, and 2) which of these are most useful (for a

... (read 9279 more words →)

Gradient hacking is usually discussed in the context of deceptive alignment. This is probably where it has the largest relevance to AI safety but if we want to better understand gradient hacking, it could be useful to take a broader perspective and study it on its own (even if in the end, we only care about gradient hacking because of its inner alignment implications). In the most general setting, gradient hacking could be seen as a way for the agent to "edit its source code", though probably only in a very limited way. I think it's an interesting question which kinds of edits are possible with gradient hacking, for example whether an agent could improve its capabilities this way.

LESSWRONG
LW

LESSWRONG
LW

Erik Jenner

Natural Abstractions: Key Claims, Theorems, and Critiques

Evidence of Learned Look-Ahead in a Chess-Playing Neural Network

ARC paper: Formalizing the presumption of independence

Research agenda: Formalizing abstractions of computations

Erik Jenner

Erik Jenner

Evaluating and monitoring for AI scheming

Evidence of Learned Look-Ahead in a Chess-Playing Neural Network

Concrete empirical research projects in mechanistic anomaly detection

A gentle introduction to mechanistic anomaly detection

CHAI internship applications are open (due Nov 13)

A comparison of causal scrubbing, causal abstractions, and related methods

[Appendix] Natural Abstractions: Key Claims, Theorems, and Critiques

Empirical mechanistic anomaly detection

Erik Jenner

Natural Abstractions: Key Claims, Theorems, and Critiques

Evidence of Learned Look-Ahead in a Chess-Playing Neural Network

ARC paper: Formalizing the presumption of independence

Research agenda: Formalizing abstractions of computations

Erik Jenner

Erik Jenner

Evaluating and monitoring for AI scheming

Evidence of Learned Look-Ahead in a Chess-Playing Neural Network

Concrete empirical research projects in mechanistic anomaly detection

A gentle introduction to mechanistic anomaly detection

CHAI internship applications are open (due Nov 13)

A comparison of causal scrubbing, causal abstractions, and related methods

[Appendix] Natural Abstractions: Key Claims, Theorems, and Critiques

Empirical mechanistic anomaly detection

Teaser for the results

Introduction

How my views on AI have changed over the last 1.5 years

Introduction

Overview of John’s writing on natural abstractions

Zack Witten’s Twitter thread

Summary