LESSWRONG
LW

Erik Jenner
2123Ω314211310
Message
Dialogue
Subscribe

Research Scientist on the Google DeepMind AGI Safety & Alignment team

Sequences

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
No wikitag contributions to display.
Empirical mechanistic anomaly detection
2ejenner's Shortform
5y
36
ejenner's Shortform
Erik Jenner10d244

Tips for writing (MATS) applications.

The common theme of these is to make it very easy for reviewers to notice the strengths of your application. My selfish motivation for writing this list is that this makes it easier to review applications, but it will also make your application better.

See end of the quick take for caveats.

  • Assume that readers will initially only spend a few minutes on your application, so make it very skimmable.
    • If there's something you really want to highlight, don't be afraid to put it in multiple places (e.g. your CV as well as some free-form response)
    • You can use bold font for highlights in your CV (just don't bold so much that bolding loses its impact)
    • The longer your responses, the more reviewers will skim them. This isn't super bad because reviewers probably have a lot of practice skimming applications. But if you want to have control over what reviewers actually see, either optimize for density or structure your long responses very clearly (topic sentences, maybe even bolded headings).
  • Make it clear what the full range of topics is you'd be excited to work on.
    • For example, if you discuss a specific interest of yours, also make clear whether you're mainly looking for projects in that area or are also open to very different ones!
    • Giving a concrete example of what you'd be excited to work on can ba a great way to demonstrate that you know the area! I'm not saying not to do that, just to also make clear how broad your interests are beyond that.
  • If you've done ML or coding projects that could support the application, make that clear!
    • Putting projects in a Github is a good idea! It can be hard to judge whether a project is actually impressive just based on a 1-paragraph description in a CV. By default, I'll often assume that CV descriptions give an overly rosy view of the project. If there's code to look at, that can be much more legibly impressive.
    • Link the Github in your application!
    • Skimmability applies here too: it's useful if the README makes it clear what the project is about and why it's impressive. E.g. if you have an ML project, put your main plot in the README, this makes it easy to tell that you actually ran experiments
    • Blog posts or project web pages are also great (personally, I think just writing a nice README is a good 80/20, but I'm not sure how much attention other reviewers pay to Github).
    • Generally go for quality over quantity. One very clearly impressive project can already have a lot of weight. If you list 7 different projects, I'll probably just pick one or two to really look at anyway. So you might as well spend more space one your most impressive projects and then list the rest more briefly; that way I can focus on the most important ones instead of a random one.
  • If you have code or papers that you want reviewers to see but that aren't public yet, don't say "available on request." Attach a link instead (which can be to a drive file etc.)
    • Personally, I'm pretty unlikely to send a request unless I'm seriously thinking about making an offer.
    • You can totally ask in the application that reviewers don't share the paper (though that should be the default expectation anyway).

Caveats:

  • These are mainly based on my experience reviewing applications to MATS or CHAI internships. (I expect many generalize beyond that to e.g. full-time positions but don't have experience reviewing for those.)
  • Obviously I can't speak for other mentors and I'm sure some of them value other things in applications and would actively disagree with a parts of this list.
  • More important than the tips above is having the right skills and legible evidence of those in the first place. The list above is just about comparatively easy-to-do things during the actual application process
Reply
I am worried about near-term non-LLM AI developments
Erik Jenner1mo120

There are literal interpretations of these predictions that aren't very strong:

  1. I expect a new model to be released, one which does not rely on adapting pretrained transformers or distilling a larger pretrained model
  2. It will be inspired by the line of research I have outlined above, or a direct continuation of one of the listed architectures
  3. It will have language capabilities equal to or surpassing GPT-4
  4. It will have a smaller parameter count (by 1-2+ OOMs) compared to GPT-4

GPT-4 was rumored to have 1.8T parameters, so <180B parameters would technically satisfy 4. My impression is that current ~70B open-weight models (e.g. Qwen 2.5) are already roughly as good as the original GPT-4 was. (Of course that's not a fair comparison since the 1.8T parameter rumor is for an MoE model.)

So the load-bearing part is arguably "inspired by [this] line of research," but I'm not sure what would or wouldn't count for that. E.g. a broad interpretation could argue that any test-time training / continual learning approach would count, even if most of the capabilities still come from pretraining similar to current approaches. (Still a non-trivial prediction to be clear!)

My impression was that you're intending to make stronger claims than this broad interpretation. If so, you could consider picking slightly different concretizations to make the predictions more impressive if you end up being right. For example, I'd consider 2 OOMs fewer parameters than GPT-4 noticeably more impressive than 1 OOM (and my guess would be that the divergence between your view and the broader community would be even larger at 3 OOMs fewer parameters). Might be even better to tie it to compute and/or training data instead of parameters. You could also try to make the "inspired by this research" claim more concrete (e.g. "<10% of training compute before model release is spent on training on offline/IID data", if you believe some claim of that form).

Reply
Efficiently Detecting Hidden Reasoning with a Small Predictor Model
Erik Jenner2mo*41

Nice post, I think detecting hidden reasoning is an important direction!

I find it helpful in the context of this post to distinguish between (1) "hidden reasoning" in the sense that the model is deliberately confining some reasoning to forward passes and doesn't include any traces of it in the CoT except for the conclusion, vs (2) in the steganography sense, where the tokens in the CoT are carrying important information (but in some encoded way).

IIUC, your datasets are meant to measure (1) (though the methods could in principle be applicable to both (1) and (2)), does that sound right? That could be one motivation for your approach over paraphrasing.

If you do want to study (1), maybe a more realistic proxy would be cases where current models hide reasoning in this sense (whether naturally or after prompting). I.e. pick a task where models need to do some reasoning, but it's easy enough that they can do it in a single forward pass. Make the task such that models want to hide this reasoning (e.g. it's about bias), or prompt them to hide it.

My guess is that this proxy of hidden reasoning would be harder to detect than the "removing reasoning chunks" proxy. And I think this is likely also true for the real case of (1)---if a scheming model was already deliberately not mentioning certain things in its CoT, it could probably also take care to make the CoT still appear more or less logical.

At 20% ablation, the monitor guessed that there were skipped steps only 31% of the time (true-positives)

Minor note: do all CoTs have at least one step removed in the 20% ablation setting? Earlier on, it sounded like you're randomly deciding whether to remove each line of the CoT. If a CoT is short, some of them might by chance not have any lines removed, unless I'm misunderstanding the dataset creation? (Maybe this effect basically doesn't matter though because all CoTs are much longer than 5 lines?)

Reply
A Bear Case: My Predictions Regarding AI Progress
Erik Jenner6mo3418

There were no continuous language model scaling laws before the transformer architecture

https://arxiv.org/abs/1712.00409 was technically published half a year after transformers, but it shows power-law language model scaling laws for LSTMs (several years before the Kaplan et al. paper, and without citing the transformer paper). It's possible that transformer scaling laws are much better, I haven't checked (and perhaps more importantly, transformer training lets you parallelize across tokens), just mentioning this because it seems relevant for the overall discussion of continuity in research.

I also agree with Thomas Kwa's sibling comment that transformers weren't a single huge step. Fully-connected neural networks seem like a very strange comparison to make, I think the interesting question is whether transformers were a sudden single step relative to LSTMs. But I'd disagree even with that: Attention was introduced three years before transformers and was a big deal for machine translation. Self-attention was introduced somewhere between the first attention papers and transformers. And the transformer paper itself isn't atomic, it consists of multiple ideas—replacing RNNs/LSTMs with self-attention is clearly the big one, but my impression is that multi-head attention, scaled dot product attention, and the specific architecture were pretty important to actually get their impressive results.

To be clear, I agree that there are sometimes new technologies that are very different from the previous state of the art, but I think it's a very relevant question just how common this is, in particular within AI. IMO the most recent great example is neural machine translation (NMT) replacing complex hand-designed systems starting in 2014---NMT worked very differently than the previous best machine translation systems, and surpassed them very quickly (by 2014 standards for "quick"). I expect something like this to happen again eventually, but it seems important to note that this was 10 years ago, and how much progress has been driven since then by many different innovations (+ scaling).

ETA: maybe a crux is just how impressive progress over the past 10 years has been, and what it would look like to have "equivalent" progress before the next big shift. But I feel like in that case, you wouldn't count transformers as a big important step either? My main claim here is that to the extent to which there's been meaningful progress over the past 10 years, it was mostly driven by a large set of small-ish improvements, and gradual shifts of the paradigm.

Reply2
ejenner's Shortform
Erik Jenner8mo20

Yeah. I think there's a broader phenomenon where it's way harder to learn from other people's mistakes than from your own. E.g. see my first bullet point on being too attached to a cool idea. Obviously, I knew in theory that this was a common failure mode (from the Sequences/LW and from common research advice), and someone even told me I might be making the mistake in this specific instance. But my experience up until that point had been that most of the research ideas I'd been similarly excited about ended up ~working (or at least the ones I put serious time into).

Reply
ejenner's Shortform
Erik Jenner8mo30

Some heuristics (not hard rules):

  • ~All code should start as a hacky jupyter notebook (like your first point)
  • As my codebase grows larger and messier, I usually hit a point where it becomes more aversive to work with because I don't know where things are, there's too much duplication, etc. Refactor at that point.
  • When refactoring, don't add abstractions just because they might become useful in the future. (You still want to think about future plans somewhat of course, maybe the heuristic is to not write code that's much more verbose than necessary right now, in the hope that it will pay off in the future.)

These are probably geared toward people like me who tend to over-engineer; someone who's currently unhappy that their code is always a mess might need different ones.

I don't know whether functional programming is fundamentally better in this respect than object-oriented.

Reply
ejenner's Shortform
Erik Jenner8mo1346

Research mistakes I made over the last 2 years.

Listing these in part so that I hopefully learn from them, but also because I think some of these are common among junior researchers, so maybe it's helpful for someone else.

  • I had an idea I liked and stayed attached to it too heavily.
    • (The idea is using abstractions of computation for mechanistic anomaly detection. I still think there could be something there, but I wasted a lot of time on it.)
    • What I should have done was focus more on simpler baselines, and be more scared when I couldn't beat those simple baselines.
    • (By "simpler baselines," I don't just mean comparing against what other people are using, I also mean ablations where you remove some parts of the method and see if it still works.)
    • Notably, several researchers I respect told me I shouldn't be too attached to that idea and should consider using simpler methods.
  • I was too focused initially on growing the field of empirical mechanistic anomaly detection; I should have just tried to get interesting results first.
  • Relatedly, I spent too much time making a nice library for mechanistic anomaly detection (though in part this was for my own use, not just because of field-building).
    • Apart from being a time sink, this also had bad effects on my research prioritization. It nudged me toward doing experiments that were easy to do with the library or that would be natural additions to the library when implemented. It made it aversive to do experiments that wouldn't fit naturally into the library at all.
    • It also made fast iteration more difficult, because I'd often be tempted to quickly integrate some new method/tool into the library instead of doing a hacky version first.
    • I do still think clean research code and infrastructure are really valuable, so this is difficult to balance. Consider reversing advice especially on this point.
  • I worked on purely conceptual/theoretical work without collaborators or good mentorship, and with only ~2.5 years of research experience. I expected beforehand that I'd be unusually good at this type of research (and I still think that's plausibly true), but even so, I don't think it was time well spent in expectation.
    • I'm very sympathetic to this kind of work being important. But I think it's really brutal (and won't work) without at least one (and ideally more) of (1) strong collaborators and ideally mentors, (2) good empirical feedback loops, (3) a lot of research experience. Maybe there are a small handful of junior researchers who can do useful things there on their own, but I've been growing more and more skeptical.
  • I looked into related work too late several times, and didn't think about how my work was going to be different early enough.
    • Drafting a paper outline was great for making me confront that question (and is also a useful exercise for noticing other mistakes).
  • I didn't work on control. I was convinced that control was great pretty quickly, and then ... assumed that everyone else would also be convinced and lots of people would switch to working on it, and it would end up being non-neglected. This sounds really silly in hindsight, and I suspect I was also doing motivated reasoning to avoid changing my research.
    • "Working on control" is a messy concept. Mechanistic anomaly detection (which I was working on) seems at least as applicable to control as to alignment. But what I mean by "working on control" includes an associated cluster of research taste aesthetics, such as pessimistic assumptions about inductive biases, meta-level adversarial evals, and thinking about the best protocols a blue team could use given clearly specified (and realistic) rules. I think I've been slowly moving closer toward that cluster, but could have gotten there sooner by making a deliberate sudden switch.
    • I didn't explicitly realize how "working on control" has this research taste cluster associated with it until more recently, otherwise the "everyone will work on it" argument would've had less force anyway.
Reply156311
The Plan - 2024 Update
Erik Jenner8mo291

2 years ago, you wrote:

theoretical progress has been considerably faster than expected, while crossing the theory-practice gap has been mildly slower than expected. (Note that “theory progressing faster than expected, practice slower” is a potential red flag for theory coming decoupled from reality, though in this case the difference from expectations is small enough that I’m not too worried. Yet.)

I don't know how much the difficulty of crossing the theory-practice gap has deviated from your expectations since then. But I would indeed be worried that a lot of the difficulty is going to be in getting any good results for deep learning, and that finding additional theoretical/conceptual results in other settings doesn't constitute much progress on that. (But kudos for apparently working on image generator nets again!)

As a sidenote, your update from 2 years ago also mentioned:

I tried to calculate “local” natural abstractions (in a certain sense) in a generative image net, and that worked quite well.

I assume that was some other type of experiment involving image generators? (and the notion of "working well" there isn't directly comparable to what you tried now?)

Reply
Natural Abstractions: Key Claims, Theorems, and Critiques
Erik Jenner8mo80Review for 2023 Review

I think this was a very good summary/distillation and a good critique of work on natural abstractions; I'm less sure it has been particularly useful or impactful.

I'm quite proud of our breakdown into key claims; I think it's much clearer than any previous writing (and in particular makes it easier to notice which sub-claims are obviously true, which are daring, which are or aren't supported by theorems, ...). It also seems that John was mostly on board with it.

I still stand by our critiques. I think the gaps we point out are important and might not be obvious to readers at first. That said, I regret somewhat that we didn't focus more on communicating an overall feeling about work on natural abstractions, and our core disagreements. I had some brief back-and-forth with John in the comments, where it seemed like we didn't even disagree that much, but at the same time, I still think John's writing about the agenda was wildly more optimistic than my views, and I don't think we made that crisp enough.

My impression is that natural abstractions are discussed much less than they were when we wrote the post (and this is the main reason why I think the usefulness of our post has been limited). An important part of the reason I wanted to write this was that many junior AI safety researchers or people getting into AI safety research seemed excited about John's research on natural abstractions, but I felt that some of them had a rosy picture of how much progress there'd been/how promising the direction was. So writing a summary of the current status combined with a critique made a lot of sense, to both let others form an accurate picture of the agenda's progress while also making it easier for them to get started if they wanted to work on it. Since there's (I think) less attention on natural abstractions now, it's unsurprising that those goals are less important.

As for why there's been less focus on natural abstractions, my guess is a combination of at least:

  • John has been writing somewhat less about it than during his peak-NAH-writing.
  • Other directions have gotten off the ground and have captured a lot of excitement (e.g. evals, control, and model organisms).
  • John isn't mentoring for MATS anymore, so junior researchers don't get exposure to his ideas through that.

It's also possible that many became more pessimistic about the agenda without public fanfare, or maybe my impression of relative popularity now vs then is just off.

I still think very high effort distillations and critiques can be a very good use of time (and writing this one still seems reasonable ex ante, though I'd focus more on nailing a few key points and less on being super comprehensive).

Reply
evhub's Shortform
Erik Jenner8mo106

One more: It seems plausible to me that the alignment stress-testing team won't really challenge core beliefs that underly Anthropic's strategy.

For example, Sleeper Agents showed that standard finetuning might not suffice given a scheming model, but Anthropic had already been pretty invested in interp anyway (and I think you and probably others had been planning for methods other than standard finetuning to be needed). Simple probes can catch sleeper agents (I'm not sure whether I should think of this as work by the stress-testing team?) then showed positive results using model internals methods, which I think probably don't hold up to stress-testing in the sense of somewhat adversarial model organisms.

Examples of things that I'd count as "challenge core beliefs that underly Anthropic's strategy":

  • Demonstrating serious limitations of SAEs or current mech interp (e.g., for dealing with model organisms of scheming)
  • Demonstrate issues with hopes related to automated alignment research (maybe model organisms of subtle mistakes in research that seriously affect results but are systematically hard to catch)

To be clear, I think the work by the stress-testing team so far has been really great (mainly for demonstrating issues to people outside Anthropic), I definitely wouldn't want that to stop! Just highlighting a part that I'm not yet sure will be covered.

Reply
Load More
52Evaluating and monitoring for AI scheming
Ω
2mo
Ω
9
121Evidence of Learned Look-Ahead in a Chess-Playing Neural Network
Ω
1y
Ω
14
43Concrete empirical research projects in mechanistic anomaly detection
1y
3
73A gentle introduction to mechanistic anomaly detection
1y
2
34CHAI internship applications are open (due Nov 13)
2y
0
73A comparison of causal scrubbing, causal abstractions, and related methods
Ω
2y
Ω
3
48[Appendix] Natural Abstractions: Key Claims, Theorems, and Critiques
Ω
2y
Ω
0
246Natural Abstractions: Key Claims, Theorems, and Critiques
Ω
2y
Ω
26
64Sydney can play chess and kind of keep track of the board state
3y
19
93Research agenda: Formalizing abstractions of computations
Ω
3y
Ω
10
Load More