Erik Jenner

PhD student in AI safety at CHAI (UC Berkeley)

Wiki Contributions


Thanks for writing this! On the point of how to get information, mentors themselves seem like they should also be able to say a lot of useful things (though especially for more subjective points, I would put more weight on what previous mentees say!)

So since I'm going to be mentoring for MATS and for CHAI internships, I'll list my best guesses as to how working with me will be like, maybe this helps someone decide:

  • In terms of both research experience and mentoring experience, I'm one of the most junior mentors in MATS.
    • Concretely, I've been doing ML research for ~4 years and AI safety research for a bit over 2 of those. I've co-mentored two bigger projects (CHAI internships) and mentored ~5 people for smaller projects or more informally.
    • This naturally has disadvantages. Depending on what you're looking for, it can also have advantages, for example it might help for creating a more collaborative atmosphere (as opposed to a "boss" dynamic like the post mentioned). I'm also happy to spend time on things that some senior mentors might be too busy for (like code reviews, ...).
  • Your role as a mentee: I'm mainly looking for either collaborators on existing projects, or for mentees who'll start new projects that are pretty close to topics I'm thinking about (likely based on a mix of ideas I already have and your ideas). I also have a lot of engineering work to be done, but that will only happen if it's explicitly what you want---by default, I'm hoping to help mentees on a path to developing their own alignment ideas. That said, if you're planning to be very independent and just develop your own ideas from scratch, I'm probably not the best mentor for you.
  • I live in Berkeley and am planning to be in the MATS office regularly (e.g. just working there and being available once/week in addition to in-person meetings). For (in-person) CHAI internships, we'd be in the same office anyway.

If you have concrete questions about other things, whose answer would make a difference for whether you want to apply, then definitely feel free to ask!

Thanks! Mostly agree with your comments.

I actually think this is reasonably relevant, and is related to treeification.

I think any combination of {rewriting, using some canonical form} and {treeification, no treeification} is at least possible, and they all seem sort of reasonable. Do you mean the relation is that both rewriting and treeification give you more expressiveness/more precise hypotheses? If so, I agree for treeification, not sure for rewriting. If we allow literally arbitrary extensional rewrites, then that does increase the number of different hypotheses we can make, but these hypotheses can't be understood as making precise claims about the original computation anymore. I could even see an argument that allowing rewrites in some sense always makes hypotheses less precise, but I feel pretty confused about what rewrites even are given that there might be no canonical topology for the original computation.

My guess would be they're mostly at capacity in terms of mentorship, otherwise they'd presumably just admit more PhD students. Also not sure they'd want to play grantmaker (and I could imagine that would also be really hard from a regulatory perspective---spending money from grants that go through the university can come with a lot of bureaucracy, and you can't just do whatever you want with that money).

Connecting people who want to give money with non-profits, grantmakers, or independent researchers who could use it seems much lower-hanging fruit. (Though I don't know any specifics about who these people who want to donate are and whether they'd be open to giving money to non-academics.)

Have you seen and any of the other recent posts on I don't think they make it obvious that formalizing the presumption of independence would lead to alignment solutions, but they do give a much more detailed explanation of why you might hope so than the paper.

We do not consider Conjecture at the same level of expertise as other organizations such as Redwood, ARC, researchers at academic labs like CHAI, and the alignment teams at Anthropic, OpenAI and DeepMind. This is primarily because we believe their research quality is low.

This isn't quite the right thing to look at IMO. In the context of talking to governments, an "AI safety expert" should have thought deeply about the problem, have intelligent things to say about it, know the range of opinions in the AI safety community, have a good understanding of AI more generally, etc. Based mostly on his talks and podcast appearances, I'd say Connor does decently well along these axes. (If I had to make things more concrete, there are a few people I'd personally call more "expert-y", but closer to 10 than 100. The AIS community just isn't that big and the field doesn't have that much existing content, so it seems right that the bar for being an "AIS expert" is lower than for a string theory expert.)

I also think it's weird to split this so strongly along organizational lines. As an extreme case, researchers at CHAI range on a spectrum from "fully focused on existential safety" to "not really thinking about safety at all". Clearly the latter group aren't better AI safety experts than most people at Conjecture. (And FWIW, I belong to the former group and I still don't think you should defer to me over someone from Conjecture just because I'm at CHAI.)

One thing that would be bad is presenting views that are very controversial within the AIS community as commonly agreed-upon truths. I have no special insight into whether Conjecture does that when talking to governments, but it doesn't sound like that's your critique at least?

I only very recently noticed that you can put \newcommand definitions in equations in LW posts and they'll apply to all the equations in that post. This is an enormous help for writing long technical posts, so I think it'd be nice if it was (a) more discoverable and (b) easier to use. For (b), the annoying thing right now is that I have to put newcommands into one of the equations, so either I need to make a dedicated one, or I need to know which equation I used. Also, the field for entering equations isn't great for entering things with many lines.

Feature suggestion to improve this: in the options section below the post editor, have a multiline text field where you can put LaTeX, and then inject that LaTeX code into MathJax as a preamble (or just add an otherwise empty equation to the page, I don't know to what extent MathJax supports preambles).

for all  such that  has an outgoing arrow, there exists  such that  and 

Should it be  at the end instead? Otherwise not sure what b is.

I think this could be a reasonable definition but haven't thought about it deeply. One potentially bad thing is that  would have to be able to also map any of the intermediate steps between a an a' to . I could imagine you can't do that for some computations and abstractions (of course you could always rewrite the computation and abstraction to make it work, but ideally we'd have a definition that just works).

What I've been imagining instead is that the abstraction can specify a function that determines which are the "high-level steps", i.e. when  should be applied. I think that's very flexible and should support everything.

But also, in practice the more important question may just be how to optimize over this choice of high-level steps efficiently, even just in the simple setting of circuits.

Yeah, that seems to be the most important remaining difference now that Atticus is also using multiple interventions at once. Though I think the metrics are also still different? (ofc that's pretty orthogonal to the main methods)

My sense now is that the types of interventions are bigger difference than I thought when writing that comment. In particular, as far as I can tell, causal scrubbing shouldn't be thought of as just doing a subset of the interventions, it also does some additional things (basically because causal abstractions don't treeify so are more limited in that regard). And there's a closely related difference in that causal scrubbing never compares to the output of the hypothesis, just different outputs of G.

But it also seems plausible that this still turns out not to matter too much in terms of which hypotheses are accepted/rejected. (There are definitely some examples of disagreements between the two methods, but I'm pretty unsure how severe and wide-spread they are.)

I'm interested in characterizing functions which are "insensitive" to subsets of their input variables, especially in high-dimensional spaces.

There's a field called "Analysis of boolean functions" (essentially Fourier analysis of functions ) that seems relevant to this question and perhaps to your specific problem statement. In particular, the notion of "total influence" of a boolean function is meant to capture its sensitivity (e.g. the XOR function on all inputs has maximal total influence). This is the standard reference, see section 2.3 for total influence. Boolean functions with low influence (i.e. "insensitive" functions) are an important topic in this field, so I expect there are some relevant results (see e.g. tribes functions and the KKL theorem, though those specifically address a somewhat different question than your problem statement).

That-Which-Predicts will not, not ever, not even if scaled up to be trained and run on a Matrioshka brain for a million years, step out of character to deviate from next token prediction.

I read this as claiming that such a scaled-up LLM would not itself become a mesa-optimizer with some goal that's consistent between invocations (so if you prompt it with "This is a poem about apples:", it's not going to give you a poem that subtly manipulates you, such that at some future point it can take over the world). Even if that's true (I'm unsure), how do you know? This post confidently asserts things like this but the only explanation I see is "it's been really heavily optimized", which doesn't engage at all with existing arguments about the possibility of deceptive alignment.

As a second (probably related) point, I think it's not clear what "the mask" is or what it means to "just predict tokens", and that this can confuse the discussion.

  • A very weak claim would be that for an input that occurred often during training, the model will predict a distribution over next tokens that roughly matches the empirical distribution of next tokens for that input sequence during training. As far as I can tell, this is the only interpretation that you straightforwardly get from saying "it's been optimized really heavily".
  • We could reasonably extend this to "in-distribution inputs", though I'm already unsure how exactly that works. We could talk about inputs that are semantically similar to inputs encountered during training, but really we probably want much more interpolation than that for any interesting claim. The fundamental problem is: what's the "right mask" or "right next token" once the input sequence isn't one that has ever occurred during training, not even in slightly modified form? The "more off-distribution" we go, the less clear this becomes.
  • One way of specifying "the right mask" would be to say: text on the internet is generated by various real-world processes, mostly humans. We could imagine counterfactual versions of these processes producing an infinite amount of counterfactual additional text, so we actually get a nice distribution with full support over strings. Then maybe the claim is that the model predicts next tokens from this distribution. First, this seems really vague, but more importantly, I think this would be a pretty crazy claim that's clearly not true literally. So I'm back to being confused about what exactly "it's just predicting tokens" means off distribution.

Specifically, I'd like to know: are you making any claims about off-distribution behavior beyond the claim that the LLM isn't itself a goal-directed mesa-optimizer? If so, what are they?

Load More