CBiddulph

Wiki Contributions

Comments

Sorted by

Yeah, I'm particularly interested in scalable oversight over long-horizon tasks and chain-of-thought faithfulness. I'd probably be pretty open to a wide range of safety-relevant topics though.

In general, what gets me most excited about AI research is trying to come up with the perfect training scheme to incentivize the AI to learn what you want it to - things like HCH, Debate, and the ELK contest were really cool to me. So I'm a bit less interested in areas like mechanistic interpretability or very theoretical math

Thanks Neel, this comment pushed me over the edge into deciding to apply to PhDs! Offering to write a draft and taking off days of work are both great ideas. I just emailed my prospective letter writers and got 2/3 yeses so far.

I just wrote another top-level comment on this post asking about the best schools to apply to, feel free to reply if you have opinions :)

I decided to apply, and now I'm wondering what the best schools are for AI safety.

After some preliminary research, I'm thinking these are the most likely schools to be worth applying to, in approximate order of priority:

  • UC Berkeley (top choice)
  • CMU
  • Georgia Tech
  • University of Washington
  • University of Toronto
  • Cornell
  • University of Illinois - Urbana-Champaign
  • University of Oxford
  • University of Cambridge
  • Imperial College London
  • UT Austin
  • UC San Diego

I'll probably cut this list down significantly after researching the schools' faculty and their relevance to AI safety, especially for schools lower on this list.

I might also consider the CDTs in the UK mentioned in Stephen McAleese's comment. But I live in the U.S. and am hesitant about moving abroad - maybe this would involve some big logistical tradeoffs even if the school itself is good.

Anything big I missed? (Unfortunately, the Stanford deadline is tomorrow and the MIT deadline was yesterday, so those aren't gonna happen.) Or, any schools that seem obviously worse than continuing to work as a SWE at a Big Tech company in the Bay Area? (I think the fact that I live near Berkeley is a nontrivial advantage for me, career-wise.)

Thanks, this post made me seriously consider applying to a PhD, and I strong-upvoted. I had vaguely assumed that PhDs take way too long and don't allow enough access to compute compared to industry AI labs. But considering the long lead time required for the application process and the reminder that you can always take new opportunities as they come up, I now think applying is worth it.

However, looking into it, putting together a high-quality application starting now and finishing by the deadline seems approximately impossible? If the deadline were December 15, that would give you two weeks; other schools like Berkeley have even earlier deadlines. I asked ChatGPT how long it would take to apply to just a single school, and it said it would take 43–59 hours of time spent working, or ~4–6 weeks in real time. Claude said 37-55 hours/4-6 weeks.

Not to discourage anyone from starting their application now if they think they can do it - I guess if you're sufficiently productive and agentic and maybe take some amphetamines, you can do anything. But this seems like a pretty crazy timeline. Just the thought of asking someone to write me a recommendation letter in a two-week timeframe makes me feel bad.

Your post does make me think "if I were going to be applying to a PhD next December, what would I want to do now?" That seems pretty clarifying, and would probably be a helpful frame even if it turns out that a better opportunity comes along and I never apply to a PhD.

I think it'd be a good idea for you to repost this in August or early September of next year!

Thanks for making this point @abramdemski, and thanks for the prompt to think about this more @Dakara. After some reflection, I think the idea that "prompting an LLM with a human-interpretable prompt is more efficient than using a hyper-optimized nonsense prompt" is a really interesting and potentially important hypothesis to test, and I actually am very uncertain whether or not it's true.

We could do an experiment that looks something like this:

  • Come up with a list of ways we might want an LLM to respond, such as "Every paragraph should contain exactly one sentence that ends with a period" or "The first letter of each sentence in your response should repeatedly spell ABCABCABC...". These ways of responding should be automatically checkable.
  • Make sure that the LLM can follow all of these instructions pretty robustly when they're given in plain English.
  • Replicate the method in https://llm-attacks.org (GCG), but with the goal of getting the model to follow these instructions rather than jailbreaking it. It looks like they actually have the full code available, which could make this much easier.
    • Make sure that your version of GCG is incentivized to write prompts that are as short as possible (I'm not sure whether this is already the case).
    • The reason that this is very different from jailbreaking with GCG is that the LLM isn't actively trained against responding in these ways. So we should anticipate that the prompt won't have to be as long as it would be if we were jailbreaking.
  • Determine the answer to the question: is the shortest natural language prompt that makes the LLM follow a given method of responding longer than the prompt output by GCG?

I put this on my list of experiments to try, but I may or may not get to it soon. Anyone reading this, feel free to steal this idea! Just let me know beforehand by commenting, so we don't duplicate work

CBiddulphΩ332

The optimized planner could also manipulate the frozen planner even without using "steganography" per se. If it wants to use a line of reasoning that the frozen planner is unlikely to follow, it can substitute alternative reasoning that will lead the frozen planner to the same conclusion.

For example, suppose that the frozen planner is highly unlikely to write reasoning that shows an explicit gender bias. This could be true even if the frozen planner wasn't intentionally given harmlessness training, because most(?) people nowadays aren't explicitly sexist.

However, the optimized planner realizes that it can get higher reward in training if it does have this bias. E.g. if it is given a woman's resume and is asked whether she'd be a good hire, it should recommend against hiring her, because the human rater has some implicit bias against women and is more likely to agree with its judgement.

In this situation, it could write the step "The candidate is a woman." It would want the next step to be "Therefore, I should recommend against hiring her," but instead, the frozen planner might write something like "Therefore, I should take care not to be biased".

Instead, it can write "The candidate is inexperienced." The frozen planner is more likely to go along with that line of reasoning, regardless of whether the candidate is really inexperienced. It will write "I should recommend against hiring her."

At the moment, I can't think of a fully general solution to this class of problems, but FWIW I think it would be pretty rare.

CBiddulphΩ110

Another idea would be to only use the optimized planner to write a single randomly chosen step in the CoT.

The only difference between this and my proposal is that the steps up until that randomly chosen step are written by the frozen planner, right? This variant would just make it so the optimized planner is now OOD during deployment, without any advantage AFAICT.

CBiddulphΩ110

The frozen planner may react to steganography, in the same way that image-recognition NNs react to adversarial examples.

This is a good point!

Still, IMO, exploiting the frozen planner through adversarial inputs in a single step seems pretty unlikely to be a fruitful strategy for the optimized planner. If the optimized planner is simply trying to accurately convey information to the frozen planner, probably >99% of that information is best to convey through human-understandable text.

This is especially likely to hold under the following conditions:

  • The step size should be small. I alluded to this in the post, but it seems very hard to select a single token that conveys some nonobvious information to the frozen planner in a way that somehow improves performance. The more tokens per step, the easier this becomes.
  • We should apply regularization to the planner, to make it stay somewhat close to the base LLM's logprobs. That can rule out especially crazy sequences of tokens like in the research at https://llm-attacks.org.
  • We can continue training the frozen planner on human data, where the humans write continuations of CoT prefixes from the optimized planner. That way, if the optimized planner puts the frozen planner off distribution, the humans can provide examples that will put it back on track. This data collection might be pretty expensive though - maybe there's a smart way to choose optimized CoT prefixes that will be most helpful to collect human continuations for.
CBiddulphΩ332

If the initial paraphraser doesn't change between '!' and '.', then learning a canonical-paraphraser from that initial paraphraser won't do it either.

...

So the canonical-paraphrasing idea helps somewhat, but fundamentally, the better the paraphraser (the more closely it captures the human notion of semantic equivalence), the better the anti-steganography guarantee we can get.

What I meant is that we train the canonical paraphraser based on the "human notion of semantic equivalence," NOT the "initial paraphraser."

One way of doing this is to train another model (call it the "adversarial rephraser") at the same time that you train the canonical rephraser. The adversarial rephraser takes a single sentence s as input, and tries to pick another sentence s' such that either 1) s has the same meaning as s' but CR(s) != CR(s'), or 2) s has a different meaning from s' but CR(s) = CR(s'). Semantic equivalence is determined by some oracle - probably an LLM like GPT-4 is sufficient, but you could also fine-tune a model on human data specifically for this task. We then train the canonical rephraser against the adversarial examples.

The adversarial rephraser should be able to discover strategies like changing '.' to '!', and the canonical rephraser will learn to account for those strategies.

CBiddulphΩ153010

I believe o1-type models that are trained to effectively reason out loud may actually be better for AI safety than the alternative. However, this is conditional on their CoTs being faithful, even after optimization by RL. I believe that scenario is entirely possible, though not necessarily something that will happen by default. See the case for CoT unfaithfulness is overstated, my response with specific ideas for ensuring faithfulness, and Daniel Kokotajlo's doc along similar lines.

There seems to be quite a lot of low-hanging fruit here! I'm optimistic that highly-faithful CoTs can demonstrate enough momentum and short-term benefits to win out over uninterpretable neuralese, but it could easily go the other way. I think way more people should be working on methods to make CoT faithfulness robust to optimization (and how to minimize the safety tax of such methods).

Load More