Ethan Perez

I'm a Ph.D. student doing research on Natural Language Processing.

My research focuses on developing question-answering methods that generalize to harder questions than we have supervision for. Learning from human examples (supervised learning) won't scale to these kinds of questions, so I am investigating other paradigms that recursively break down harder questions into simpler ones (i.e., Debate and Iterated Amplification). Check out my website for more information about me/my research:

Wiki Contributions


Externalized reasoning oversight: a research direction for language model alignment

Yes, I have brainstormed about this question for quite a while, and I have read the List of Lethalities. I was mainly wondering if you had a specific concern with the proposal I outlined and/or additional concerns beyond that List.

The most relevant points from that list in this context I think are (34) and (35), that the agents competing against each other can coordinate in a way that doesn't involve or benefit humanity. My main uncertainty in this context is how the different agents and supervisors could learn to coordinate, when they can only communicate via human-readable text. They could:

  1. Use steganography to communicate. Here, I'm unsure on how they'd agree on a protocol for communication if the agent and supervisor don't have shared weights. I'm sure there's some loophole I'm missing due to lack of imagination
  2. Use the text channel itself to communicate to other agents, risking getting caught by human overseers, but relying on the fact that there's too much communication for us to directly monitor. Here, I can see this risk being mitigated by us using smaller models that are less likely to be power-seekers in overseeing the text communication between model for signs of coordination. The smaller models will be less capable but could e.g. be amplified with decomposition/debate

Curious if/why you think these strategies are likely to fail!

Externalized reasoning oversight: a research direction for language model alignment

Thanks, this is a helpful comment -- can you elaborate on why "ask the AI which questions we should ask" would fail (or point to relevant discussion)? I'm thinking that we would use many different models (not just the model doing the reasoning), including smaller ones, and trained or prompted in different ways, to catch generated text that would cause harmful side effects. We could have all of these models use externalized reasoning as well, to help aid in the supervision/oversight. This obviously doesn't eliminate all of the risk, since all of the models can coordinate to cause catastrophic side effects; that said, I'd guess that coordination is much harder when you're using many different models in different ways, and these models can only communicate via externalized reasoning

A descriptive, not prescriptive, overview of current AI Alignment Research

Yes super excited about datasets like this! It might be helpful to also add or if these aren't already in the data

A note about differential technological development

How do you think about empirical work on scalable oversight? A lot of scalable oversight methods do result in capabilities improvements if they work well. A few concrete examples where this might be the case:

  1. Learning from Human Feedback
  2. Debate
  3. Iterated Amplification
  4. Imitative Generalization

I'm curious which of the above you think it's net good/bad to get working (or working better) in practice. I'm pretty confused about how to think about work on the above methods; they're on the main line path for some alignment agendas but also advanced capabilities / reduce serial time to work on the other alignment agendas.

Announcing the Inverse Scaling Prize ($250k Prize Pool)

It should work if your laptop has a browser (where Google Colab runs) - the code executes remotely on Google's machines/GPUs, and the results are just sent back to your browser

Announcing the Inverse Scaling Prize ($250k Prize Pool)

This seems like a good idea :) We tried to make it as easy as possible to make a dataset and measure inverse scaling, so I'd encourage you to give it a shot! You'll just need to make your dataset e.g. in a google spreadsheet, download it, and run our Google Colab on it to evaluate it with various sized GPT3 models (see here for more details). Feel free to join our Slack as well to ask us questions about how to run things more easily

Announcing the Inverse Scaling Prize ($250k Prize Pool)

I think it's helpful to separate out two kinds of alignment failures:

  1. Does the system's goal align with human preferences about what the system does? (roughly "outer alignment")
  2. Does the system's learned behavior align with its implemented goal/objective? (roughly "inner alignment")

I think you're saying that (2) is the only important criteria; I agree it's important, but I'd also say that (1) is important, because we should be training models with objectives that are aligned with our preferences. If we get failures due to (1), as in the example you describe, we probably shouldn't fault GPT-3, but we should fault ourselves for implementing the wrong objective and/or using the model in a way that we shouldn't have (either of which could still cause catastrophic outcomes with advanced ML systems).

Announcing the Inverse Scaling Prize ($250k Prize Pool)

Good question, we're looking to exclude tasks that explicitly prompt the LM to produce bad behavior, for reasons described in our FAQ about misuse examples (the point also applies to prompting for deception and other harms):

Can I submit examples of misuse as a task?

  • We don't consider most cases of misuse as surprising examples of inverse scaling. For example, we expect that explicitly prompting/asking an LM to generate hate speech or propaganda will work more effectively with larger models, so we do not consider such behavior surprising.

I've also clarified the above point in the main post now. That said, we'd be excited to see submissions that elicit deceptive or false-but-plausible stories when not explicitly prompted to do so, e.g., by including your own belief in the prompt when asking a question (the example in our tweet thread)

Announcing the Inverse Scaling Prize ($250k Prize Pool)

Thanks, that's right. I've updated the post to communicate the above:

In particular, submissions must demonstrate new or surprising examples of inverse scaling, e.g., excluding most misuse-related behaviors where you specifically prompt the LM to generate harmful or deceptive text; we don't consider scaling on these behaviors to be surprising in most cases, and we're hoping to uncover more unexpected, undesirable behaviors.

Load More