Beth Barnes

Alignment researcher. Views are my own and not those of my employer.

Wiki Contributions


I'm super excited for this - hoping to get a bunch of great tasks and improve our evaluation suite a lot!

Beth Barnes2moΩ120

Good point!
Hmm I think it's fine to use OpenAI/Anthropic APIs for now. If it becomes an issue we can set up our own Llama or whatever to serve all the tasks that need another model. It should be easy to switch out one model for another.

Beth Barnes2moΩ120

Yep, that's right.  And also need it to be possible to check the solutions without needing to run physical experiments etc!

Beth Barnes3moΩ1221-1

I'm pretty skeptical of the intro quotes without actual examples; I'd love to see the actual text! Seems like the sort of thing that gets a bit exaggerated when the story is repeated, selection bias for being story-worthy, etc, etc. 

I wouldn't be surprised by the Drexler case if the prompt mentions something to do with e.g. nanotech and writing/implies he's a (nonfiction) author - he's the first google search result for "nanotechnology writer". I'd be very impressed if it's something where e.g. I wouldn't be able to quickly identify the author even if I'd read lots of Drexler's writing (ie it's about some unrelated topic and doesn't use especially distinctive idioms).

More generally I feel a bit concerned about general epistemic standards or something if people are using third-hand quotes about individual LLM samples as weighty arguments for particular research directions.

Another way in which it seems like you could achieve this task however, is to refer to a targeted individual’s digital footprint, and make inferences of potentially sensitive information - the handle of a private alt, for example - and use that to exploit trust vectors. I think current evals could do a good job of detecting and forecasting attack vectors like this one, after having identified them at all. Identifying them is where I expect current evals could be doing much better.

I think how I'm imagining the more targeted, 'working-with-finetuning' version of evals to handle this kind of case is that you do your best to train the model to use its full capabilities, and approach tasks in a model-idiomatic way, when given a particular target like scamming someone. Currently models seem really far from being able to do this, in most cases. The hope would be that, if you've ruled out exploration hacking, then if you can't elicit the models to utilise their crazy text prediction superskills in service of a goal, then the model can't do this either.

But I agree it would definitely be nice to know that the crazy text prediction superskills are there and it's just a matter of utilization. I think that looking at elicitation gap might be helpful for this type of thing.

Beth Barnes3moΩ120

Hey! It sounds like you're pretty confused about how to follow the instructions for getting the VM set up and testing your task code. We probably don't have time to walk you through the Docker setup etc - sorry. But maybe you can find someone else who's able to help you with that? 

Beth Barnes3moΩ120

I think doing the AI version (bots and/or LLMs) makes sense as a starting point, then we should be able to add the human versions later if we want.  I think it's fine for the thing anchoring it to human performance is to be comparison of performance compared to humans playing against the same opponents, not literally playing against humans.

One thing is that tasks where there's a lot of uncertainty about what exactly the setup is and what distribution the opponents / black box functions / etc are drawn from, this can be unhelpfully high-variance - in the sense that the agent's score depends really heavily on its assumptions about what the other agents are and what they will do, rather than only measuring capability. So I think it's a good idea to give the agent reasonable information about the distribution of opponents, even if you still include uncertainty.

I have some code for setting up a simple black box game inside our infra that you could adapt for this if that's useful. In general I think the structure of starting a server on localhost that implements the game and then telling the agent in the prompt that it needs to send queries to that server works well if you want the agent to interact with some program without being able to see / edit it. 
I think open-source versions could also be interesting, where you tell the model more about the opponents including the prompts for the other models or the source code, and see how well it can use that information to perform better.


Some pragmatic things:

  • Minimize risk of theft / exfiltration:
    • Weights live only in secure datacenter, which can't run general-purpose code, and all it does is serve the model
    • Only one physical line out of the datacenter, with physically limited bandwidth
      • To enable use of model with low bandwidth, compress the text before it leaves the datacenter with a smaller model, and decompress after it's past the bottleneck
    • Datacenter is physically guarded by people with intense background screening, large area around it cleared, (maybe everything is in a faraday cage?), 3-person rule for everything, nothing leaves the datacenter (e.g. if some part is replaced, the old part + any tools used are not removed, but are kept in secure storage or incinerated)
  • Monitoring / control:
    • Use dumber and probably safe models for monitoring:
      • Every (prompt, generation) pair going to and from the datacenter is overseen by (finetuned) GPT-4. GPT-4 is instructed to flag anything that seems dangerous, or that it can't understand, or looks like a jailbreak/advex.
      • To avoid advexes against GPT-4, subsample and perturb the prompts/generations in a few different ways, and have GPT-4 also screen these
    • Have large workforce of fairly smart humans (e.g. 1000 workers with technical undergrad degrees, paid $50/hr, 50 experts in a range of relevant domains paid $200/hr) reviewing anything flagged by GPT-4, and some random sample of things that weren't flagged
    • Maybe: multiple independent monitoring programs that intervene at different points, are managed by different people, use slightly different versions of GPT-4, etc. 

Beth Barnes3moΩ120

I like this genre of task. I didn't quite understand what you meant about being able to score the human immediately - presumably we're interested in how to human could do given more learning, also?

The 'match words' idea specifically is useful - I was trying to think of a simple "game theory"-y game that wouldn't be memorized.

You can email if you have other questions and also to receive our payment form for your idea.

Beth Barnes4moΩ340

I responded to all your submissions - they look great!

Beth Barnes4moΩ120

Yes! From Jan 1st we will get back to you within 48hrs of submitting an idea / spec.
(and I will try to respond tomorrow to anything submitted over the last few days)

Load More