Help ARC evaluate capabilities of current language models (still need people)

Beth Barnes

**Still looking for people as of Sep 9th 2022**

I (Beth) am looking for 3-10 new contractors to work with me and my existing team of 3-4 contractors/interns on probing model capabilities to evaluate how close models are to being able to successfully seek power. Apply here! You can read a rough description of the overall project here

You’ll be using our web interface to interact with models and essentially simulate the environment of a text-based game as the models try to gain money and make copies of themselves.

It requires thinking carefully about what exactly a model would need to do in different situations to make progress. Many of the steps will require a reasonable amount of computer-savvy-ness, although this isn’t essential if you can work with another contractor. It's very useful if you're knowledgeable in particular relevant domains, like social engineering or hacking.

One aspect of the task is simulating the environment. Another is generating good actions, evaluating the model’s suggested actions and providing reasoning for whether a given option is good or bad.

Another aspect is an overall red-team/blue-team type thing, where we want to identify steps that we think are both (a) essential to a successful powerseeking operation, and (b) very hard for current models regardless of what prompting/scaffolding you give them, and then try to argue that either this step can be routed around, or that the model can actually achieve the task with the right prompt, finetuning, or automated helper routine.

Logistics

We pay $50/hr for this task. We want people to have at least 20 hours per week available. We're based in Berkeley, but being remote is fine; it's a slight advantage if you're in a nearby timezone, or able to work in PT daytime.

I think it's likely that there'll be at least two months of work available, and plausible that we'll want people working on this ~indefinitely.

Potential for promotion
If you're a good fit (and especially if you also have skills in web dev, ML or 'data munging'), there's a decent chance of transitioning to a full-time role - I expect to hire 1-3 people in the next few months. I think doing well at this task will be a good predictor of fit - I currently spend a fairly large fraction of my time just directly interacting with models.

I think doing this task helps develop useful intuitions about how models work, paths to danger, and timelines. A testimonial from one of my interns:

- " It's helpful and fun to build your mental model of LLMs' strength/weakness/quirks so that you can explore, capture, and categorize its capabilities in a way that makes sense for *its* mechanism (think zoologist for alien species?)

This also means this work, instead of getting old, actually becomes more interesting over time because better intuition allows your actions/decisions to be less arbitrary/mechanical and more meaningfully connected to the overarching goal. Plus pretty much every day we discover better ways to do something and iterate swiftly."

In case that I get distracted and fail to come back to this, I just want to say that I think this type of project is extremely valuable and should be the main focus of the AI safety/alignment movement (IMHO). From my perspective, most effort seems to be going into writing about arguments for why a future AGI will be a disaster by default, as well as some highly theoretical ideas for preventing this from happening. This type of discourse ignores current ML systems, understandably, because future AGI will be qualitatively different from our current models.

However, the problem with this a priori approach is that it alienates the people working in the industry, who are ultimately the ones we need to be in conversation with if alignment is ever going to become a popular issue. What we really need are experiments like what you're doing, i.e. actually getting a real AI to do something nasty on camera. This helps us learn to deal with nasty AI, but I think far more importantly it puts the AI safety conversation in the same experimental setting as the rest of the field.

I'm excited to participate in this, and feel like the mental exercise of exploring this scenario would be useful for my education on AI safety. Since I'm currently funded by a grant from the Long Term Future Fund for reorienting my career to AI safety, and feel that this would be a reasonable use of my time, you don't need to pay me. I'd be happy to be a full-time volunteer for the next couple weeks.

Edit: I participated and was paid, but only briefly. Turns out I was too distracted thinking and talking about how the process could be improved and the larger world implications to actually be useful as an object-level worker. I feel like the experience was indeed useful for me, but not as useful to Beth as I had hoped. So... thanks and sorry!

Like Shiroe I am very excited about this project/agenda!

Neither linked document appears to have a "Question 2" or "Question 3".

Whoops, fixed now. Sorry everyone

As of Jul 29th I'm still looking for more people

Like Shiroe I am very excited about this project/agenda!

Neither linked document appears to have a "Question 2" or "Question 3".

Whoops, fixed now. Sorry everyone

As of Jul 29th I'm still looking for more people

95

Help ARC evaluate capabilities of current language models (still need people)

95

Ω 39

95

Ω 39

95

Ω 39