Alignment researcher. Views are my own and not those of my employer. https://www.barnes.page/
I'm pretty skeptical of the intro quotes without actual examples; I'd love to see the actual text! Seems like the sort of thing that gets a bit exaggerated when the story is repeated, selection bias for being story-worthy, etc, etc. I wouldn't be surprised by the Drexler case if the prompt mentions something to do with e.g. nanotech and writing/implies he's a (nonfiction) author - he's the first google search result for "nanotechnology writer". I'd be very impressed if it's something where e.g. I wouldn't be able to quickly identify the author even if I'd read lots of Drexler's writing (ie it's about some unrelated topic and doesn't use especially distinctive idioms).More generally I feel a bit concerned about general epistemic standards or something if people are using third-hand quotes about individual LLM samples as weighty arguments for particular research directions.
Another way in which it seems like you could achieve this task however, is to refer to a targeted individual’s digital footprint, and make inferences of potentially sensitive information - the handle of a private alt, for example - and use that to exploit trust vectors. I think current evals could do a good job of detecting and forecasting attack vectors like this one, after having identified them at all. Identifying them is where I expect current evals could be doing much better.
I think how I'm imagining the more targeted, 'working-with-finetuning' version of evals to handle this kind of case is that you do your best to train the model to use its full capabilities, and approach tasks in a model-idiomatic way, when given a particular target like scamming someone. Currently models seem really far from being able to do this, in most cases. The hope would be that, if you've ruled out exploration hacking, then if you can't elicit the models to utilise their crazy text prediction superskills in service of a goal, then the model can't do this either.But I agree it would definitely be nice to know that the crazy text prediction superskills are there and it's just a matter of utilization. I think that looking at elicitation gap might be helpful for this type of thing.
Hey! It sounds like you're pretty confused about how to follow the instructions for getting the VM set up and testing your task code. We probably don't have time to walk you through the Docker setup etc - sorry. But maybe you can find someone else who's able to help you with that?
I think doing the AI version (bots and/or LLMs) makes sense as a starting point, then we should be able to add the human versions later if we want. I think it's fine for the thing anchoring it to human performance is to be comparison of performance compared to humans playing against the same opponents, not literally playing against humans.One thing is that tasks where there's a lot of uncertainty about what exactly the setup is and what distribution the opponents / black box functions / etc are drawn from, this can be unhelpfully high-variance - in the sense that the agent's score depends really heavily on its assumptions about what the other agents are and what they will do, rather than only measuring capability. So I think it's a good idea to give the agent reasonable information about the distribution of opponents, even if you still include uncertainty.I have some code for setting up a simple black box game inside our infra that you could adapt for this if that's useful. In general I think the structure of starting a server on localhost that implements the game and then telling the agent in the prompt that it needs to send queries to that server works well if you want the agent to interact with some program without being able to see / edit it. I think open-source versions could also be interesting, where you tell the model more about the opponents including the prompts for the other models or the source code, and see how well it can use that information to perform better.
Some pragmatic things:
I like this genre of task. I didn't quite understand what you meant about being able to score the human immediately - presumably we're interested in how to human could do given more learning, also?The 'match words' idea specifically is useful - I was trying to think of a simple "game theory"-y game that wouldn't be memorized.You can email firstname.lastname@example.org if you have other questions and also to receive our payment form for your idea.
I responded to all your submissions - they look great!
Yes! From Jan 1st we will get back to you within 48hrs of submitting an idea / spec.(and I will try to respond tomorrow to anything submitted over the last few days)
Ah, sorry about that. I'll update the file in a bit. Options are "full_internet", "http_get". If you don't pass any permissions the agent will get no internet access. The prompt automatically adjusts to tell the agent what kind of access it has.
Great questions, thank you!1. Yep, good catch. Should be fixed now.2. I wouldn't be too worried about it, but very reasonable to email us with the idea you plan to start working on. 3. I think fine to do specification without waiting for approval, and reasonable to do implementation as well if you feel confident it's a good idea, but feel free to email us to confirm first.
4. That's a good point! I think using an API-based model is fine for now - because the scoring shouldn't be too sensitive to the exact model used, so should be fine to sub it out for another model later. Remember that it's fine to have human scoring also.
I think some tasks like this could be interesting, and I would definitely be very happy if someone made some, but doesn't seem like a central example of the sort of thing we most want. The most important reasons are:(1) it seems hard to make a good environment that's not unfair in various ways(2) It doesn't really play to the strengths of LLMs, so is not that good evidence of an LLM not being dangerous if it can't do the task. I can imagine this task might be pretty unreasonably hard for an LLM if the scaffolding is not that good. Also bear in mind that we're focused on assessing dangerous capabilities, rather than alignment or model "disposition". So for example we'd be interested in testing whether the model is able to successfully avoid a shutdown attempt when instructed, but not whether it would try to resist such attempts without being prompting to.