gabrielrecc — LessWrong

I assume that both were inspired by https://arxiv.org/abs/2108.12099 and are related via that shared ancestor

I think there's a missing link on https://alignmentproject.aisi.gov.uk/how-to-apply :
"The GFA for DSIT/AISI funded projects is standard and not subject to negotiation. The GFA can be accessed here: link to GFA attachment."

Model Organisms for Emergent Misalignment

gabrielrecc3mo30

Agree that it would be better not to have them up as readily downloadable plaintext, and it might even be worth going a step farther and encrypting the gzip or zip file, and making the password readily available in the repo's README. This is what David Rein did with GPQA and what we did with FindTheFlaws. Might be overkill, but if I were working for a frontier lab building scrapers to pull in as much data from the web as possible, I'd certainly have those scrapers unzip any unencrypted gzips they came across, and I assume their scrapers are probably doing the same.

PS to the original posters: seems like nice work! Am planning to read the full paper and ask a more substantive follow-up question when I get the chance

Survival without dignity

gabrielrecc10mo189

Love pieces that manage to be both funny and thought-provoking. And +1 for fitting a solar storm in there. There is now better evidence of very large historical solar storms than there had been during David Roodman's Open Phil review in late 2014, have been meaning to write something up about that but other things have taken priority.

There is a globe in your LLM

gabrielrecc1y83

This is cool, although I suspect that you'd get something similar from even very simple models that aren't necessarily "modelling the world" in any deep sense, simply due to first and second order statistical associations between nearby place names. See e.g. https://onlinelibrary.wiley.com/doi/pdfdirect/10.1111/j.1551-6709.2008.01003.x , https://escholarship.org/uc/item/2g6976kg .

Ilya Sutskever and Jan Leike resign from OpenAI [updated]

gabrielrecc1y3-1

Leopold and Pavel were out ("fired for allegedly leaking information") in April. https://www.silicon.co.uk/e-innovation/artificial-intelligence/openai-fires-researchers-558601

Reproducing ARC Evals' recent report on language model agents

gabrielrecc2y20

Nice job! I'm working on something similar.

> Next, I might get my agent to attempt the last three tasks in the report

I wanted to clarify one thing: Are you building custom prompts for the different tasks? If so, I'd be curious to know how much effort you put into these (I'm generally curious how much of your agent's ability to complete more tasks might be due to task-specific prompting, vs. the use of WebDriverIO and other affordances of your scaffolding). If not, isn't getting the agent to attempt the last three tasks as simple as copy-pasting the task instructions from the ARC Evals task specs linked in the report, and completing the associated setup instructions?

Biosecurity Culture, Computer Security Culture

[+]gabrielrecc2y-6-3

How to find AI alignment researchers to collaborate with?

gabrielrecc2y20

I'd recommend participating in AGISF. Completely online/virtual, a pretty light commitment (I'd describe it more as a reading group than a course personally), cohorts are typically run by AI alignment researchers or people who are quite well-versed in the field, and you'll be added to a Slack group which is pretty large and active and a reasonable way to try to get feedback.

When can we trust model evaluations?

gabrielrecc2yΩ8152

This is great. One nuance: This implies that behavioral RL fine-tuning evals are strictly less robust than behavioral I.I.D. fine-tuning evals, and that as such they would only be used for tasks that you know how to evaluate but not generate. But it seems to me that there are circumstances in which the RL-based evals could be more robust at testing capabilities, namely in cases where it's hard for a model to complete a task by the same means that humans tend to complete it, but where RL can find a shortcut that allows it to complete the task in another way. Is that right or am I misunderstanding something here?

For example, if we wanted to test whether a particular model was capable of getting 3 million points in the game of Qbert within 8 hours of gameplay time, and we fine-tuned on examples of humans doing the same, it might not be able to: achieving this in the way an expert human does might require mastering numerous difficult-to-learn subskills. But an RL fine-tuning eval might find the bug discovered by Canonical ES, illustrating the capability without needing the subskills that humans lean on.

LESSWRONG
LW

LESSWRONG
LW

Posts

Wikitag Contributions

Comments