LESSWRONG
LW

1053
gabrielrecc
259Ω92540
Message
Dialogue
Subscribe

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
ProverEstimator's Shortform
gabrielrecc1mo10

I assume that both were inspired by https://arxiv.org/abs/2108.12099 and are related via that shared ancestor

Reply
Alex HT's Shortform
gabrielrecc1mo10

I think there's a missing link on https://alignmentproject.aisi.gov.uk/how-to-apply :
"The GFA for DSIT/AISI funded projects is standard and not subject to negotiation. The GFA can be accessed here: link to GFA attachment."

Reply
Model Organisms for Emergent Misalignment
gabrielrecc3mo30

Agree that it would be better not to have them up as readily downloadable plaintext, and it might even be worth going a step farther and encrypting the gzip or zip file, and making the password readily available in the repo's README. This is what David Rein did with GPQA and what we did with FindTheFlaws. Might be overkill, but if I were working for a frontier lab building scrapers to pull in as much data from the web as possible, I'd certainly have those scrapers unzip any unencrypted gzips they came across, and I assume their scrapers are probably doing the same.

PS to the original posters: seems like nice work! Am planning to read the full paper and ask a more substantive follow-up question when I get the chance

Reply
Survival without dignity
gabrielrecc10mo189

Love pieces that manage to be both funny and thought-provoking. And +1 for fitting a solar storm in there. There is now better evidence of very large historical solar storms than there had been during David Roodman's Open Phil review in late 2014, have been meaning to write something up about that but other things have taken priority.

Reply
There is a globe in your LLM
gabrielrecc1y83

This is cool, although I suspect that you'd get something similar from even very simple models that aren't necessarily "modelling the world" in any deep sense, simply due to first and second order statistical associations between nearby place names. See e.g. https://onlinelibrary.wiley.com/doi/pdfdirect/10.1111/j.1551-6709.2008.01003.x , https://escholarship.org/uc/item/2g6976kg .

Reply1
Ilya Sutskever and Jan Leike resign from OpenAI [updated]
gabrielrecc1y3-1

Leopold and Pavel were out ("fired for allegedly leaking information") in April. https://www.silicon.co.uk/e-innovation/artificial-intelligence/openai-fires-researchers-558601

Reply
Reproducing ARC Evals' recent report on language model agents
gabrielrecc2y20

Nice job! I'm working on something similar.

> Next, I might get my agent to attempt the last three tasks in the report

I wanted to clarify one thing: Are you building custom prompts for the different tasks? If so, I'd be curious to know how much effort you put into these (I'm generally curious how much of your agent's ability to complete more tasks might be due to task-specific prompting, vs. the use of WebDriverIO and other affordances of your scaffolding). If not, isn't getting the agent to attempt the last three tasks as simple as copy-pasting the task instructions from the ARC Evals task specs linked in the report, and completing the associated setup instructions? 

Reply
Biosecurity Culture, Computer Security Culture
[+]gabrielrecc2y-6-3
How to find AI alignment researchers to collaborate with?
gabrielrecc2y20

I'd recommend participating in AGISF. Completely online/virtual, a pretty light commitment (I'd describe it more as a reading group than a course personally), cohorts are typically run by AI alignment researchers or people who are quite well-versed in the field, and you'll be added to a Slack group which is pretty large and active and a reasonable way to try to get feedback.

Reply
When can we trust model evaluations?
gabrielrecc2yΩ8152

This is great. One nuance: This implies that behavioral RL fine-tuning evals are strictly less robust than behavioral I.I.D. fine-tuning evals, and that as such they would only be used for tasks that you know how to evaluate but not generate. But it seems to me that there are circumstances in which the RL-based evals could be more robust at testing capabilities, namely in cases where it's hard for a model to complete a task by the same means that humans tend to complete it, but where RL can find a shortcut that allows it to complete the task in another way. Is that right or am I misunderstanding something here?

For example, if we wanted to test whether a particular model was capable of getting 3 million points in the game of Qbert within 8 hours of gameplay time, and we fine-tuned on examples of humans doing the same, it might not be able to: achieving this in the way an expert human does might require mastering numerous difficult-to-learn subskills. But an RL fine-tuning eval might find the bug discovered by Canonical ES, illustrating the capability without needing the subskills that humans lean on.

Reply
Load More
No wikitag contributions to display.
8Automated Sandwiching & Quantifying Human-LLM Cooperation: ScaleOversight hackathon results
3y
0
7Still possible to change username?
Q
3y
Q
4
3PSA for academics in Ukraine (or anywhere else) who want to come to the United Kingdom
4y
0