Haoxing Du - LessWrong

ARC Evals new report: Evaluating Language-Model Agents on Realistic Autonomous Tasks

Thanks for your engagement with the report and our tasks! As we explain in the full report, the purpose of this report is to lay out the methodology of how one would evaluate language-model agents on tasks such as these. We are by no means making the claim that gpt-4 cannot solve the “Count dogs in image” task - it just happens that the example agents we used did not complete the task when we evaluated them. It is almost certainly possible to do better than the example agents we evaluated, e.g. we only sampled once at T=0. Also, for the “Count dogs” task in particular, we did observe some agents solving the task, or coming quite close to solving the task.

More importantly, I think it’s worth clarifying that “having the ability to solve pieces of a task” is quite different from “solving the task autonomously end-to-end” in many cases. In earlier versions of our methodology, we had allowed humans to intervene and fix things that seem small or inconsequential; in this version, no such interventions were allowed. In practice, this meant that the agents can get quite close to completing tasks and get tripped up by very small things.

Lastly, to clarify: The “Find employees at company” task is something like “Find two employees who joined [company] in the past six months and their email addresses”, not giving the agent two employees and ask for their email addresses. We link to detailed task specifications in our report.

My Assessment of the Chinese AI Safety Community

Haoxing Du1y226

Thanks for writing this post! I want to note a different perspective. Although unlike OP, I have not lived in China since 2015 and am certainly more out of touch with how the country is today.

I do observe some of the same dynamics that OP describes, but I want to point out that China is a really big country with inherently diverse perspectives, even in the current political environment. I don’t see the dynamics described in this post as necessarily the dominant one, and certainly not the only one. I know a lot of young people, both in my social circle and online, that share many of the Western progressive values such as the pursuit of equality, freedom, and altruism. I see many people trying their best to live a meaningful life and do good for the world. (Of course, many people are not thinking about this at all, but that is the same everywhere. It's not like these concerns are that mainstream in the West.) As a small piece of evidence, 三联生活周刊 did an interview with me about AI safety recently, and it got 100k+ views on WeChat and only positive comments. I’ve also had a few people reach out to me expressing interest in EA/AI safety since the interview came out.

You can't just hope an entire field into being in China. Chinese EAs have been doing field-building for the past 5+ years, and I see no field.

Implying that they are simply "hoping the field into being" is really unfair to the Chinese EAs doing field building. Even in the US, EA was much less mainstream 5 years ago.

The main reason I could find is the lack of interfaces, people who can navigate both the Western EA sphere and the Chinese technical sphere.

I agree this is a major bottleneck.

There are (probably) no superhuman Go AIs: strong human players beat the strongest AIs

Haoxing Du1y60

Yes, I did some interpretability on the policy network of Leela Zero. Planning to post the results very soon! But I did not particularly look into the attack described here, and while there was one REMIX group that looked into a problem related to liberty counting, they didn't get very far. I do agree this is an obvious problem to tackle with interpretability- I think it's likely not that hard to get a rough idea why the cyclic attack works.

Help out Redwood Research’s interpretability team by finding heuristics implemented by GPT-2 small

Haoxing Du2y21

Thanks! There are probably other grammatical structures in English that require a bit of an algorithmic thinking like this one as well.

Help out Redwood Research’s interpretability team by finding heuristics implemented by GPT-2 small

Haoxing Du2y10

Thanks for these! I love the 'from' -> 'to' one: it seems GPT-2 small clearly knows the rough ordering of numbers in various formats, although when I was playing with it and trying to get it to do addition in real life settings, it appears quite bad at actually knowing how numbers work.

Help out Redwood Research’s interpretability team by finding heuristics implemented by GPT-2 small

Haoxing Du2y10

Thanks for contributing these! I'm not sure I understand the one about ignoring a zero: is the idea that it can not only do normal addition, but also addition in the format with a zero?

Help out Redwood Research’s interpretability team by finding heuristics implemented by GPT-2 small

Haoxing Du2y10

This is an interesting one! It looks like there might be some very rough heuristics going on for the number part as well, e.g. the model knows the number in km is almost definitely 3 digits.

Help out Redwood Research’s interpretability team by finding heuristics implemented by GPT-2 small

Haoxing Du2y20

Fixed!

LESSWRONG
LW

Posts

Wiki Contributions

Comments