Hjalmar_Wijk - LessWrong

New voluntary commitments (AI Seoul Summit)

Things that distinguish an "RSP" or "RSP-type commitment" for me (though as with most concepts, something could lack a few of the items below and still overall seem like an "RSP" or a "draft aiming to eventually be an RSP")

Scaling commitments: Commitments are not just about deployment but about creating/containing/scaling a model in the first place (in fact for most RSPs I think the scaling/containment commitments are the focus, much more so than the deployment commitments which are usually a bit vague and to-be-determined)
Red lines: The commitment spells out red lines in advance where even creating a model past that line would be unsafe given their current practices/security/alignment research, likely including some red lines where the developer admits they do not know how to ensure safety anymore, and commits to not reaching that point until the situation changes.
Iterative policy updates: For red-lines where the developer doesn't know what exact mitigations would be sufficient to ensure safety, they identify a previous point where they do think they can mitigate risks, and commit to not scaling further than that until they have published a new commitment and given the public a chance to scrutinize it
Evaluations and accountability: I think this is an area that many RSPs do poorly at, where the developer presents clear externally accountable evidence that:
- They will not suddenly cross any of their red lines before the mitigations are implemented/a new RSP version has been published and given scrutiny, by pointing at specific evaluations procedures and policies
- Their mitigations/procedures/evaluations etc. will be implemented faithfully and in the spirit of the document, e.g. through audits/external oversight/whistleblowing etc.

Voluntary commitments that wouldn't be RSPs:

We commit to various deployment mitigations, incident sharing etc. (these are not about scaling)
We commit to [amazing set of safety practices, including state-proof security and massive spending on alignment etc.] by 2026 (this would be amazing, but doesn't identify red lines for when those mitigations would no longer be enough, and doesn't make any commitment about what they would do if they hit concerning capabilities before 2026)

... many more obviously, I think RSPs are actually quite specific.

Modern Transformers are AGI, and Human-Level

Hjalmar_Wijk2moΩ332

Yeah, I agree that lack of agency skills are an important part of the remaining human<>AI gap, and that it's possible that this won't be too difficult to solve (and that this could then lead to rapid further recursive improvements). I was just pointing toward evidence that there is a gap at the moment, and that current systems are poorly described as AGI.

Modern Transformers are AGI, and Human-Level

Hjalmar_Wijk2moΩ183013

I agree the term AGI is rough and might be more misleading than it's worth in some cases. But I do quite strongly disagree that current models are 'AGI' in the sense most people intend.

Examples of very important areas where 'average humans' plausibly do way better than current transformers:

Most humans succeed in making money autonomously. Even if they might not come up with a great idea to quickly 10x $100 through entrepreneurship, they are able to find and execute jobs that people are willing to pay a lot of money for. And many of these jobs are digital and could in theory be done just as well by AIs. Certainly there is a ton of infrastructure built up around humans that help them accomplish this which doesn't really exist for AI systems yet, but if this situation was somehow equalized I would very strongly bet on the average human doing better than the average GPT-4-based agent. It seems clear to me that humans are just way more resourceful, agentic, able to learn and adapt etc. than current transformers are in key ways.
Many humans currently do drastically better on the METR task suite (https://github.com/METR/public-tasks) than any AI agents, and I think this captures some important missing capabilities that I would expect an 'AGI' system to possess. This is complicated somewhat by the human subjects not being 'average' in many ways, e.g. we've mostly tried this with US tech professionals and the tasks include a lot of SWE, so most people would likely fail due to lack of coding experience.
Take enough randomly sampled humans and set them up with the right incentives and they will form societies, invent incredibly technologies, build productive companies etc. whereas I don't think you'll get anything close to this with a bunch of GPT-4 copies at the moment

I think AGI for most people evokes something that would do as well as humans on real-world things like the above, not just something that does as well as humans on standardized tests.

Anthropic's Responsible Scaling Policy & Long-Term Benefit Trust

Hjalmar_Wijk8mo92

They do as far as I can tell commit to a fairly strong sort of "timeline" for implementing these things: before they scale to ASL-3 capable models (i.e. ones that pass their evals for autonomous capabilities or misuse potential).

More information about the dangerous capability evaluations we did with GPT-4 and Claude.

Hjalmar_Wijk1yΩ242

ARC evals has only existed since last fall, so for obvious reasons we have not evaluated very early versions. Going forward I think it would be valuable and important to evaluate models during training or to scale up models in incremental steps.

More information about the dangerous capability evaluations we did with GPT-4 and Claude.

Hjalmar_Wijk1y228

I work at ARC evals, and mostly the answer is that this was sort of a trial run.

For some context ARC evals spun up as a project last fall, and collaborations with labs are very much in their infancy. We tried to be very open about limitations in the system card, and I think the takeaways from our evaluations so far should mostly be that "It seems urgent to figure out the techniques and processes for evaluating dangerous capabilities of models, several leading scaling labs (including OpenAI and Anthropic) have been working with ARC on early attempts to do this".

I would not consider the evaluation we did of GPT-4 to be nearly good enough for future models where the prior of existential danger is higher, but I hope we can make quick progress and I have certainly learned a lot from doing this initial evaluation. I think you should hold us and the labs to a much higher standard going forward though, in light of the capabilities of GPT-4.

They gave LLMs access to physics simulators

Hjalmar_Wijk2y21

I might be missing something but are they not just giving the number of parameters (in millions of parameters) on a log10 scale? Scaling laws are usually by log-parameters, and I suppose they felt that it was cleaner to subtract the constant log(10^6) from all the results (e.g. taking log(1300) instead of log(1.3B)).

The B they put at the end is a bit weird though.

An Update on Academia vs. Industry (one year into my faculty job)

Hjalmar_Wijk2yΩ7128

As someone who has been feeling increasingly skeptical of working in academia I really appreciate this post and discussion on it for challenging some of my thinking here.

I do want to respond especially to this part though, which seems cruxy to me:

Furthermore, it is a mistake to simply focus on efforts on whatever timelines seem most likely; one should also consider tractability and neglectedness of strategies that target different timelines. It seems plausible that we are just screwed on short timelines, and somewhat longer timelines are more tractable. Also, people seem to be making this mistake a lot and thus short timelines seem potentially less neglected.

I suspect this argument pushes in the other direction. On longer timelines the amount of effort which will eventually get put toward the problem is much greater. If the community continues to grow at the current pace, then 20 year timeline worlds might end up seeing almost 1000x as much effort put toward the problem in total than 5 year timeline worlds. So neglectedness considerations might tell us that impacts on 5 year timeline worlds are 1000x more important than impacts on 20 year timeline worlds. This is of course mitigated by the potential for your actions to accrue more positive knock-on effects over 20 years, for instance very effective field building efforts could probably overcome this neglectedness penalty in some cases. But in terms of direct impacts on different timeline scenarios this seems like a very strong effect.

On the tractability point, I suspect you need some overly confident model of how difficult alignment turns out to be for this to overcome the neglectedness penalty. E.g. Owen Cotton-Barret suggests here using a log-uniform prior for the difficulty of unknown problems, which (unless you think alignment success in short timelines is essentially impossible) would indicate that tractability is constant. Using a less crude approximation we might use something like a log-normal distribution for the difficulty of solving alignment, where we see overall decreasing returns to effort unless you have extremely low variance (implying you know almost exactly which OOM of effort is enough to solve alignment) or extremely low probability of success by default (<< 1%).

Overall my current guess is that tractability/neglectedness pushes toward working on short timelines, and gives a penalty to delayed impact of perhaps 10x per decade (20x penalty from neglectedness, compensated by a 2x increase in tractability).

If you think that neglectedness/tractability overall pushes toward targeting impact toward long timelines then I'd be curious to see that spelled out more clearly (e.g. as a distribution over the difficulty of solving alignment that implies some domain of increasing returns to effort, or some alternative way to model this). This seems very important if true.

Tabooing 'Agent' for Prosaic Alignment

Hjalmar_Wijk5y10

Strongly agree with this, I think this seems very important.

Tabooing 'Agent' for Prosaic Alignment

Hjalmar_Wijk5yΩ110

These sorts of problems are what caused me to want a presentation which didn't assume well-defined agents and boundaries in the ontology, but I'm not sure how it applies to the above - I am not looking for optimization as a behavioral pattern but as a concrete type of computation, which involves storing world-models and goals and doing active search for actions which further the goals. Neither a thermostat nor the world outside seem to do this from what I can see? I think I'm likely missing your point.

LESSWRONG
LW

Posts

Wiki Contributions

Comments