Hjalmar_Wijk - LessWrong

We might be dropping the ball on Autonomous Replication and Adaptation.

Yeah that's right, I made too broad a claim and only meant to say it was an argument against their ability to pose a threat as rogue independent agents.

We might be dropping the ball on Autonomous Replication and Adaptation.

Hjalmar_Wijk2moΩ792

I think in our current situation shutting down all rogue AI operations might be quite tough (though limiting their scale is certainly possible, and I agree with the critique that absent slowdowns or alignment problems or regulatory obstacles etc. it would be surprising if these rogue agents could compete with legitimate actors with many more GPUs).

Assuming the AI agents have money there are maybe three remaining constraints the AI agents need to deal with:

purchasing/renting the GPUs,
(if not rented through a cloud provider) setting them up and maintaining them, and
evading law enforcement or other groups trying to locate them and shut them down

For acquiring the GPUs, there are currently a ton of untracked GPUs spread out around the world, both gaming GPUs like 4090s and datacenter GPUs like H100s. I can buy or rent them with extremely minimal KYC from tons of different places. If we assume ARA agents are able to do all of the following:

Recruit people to physically buy GPUs from retail stores
Take delivery of online ordered GPUs in anonymized ways using intermediaries
Recruit people to set up fake AI startups and buy GPUs through them

then the effort required to prevent them from acquiring any GPUs in a world anything like today seems crazy. And even if the US + allies started confiscating or buying up 4090s en masse, it's enough that some other countries adopt laxer standards (perhaps to attract AI talent that is frustrated by the new draconian GPU-control regime).

As for setting up acquired GPUs, I think the AI agents will probably be able to find colocation operators that don't ask too many questions in many parts of the world, but even if the world was to coordinate on extremely stringent controls, the AI agents could set up datacenters of their own in random office buildings or similar - each inference setup wouldn't need that much power and I think it would be very hard to track it down.

As for not being shut down by law enforcement, I think this might take some skill in cybersecurity and opsec etc. but if it operates as dozens or hundreds of separate "cells", and each has enough labor to figure out decorrelated security and operations etc. then it seems quite plausible that they wouldn't be shut down. Historically "insurgency groups" or hacker networks etc. seem to be very difficult to fully eliminate, even when a ton of resources are thrown at the problem.

I don't think any of the above would require superhuman abilities, though many parts are challenging, which is part of why evals targeting these skills could provide a useful safety case - e.g. if it's clear that the AI could not pull off the cybersecurity operation required to not be easily shut down then this is a fairly strong argument that this AI agent wouldn't be able to pose a threat [Edit: from rogue agents operating independently, not including other threat models like sabotaging things from within the lab etc.].

Though again I am not defending any very strong claim here, e.g. I'm not saying:

that rogue AIs will be able to claim 5+% of all GPUs or an amount competitive with a well-resourced legitimate actor (I think the world could shut them out of most of the GPU supply, and that over time the situation would worsen for the AI agents as the production of more/better GPUs is handled with increasing care),
that these skills alone mean it poses a risk of takeover or that it could cause significant damage (I agree that this would likely require significant further capabilities, or many more resources, or already being deployed aggressively in key areas of the military etc.)
that "somewhat dumb AI agents self-replicate their way to a massive disaster" is a key threat model we should be focusing our energy on

I'm just defending the claim that ~human-level rogue AIs in a world similar to the world of today might be difficult to fully shut down, even if the world made a concerted effort to do so.

We might be dropping the ball on Autonomous Replication and Adaptation.

Hjalmar_Wijk2moΩ790

I did some BOTECs on this and think 1 GB/s is sort of borderline, probably works but not obviously.

E.g. I assumed a ~10TB at fp8 MoE model with a sparsity factor of 4 with 32768 hidden size.

With 32kB per token you could send at most 30k tokens/second over a 1GB/s interconnect. Not quite sure what a realistic utilization would be, but maybe we halve that to 15k?

If the model was split across 20 8xH100 boxes, then each box might do ~250 GFLOP/token (2 * 10T parameters / (4*20)), so each box would do at most 3.75 PFLOP/second, which might be about ~20-25% utilization.

This is not bad, but for a model with much more sparsity or GPUs with a different FLOP/s : VRAM ratio or spottier connection etc. the bandwidth constraint might become quite harsh.

(the above is somewhat hastily reconstructed from some old sheets, might have messed something up)

New voluntary commitments (AI Seoul Summit)

Hjalmar_Wijk2mo10

Things that distinguish an "RSP" or "RSP-type commitment" for me (though as with most concepts, something could lack a few of the items below and still overall seem like an "RSP" or a "draft aiming to eventually be an RSP")

Scaling commitments: Commitments are not just about deployment but about creating/containing/scaling a model in the first place (in fact for most RSPs I think the scaling/containment commitments are the focus, much more so than the deployment commitments which are usually a bit vague and to-be-determined)
Red lines: The commitment spells out red lines in advance where even creating a model past that line would be unsafe given their current practices/security/alignment research, likely including some red lines where the developer admits they do not know how to ensure safety anymore, and commits to not reaching that point until the situation changes.
Iterative policy updates: For red-lines where the developer doesn't know what exact mitigations would be sufficient to ensure safety, they identify a previous point where they do think they can mitigate risks, and commit to not scaling further than that until they have published a new commitment and given the public a chance to scrutinize it
Evaluations and accountability: I think this is an area that many RSPs do poorly at, where the developer presents clear externally accountable evidence that:
- They will not suddenly cross any of their red lines before the mitigations are implemented/a new RSP version has been published and given scrutiny, by pointing at specific evaluations procedures and policies
- Their mitigations/procedures/evaluations etc. will be implemented faithfully and in the spirit of the document, e.g. through audits/external oversight/whistleblowing etc.

Voluntary commitments that wouldn't be RSPs:

We commit to various deployment mitigations, incident sharing etc. (these are not about scaling)
We commit to [amazing set of safety practices, including state-proof security and massive spending on alignment etc.] by 2026 (this would be amazing, but doesn't identify red lines for when those mitigations would no longer be enough, and doesn't make any commitment about what they would do if they hit concerning capabilities before 2026)

... many more obviously, I think RSPs are actually quite specific.

Modern Transformers are AGI, and Human-Level

Hjalmar_Wijk4moΩ332

Yeah, I agree that lack of agency skills are an important part of the remaining human<>AI gap, and that it's possible that this won't be too difficult to solve (and that this could then lead to rapid further recursive improvements). I was just pointing toward evidence that there is a gap at the moment, and that current systems are poorly described as AGI.

Modern Transformers are AGI, and Human-Level

Hjalmar_Wijk4moΩ183013

I agree the term AGI is rough and might be more misleading than it's worth in some cases. But I do quite strongly disagree that current models are 'AGI' in the sense most people intend.

Examples of very important areas where 'average humans' plausibly do way better than current transformers:

Most humans succeed in making money autonomously. Even if they might not come up with a great idea to quickly 10x $100 through entrepreneurship, they are able to find and execute jobs that people are willing to pay a lot of money for. And many of these jobs are digital and could in theory be done just as well by AIs. Certainly there is a ton of infrastructure built up around humans that help them accomplish this which doesn't really exist for AI systems yet, but if this situation was somehow equalized I would very strongly bet on the average human doing better than the average GPT-4-based agent. It seems clear to me that humans are just way more resourceful, agentic, able to learn and adapt etc. than current transformers are in key ways.
Many humans currently do drastically better on the METR task suite (https://github.com/METR/public-tasks) than any AI agents, and I think this captures some important missing capabilities that I would expect an 'AGI' system to possess. This is complicated somewhat by the human subjects not being 'average' in many ways, e.g. we've mostly tried this with US tech professionals and the tasks include a lot of SWE, so most people would likely fail due to lack of coding experience.
Take enough randomly sampled humans and set them up with the right incentives and they will form societies, invent incredibly technologies, build productive companies etc. whereas I don't think you'll get anything close to this with a bunch of GPT-4 copies at the moment

I think AGI for most people evokes something that would do as well as humans on real-world things like the above, not just something that does as well as humans on standardized tests.

Anthropic's Responsible Scaling Policy & Long-Term Benefit Trust

Hjalmar_Wijk10mo92

They do as far as I can tell commit to a fairly strong sort of "timeline" for implementing these things: before they scale to ASL-3 capable models (i.e. ones that pass their evals for autonomous capabilities or misuse potential).

More information about the dangerous capability evaluations we did with GPT-4 and Claude.

Hjalmar_Wijk1yΩ242

ARC evals has only existed since last fall, so for obvious reasons we have not evaluated very early versions. Going forward I think it would be valuable and important to evaluate models during training or to scale up models in incremental steps.

More information about the dangerous capability evaluations we did with GPT-4 and Claude.

Hjalmar_Wijk1y228

I work at ARC evals, and mostly the answer is that this was sort of a trial run.

For some context ARC evals spun up as a project last fall, and collaborations with labs are very much in their infancy. We tried to be very open about limitations in the system card, and I think the takeaways from our evaluations so far should mostly be that "It seems urgent to figure out the techniques and processes for evaluating dangerous capabilities of models, several leading scaling labs (including OpenAI and Anthropic) have been working with ARC on early attempts to do this".

I would not consider the evaluation we did of GPT-4 to be nearly good enough for future models where the prior of existential danger is higher, but I hope we can make quick progress and I have certainly learned a lot from doing this initial evaluation. I think you should hold us and the labs to a much higher standard going forward though, in light of the capabilities of GPT-4.

They gave LLMs access to physics simulators

Hjalmar_Wijk2y21

I might be missing something but are they not just giving the number of parameters (in millions of parameters) on a log10 scale? Scaling laws are usually by log-parameters, and I suppose they felt that it was cleaner to subtract the constant log(10^6) from all the results (e.g. taking log(1300) instead of log(1.3B)).

The B they put at the end is a bit weird though.

LESSWRONG
LW

Posts

Wiki Contributions

Comments