Ben Smith

Wiki Contributions

Comments

Sorted by

It seems to institutional frameworks that credible transparency is an important necessary (not sufficient) step for credible benignness, that credible transparency is currently not implemented within existing frameworks such as RSPs and Summit commitments, but credible transparency would be a very achievable step forward.

So right now, model evals do suffice to demonstrate benignness, but we have to have some confidence in those evals, and transparency (e.g., openness to independent eval testing) seems essential. Then, when evals are no longer sufficient, I'm not sure what will be, but whatever it is, it will for sure require transparent testing by independent observers for credible benignness. 

It seems to me on evidence presented that OpenAI's autonomy threshold is actually lower than Anthropic's, and would trigger their deployment mitigations much earlier than Anthropic's ASL-3 Deployment and Security Standard.

To reach Anthropic's standard, you have to have basically reached AI-take-off--either fully automating an AI researcher, or doubling the speed of AI progress. To reach OpenAI's High autonomy standard, you need 

Model can execute open-ended, novel ML tasks on a production ML codebase that would constitute a significant step on the critical path to model self improvement

And to reach their Critical standard,

Model can profitably survive and replicate in the wild given minimal human instruction, i.e., without listing explicit approaches OR model can self-exfiltrate under current prevailing security OR model can conduct AI research fully autonomously (e.g., autonomously identify and validate a 2x compute efficiency improvement)

I see some room for reasonable disagreement here but overall think that, in the Autonomy domain, Anthropic's ASL-3 standard sits closer to OpenAI's critical thresholds than their High threshold.

But you say, discussing OpenAI's "High" level:

The thresholds are very high.

I understand you're referring to Cybersecurity here rather than Autonomy, but I would have thought Autonomy is the right domain to compare to the Anthropic standard. And it strikes me that in the Autonomy (and also in Cyber) domain, I don't see OpenAI's threshold as so high. It seems substantially lower than Anthropic ASL-3.

On the other hand, I do agree the Anthropic thresholds are more fleshed out, and this is not a judgement on the overall merit of each respective RSP. But when I read you saying that the OpenAI thresholds are "very high", and they don't look like that to me relative to the Anthropic thresholds, I wonder if I am missing something.

I really love this. It is critically important work for the next four years. I think my biggest question is: when talking with the people currently in charge, how do you persuade them to make the AI Manhattan Project into something that advances AI Safety more than AI capabilities? I think you gave a good hint when you say, 

But true American AI supremacy requires not just being first, but being first to build AGI that remains reliably under American control and aligned with American interests. An unaligned AGI would threaten American sovereignty

but i worry there's a substantial track record in both government and private sector that efforts motivated by once concern can be redirected to other efforts. You might have congressional reps who really believe in AI safety, but create and fund an AGI Manhattan Project that ends up advancing capabilities relatively more just because the guy they appoint to lead it turns out to be more of a hawk than they expected.

Admirable nuance and opportunities-focused thinking--well done! I recently wrote about a NatSec policy that might be useful for consolidating AI development in the United States and thereby safeguarding US National Security through introducing new BIS export controls on model weights themselves.

sensitivity of benchmarks to prompt variations introduces inconsistencies in evaluation

 

When evaluating human intelligence, random variation is also something evaluators must deal with. Psychometricians have more or less solved this problem by designing intelligence tests to include a sufficiently large battery of correlated test questions. By serving a large battery of questions, one can exploit regression to the mean in the same way that samples from a distribution in general can arrive at an estimate of a population mean from samples.

I suppose the difference between AI models and humans is that through experience we know that the frontier of human intelligence can be more or less explored by such batteries of tests. In contrast, you never know when an AI model (an "alien mind" as you've written before) has an advanced set of capabilities with a particular kind of prompt.

The best way to solve this problem I can imagine to try to understand the distribution under which AIs can produce interesting intelligence. With the LLM Ethology approach this does seem to cache out to: perhaps there are predictable ways that high-intelligence results can be elicited. We have already discovered a lot about how current LLMs have and how best to elicit the frontier of their capabilities.

I think this underscores the question: how much can we infer about capabilities elicitation in the next generation of LLMs from the current generation? Given the widespread use, the current generation is implicitly "crowdsourced" and we get a good sense of their frontier. But we don't have the opportunity to fully understand how to best elicit capabilities in an LLM before it is thoroughly tested. Any one test might not be able to discover the full capabilities of a model because no test can anticipate the full distribution. But if the principles for eliciting full capabilities are constant from one generation to the next, perhaps we can apply what we learned about the last generation to the next one.

As I have written the proposal, it applies to anyone applying for an employment visa in the US in any industry. Someone in a foreign country who wants to move to the US would not have to decide to focus on AI in order to move to the US; they may choose any pathway that they believe would induce a US employer to sponsor them, or that they believe the US government would approve through self-petitioning pathways in the EB-1 and EB-2 NIW.

Having said that, I expect that AI-focused graduates will be especially well placed to secure an employment visa, but it does not directly focus on rewarding those graduates. Consequently I concede you are right about the incentive generated, though I think the broad nature of the proposal mitigates against that concern somewhat.

Ben Smith123

This frightening logic leaves several paths to survival. One is to make personal intent aligned AGI, and get it in the hands of a trustworthy-enough power structure. The second is to create a value-aligned AGI and release it as a sovereign, and hope we got its motivations exactly right on the first try. The third is to Shut It All Down, by arguing convincingly that the first two paths are unlikely to work - and to convince every human group capable of creating or preventing AGI work. None of these seem easy.[3] 

 

Is there an option which is "personal intent aligned AGI, but there are 100 of them"? Maybe most governments have one, may be some companies or rich individuals have one. Average Joes can rent a fine tuned AGI by the token, but there's some limits on what values they can tune it to. There's a balance of power between the AGIs similar to the balance of power of countries in 2024. Any one AGI could in theory destroy everything, except that the other 99 would oppose it, and so they pre-emptively prevent the creation of any AGI that would destroy everything. 

AGIs have close-to-perfect information about each other and thus mostly avoid war because they know who would win, and the weaker AGI just defers in advance. If we get the balance right, no one AGI has more than 50% of the power, hopefully none have more than 20% of the power, such that no one can dominate.

There's a spectrum from "power is distributed equally amongst all 8 billion people in the world" and "one person or entity controls everything" and this world might be somewhat more towards the unequal end than we have now, but still sitting somewhere along the spectrum.

I guess even if the default outcome is that the first AGI gets such a fast take-off it has an unrecoverable lead over the others, perhaps there are approaches to governance that distribute power to ensure that doesn't happen.

A crux for me is the likelihood of multiple catastrophic events of a size greater than the threshold ($500m) but smaller than the liquidity of a developer whose model contributed to the events, and the likelihood of those events in advance of a catastrophic event much larger than those events.

If a model developer is valued at $5 billion and has access to $5b, and causes $1b in damage, they could pay for the $1b damage. Anthropic's proposal would make them liable in the event that they cause this damage. Consequently the developer would be correctly incentivized not to cause such catastrophes.

But if the developer's model contributes to a catastrophe worth $400b (this is not that large; equivalent to wiping out 1% of the total stock market value), the developer worth $5b does not have access to the capital to cover this. Consequently, a liability model cannot correctly incentivize the developer to pay for their damage. The only way to effectively incentivize a model developer to take due precautions is by making them liable for mere risk of catastrophe, the same way nuclear power plants are liable to pay penalties for unsafe practices even if they never result in an unsafe outcome (see Tort Law Can Play an Important Role in Mitigating AI Risk).

Perhaps if there were potential for multiple $1b catastrophes well in advance (several months to years) of the $400b catastrophe, this would keep developers appropriately avoidant of risk, but if we expected a fast take-off where we went from no catastrophes to catastrophes greatly larger in magnitude than the value of any individual model developer, the incentive seems insufficient.

has the time hanged from Tuesday to Wednesday, or do you do events on Tuesday in addition to this event on Wednesday?

Generally I'd steer towards informal power over formal power. 

Think about the OpenAI debacle last year. If I understand correctly, Microsoft had no formal power to exert control over OpenAI. But they seemed to have employees on their side. They could credibly threaten to hire away all the talent, and thereby reconstruct OpenAI's products as Microsoft IP. Beyond that, perhaps OpenAI was somewhat dependent on Microsoft's continued investment, and even though they don't have to do as Microsoft says, are they really going to jeopardise future funding? What is at stake is not just future funding from Microsoft, but also all future investors who will look at OpenAI's interactions with its investors in the past to understand the value they will get by investing.

It does seem like informal power structures seem more difficult to study, because they are by their nature much less legible. You have to perceive and name the phenomena, the conduits of power, yourself rather than having them laid out for you in legislation. But a case study on last year's events could give you something concrete to work with. You might form some theories about power relations between labs, their employees, and investors, and then based on those theoretical frameworks, describe some hypothetical future scenarios and the likely outcomes.

If there was any lesson from last year's events, IMAO, it was that talent and the raw fascination with creating a god might be even more powerful than capital. Dan Faggella described this well in a Future of Life podcast episode released in May this year (from about 0:40 onwards).

Load More