Some observations:

  1. ML-relevant hardware supply is bottlenecked at several points.
  2. One company, NVIDIA, is currently responsible for most purchasable hardware.[1]
  3. NVIDIA already implements driver licensing to force data center customers to buy into the more expensive product line.[2]
  4. NVIDIA would likely not oppose even onerous regulation on the use of its ML hardware if it gives them more of a competitive moat.

Where can you stick a regulatory lever to prevent improper use of GPUs at scale?

Here's one option![3]

Align greed with oversight

  1. Buff up driver licensing with some form of hardware authentication so that only appropriately signed drivers can run. (NVIDIA may do this already; it would make sense!)
  2. Modify ML-focused products to brick themselves if they do not receive a periodic green light signal from a regulatory source with proper authentication.
  3. Require chain of custody for ML-focused products. If an auditable chain of custody cannot be produced for a particular piece of hardware, stop sending a green light.
  4. Use both targeted and randomized audits to confirm that hardware is being used for the stated purpose. Audits are not primarily automatic: regulators should have on-demand access to the physical hardware and verifiable detailed information on how the system is being used.
  5. If audits are blocked or refused by a product owner, stop sending a green light.
  6. While not a primary form of security, the hardware/driver could self-report on some kinds of usage patterns. It does not seem realistic to collect invasive data[4], but the combination of chain of custody and high-level samples of hardware utilization over time could help corroborate client claims or flag installations for audits.
  7. I imagine NVIDIA would be happy to offer Certified ML-MOAT-2023-B Compliant hardware and drivers. Given the opportunity, they may even push regulators to implement progressively more difficult forms of verification to make it more expensive for smaller hardware providers to break into the market.[5]

Between the cost of the hardware involved and the chain of custody information, it is unlikely that many small businesses would suffer excessive burdens as a result of the regulation. A company with 16 H100's is far less of a risk than a company with 140,000 H100's; you can probably skip the audit unless you have some reason to suspect the ostensibly small installations are attempting to exploit a loophole.

Clouds

GPU cloud service providers (CSPs) obscure chain of custody and complicate audits. Getting sufficient regulatory oversight here will likely have the most consumer-visible impacts.

One option would be to require any compute requests above a certain threshold (either in terms of simultaneous compute or integrated compute over time) to meet KYC-style requirements.

Regulators could lean on the CSP to extract information from large-scale customers, and those large-scale customers could be directly required to report or submit to audits as if they owned the hardware.

Given the profit margins for GPU CSPs, I suspect most major AI labs to prefer building out their own infrastructure or making special deals (like OpenAI and Microsoft). The main value of having requirements on CSPs would be to plug an obvious loophole that bad actors might otherwise try to abuse because it's still cheaper than their other options for getting access to tons of compute.

It is unlikely that any small research group would ever actually encounter the regulatory system. That helps keep regulatory overheads low and focuses the effort on the more dangerous massive deployments.

Threat model exclusions

This doesn't do anything against an extremely capable hostile actor in the limit. The hardware and software are in the hands of the enemy, and no amount of obfuscation can protect you forever.

In reality, though:

  1. With a decent hardware-supported implementation, it can be pretty annoying to grab the requisite information.
  2. Robust chain of custody makes getting hundreds of thousands of GPUs into the hands of those who would misuse them more difficult.
  3. The main contenders in the current race dynamic operate in countries where large companies mostly obey laws. The main value of the extra security features in the hardware/software are to make it impossible to accidentally violate regulatory intent, and to make the intent legible.

This also doesn't do anything to stop fully decentralized execution across consumer-grade hardware, but efficiently scaling large training runs over the internet with current approaches seems far more challenging than in the case of, say, Folding@Home.

Regulatory targets

I don't think it's reasonable to expect the regulatory agencies to hire an army of ML experts, so it's probably going to be mostly based on metrics that can be consumed by a bureaucracy:

  1. Compute:
    How large is the training run in floating point operations? What is the training data, and how much of it is there? How large are the models? If you have a reasonable suspicion that parameter counts are not a useful indicator, or if you expect the system's capabilities to not scale with parameter counts in a way similar to the published Chinchilla benchmark results, see Form 27-B Reporting Aberrant Scaling Behaviors for an instruction sheet on how to report nonstandard scaling behavior in detail...
  2. Training data:
    Where is the training data from? Fill out form 95-P Private Information Use and Form 95-C/IP Inclusion Of Copyrighted Or Protected Work.
  3. Capability:
    What is the expected capability of the system? What is the MLCCS code for the type of model you're training? If no MLCCS code matches, please fill out Form 511-47. If the model matches more than one MLCCS code, make sure to include the Expected Achievement scores for every MLCCS code. Failure to specify accurate Expected Achievement scores will result in an audit. Note that significant divergence between registered capability estimates and the deployed product may be grounds for penalties specified in Form 2210-M.
  4. Risk assessment:
    What types of negative impact may the model have? Fill out Form 98-ES Ethical and Social Impacts and Form 98-EX Is Your Model Going To Kill And/Or Torture Everyone. What factors affect the estimate of these negative effects? Could any of these negative effects be amplified by estimation errors? Fill out Form 648-Q Risk Estimates Robustness Estimate. Note that failure to completely report factors which may predictably reduce the predictability of risk may be grounds for penalties specified in Form 2210-M.
  5. Deployment target:
    How will the model be used? What customers will be served? Is it for in-house use, B2B services, government use, or consumer use?

Physical audits would presumably collect documentation and evidence- including through access to physical systems- that reported architectures and training schemes are actually being used. The relatively sparse ML experts involved in the auditing process would likely be fed a bunch of the collected information to try to confirm that things add up. Realistically, an actively malicious corporation could slip some things past the regulators, but that's not really the threat model or goal of the regulation.

The main themes are:

  1. Collect a bunch of information,
  2. Constrain models to be less harmful in expectation by default,
  3. Provide a lever to intervene when things seem clearly bad,
  4. Provide a big stick to hit companies with if they fail to accurately report expected impacts of the model (intentionally or not; probably want something close to strict liability here). This won't prevent all oopsies, and the existence of a penalty only matters if anyone exists after the oopsy, but it does give the company a more direct incentive to not oopsy.

In other words, collect rope with which to later hang a bad actor so as to disincentivize bad actors from bad acting in the first place.

Conclusion

Using signed drivers to brick rogue GPUs seems like an easy lever for regulation to use. This isn't a complete solution, nor does it preclude other options, but it seems relatively easy to implement and helpful for avoiding the most egregious types of driving off a cliff.

It's conceivable that extended regulations could be built on a similar foundation. For example, it could serve as an enforcement mechanism for restricting the amount of compute per entity to directly combat resource races.[6]

The details could use a lot of refinement. I have no background in policy or governance, and I don't know what is legally realistic.

  1. ^

    There are a variety of would-be competitors (plenty of smaller companies like Cerebras, but also major players from other markets like AMD and Intel), but they aren't yet taking market share from NVIDIA in the ML space.

    Google's in-house TPUs are another option, and other megacorps like Microsoft, Apple, Amazon and friends have the resources to dump on making their own designs. If NVIDIA's dominance becomes a little too extractive, the biggest customers may end up becoming competitors. Fortunately, this would still be a pretty narrow field from a regulatory perspective.

  2. ^

    The data center hardware does have other advantages in ML on a per-unit basis and isn't just vastly more expensive, but there is a reason why NVIDIA restricts the use of gaming-class hardware.

  3. ^

    I'm pretty certain this is not a novel idea, but I think it deserves some signal boosting.

  4. ^

    Regulation that required sending off automatic detailed technical reports of all code being executed and all data shuffled in and out would probably run into a lot of resistance, security concerns, and practical difficulties with limited upside. I could be wrong- maybe there's some use case for this kind of reporting- but it seems like "regulators physically walk into your offices/data centers and audit the whole system at random times" is both stronger and easier to get agreement on.

  5. ^

    NVIDIA has done this before in other fields. They've developed extensions that they know their competitors don't currently handle well, then push for those extensions to be added to standardized graphics APIs like DirectX, then strongly incentivize game developers to use those features aggressively. "Wow! Look at those tessellation benchmarks, NVIDIA is sooo much better!" and such.

  6. ^

    NVIDIA probably wouldn't be on board with this part!

New Comment
6 comments, sorted by Click to highlight new comments since: Today at 2:21 AM

A hardware protection mechanism that needs to confirm permission to run by periodically dialing home would, even if restricted to large GPU installations, brick any large scientific computing system or NN deployment that needs to be air-gapped (e.g. because it deals with sensitive personal data, or particularly sensitive commercial secrets, or with classified data). Such regulation also provides whoever controls the green light a kill switch against any large GPU application that runs critical infrastructure. Both points would severely damage national security interests.

On the other hand, the doom scenarios this is supposed to protect from would, at least as of the time of writing this, by most cybersecurity professionals probably be viewed as an example of poor threat modelling (in this case, assuming the adversary is essentially almighty and that everything they do will succeed on their first try, whereas anything we try will fail because it is our first try).

In summary, I don't think this would (or should) fly, but obviously I might be wrong. For a point of reference, techniques similar in spirit have been seriously proposed to regulate use of cryptography (for instance, via adoption of the Clipper chip), but I think it's fair to say they have not been very successful.

[-]porby11mo42

A hardware protection mechanism that needs to confirm permission to run by periodically dialing home would, even if restricted to large GPU installations, brick any large scientific computing system or NN deployment that needs to be air-gapped (e.g. because it deals with sensitive personal data, or particularly sensitive commercial secrets, or with classified data). Such regulation also provides whoever controls the green light a kill switch against any large GPU application that runs critical infrastructure. Both points would severely damage national security interests.

Yup! Probably don't rely on a completely automated system that only works over the internet for those use cases. There are fairly simple (for bureaucratic definitions of simple) workarounds. The driver doesn't actually need to send a message anywhere, it just needs a token. Airgapped systems can still be given those small cryptographic tokens in a reasonably secure way (if it is possible to use the system in secure way at all), and for systems where this kind of feature is simply not an option, it's probably worth having a separate regulatory path. I bet NVIDIA would be happy to set up some additional market segmentation at the right price.

The unstated assumption was that the green light would be controlled by US regulatory entities for hardware sold to US entities. Other countries could have their own agencies, and there would need to be international agreements to stop "jailbroken" hardware from being the default, but I'm primarily concerned about companies under the influence of the US government and its allies anyway (for now, at least).

techniques similar in spirit have been seriously proposed to regulate use of cryptography (for instance, via adoption of the Clipper chip), but I think it's fair to say they have not been very successful.

I think there's a meaningful difference between attempts to regulate cryptography and regulating large machine learning deployments; consumers will never interact with the regulatory infrastructure, and the negative externalities are extremely small compared to compromised or banned cryptography.

The regulation is intended to encourage a stable equilibrium among labs that may willingly follow that regulation for profit-motivated reasons.

Extreme threat modeling doesn't suggest ruling out plans that fail against almighty adversaries, it suggests using security mindset: reduce unnecessary load-bearing assumptions in the story you tell about why your system is secure. The proposal is mostly relying on standard cryptographic assumptions, and doesn't seem likely to do worse in expectation than no regulation.

There is no problem with air gap. Public key cryptography is a wonderful thing. Let there be a license file, which is a signed statement of hardware ID and duration for which license is valid. You need private key to produce a license file, but public key can be used to verify it. Publish a license server which can verify license files and can be run inside air gapped networks. Done.

I think another regulatory target, particularly around the distribution of individual GPUs, would be limiting enthusiast grade hardware (as opposed to Enterprise) to something like xGB where x is obligately readjusted every year based on risk assessments

[-]porby11mo30

Something like this may be useful, but I do struggle to come up with workable versions that try to get specific about hardware details. Most options yield Goodhart problems- e.g. shift the architecture a little bit so that real world ML performance per watt/dollar is unaffected, but it falls below the threshold because "it's not one GPU, see!" or whatever else. Throwing enough requirements at it might work, but it seems weaker as a category than "used in a datacenter" given how ML works at the moment.

It could be that we have to bite the bullet and try for this kind of extra restriction anyway if ML architectures shift in such a way that internet-distributed ML becomes competitive, but I'm wary of pushing for it before that point because the restrictions would be far more visible to consumers.

In summary, maybeshrugidunno!