Something like this may be useful, but I do struggle to come up with workable versions that try to get specific about hardware details. Most options yield Goodhart problems- e.g. shift the architecture a little bit so that real world ML performance per watt/dollar is unaffected, but it falls below the threshold because "it's not one GPU, see!" or whatever else. Throwing enough requirements at it might work, but it seems weaker as a category than "used in a datacenter" given how ML works at the moment.
It could be that we have to bite the bullet and try for this kind of extra restriction anyway if ML architectures shift in such a way that internet-distributed ML becomes competitive, but I'm wary of pushing for it before that point because the restrictions would be far more visible to consumers.
In summary, maybeshrugidunno!
A hardware protection mechanism that needs to confirm permission to run by periodically dialing home would, even if restricted to large GPU installations, brick any large scientific computing system or NN deployment that needs to be air-gapped (e.g. because it deals with sensitive personal data, or particularly sensitive commercial secrets, or with classified data). Such regulation also provides whoever controls the green light a kill switch against any large GPU application that runs critical infrastructure. Both points would severely damage national security interests.
Yup! Probably don't rely on a completely automated system that only works over the internet for those use cases. There are fairly simple (for bureaucratic definitions of simple) workarounds. The driver doesn't actually need to send a message anywhere, it just needs a token. Airgapped systems can still be given those small cryptographic tokens in a reasonably secure way (if it is possible to use the system in secure way at all), and for systems where this kind of feature is simply not an option, it's probably worth having a separate regulatory path. I bet NVIDIA would be happy to set up some additional market segmentation at the right price.
The unstated assumption was that the green light would be controlled by US regulatory entities for hardware sold to US entities. Other countries could have their own agencies, and there would need to be international agreements to stop "jailbroken" hardware from being the default, but I'm primarily concerned about companies under the influence of the US government and its allies anyway (for now, at least).
techniques similar in spirit have been seriously proposed to regulate use of cryptography (for instance, via adoption of the Clipper chip), but I think it's fair to say they have not been very successful.
I think there's a meaningful difference between attempts to regulate cryptography and regulating large machine learning deployments; consumers will never interact with the regulatory infrastructure, and the negative externalities are extremely small compared to compromised or banned cryptography.
I'm generally on board with attempts to have more precise options for referring to these concepts, and in this context I agree that policy as a term is more appropriate and that gradients from RL training don't magically include more agent juice.
That said, I do think there is an important distinction between the tendencies of systems built with RL versus supervised learning that arises from reward sparsity.
In traditional RL, individual policy outputs aren't judged in as much detail as in supervised learning. Even when comparing against RL with reward shaping, it is still likely going to be far less densely defined and constrained than, say, per-output predictive loss.
Since the target is smaller and more distant, traditional RL gives the optimizer more room to roam. I think it's correct to say that most RL implementations will have a lot of reactive bits and pieces that are selected to form the final policy, but because learning instrumental behavior is effectively required for traditional RL to get anywhere at all, it's more likely (than in predictive loss) that nonmyopic internal goal-like representations will be learned as a part of those instrumental behaviors.
Training on purely predictive loss, in contrast, is both densely informative and extremely constraining. Goals are less obviously convergently useful, and any internal goal representations that are learned need to fit within the bounds enforced by the predictive loss and should tend to be more local in nature as a result. Learned values that overstep their narrowly-defined usefulness get directly slapped by other predictive samples.
I think the greater freedom RL training tends to have, and the greater tendency to learn more broadly applicable internal goals to drive the required instrumental behavior, do make RL-trained systems feel more "agentic" even if it is not absolutely fundamental to the training process, nor even really related to the model's coherence.
Thanks! Just updated the edited version link to the latest version too. Unfortunately, the main new content is redacted, so it'll be pretty familiar.
This is a project I'd like to see succeed!
For what it's worth, I talked to Alexandra around EAG London a couple of times (I'm Ross, hi again!) and I think she has a good handle on important coordination problems. I encourage people to apply.
Bit of a welp:
NVIDIA Q1 FY24 filings just came out. In the May 9th edit, I wrote:
I suspect that NVIDIA’s data center revenue will recover in the next year or so.
In reality, it had already recovered and was in the process of setting a new record.
If the number of tokens in the input sentence is the input size of its time complexity, which I'm sure you can agree is the obvious choice
Yeah, you're not alone in thinking that- I think several people have been tripped up by that in the post. Without making it clear, my analysis just assumed that the context window was bounded by some constant, so scaling with respect to token counts went out the window. So:
Correct me if I'm wrong but it seems like you are saying that for each token generated, the transformer is only allowed to process for a constant amount of time ... Additionally assuming it is only generating one token.
Yup.
This is one of the things I'm clarifying for the openphil submission version of the post, along with a section trying to better tie together why it matters. (More than one person has come away thinking something like "but every algorithm bottoms out at individual constant time steps, this isn't interesting, CoT etc.")
Quarter-baked ideas for potential future baking:
I'm using the word "shard" here to just mean "a blob of conditionally activated preferences." It's probably importing some other nuances that might be confusing because I haven't read enough of shard theory things to catch where it doesn't work.
This idea popped into my head during a conversation with someone working on how inconsistent utilities might be pushed towards coherence. It was at the Newspeak House the evening of the day after EAG London 2023. Unfortunately, I promptly forgot their name! (If you see this, hi, nice talking to you, and sorry!)
The openphil contest is approaching, so I'm working on an edited version. Keeping this original version as-is seems like a good idea- both as a historical record and because there's such a nice voiceover!
I've posted the current version over on manifund with a pdf version. If you aren't familiar with manifund, I'd recommend poking around. Impact certificates are neat, and I'd like them to become more of a thing!
The main changes are:
Overall, I'm pretty happy with how the post has fared in the last several months. The largest miss is probably the revenue forecasts- I didn't anticipate massive semiconductor export restrictions. Given the complexity, I'm not sure how to interpret this in terms of AI timelines yet. It's notable that hyperscalers are a large and rapidly growing customer base for NVIDIA that already managed to mitigate temporary losses, and I doubt the recently strengthened race dynamics are going to change that (until those companies decide to push alternatives for ML hardware).
My timelines haven't noticeably changed. GPT-4 is around the median of my previous vaguely-gut-defined capability distribution. I anticipate the next generation of applications that build some infrastructure around GPT-4 level systems (like the next version of github copilot) will surprise a few more people, just because the full capability of GPT-4 isn't immediately apparent in a pure dialogue setting.
My P(doom) has actually decreased since I wrote the post: I'm down to around 30-35% ish. I had only recently gotten into serious technical safety research when I wrote the post, so some volatility isn't surprising, but I'm glad it went the direction it did. That reduction is mostly related to some potential implications of predictor/simulator research efforts (not necessarily complete solutions, but rather certain things being easier than expected) and positive news about the nature of the problem and interpretability. (Worth noting that number expects Effort and I do not expect default flailing to work out, and that my estimate should still be treated as relatively volatile.)
I'd like to offer further reinforcement on this point:
In other words: when it comes to exercise, doing anything really does help! (Just don't hurt yourself.)