Btw, I'm thinking of the OGI model as offering something of a dual veto structure - in order for something to proceed, it would have be favored by both the corporation and the host government (in contrast to an AGI Manhattan project, where it would just need to be favored by the government). So at least the potential may exist for there to be more checks and balances and oversight in the corporate case, especially in the versions that involve some sort of very soft nationalization.
Interesting, thanks.
We would need a reason for thinking that this problem is worse in the corporate case in order for it to be a consideration against the OGI model.
Could we get info on this by looking at metrics of corruption? I'm not familiar with the field, but I know it's been busy recently, and maybe there's some good papers that put the private and public sectors on the same scale. A quick google scholar search mostly just convinced me that I'd be better served asking an expert.
As for the suggestion that governments (nationally or internationally) should prohibit profit-generating activities by AI labs that have major negative externalities, this is fully consistent with the OGI model
Well, I agree denotationally, but in appendix 4 when you're comparing OGI with other models, your comparison includes points like "OGI obviates the need for massive government funding" and "agreeable to many incumbents, including current AI company leadership, personnel, and investors". If governments enact a policy that maintains the ability to buy shares in AI labs, but requires massive government funding and is disagreeable to incumbents, that seems to be part of a different story (and with a different story about how you get trustworthiness, fair distribution, etc.) than the story you're telling about OGI.
I didn't feel like there was a serious enough discussion of why people might not like the status quo.
Another model to compare to might be the one proposed in AI For Humanity (Ma, Ong, Tan) - the book as a whole isn't all that, but the model is a good contribution. It's something like "international climate policy for AGI."
Interesting speculation, but I'd like to see you do some math to check if the premise actually works. That is, is a gremlin-free LLM under RL ever unstable to the formation of a gremlin that tends to keep itself activated at a slight cost in expected reward?
In a new paper with Aidan Homewood, "Limits of Safe AI Deployment: Differentiating Oversight and Control,"
Link should go to arxiv.
Still reading the paper, but it seems like your main point is that if oversight is "meaningful," then it should be able to stop bad behavior before it actually gets executed (it might fail, but failures should be somewhat rare). And that we don't have "meaningful oversight" of high-profile models in this sense (and especially not of the systems built on top of these models, considered as a whole) because they don't catch bad behavior before it happens.
Instead we have some weaker category of thing that lets the bad stuff happen, waits for the public to bring it to the attention of the AI company, and then tries to stop it.
Is this about right?
I'm always really curious what the reward model thinks of this. E.g. are the trajectories that avoid shutdown on average higher reward than the trajectories that permit it? The avoid-shutdown behavior could be something naturally likely in the base model, or it could be an unintended generalization by the reward model, or it could be an unintended generalization by the agent model.
EDIT: Though I guess in this case one might expect to blame a diversity of disjoint RL post-training regimes, not all of which have a clever/expensive, or even any, reward model (not sure how OpenAI does RL on programming tasks). I still think it's possible the role of a human feedback reward model is interesting.
I was asking more "how does the AI get a good model of itself", but your answer was still interesting, thanks. Still not sure if you think there's some straightforward ways future AI will get such a model that all come out more or less at the starting point of your proposal. (Or if not.)
Here's another take for you: this is like Eliciting Latent Knowledge (with extra hope placed on cogsci methods), except where I take ELK to be asking "how do you communicate with humans to make them good at RL feedback," you're asking "how do you communicate with humans to make them good at participating in verbal chain of thought?"
And don't you think 500 lines of Python also "fails due to" having unintended optima?
I've put "fails due to" in scare quotes because what's failing is not every possible approach, merely almost all samples from approaches we currently know how to take. If we knew how to select python code much more cleverly, suddenly it wouldn't fail anymore. And ditto for if we knew how to better construct reward functions from big AI systems plus small amounts of human text or human feedback.
Thanks for posting this! Though I think the nice list of ignored problems you give is important enough that a "future work" section (and your own prediction of future work) shouldn't neglect chipping away at them.