I agree with you this is a potential problem, but at this point, we are no longer dealing with adversarial forces or deception, and thus this experiment doesn't work anymore.
Also, a point to keep in mind here is that once we assume away deception/adversarial forces, existential risk from AI, especially in Evan Hubinger's models goes way down, as we can now use more normal t methods of alignment to at least avoid X-risk.
I tend to view the events of OpenAI's firing of Sam Altman much more ambiguously than others, and IMO, it probably balances out to nothing in the end, so I don't care as much as some other people here.
To respond more substantially:
From johnswentworth:
Here's the high-gloss version of my take. The main outcomes are:
The leadership who were relatively most focused on racing to AGI and least focused on safety are moving from OpenAI to Microsoft. Lots of employees who are relatively more interested in racing to AGI than in safety will probably follow. Microsoft is the sort of corporate bureaucracy where dynamic orgs/founders/researchers go to die. My median expectation is that whatever former OpenAI group ends up there will be far less productive than they were at OpenAI. It's an open question whether OpenAI will stick around at all. Insofar as they do, they're much less likely to push state-of-the-art in capabilities, and much more likely to focus on safety research. Insofar as they shut down, the main net result will be a bunch of people who were relatively more interested in racing to AGI and less focused on safety moving to Microsoft, which is great.
I agree with a rough version of the claim that they might be absorbed into Microsoft, thus making it less likely to advance capabilities, and this is plausibly at least somewhat important.
My main disagreement here is that I don't think that capabilities advances matter as much as LWers think for AI doom, and may even be anti-helpful to slow down, depending on the circumstances. This probably comes down to very different views on stuff like how strong do the priors need to be, etc.
From johnswentworth:
There's apparently been a lot of EA-hate on twitter as a result. I personally expect this to matter very little, if at all, in the long run, but I'd expect it to be extremely disproportionately salient to rationalists/EAs/alignment folk.
I actually think this partially matters, but the trickiness here is that on the one hand, twitter can be important, but I also agree that people overrate it a lot here.
My main disagreement tends to be that I don't think OpenAI actually matters too much in the capabilities race, and I think that social stuff matters more than John Wentworth thinks. Also, given my optimistic world-model on alignment, corporate drama like this mostly doesn't matter.
One final thought: I feel like the AGI clauses in the OpenAI's charter weew extremely terrible, because AGI is very ill-defined, and in a corporate setting/court setting, this is a very bad basis to build upon. They need to use objective metrics that are verifiable if they want to deal with dangerous AI. More generally, I kind of hate the AGI concept, for lots of reasons.
I'd say my major takeaways, assuming this research scales (it was only done on GPT-2, and we already knew it couldn't generalize.)
Gary Marcus was right about LLMs mostly not reasoning outside the training distribution, and this updates me more towards "LLMs probably aren't going to be godlike, or be nearly as impactful as LW say it is."
Be more skeptical of AI progress leading to big things, and in general unless reality can simply be memorized, scaling probably won't work to automate the economy. More generally, this updates me towards longer timelines, and longer tails on those timelines.
Be slightly more pessimistic on AI safety, since LLMs have a bunch of nice properties, and future AI probably will have less nice properties, though alignment optimism mostly doesn't depend on LLMs.
AI governance gets a lucky break, since they only have to regulate misuse, and even though their threat model isn't likely or even probable to be realized, it's still nice that we don't have to deal with the disruptive effects of AI now.
The question becomes, do you expect "If you change random bits and try to run it, it mostly just breaks." to hold up?
My suspicion is that the answer is likely no, and this is actually a partial crux on why I'm less doomy than others on AI risk, especially from misalignment.
My general expectation is that most of the difficulty is hardware + ethics, and in particular the hardware for running a human brain just does not exist right now, primarily because of the memory bottleneck/Von Neumann bottleneck that exists for GPUs, and it would at the current state of affairs require deleting a lot of memory from a human brain.
It seems like Occam's Razor just logically follows from the basic premises of probability theory.
This turns out to be the case for countable probability spaces, like Turing Machines, due to Qiaochu Yuan's comment:
https://www.lesswrong.com/posts/fpRN5bg5asJDZTaCj/against-occam-s-razor?commentId=xFKD5hZZ68QXttHqq
I know I reacted to this comment, but I want to emphasize that this:
Well, if someone originally started worrying based on strident predictions of sophisticated internal reasoning with goals independent of external behavior,
Is to first order arguably the entire AI risk argument, that is if we make the assumption that the external behavior gives strong evidence about it's internal structure, then there is no reason to elevate the AI risk argument at all, given the probably aligned behavior of GPTs when using RLHF.
More generally, the stronger the connection between external behavior and internal goals, the less worried you should be about AI safety, and this is a partial disagreement with people that are more pessimistic, albeit I have other disagreements there.
In one sense, I no longer endorse the previous comment. In another sense, I sort of endorse the previous comment.
I was basically wrong about alignment requiring human values to be game-theoretically optimal, and I think that cooperation is actually doable without relying on game theory tools like enforcement, because the situation with human alignment is very different than the situation with AI alignment, because we have access to the AI's brain and can directly reward good things and negatively reward bad things, combined with the fact that we have a very powerful optimizer called SGD that lets us straightforwardly select over minds and directly edit the AI's brain, which aren't things we have to align humans, partially for ethical reasons and partially for technological reasons.
I also think my analogy of human-animal alignment is actually almost as bad as human-evolution alignment, which is worthless, and instead the better analogy for how to predict the likelihood of AI alignment is prefrontal cortex-survival value alignment, or innate reward alignment, which is very impressive alignment.
However, I do think that even with that assumption of aligned AIs, I do think that democracy is likely to decay pretty quickly under AI, especially because of the likely power imbalances, especially hundreds of years into the future. We will likely retain some freedoms under aligned AI rule, but I expect it to be a lot less than what we're used to today, and it will transition into a form of benevolent totalitarianism.
I think this is a potential scenario, and if we remove existential risk from the equation, it is somewhat probable as a scenario, where we basically have solved alignment, and yet AI governance craps out in different ways.
I think this way primarily because I tend to think that value alignment is really easy, much easier than LWers generally think, and I think this because most of the complexity of value learning is offloadable to the general learning process, with only very weak priors being required.
Putting it another way, I basically disagree with the implicit premise on LW that being capable of learning is easier than being aligned to values, at most they're comparably or a little more difficult.
More generally, I think it's way easier to be aligned with say, not killing humans, than to actually have non-trivial capabilities, at least for a given level of compute, especially at the lower end of compute.
In essence, I believe there's simple tricks to aligning AIs, while I see no reason to expect a simple trick to make governments be competent at regulating AI.
This is such a good comment, and quite a lot of this will probably end up in my new post, especially the sections about solving the misgeneralization problem in practice, as well as solutions to a lot of misalignment problems in general.
I especially like it because I can actually crib parts of this comment to show other people how misalignment in AI gets solved in practice, and pointing out to other people that misalignment is in fact, an actually solvable problem in current AI.
My biggest counterargument to the case that AI progress should be slowed down comes from an observation made by porby about a fundamental lack of a property we theorize about AI systems, and the one foundational assumption around AI risk:
Instrumental convergence, and it's corollaries like powerseeking.
The important point is that current and most plausible future AI systems don't have incentives to learn instrumental goals, and the type of AI that has enough space and has very few constraints, like RL with sufficiently unconstrained action spaces to learn instrumental goals is essentially useless for capabilities today, and the strongest RL agents use non-instrumental world models.
Thus, instrumental convergence for AI systems is fundamentally wrong, and given that this is the foundational assumption of why superhuman AI systems pose any risk that we couldn't handle, a lot of other arguments for why we might to slow down AI, why the alignment problem is hard, and a lot of other discussion in the AI governance and technical safety spaces, especially on LW become unsound, because they're reasoning from an uncertain foundation, and at worst are reasoning from a false premise to reach many false conclusions, like the argument that we should reduce AI progress.
Fundamentally, instrumental convergence being wrong would demand pretty vast changes to how we approach the AI topic, from alignment to safety and much more to come,
To be clear, the fact that I could only find a flaw within AI risk arguments because they were founded on false premises is actually better than many other failure modes, because it at least shows fundamentally strong locally valid reasoning on LW, rather than motivated reasoning or other biases that transforms true statements into false statements.
One particular case of the insight is that OpenAI and Anthropic were fundamentally right in their AI alignment plans, because they have managed to avoid instrumental convergence from being incentivized, and in particular LLMs can be extremely capable without being arbitrarily capable or having instrumental world models given resources.
I learned about the observation from this post below:
https://www.lesswrong.com/posts/EBKJq2gkhvdMg5nTQ/instrumentality-makes-agents-agenty
Porby talks about why AI isn't incentivized to learn instrumental goals, but given how much this assumption gets used in AI discourse, sometimes implicitly, I think it's of great importance that instrumental convergence is likely wrong.
I have other disagreements, but this is my deepest disagreement with your model (and other models around AI is especially dangerous).
EDIT: A new post on instrumental convergence came out, and it showed that many of the inferences made weren't just unsound, but invalid, and in particular Nick Bostrom's Superintelligence was wildly invalid in applying instrumental convergence to strong conclusions on AI risk.