My understanding of Shard Theory is that what you said is true, except sometimes the shards "directly" make bids for outputs (particularly when they are more "reflexive," e. g. the "lick lollipop" shard is activated when you see a lollipop), but sometimes make bids for control of a local optimization module which then implements the output which scores best according to the various competing shards. You could also imagine shards which do a combination of both behaviors. TurnTrout can correct me if I'm wrong.
My takeoff speeds are on the somewhat faster end, probably ~a year or two from “we basically don’t have crazy systems” to “AI (or whoever controls AI) controls the world”
I think I'm like >95th percentile on verbal-ness of thoughts. I feel like almost all of my thoughts that aren't about extremely concrete things in front of me or certain abstract systems that are best thought of visually are verbal, and even in those cases I sometimes think verbally. Almost all of the time, at least some words are going through my head, even if it's just random noise or song lyrics or something like that. I struggle to imagine what it would be like to not think this way, as if I feel like many propositions can't be easily represented in an image. For example, if I think of an image of a dog in my home, this could correspond to the proposition "there is a dog in my home" or "I wish there were a dog in my home" or "I wish there weren't a dog in my home" or "This is the kind of dog that I would have if I had a dog."
Hmm... I guess I'm skeptical that we can train very specialized "planning" systems? Making superhuman plans of the sort that could counter those of an agentic superintelligence seems like it requires both a very accurate and domain-general model of the world as well as a search algorithm to figure out which plans actually accomplish a given goal given your model of the world. This seems extremely close in design space to a more general agent. While I think we could have narrow systems which outperform the misaligned superintelligence in other domains such as coding or social manipulation, general long-term planning seems likely to me to be the most important skill involved in taking over the world or countering an attempt to do so.
This makes sense, but it seems to be a fundamental difficulty of the alignment problem itself as opposed to the ability of any particular system to solve it. If the language model is superintelligent and knows everything we know, I would expect it to be able to evaluate its own alignment research as well as if not better than us. The problem is that it can't get any feedback about whether its ideas actually work from empirical reality given the issues with testing alignment problems, not that it can't get feedback from another intelligent grader/assessor reasoning in a ~a priori way.
I think this is a very good critique of OpenAI's plan. However, to steelman the plan, I think you could argue that advanced language models will be sufficiently "generally intelligent" that they won't need very specialized feedback in order to produce high quality alignment research. As e. g. Nate Soares has pointed out repeatedly, the case of humans suggests that in some cases, a system's capabilities can generalize way past the kinds of problems that it was explicitly trained to do. If we assume that sufficiently powerful language models will therefore have, in some sense, the capabilities to do alignment research, the question then becomes how easy it will be for us to elicit these capabilities from the model. The success of RLHF at eliciting capabilities from models suggests that by default, language models do not output their "beliefs", even if they are generally intelligent enough to in some way "know" the correct answer. However, addressing this issue involves solving a different and I think probably easier problem (ELK/creating language models which are honest), rather than the problem of how to provide good feedback in domains where we are not very capable.
I agree with most of these claims. However, I disagree about the level of intelligence required to take over the world, which makes me overall much more scared of AI/doomy than it seems like you are. I think there is at least a 20% chance that a superintelligence with +12 SD capabilities across all relevant domains (esp. planning and social manipulation) could take over the world.
I think human history provides mixed evidence for the ability of such agents to take over the world. While almost every human in history has failed to accumulate massive amounts of power, relatively few have tried. Moreover, when people have succeeded at quickly accumulating lots of power/taking over societies, they often did so with surprisingly small strategic advantages. See e. g. this post; I think that an AI that was both +12 SD at planning/general intelligence and social manipulation could, like the conquistadors, achieve a decisive strategic advantage without having to have some kind of crazy OP military technology/direct force advantage. Consider also Hitler's rise to power and the French Revolution as cases where one actor/a small group of actors was able to surprisingly rapidly take over a country.
While these examples provide some evidence in favor of it being easier than expected to take over the world, overall, I would not be too scared of a +12 SD human taking over the world. However, I think that the AI would have some major advantages over an equivalently capable human. Most importantly, the AI could download itself onto other computers. This seems like a massive advantage, allowing the AI to do basically everything much faster and more effectively. While individually extremely capable humans would probably greatly struggle to achieve a decisive strategic advantage, large groups of extremely intelligent, motivated, and competent humans seem obviously much scarier. Moreover, as compared to an equivalently sized group of equivalently capable humans, a group of AIs sharing their source code would be able to coordinate among themselves far better, making them even more capable than the humans.
Finally, it is much easier for AIs to self modify/self improve than it is for humans to do so. While I am skeptical of foom for the same reasons you are, I suspect that over a period of years, a group of AIs could accumulate enough financial and other resources that they could translate these resources into significant cognitive improvements, if only by acquiring more compute.
While the AI has the disadvantage relative to an equivalently capable human of not immediately having access to a direct way to affect the "external" world, I think this is much less important than the AIs advantages in self replication, coordination, an self improvement.
You write that even if the mechanistic model is wrong, if it “has some plausible relationship to reality, the predictions that it makes can still be quite accurate.” I think that this is often true, and true in particular in the case at hand (explicit search vs not). However, I think there are many domains where this is false, where there is a large range of mechanistic models which are plausible but make very false predictions. This depends roughly on how much the details of the prediction vary depending on the details of the mechanistic model. In the explicit search case, it seems like many other plausible models for how RL agents might mechanistically function imply agent-ish behavior, even if the model is not primarily using explicit search. However, this is because, due to the fact that the agent must accomplish the training objective, the space of possible behaviors is heavily constrained. In questions where the prediction space is less constrained to begin with (e. g. questions about how the far future will go), different “mechanistic” explanations (for example, thinking that the far future will be controlled by a human superintelligence vs an alien superintelligence vs evolutionary dynamics) imply significantly different predictions.
I think the NAH does a lot of work for interpretability of an AI's beliefs about things that aren't values, but I'm pretty skeptical about the "human values" natural abstraction. I think the points made in this post are good, and relatedly, I don't want the AI to be aligned to "human values"; I want it to be aligned to my values. I think there’s a pretty big gap between my values and those of the average human even subjected to something like CEV, and that this is probably true for other LW/EA types as well. Human values as they exist in nature contain fundamental terms for the in group, disgust based values, etc.
Human bureaucracies are mostly misaligned because the actual bureaucratic actors are also misaligned. I think a “bureaucracy” of perfectly aligned humans (like EA but better) would be well aligned. RLHF is obviously not a solution in the limit but I don’t think it’s extremely implausible that it is outer aligned enough to work, though I am much more enthusiastic about IDA