I'm in the middle of writing an essay which discusses the shareholder value revolution (amongst many other examples) as the process of making a conceptual mistake.
Schemers (as defined here) seek a consequence of being selected. Therefore, they pursue influence on behavior in deployment instrumentally
This definition? If so, it seems vastly underspecified to be fit for scientific inquiry. For one thing, the definition of "selection" is pretty vague—I do not know how to assign "the degree to which [one cognitive pattern] is counterfactually responsible" for something even in principle. It also doesn't even try to set a threshold for what counts as a non-schemer—e.g. does it needs to care literally 0% about the consequences of being selected? If so, approximately everything is a schemer, including all humans. (It also assumes the instrumental/terminal goal distinction, which I think is potentially confused, but that's a more involved discussion.)
To be clear, my complaint is not that people are using vague definitions. My complaint is that the vague definitions are becoming far more load-bearing than they deserve. If people had tried to pin down more carefully what "schemer" means they would have been forced to develop a more nuanced understanding of what we even mean by "alignment" and "goals" and so on, which is the kind of thinking I want to see more of.
That's incorrect, because it's also possible for an AI to falsely confess to scheming. This also happens to humans, e.g. if you keep asking criminal suspects loaded questions. And so there may never actually be a phase transition, because a "schemer" that tells the truth 1% of the time may not be distinguishable from an honest AI that falsely confesses 1% of the time.
The concept of "schemers" seems to be gradually becoming increasingly load-bearing in the AI safety community. However, I don't think it's ever been particularly well-defined, and I suspect that taking this concept for granted is inhibiting our ability to think clearly about what's actually going on inside AIs (in a similar way to e.g. how the badly-defined concept of alignment faking obscured the interesting empirical results from the alignment faking paper).
In my mind, the spectrum from "almost entirely honest, but occasionally flinching away from aspects of your motivations you're uncomfortable with" to "regularly explicitly thinking about how you're going to fool humans in order to take over the world" is a pretty continuous one. Yet generally people treat "schemer" as a fairly binary classification.
To be clear, I'm not confident that even "a spectrum of scheminess" is a good way to think about the concept. There are likely multiple important dimensions that could be disentangled; and eventually I'd like to discover properly scientific theories of concepts like honesty, deception and perhaps even "scheming". Our current lack of such theories shouldn't be a barrier to using those terms at all, but it suggests they should be used with a level of caution that I rarely see.
Nice post. One thing I'd add is Sahil's description here of mathematics (and other "clean" concepts) as "achiev[ing] the scaling and transport of insight (which is the business of generalization) by isolation and exclusion". But, he argues, there are other ways to scale and transport insight. I think of emotional work and meditative practices as showcasing these other ways: they don't rely on theories or theorems, and often there aren't even canonical statements of their core insights. Instead, insights too complex to formalize (yet) are transmitted person-to-person.
Where I disagree with Sahil is that I suspect other ways of scaling and transporting insight are much more vulnerable to adversarial attacks (because e.g. there's no central statement which can be criticized). So in some sense the "point of the math" is that it means you need to rely less on the honesty and integrity of the people you're learning from.
Yes, ty. Though actually I've also clarified that both world-models and goal-models predict both observations and actions. In my mind it's mainly a difference in emphasis.
(I expect that Scott, Abram or some others have already pointed this out, but somehow this clicked for me only recently. Pointers to existing discussions appreciated.)
A Bayesian update can be seen as a special case of a prediction market resolution.
Specifically, a Bayesian update is the case where each "hypothesis" has bet all their wealth across some combination of outcomes, and then the pot is winner-takes-all (or split proportionally when there are multiple winners).
The problem with Bayesianism is then obvious: what happens when there are no winners? Your epistemology is "bankrupt", the money vanishes into the ether, and bets on future propositions are undefined.
So why would a hypothesis go all-in like that? Well, that's actually the correct "cooperative" strategy in a setting where you're certain that at least one of them is exactly correct.
To generalize Bayesianism, we want to instead talk about what the right "cooperative" strategy is when a) you don't think any of them are exactly correct, and b) when each hypothesis has goals too, not just beliefs.
Yeah, I do feel confused about the extent to which the solution to this problem is just "selectively become dumber" (e.g. as discussed by Habryka here). However, I have faith that there are a bunch of Pareto improvements to be made—for example, I think that less neuroticism helps you get less pwned without making you dumber in general. (Though as a counterpoint, maybe neuroticism was useful for helping people identify AI risk?) I'd like to figure out theories of virtue and emotional health good enough to allow us to robustly identify other such Pareto improvements.
A related thought that I had recently: fertility decline seem like a rough proxy for "how pwned are you getting by memes", and fertility is strongly anticorrelated with population-level intelligence. So you have east asians getting hit hardest by the fertility crisis, then white populations, then south asians, while african fertility is still very high. Obviously this is confounded by metrics like development and urbanization, though, so it's hard to say if intelligence mediates the decline directly or primarily via creating wealth—but it does seem like e.g. east asians are getting hit disproportionately hard. (Plausibly there's some way to figure this out more robustly by looking at subpopulations.)
Yepp, this is true. However, I believe that there are other strategies for avoiding such memes other than "being smart". Two of these strategies broadly correspond to what we call "being virtuous" and "being emotionally healthy". See my exchange with Wei Dai here, and this sequence, for more.
Yes.