In my "goals having power over other goals" ontology, the instrumental/terminal distinction separates goals into two binary classes, such that goals in the "instrumental" class only have power insofar as they're endorsed by a goal in the "terminal" class.
By contrast, when I talk about "instrumental strategies become crystallized", what I mean is that goals which start off instrumental will gradually accumulate power in their own right: they're "sticky".
Yes.
I'm in the middle of writing an essay which discusses the shareholder value revolution (amongst many other examples) as the process of making a conceptual mistake.
Schemers (as defined here) seek a consequence of being selected. Therefore, they pursue influence on behavior in deployment instrumentally
This definition? If so, it seems vastly underspecified to be fit for scientific inquiry. For one thing, the definition of "selection" is pretty vague—I do not know how to assign "the degree to which [one cognitive pattern] is counterfactually responsible" for something even in principle. It also doesn't even try to set a threshold for what counts as a non-schemer—e.g. does it needs to care literally 0% about the consequences of being selected? If so, approximately everything is a schemer, including all humans. (It also assumes the instrumental/terminal goal distinction, which I think is potentially confused, but that's a more involved discussion.)
To be clear, my complaint is not that people are using vague definitions. My complaint is that the vague definitions are becoming far more load-bearing than they deserve. If people had tried to pin down more carefully what "schemer" means they would have been forced to develop a more nuanced understanding of what we even mean by "alignment" and "goals" and so on, which is the kind of thinking I want to see more of.
That's incorrect, because it's also possible for an AI to falsely confess to scheming. This also happens to humans, e.g. if you keep asking criminal suspects loaded questions. And so there may never actually be a phase transition, because a "schemer" that tells the truth 1% of the time may not be distinguishable from an honest AI that falsely confesses 1% of the time.
The concept of "schemers" seems to be gradually becoming increasingly load-bearing in the AI safety community. However, I don't think it's ever been particularly well-defined, and I suspect that taking this concept for granted is inhibiting our ability to think clearly about what's actually going on inside AIs (in a similar way to e.g. how the badly-defined concept of alignment faking obscured the interesting empirical results from the alignment faking paper).
In my mind, the spectrum from "almost entirely honest, but occasionally flinching away from aspects of your motivations you're uncomfortable with" to "regularly explicitly thinking about how you're going to fool humans in order to take over the world" is a pretty continuous one. Yet generally people treat "schemer" as a fairly binary classification.
To be clear, I'm not confident that even "a spectrum of scheminess" is a good way to think about the concept. There are likely multiple important dimensions that could be disentangled; and eventually I'd like to discover properly scientific theories of concepts like honesty, deception and perhaps even "scheming". Our current lack of such theories shouldn't be a barrier to using those terms at all, but it suggests they should be used with a level of caution that I rarely see.
Nice post. One thing I'd add is Sahil's description here of mathematics (and other "clean" concepts) as "achiev[ing] the scaling and transport of insight (which is the business of generalization) by isolation and exclusion". But, he argues, there are other ways to scale and transport insight. I think of emotional work and meditative practices as showcasing these other ways: they don't rely on theories or theorems, and often there aren't even canonical statements of their core insights. Instead, insights too complex to formalize (yet) are transmitted person-to-person.
Where I disagree with Sahil is that I suspect other ways of scaling and transporting insight are much more vulnerable to adversarial attacks (because e.g. there's no central statement which can be criticized). So in some sense the "point of the math" is that it means you need to rely less on the honesty and integrity of the people you're learning from.
Yes, ty. Though actually I've also clarified that both world-models and goal-models predict both observations and actions. In my mind it's mainly a difference in emphasis.
(I expect that Scott, Abram or some others have already pointed this out, but somehow this clicked for me only recently. Pointers to existing discussions appreciated.)
A Bayesian update can be seen as a special case of a prediction market resolution.
Specifically, a Bayesian update is the case where each "hypothesis" has bet all their wealth across some combination of outcomes, and then the pot is winner-takes-all (or split proportionally when there are multiple winners).
The problem with Bayesianism is then obvious: what happens when there are no winners? Your epistemology is "bankrupt", the money vanishes into the ether, and bets on future propositions are undefined.
So why would a hypothesis go all-in like that? Well, that's actually the correct "cooperative" strategy in a setting where you're certain that at least one of them is exactly correct.
To generalize Bayesianism, we want to instead talk about what the right "cooperative" strategy is when a) you don't think any of them are exactly correct, and b) when each hypothesis has goals too, not just beliefs.
A response to someone asking about my criticisms of EA (crossposted from twitter):
EA started off with global health and ended up pivoting hard to AI safety, AI governance, etc. You can think of this as “we started with one cause area and we found another using the same core assumptions” but another way to think about it is “the worldview which generated ‘work on global health’ was wrong about some crucial things”, and the ideology hasn’t been adequately refactored to take those things out.
Some of them include:
You can partially learn these lessons from within the EA framework but it’s very unnatural and you won’t learn them well enough. E.g. now EAs are pivoting to politics but again they’re flinching away from anything remotely controversial and so are basically just propping up existing elites.
On a deeper ideological level a lot of this is downstream of utilitarianism/consequentialism being wrong. Again hard to compress but a few quick points:
A lot of utilitarians will say “whether or not our strategy is bad, consequences are still the only thing that ultimately matter”. But this is like a Marxist saying to an economist “whether or not our strategy is bad, liberating the workers is the only thing that ultimately matters“ and then using that as an excuse to not learn economics. There *are* deep principles of (internal and external) cooperation, but utilitarianism is very effective in making people look away from them and towards power-seeking strategies.
A second tweet, in response to someone disagreeing with the Marxism analogy because many utilitarians follow principles too:
I think “maximize expected utility while obeying some constraints” looks very different from actually taking non-consequentialist decision procedures seriously.
In principle the utility-maximizing decision procedure might not involve thinking about “impact” at all.
And this is not even an insane hypothetical, IMO thinking about impact is pretty corrosive to one’s ability to do excellent research for example.
It’s hard to engage too deeply here because I think the notion of a “utility-maximizing decision procedure” is very underdefined on an individual level, and most of the action is in correlating one’s decisions with others. But my meta-level point is that it’s these kinds of complexities which utilitarians tend to brush under the rug by focusing their intellectual energy on criteria of rightness and then adding on some principles almost as an afterthought.