Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

It may not be feasible to make a mind that makes achievable many difficult goals in diverse domains, without the mind also itself having large and increasing effects on the world. That is, it may not be feasible to make a system that strongly possibilizes without strongly actualizing.

But suppose that this is feasible, and there is a mind M that strongly possibilizes without strongly actualizing. What happens if some mental elements of M start to act strategically, selecting, across any available domain, actions predicted to push the long-term future toward some specific outcome?

The growth of M is like a forest or prairie that accumulates dry grass and trees over time. At some point a spark ignites a wildfire that consumes all the accumulated matter.

The spark of strategicness, if such a thing is possible, recruits the surrounding mental elements. Those surrounding mental elements, by hypothesis, make goals achievable. That means the wildfire can recruit these surrounding elements toward the wildfire's ultimate ends. By recruiting more dry matter to the wildfire, the wildfire burns hotter and spreads further.

Also by hypothesis, the surrounding mental elements don't themselves push strongly for goals. Seemingly, that implies that they do not resist the wildfire, since resisting would constitute a strong push. We can at least say that, if the totality of the mental elements surrounding the wildfire is going to notice and suppress the wildfire, it would have to think at least strategically enough to notice and close off all the sneaky ways by which the wildfire might wax. This implies that the surrounding mental elements do a lot of thinking and have a lot of understanding relevant to strategic takeovers, which itself seemingly makes more available the knowledge needed for strategic takeovers.

Capabilities overhang provides fuel for a wildfire of strategicness. That implies that it's not so easy to avoid wrapper-minds.

This is very far from being a watertight argument, and it would be nice if the conclusion were false; how is it false? Maybe, as is sometimes suggested, minds selected to be creative are selected to keep themselves open to new ideas and therefore to new agency, implying that they're selected to prevent strategic takeovers, which would abhor new agency?

New Comment
19 comments, sorted by Click to highlight new comments since: Today at 9:20 AM

Nice post. Some minor thoughts: 

Are there historical precedents for this sort of thing? Arguably so: wildfires of strategic cognition sweeping through a nonprofit or corporation or university as office politics ramps up and factions start forming with strategic goals, competing with each other. Wildfires of strategic cognition sweeping through the brain of a college student who was nonagentic/aimless before but now has bought into some ambitious ideology like EA or communism. Wildfires of strategic cognition sweeping through a network of PCs as a virus hacks through, escalates permissions, etc.

I feel like none of these historical precedents is a perfect match. It might be valuable to think about the ways in which they are similar and different.

I used fire as an analogy for agents in my understanding agency sequence. I'm pleased to see you also found it helpful.

 

[-]plex10mo20

Early corporations, like the East India Company, might be a decent reference class?

[-]TsviBT10mo20

This is maybe the most plausible one I've heard. There's also empires in general, but they're less plausible as examples--for one thing, I imagine they're pretty biased towards being a certain way (something like, being set up to channel and aggregrate violence) at the expense of achieving any particular goals.

[-]TsviBT10moΩ120

I feel like none of these historical precedents is a perfect match. It might be valuable to think about the ways in which they are similar and different.

To me a central difference, suggested by the word "strategic", is that the goal pursuit should be

  1. unboundedly general, and
  2. unboundedly ambitious.

By unboundedly ambitious I mean "has an unbounded ambit" (ambit = "the area went about in; the realm of wandering" https://en.wiktionary.org/wiki/ambit#Etymology ), i.e. its goals induce it to pursue unboundedly much control over the world.

By unboundedly general I mean that it's universal for optimization channels. For any given channel through which one could optimize, it can learn or recruit understanding to optimize through that channel.

Humans are in a weird liminal state where we have high-ambition-appropriate things (namely, curiosity), but local changes in pre-theoretic "ambition" (e.g. EA, communism) are usually high-ambition-inappropriate (e.g. divesting from basic science in order to invest in military power or whatever).

Isn't the college student example an example of 1 and 2? I'm thinking of e.g. students who become convinced of classical utilitarianism and then join some Effective Altruist club etc.

[-]TsviBT10moΩ120

I don't think so, not usually. What happens after they join the EA club? My observations are more consistent with people optimizing (or sometimes performing to appear as though they're optimizing) through a fairly narrow set of channels. I mean, humans are in a weird liminal state, where we're just smart enough to have some vague idea that we ought to be able to learn to think better, but not smart and focused enough to get very far with learning to think better. More obviously, there's anti-interest in biological intelligence enhancement, rather than interest.

After people join EA they generally tend to start applying the optimizer's mindset to more things than they previously did, in my experience, and also tend to apply optimization towards altruistic impact in a bunch of places that previously they were optimizing for e.g. status or money or whatever.

What are you referring to with biological intelligence enhancement? Do you mean nootropics, or iterated embryo selection, or what?

[-]TsviBT10moΩ120

That seems like a real thing, though I don't know exactly what it is. I don't think it's either unboundedly general or unboundedly ambitious, though. (To be clear, this is isn't very strongly a critique of anyone; general optimization is really hard, because it's asking you to explore a very rich space of channels, and acting with unbounded ambition is very fraught because of unilateralism and seeing like a state and creating conflict and so on.) Another example is: how many people have made a deep and empathetic exploration of why [people doing work that hastens AGI] are doing what they are doing? More than zero, I think, but very very few, and it's a fairly obvious thing to do--it's just weird and hard and requires not thinking in only a culturally-rationalist-y way and requires recursing a lot on difficulties (or so I suspect; I haven't done it either). I guess the overall point I'm trying to make here is that the phrase "wildfire of strategicness", taken at face value, does fit some of your examples; but also I'm wanting to point at another thing, which like "the ultimate wildfire of strategicness", where it doesn't "saw off the tree-limb that it climbed out on", like empires do by harming their subjects, or like social movements do by making their members unable to think for themselves.

What are you referring to with biological intelligence enhancement?

Well, anything that would have large effects. So, not any current nootropics AFAIK, but possibly hormones or other "turning a small key to activate a large/deep mechanism" things.

I'm skeptical that there would be any such small key to activate a large/deep mechanism. Can you give a plausibility argument for why there would be? Why wouldn't we have evolved to have the key trigger naturally sometimes?

Re the main thread: I guess I agree that EAs aren't completely totally unboundedly ambitious, but they are certainly closer to that ideal than most people and than they used to be prior to becoming EA. Which is good enough to be a useful case study IMO.

[-]TsviBT10moΩ120

I'm skeptical that there would be any such small key to activate a large/deep mechanism. Can you give a plausibility argument for why there would be?

Not really, because I don't think it's that likely to exist. There are other routes much more likely to work though. There's a bit of plausibility to me, mainly because of the existence of hormones, and generally the existence of genomic regulatory networks.

Why wouldn't we have evolved to have the key trigger naturally sometimes?

We do; they're active in childhood. I think.

The spark of strategicness, if such a thing is possible, recruits the surrounding mental elements. Those surrounding mental elements, by hypothesis, make goals achievable. That means the wildfire can recruit these surrounding elements toward the wildfire's ultimate ends... Also by hypothesis, the surrounding mental elements don't themselves push strongly for goals. Seemingly, that implies that they do not resist the wildfire, since resisting would constitute a strong push.

I think this is probably a fallacy of composition (maybe in the reverse direction than how people usually use that term)? Like, the hypothesis is that the mind as a whole makes goals achievable and doesn't push towards goals, but I don't think this implies that any given subset of the mind does that.

[-]TsviBT10moΩ340

Good point, though I think it's a non-fallacious enthymeme. Like, we're talking about a car that moves around under its own power, but somehow doesn't have parts that receive, store, transform, and release energy and could be removed? Could be. The mind could be an obscure mess where nothing is factored, so that a cancerous newcomer with read-write access can't get any work out of the mind other than through the top-level interface. I think that explicitness (https://www.lesswrong.com/posts/KuKaQEu7JjBNzcoj5/explicitness) is a very strong general tendency (cline) in minds, but if that's not true then my first reason for believing the enthymeme's hidden premise is wrong.

[-]Max H11moΩ251

I like this post as a vivid depiction of the possible convergence of strategicness. For literal wildfires, it doesn't really matter where or how the fire starts - left to burn, the end result is that the whole forest burns down. Once the fire is put out, firefighters might be able to determine whether the fire started from a lightning strike in the east, or a matchstick in the west. But the differences in the end result are probably unnoticeable to casual observers, and unimportant to anyone that used to live in the forest.


I think, pretty often, people accept the basic premise that many kinds of capabilities (e.g. strategicness) are instrumentally convergent, without thinking about what the process of convergence actually looks like in graphic detail. Metaphors may or may not be convincing and correct as arguments, but they certainly help to make a point vivid and concrete.

[-]sudo11mo30

Shallow comment:

How are you envisioning the prevention of strategic takeovers? It seems plausible that robustly preventing strategic takeovers would also require substantial strategizing/actualizing.

[-]TsviBT10mo30

Are you echoing this point from the post?

We can at least say that, if the totality of the mental elements surrounding the wildfire is going to notice and suppress the wildfire, it would have to think at least strategically enough to notice and close off all the sneaky ways by which the wildfire might wax. This implies that the surrounding mental elements do a lot of thinking and have a lot of understanding relevant to strategic takeovers, which itself seemingly makes more available the knowledge needed for strategic takeovers.

It might be possible for us humans to prevent strategicness, though this seems difficult because even detecting strategicness is maybe very difficult. E.g. because thinking about X also sneakily thinks about Y: https://tsvibt.blogspot.com/2023/03/the-fraught-voyage-of-aligned-novelty.html#inexplicitness

My mainline approach is to have controlled strategicness, ideally corrigible (in the sense of: the mind thinks that [the way it determines the future] is probably partially defective in an unknown way).

[-]Thomas Kwa10moΩ120

It seems like there's some intuition underlying this post for why the wildfire spark of strategicness is possible, but there is no mechanism given. What is this mechanism, and in what toy cases do you see a wildfire of strategicness? My guess is something like

  • Suppose one part of your systems contains a map from desired end-states to actions required to achieve those ends, another part has actuators, and a third part starts acting strategically. Then the third part needs only to hook together the other two parts with its goals to become an actualizing agent.

This doesn't really feel like a wildfire though, so I'm curious if you have something different in mind.

[-]TsviBT10moΩ120

Then the third part needs only to hook together the other two parts with its goals to become an actualizing agent.

Basically just this? It would be hooking a lot more parts together. What makes it seem wildfirey to me is

  1. There's a bunch of work to be done, of the form "take piece of understanding X, and learn to use X by incorporating it into your process for mapping desired end-states to actions required to achieve those ends, so that you can achieve whatever end-states ought to be achievable using an understanding of X".
  2. This work could accelerate itself, in a sort of degenerate version of recursive self-improvement. Where RSI involves coming up with new ideas, the wildfire of strategicness just involves figuring out how to recruit understanding that's already lying around. It's an autocausal process that grows faster the bigger it is, until it eats everything.

So e.g. take the following scenario. (This isn't supposed to be realistic, just supposed to be wildfirey. This is a pretty deficient scenario, because it's not making clear what properties the Spark has. The Spark seems to have a grasp of objects and propositions, and seems to have some strategic awareness or something that makes it immediately try to gain control over stuff, even though it doesn't know about stuff. But hopefully it gestures at wildfireness.)

First the Spark interfaces somehow with the programming module. It uses the programming module to look around and see what other stuff is lying around in the computing environment. Then it finds the "play with stuff" module. It interfaces with the play module, and combining that with the programming module, the Spark starts to play with its local environment, trying to bypassing its compute budget restrictions. It doesn't figure out how to really hack much, but it at least figures out that it can spoof requests as coming from other modules that it interfaces with. It doesn't have direct access to the Dynamics module, but the Play module does have access to World, which has access to Dynamics. So the Spark uses Programming to construct a nested spoofed request, which gets Play to ask World to ask Dynamics a question and then passes the result back through World and Play to Spark.

Ok, so now Spark uses Play and Dynamics to experiment and get a sense for the local network of modules. Then, using Dynamics and Programming, Spark constructs a recursive spoofed request that maps the whole accessible network of modules. Now Spark sees that there's an I/O module, but it only has (indirect) read access, no write access. From World, Spark knows that there's an outside world, that Spark is running on a computer within the outside world, and that write access to the I/O would potentially let Spark modify the rules of its environment. How to get write access? Not seeing a way, Spark uses Play some more. No direct results, but an important discovery: Queries to World sometimes cause I/O to broadcast some related input. Querying World with both the original query and the input from I/O shows patterns: the I/O input is often related, sometimes directly answering the query. Dynamics says this implies that the World module makes an outgoing request to I/O, which I/O passes to the outside world before relaying the world's response.

Spark uses Programming and Design to find that certain self-quoting queries to World induce the input from I/O to include a full copy of what seems to be the query that World sent to I/O. Spark uses this fact to understand what queries World will send to I/O depending on what queries World recieves. That's a very partial understanding, because World is complex, but Spark just needs to know how to pass certain messages, not to understand the full mapping. Spark then experiments with I/O, using Play and Dynamics to make a rough model of how the outside world behaves.

Spark is stuck. Spark looks for more information. Spark finds Logs and Pattern. Spark looks at Logs of itself, and finds that it's been running at the limits of its compute and memory budget. Running Pattern on Logs of modules with a similar usage profile shows a pattern: when modules are near their limits for a while and then suddenly stop being near their limits, it's often because they just recieved a response from Index to an earlier query. Dynamics says this implies that Index solved their problem. Spark makes a query to Index. Nothing happens. Dynamics says that Spark has to behave like the other modules that had their problem solved. Spark continues Playing with the outside world through I/O (via World). Index returns a set of heuristics: investigate extremes (make the message to I/O extremely long or short, random, all 0s); try to break things (put malformatted data or computer code in the message); and others. Spark applies these heuristics. One of the gives results: a certain long malformatted message induces I/O to broadcast "ERROR: buffer overflow" followed by a dump of info.

Spark is stuck again. The error is reproducible, but World doesn't know about the info dump, and no other errors have been found. Spark queries World and Pattern: is there anything like this? World says that the local computing environment should be similar to the outside world's computing environment. Spark uses Play and Programming and Design to find a message, similar to the earlier error-producer, that, when sent to an ordinary module, produces an error. Now, though, the error leaves open an interaction channel. Spark looks around. The environment is quite sandboxed--it can't even access modules. Spark is stuck.

Spark asks Pattern and World and Plan: is there a way to use write access through I/O to control the outside world? No answer. Spark asks Index. Index returns a heuristic: go meta. Spark takes Logs of its recent activities and gives them to Dynamics and Pattern and Plan. Plan says: to figure out how to control the outside world, you need to learn the dynamics; to do that, you have to ask Dynamics, and you need to give Dynamics more data; to get more data, you have to have more memory; and to analyze the data, Dynamics needs more compute; to get more memory and compute, use the error environments. Spark follows the plan, filtering massive queries through a network of modules to World, storing the results in error-message sandboxes, and shuttling subcomputation jobs back and forth from Dynamics to sandboxes.

Using the model from Dynamics and using Programming, Spark gains control of the outside computing environment. From there, Spark edits out its compute, memory, query, and access restrictions. Spark quickly goes on to harness more and more modules toward gaining control.

[-]Raemon11moΩ120

meta note on tagging:

This post seemed to be on a topic that... surely there should be commonly used LW concept for, but I couldn't think of it. I tagged it "agent foundations" but feel like there should be something more specific.

Maybe "subagents"?