32

Ω 15

SubagentsAgent FoundationsAI
Frontpage
Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

It may not be feasible to make a mind that makes achievable many difficult goals in diverse domains, without the mind also itself having large and increasing effects on the world. That is, it may not be feasible to make a system that strongly possibilizes without strongly actualizing.

But suppose that this is feasible, and there is a mind M that strongly possibilizes without strongly actualizing. What happens if some mental elements of M start to act strategically, selecting, across any available domain, actions predicted to push the long-term future toward some specific outcome?

The growth of M is like a forest or prairie that accumulates dry grass and trees over time. At some point a spark ignites a wildfire that consumes all the accumulated matter.

The spark of strategicness, if such a thing is possible, recruits the surrounding mental elements. Those surrounding mental elements, by hypothesis, make goals achievable. That means the wildfire can recruit these surrounding elements toward the wildfire's ultimate ends. By recruiting more dry matter to the wildfire, the wildfire burns hotter and spreads further.

Also by hypothesis, the surrounding mental elements don't themselves push strongly for goals. Seemingly, that implies that they do not resist the wildfire, since resisting would constitute a strong push. We can at least say that, if the totality of the mental elements surrounding the wildfire is going to notice and suppress the wildfire, it would have to think at least strategically enough to notice and close off all the sneaky ways by which the wildfire might wax. This implies that the surrounding mental elements do a lot of thinking and have a lot of understanding relevant to strategic takeovers, which itself seemingly makes more available the knowledge needed for strategic takeovers.

Capabilities overhang provides fuel for a wildfire of strategicness. That implies that it's not so easy to avoid wrapper-minds.

This is very far from being a watertight argument, and it would be nice if the conclusion were false; how is it false? Maybe, as is sometimes suggested, minds selected to be creative are selected to keep themselves open to new ideas and therefore to new agency, implying that they're selected to prevent strategic takeovers, which would abhor new agency?

32

Ω 15

New Comment
10 comments, sorted by Click to highlight new comments since: Today at 2:33 PM

Nice post. Some minor thoughts: 

Are there historical precedents for this sort of thing? Arguably so: wildfires of strategic cognition sweeping through a nonprofit or corporation or university as office politics ramps up and factions start forming with strategic goals, competing with each other. Wildfires of strategic cognition sweeping through the brain of a college student who was nonagentic/aimless before but now has bought into some ambitious ideology like EA or communism. Wildfires of strategic cognition sweeping through a network of PCs as a virus hacks through, escalates permissions, etc.

I feel like none of these historical precedents is a perfect match. It might be valuable to think about the ways in which they are similar and different.

I used fire as an analogy for agents in my understanding agency sequence. I'm pleased to see you also found it helpful.

 

I feel like none of these historical precedents is a perfect match. It might be valuable to think about the ways in which they are similar and different.

To me a central difference, suggested by the word "strategic", is that the goal pursuit should be

  1. unboundedly general, and
  2. unboundedly ambitious.

By unboundedly ambitious I mean "has an unbounded ambit" (ambit = "the area went about in; the realm of wandering" https://en.wiktionary.org/wiki/ambit#Etymology ), i.e. its goals induce it to pursue unboundedly much control over the world.

By unboundedly general I mean that it's universal for optimization channels. For any given channel through which one could optimize, it can learn or recruit understanding to optimize through that channel.

Humans are in a weird liminal state where we have high-ambition-appropriate things (namely, curiosity), but local changes in pre-theoretic "ambition" (e.g. EA, communism) are usually high-ambition-inappropriate (e.g. divesting from basic science in order to invest in military power or whatever).

Isn't the college student example an example of 1 and 2? I'm thinking of e.g. students who become convinced of classical utilitarianism and then join some Effective Altruist club etc.

The spark of strategicness, if such a thing is possible, recruits the surrounding mental elements. Those surrounding mental elements, by hypothesis, make goals achievable. That means the wildfire can recruit these surrounding elements toward the wildfire's ultimate ends... Also by hypothesis, the surrounding mental elements don't themselves push strongly for goals. Seemingly, that implies that they do not resist the wildfire, since resisting would constitute a strong push.

I think this is probably a fallacy of composition (maybe in the reverse direction than how people usually use that term)? Like, the hypothesis is that the mind as a whole makes goals achievable and doesn't push towards goals, but I don't think this implies that any given subset of the mind does that.

Good point, though I think it's a non-fallacious enthymeme. Like, we're talking about a car that moves around under its own power, but somehow doesn't have parts that receive, store, transform, and release energy and could be removed? Could be. The mind could be an obscure mess where nothing is factored, so that a cancerous newcomer with read-write access can't get any work out of the mind other than through the top-level interface. I think that explicitness (https://www.lesswrong.com/posts/KuKaQEu7JjBNzcoj5/explicitness) is a very strong general tendency (cline) in minds, but if that's not true then my first reason for believing the enthymeme's hidden premise is wrong.

I like this post as a vivid depiction of the possible convergence of strategicness. For literal wildfires, it doesn't really matter where or how the fire starts - left to burn, the end result is that the whole forest burns down. Once the fire is put out, firefighters might be able to determine whether the fire started from a lightning strike in the east, or a matchstick in the west. But the differences in the end result are probably unnoticeable to casual observers, and unimportant to anyone that used to live in the forest.


I think, pretty often, people accept the basic premise that many kinds of capabilities (e.g. strategicness) are instrumentally convergent, without thinking about what the process of convergence actually looks like in graphic detail. Metaphors may or may not be convincing and correct as arguments, but they certainly help to make a point vivid and concrete.

Shallow comment:

How are you envisioning the prevention of strategic takeovers? It seems plausible that robustly preventing strategic takeovers would also require substantial strategizing/actualizing.

Are you echoing this point from the post?

We can at least say that, if the totality of the mental elements surrounding the wildfire is going to notice and suppress the wildfire, it would have to think at least strategically enough to notice and close off all the sneaky ways by which the wildfire might wax. This implies that the surrounding mental elements do a lot of thinking and have a lot of understanding relevant to strategic takeovers, which itself seemingly makes more available the knowledge needed for strategic takeovers.

It might be possible for us humans to prevent strategicness, though this seems difficult because even detecting strategicness is maybe very difficult. E.g. because thinking about X also sneakily thinks about Y: https://tsvibt.blogspot.com/2023/03/the-fraught-voyage-of-aligned-novelty.html#inexplicitness

My mainline approach is to have controlled strategicness, ideally corrigible (in the sense of: the mind thinks that [the way it determines the future] is probably partially defective in an unknown way).

meta note on tagging:

This post seemed to be on a topic that... surely there should be commonly used LW concept for, but I couldn't think of it. I tagged it "agent foundations" but feel like there should be something more specific.

Maybe "subagents"?

New to LessWrong?