Instrumental Convergence Bounty

Logan Zoellner

LESSWRONG
LW

62 Instrumental Convergence Bounty

14th Sep 2023

1 min read

62

I have yet to find a real-world example that I can test my corrigibility definition on. Hence, I will send $100 to the first person who can send/show me an example of instrumental convergence that is:

Surprising, in the sense that the model was trained on a goal other than "maximize money" "maximize resources" or "take over the world"
Natural, in the sense that instrumental convergence arose while trying to do some objective, not with the goal in advance being to show instrumental convergence
Reproducible, in the sense that I could plausibly run the model+environment+whatever else is needed to show instrumental convergence on a box I can rent on lambda

Example of what would count as a valid solution:

I was training an agent to pick apples in Terraria and it took over the entire world in order to convert it into a massive apple orchard

Example that would fail because it is not surprising

I trained an agent to play CIV IV and it took over the world

Example that would fail because it is not natural

After reading your post, I created a toy model where an agent told to pick apples tiles the world with apple trees

Example that would fail because it is not reproducible

The US military did a simulation in which they trained a drone which decided to take out is operator

Update

The bounty has been claimed by Hastings.

Obviously I would still appreciate more examples, but there won't be a 2nd bounty (or if I do create one in the future it will have more requirements attached).

CorrigibilityInstrumental Convergence

Frontpage

62

New Comment

24 comments, sorted by

top scoring

Click to highlight new comments since: Today at 11:49 AM

[-]Hastings1y353

Hi! I might have something close. The chess engine Stockfish has a heuristic for what it wants, with manually specified values for how much it wants to keep its bishops, or open up lines of attack, or connect its rooks, etc. I tried to modify this function to make it want to advance the king up the board, by adding a direct reward for every step forward the king takes. At low search depths, this leads it to immediately move the king forward, but at high search depths it mostly just attacks the other player in order to make it safe to move the king, ( best defense is offense) and only starts moving the king late in the game. I wasn't trying to demonstrate instrumental convergence, in fact this behavior was quite annoying as it was ruining my intented goal (creating fake games demonstrating the superiority of the bongcloud opening)

modified stockfish: https://github.com/HastingsGreer/stockfish

This was 8 years ago, so I'm fuzzy on the details. If it sounds like vaguely what you're looking for, reply to let me know and I'll write this up with some example games and make sure the code still runs.

[-]Logan Zoellner1y50

This sounds like exactly the type of thing I'm looking for.

Do you know, in the "convergent" case, is the score just for advancing the king, or is there a possibility that there's some boring math explanation like "as the number of steps calculated increases, the weights on the other positions (bishops, lines of attack, etc) overwhelms the score for the king advancing"?

[-]Hastings1y41

I think I fully lobotomized the evaluation function to only care about advancing the king, except that it still evaluates checkmate as +infinty. Here's a sample game:

https://www.chesspastebin.com/view/24278

It doesn't really understand material anymore except for the queen, which I guess is powerful enough that it wants to preserve it to allow continued king pushing. I still managed to lose because I'm not very good at chess.

EDIT: the installation guide I listed below was too many steps and involved blindly trusting internet code, that's silly. Instead just threw it up on lichess and you can play it in the browser here: https://lichess.org/@/king-forward-bot

~~If you want to play yourself,~~ you can compile the engine with

git clone https://github.com/HastingsGreer/stockfish

cd stockfish/src

make

and then can install a gui like xboard (on mac, brew install xboard) and add your stockfish binary as an UCI engine.

[-]Logan Zoellner1y30

This looks great!

I will DM you to figure out how to send the bounty.

[-]gwern1y30

It would be more elegant to remove the checkmate penalty. After all, checkmate is instrumentally bad, because on the next move, the king is captured and can no longer move forward any spaces - unless he's advanced to the final row* and so has maximized his movement score, and so why care about what happens after that? Nothing will increase (or decrease) his movement reward.

* I take by 'advancing the king' you mean to 'moving to a new unvisited' row, so it maxes at 8, for the 8 rows of the chess board, and not simply moving at all, since that would have degenerate solutions like building a fortress and then the king simply moving back and forth between two spaces for eternity.

[-]Hastings1y10

I agree! The stockfish codebase handles evaluation of checkmates somewhere else in the code, so that would be a bit more work, but it's definitely the correct next step.

[-]Insub1y30

That's great. "The king can't fetch the coffee if he's dead"

[-]johnswentworth1y51

What is wrong with the CIV IV example? Being surprising is not actually a requirement to test your theory against; if anything it should be an anti-requirement, theories should pass the central-example tests first and those are the unsurprising examples. And indeed instrumental convergence is a very ordinary everyday thing which should be unsurprising in the vast majority of cases.

Like, if the win condition were to take over the world, then sure, CIV would be a bad example. But that's not actually the win condition in CIV. (At least not in CIV V/VI, I haven't played CIV IV.)

The thing I think you should do here is take any game with in-game resources/currency, and examine the instrumental convergence of resource/currency acquisition in an agent trained to play that game. Surely there are dozens of such examples already. E.g. just off the top of my head, alphastar should definitely work. Or in the case of CIV IV, acquisition of gold/science/food/etc.

[-]Logan Zoellner1y72

What is wrong with the CIV IV example? Being surprising is not actually a requirement to test your theory against

I'm interesting in cases where there is a correct non-power-maximizing solution. For winning at Civ IV, taking over the world is the intended correct outcome. I'm hoping to find examples like the Strawberry Problem, where there is a correct non-world-conquering outcome (duplicate the strawberry) and taking over the world (in order to e.g. maximize scientific research on strawberry duplication) is an unwanted side-effect.

[-]johnswentworth1y30

Kinda trolling, but:

[-]Logan Zoellner1y40

If I built a DWIM AI, told it to win at CIV IV and it did this, I would conclude it was misaligned. Which is precisely why SpiffingBrit's videos are so fun to watch (huge fan).

[-]tailcalled1y20

I think the correct solution to the strawberry problem would also involve a ton of instrumental convergence? You'd need to collect resources to do research/engineering, then set up systems to experiment with strawberries/biotech, then collect generally applicable information on strawberry duplication, and then apply that to duplicate the strawberry.

[-]Logan Zoellner1y10

If I ask an AI to duplicate a strawberry and it takes of the world, I would consider that misaligned. Obviously it will require some instrumental convergence (resources, intelligence, etc) to duplicate a strawberry. An aligned AI should either duplicate the strawberry while staying within a "budget" for how many resources it consumes, or say "I'm sorry, I can't do that".

I would recommend you read my post on corrigibility which describes how we can mathematically define a tradeoff between success and resource exploitation.

[-]dr_s1y20

there is a correct non-power-maximizing solution

Questionable - turning the universe into paperclips really is the optimal solution to the "make as many paperclips as possible" problem. But yeah, obviously in Civ IV taking over the world isn't even an instrumental goal - it's just the actual goal.

[-]Lao Mein1y60

I think he just wants an example of an agent being rewarded for something simple (like being rewarded for resource collection) exhibiting power-seeking behavior to the degree that it takes over the game environment. It's an intuitive difference to a lot of people to an agent specifically maximizing an objective. I actually can't name an example after looking for an hour, but I would bet money something like that already exists.

My guess is that if you plop two Starcraft AIs on a board and reward them every time they gather resources, with enough training, they would start fighting each other for control of the map. I would also guess that someone has already done this exact scenario. Is there an AI search engine for Reddit anyone would recommend?

[-]johnswentworth1y40

I think he just wants an example of an agent being rewarded for something simple (like being rewarded for resource collection) exhibiting power-seeking behavior to the degree that it takes over the game environment.

That's definitely not what "instrumental convergence" means, in general. So:

Is there a reason to be interested in that phenomenon, rather than instrumental convergence more generally?
If so, perhaps we need a different name for it?

[-]Lao Mein1y40

What is the difference between that and instrumental convergence?

[-]johnswentworth1y40

From the LW wiki page:

Instrumental convergence or convergent instrumental values is the theorized tendency for most sufficiently intelligent agents to pursue potentially unbounded instrumental goals such as self-preservation and resource acquisition [1].

So, the standard central examples of instrumental convergence are self-preservation and resource acquisition. If the OP is asking for examples of "instrumental convergence", and resource acquisition is not the kind of thing they're asking for, then the thing they're asking for is not instrumental convergence (or is at least a much narrower category than instrumental convergence).

If the OP is looking for a pattern like "AI trained at <some goal> ends up 'taking over the world'", then that would be an example of instrumental convergence, but it's a much narrower category than instrumental convergence in general. Asking for "examples of instrumental convergence", if you actually want examples of AI trained at some random goal "taking over the world" (whatever that means), is confusing in the same way as asking for examples of cars when in fact you want an example of a 2005 red Toyota Camry.

And if people frequently want to talk about 2005 red Toyota Camry specifically, and the word they're using is "car" (which is already mostly used to mean something else), then that strongly suggests we need a new word.

[-]Lao Mein1y40

I see your point. Maybe something like "resource domination" or just "instrumental resource acquisition" is a better term for what he is looking for, I think.

[-]Charlie Steiner1y40

Once you understand how it works, it's no longer surprising.

Take collecting keys in Montezuma's Revenge. If framed simply as "I trained an AI to take actions that increase the score, and it learned how to collect keys that will only be useful later," then plausibly it's a surprising example of learning instrumentally useful actions. But if it's "I trained an AI to construct a model of the world and then explore options in that model with the eventual goal of getting high reward, and rewarded it for increasing the score," then it's no longer so surprising - if you understand why it does what it does, it's not so surprising.

[-]Lao Mein1y40

Questions:

Would something like an agent trained to maximize minerals mined in Starcraft learning to attack other players to monopolize their resources count?
I assume it would count if that same agent was just rewarded every time it mined minerals, or the mineral count went up, without an explicit objective to maximize the amount of minerals it has?
Would a gridworld example work? How complex does the simulation have to be?

[-]Logan Zoellner1y10

I'm probably going to be a stickler about 2. "not with the goal in advance being to show instrumental convergence" meaning that the example can't be something written in response to this post (though I reserve the right to suspend this if the example is really good).

The reason being, I'm pretty sure that I personally could create such a gridworld simulation. But "I solved instrumental convergence in this toy example I created myself" wouldn't convince me as an outsider that anything impressive had been done.

[-]Valdes1y10

Maybe you would accept this paper, which was discussed quite a bit at the time: Emergent Tool Use From Multi-Agent Autocurricula

The AI learns to use a physics engine glitch in order to win a game. I am thinking of the behavior at 2:36 in this video. The code is available on github here. I didn't try to run it myself, so I do not know how easy to run or complete it is.

As to whether the article matches your other criteria:

The goal of the article was to get the AI to find new behaviors, so it might not count as purely natural. But it seems the physics glitch was not planned. So it did come as a surprise.
Maybe glitching the physics to win at hide and seek is not a sufficiently general behavior to count as a case of instrumental convergence.

I won't blame you if you think this doesn't count.

[-]Logan Zoellner1y56

If I was merely looking for examples of RL does something unexpected, I would not have created the bounty.

I'm interested in the idea that AI trained on totally unrelated tasks will converge on the specific set of goals described in the article on instrumental convergence

Self-preservation: A superintelligence will value its continuing existence as a means to continuing to take actions that promote its values.
Goal-content integrity: The superintelligence will value retaining the same preferences over time. Modifications to its future values through swapping memories, downloading skills, and altering its cognitive architecture and personalities would result in its transformation into an agent that no longer optimizes for the same things.
Cognitive enhancement: Improvements in cognitive capacity, intelligence and rationality will help the superintelligence make better decisions, furthering its goals more in the long run.
Technological perfection: Increases in hardware power and algorithm efficiency will deliver increases in its cognitive capacities. Also, better engineering will enable the creation of a wider set of physical structures using fewer resources (e.g., nanotechnology).
Resource acquisition: In addition to guaranteeing the superintelligence's continued existence, basic resources such as time, space, matter and free energy could be processed to serve almost any goal, in the form of extended hardware, backups and protection.

Moderation Log

24Comments