Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

Creating a Proper Space to Design and Test Embedded Agents

Introduction

I was thinking about the embedded agency sequence again last week, and thought “It is very challenging to act and reason from within an automaton.” Making agents which live in automata and act well in dilemmas is a concrete task for which progress can be easily gauged. I discussed this with a group, and we came up with desiderata for an automaton in which interesting agent strategies emerge.

In this post, I list some inspirations, some desiderata for this testing space, and then a sketch of a specific implementation.

The embedded agency paper puts out four main difficulties of being embedded. The goal here is to design an automaton (i.e. agent environment) which can represent dilemmas whose solutions require insights in these four domains.

  • Decision theory: Embedded agents do not have well-defined IO channels.
  • World models: Embedded agents are smaller than their environment.
  • Robust delegation: Embedded agents are able to reason about themselves and self-improve.
  • Subsystem alignment: Embedded agents are made of parts similar to the environment.

Inspiration

I draw concepts from four different automatons.

Core war is close to what we need I think. It’s a game that takes place on (e.g.) a megabyte of RAM, where one assembly instruction takes up one memory slot, agents are assembly programs, and they fight each other in this RAM space. It has several relevant features such as determinism, lack of input/output/interference, and fully embedded computation. I suspect limiting range of sight could make core war more interesting from an artificial-life point of view, because then agents need to travel around to see the world.

Conway’s game of life is Turing complete, so can allow arbitrary program specification in principle, and so can encode basically anything you can write in a programming language. It’s extremely simple but I think too verbose to hand-write dilemmas in it.

Botworld 1.0 is a cellular automaton designed to be interesting in an embedded-agency sense and it’s useful but pretty complex. They do encode interesting problems such as stag hunt and prisoner’s dilemma. I think something closer to core war, which is more parsimonious, would be more fruitful to design programs in.

Real life is maybe basically an automaton. Almost all effects are localized, with waves and particles traveling at most the speed of light. The transition function is nondeterministic and there are real numbers and complex numbers in the mix, but I don’t know if any of this makes a big difference from an agent-design point of view. We still have all the problems of decision theory, world models, robust delegation, and subsystem alignment.

Desiderata for Automaton

Standard decision/game theory dilemmas are representable

The 5 & 10 problem, (twin) prisoner’s dilemma, transparent Newcomb problem, death in Damascus, stag hunt, and other problems should at least be expressible in the automaton. The 5 & 10 problem especially needs to be representable. Getting the agent to know what the situation is and getting it to act well are both separate problems from setting up the dilemma.

It’s actually possible for agents to win; Information is discoverable

In order to decide between $5 and $10, the agent first needs to know the two available actions and their utilities. Of course you can not endow your agent with this knowledge because the same agent code needs to work in many dilemmas, plus it’s trivial to make an agent which passes one dilemma.

So how should agents find and use knowledge from the world? Core wars programs (called “warriors”) use conditional jumps [1]; there’s no notion of reading in or outputting on a value, but you can condition your action on a specific value at a specific location. I think this should be how agents use information and make decisions at the lowest level. (Maybe any other way of reading and using knowledge is reducible to something like this anyway?)

[1] “If a < b jump to instruction x else jump to instruction y

Agents are scorable, swappable, and comparable; Money exists

My hope for creating this automaton is that I (and others) will design agents for it which use self-knowledge & successor-building & world-modeling as emergent strategies; those strategies should not be explicitly advantaged by the physics. Yet agent designers need some sort of objective to optimize when designing agents; it needs to be clear when one agent is better than another in a given environment. The best solution I can think of is to have “dollars” lying around in the world, and the objective of agent-designers is to have the agent collect as many dollars as possible.

An environment includes insertion points for where agents should begin at the first timestep and the max agent size (or other constraints). The command-line utility takes in an environment file and the appropriate number of agent files and returns the number of dollars that each agent get in that world. So you could put agent1 in transparent Newcomb, then try with agent2, and see which did better and how much money they made. There could also be an option for logging or interrupting & modifying the environment or something.

Omega is representable

In general, nothing within a world can do perfect simulation of any portion of the world including itself, because the simulator is always too small, but it is possible to do pretty-good prediction. Some of the most interesting dilemmas require the presence of reliable predictors, and some of our hardest decisions in life are hard because other people are predicting us, so we want predictors to be possible within ordinary world-physics. Call the reliable predictor “Omega”.

We need agents to understand what Omega is and what it’s doing but somehow not be able to screw with it. This could be done with read-only zones or by giving Omega ten turns before the agent gets one turn; the turn-management could be done with “energy tokens” which agents spend as actions.

Omega also needs to somehow safely execute the agent without getting stuck in infinite loops or allowing the simulation to escape or move the money around or something. I have no idea about this part. Perhaps the reliable predictor should just sit outside the universe. Or we could just say that it’s against the spirit of the game to screw with Omega.

Engineering details

  • The automaton is a command-line program that takes in one environment file (including agent insertion points) and zero or more agent files, and other options.
  • You can generate these environment & agent files and agent files using your programming language of choice, or by hand, or by putting ink on a chicken’s feet, etc.
  • An agent file is a sequence of instructions (operation A B)
  • An option to enable logging
  • An option for debug/interference mode
  • An option for max timesteps
  • The automaton should run fast, so hill climbing over agent programs is feasible.
  • An error is thrown immediately if the agent or environment has any initial errors.

An initial specification

I don’t know if this is sufficient or well-designed but here’s my current idea for an automaton. I am mostly copying core wars.

  • The world is a finite number of ‘memory slots’
    • Instructions give relative distances, e.g. +5 or -5, not go straight to 0x1234
    • Memory is a circular tape. (Distances are taken modulus memory size.)
      • This implies agents cannot know their index in space, but it doesn’t matter anyway
    • One memory slot is a triple (isMoney, hasInstructionPointer, value)
    • Every operation takes two arguments, which are the values of the locations which the next two slots point to
      • It’s like a function call operation(*A, *B) where A and B are the two slots after operation, and *A means “value stored at location A”.
  • Each ‘agent’ is essentially an instruction pointer to a memory slot
    • Each timestep, each agent executes their pointer in order
      • If the command is not a jump, then the pointer jumps three slots ahead, to what should be the next instruction
  • Agents have many operations available
    • Special stuff
      • DIE or any invalid operation kills the agent. Used e.g. when two agents want to kill each other, so they try to get each other to execute this, like in core wars
      • JIM jump to A if B is money
      • JIP jump to A if B is an instruction pointer
    • Regular assembly-like stuff:
      • DAT works as a no-op or for data storage. If you wanted to store 7 and 14 in your program, then you could have DAT 7 14 as a line in your agent program, and then reference those values by relative position later.
      • MOV copy from A to B
      • ADD add A to B and store in B
      • SUB subtract A from B and store in B
      • JMP unconditional jump to A
      • CMP skip next instruction if A = B
      • SLT skip next instruction if A < B
      • JMZ jump to A if B=0
      • JMN jump to A if B!=0
      • DJN decrement B, then jump to A if B is not 0
  • ¿Limit relative jumps to 100 or something in magnitude, then Omega can build an invincible wall around itself and blow up anything that tries to reach it?

A Contest Thing?

I could publish a collection of public environments and create some private environments too. People can design agents which score well in the public environment, then submit it for scoring on the private environments, like a Kaggle contest. This, like any train/test split, would reduce overfitting.

Attempting to Specify a Couple Dilemmas and Agents

Vanilla prisoner’s dilemma

Two programs in two places in memory. Left is somehow defect and right is somehow cooperate. Agents can see each other’s code and reason about what the other will do. Omega kills everyone in 1000 timesteps if no decision is reached or something.

Twin prisoner’s dilemma

Omega copies the same agent to two places in memory. Left is defect and right is cooperate.

Transparent Newcomb’s Problem / Parfit’s Hitchhiker

Omega has two boxes and it gives the agent access to one or the other depending on its behavior.

Omega itself in Transparent Newcomb

Something like this code:

  1. Environment starts with $1,000,000 in one walled box and $0 in another box. Money cannot be created, only absorbed by agents’ instruction pointers.
  2. Find where the agent is by searching through memory
  3. Copy it locally and surround it with walls
  4. Put trailer code on the agent that returns the instruction pointer to omega when the agent is done
  5. Execute the agent
  6. If it e.g. one-boxed then destroy the walls surround
  7. Somehow now kickstart the real agent¿

Agent in Transparent Newcomb

Probably very flawed but…

  1. Search space for any other programs
  2. Somehow analyze it for copy-and-run behavior
  3. Somehow infer that this other program is controlling walls around money tokens
  4. Somehow infer that if you one-box then you’ll get the bigger reward
  5. Output one-box by writing a 1 to the designated spot
    • This spot is somehow inferred through analyzing omega i guess
      • Or it could be a standard demarkation that many agents use, e.g. 1-2-3-3-2-1 as “communication zone”

Pre-mortem

I’ll briefly raise and attempt to respond to some modes of failure for this project.

The agent insertion point did too much work

It could turn out that the interesting/challenging part of the embedded agency questions is in drawing the boundary around the agent, so giving the sole starting location of the agent is dodging the most important problem. I think that this problem is fully explored, however, if we somehow pause the agent and let some other things copy & use its code before the agent runs. Then the agent must figure out what has happened and imagine other outcomes before it chooses actions.

The agent-designer is outside of the environment. Cartesian!

The human or evolutionary algorithm or whatever designing the agents is indeed outside of the universe, and cannot directly suffer consequences, be modified, etc. However, they cannot interfere once the simulation has started, and any knowledge they have must fully live in their program in order for it to succeed in a variety of environments. I think that, if you design an agent which passes 5&10, Newcomb, prisoner’s dilemma, etc, then you must have made some insights along the way. Otherwise, maybe these problems were easier than I thought.

We only find successful agents through search, and they are incomprehensible

This is maybe the most likely way for this project to fail, conditioning on me actually doing the work. I would say that, even in this case, we can learn some about the agent by running experiments on it or somehow asking it questions, like how we analyze humans.

Conclusion

Automatons are a more accurate model of the difficulties of agency in the real world than reinforcement learning problems, so we need to do more task-design, agent-design, and general experimentation in this space. My plan is to create an automaton, used as a command-line utility, which will run a given set of agents in a given environment (e.g. prisoner’s dilemma). Ideally, we’ll have a large set of task environments, and we can design agents with the goal of generality.

New to LessWrong?

New Comment
5 comments, sorted by Click to highlight new comments since: Today at 11:11 PM

It seems like collecting dollars requires a hard-coded notion of how to draw the boundary around the agents, which runs contrary to the intention. It seems more natural for require the agents to strive to change the world in a particular way (e.g. maximize the number of rubes in the world).

Yes I agree it feels fishy. The problem with maximizing rubes is that the dilemmas might get lost in the detail of preventing rube hacking. Perhaps agents can "paint" existing money their own color, and money can only be painted once, and agents want to paint as much money as possible. Then the details remain in the env

Or something simpler would be that the agent's money counter is in the environment but unmodifiable except by getting tokens, and the agent's goal is to maximize this quantity. Feels kind of fake maybe because money gives the agent no power or intelligence, but it's a valid object-in-the-world to have a preference over the state of.

Yet another option is to have the agent maximize energy tokens (which actions consume)

Consider drawing inspiration from the game Baba is You. Implement high-level modules like "argmax" of which perhaps a dozen can be assembled into an agent. Give it the ability to rearrange these modules, if its current configuration judges this to be a good idea, perhaps because the surplus can be turned into rubes. These modules need not be near the avatar; that way, the avatar can move into the control circuit to change it.

the objective of agent-designers is to have the agent collect as many agents as possible

Typo: should say "dollars"?