Produced on a grant from the LTFF. Thanks to Justis for proofreading and Paul Colognese, David Udell, Alex Altair, Nicky Pochinkov, and Jessica Rumbelow for discussion and feedback. I would like to reason about the emergence and impact of an AGI on our local region of space, and hopefully find ways of ensuring that impact is positive. At the very least it would be nice to obtain theoretical guarantees that it could prevented from posing an existential threat to our species. Unfortunately, progress in this area has been slow and making even basic claims about AGI has proven to be exceptionally challenging.
Here I take a different perspective and examine a toy model of the system responsible for giving rise to the AGI. I highlight the signature of programmed goals and their correlation with events that occur. I also do a quick dive into which class of intelligent systems I think should be the focus of alignment research.
John has written about the need to narrow our definitions of agency to those that "optimize at a distance" to circumvent various issues with the traditional definition of agent. "If I’m planning a party, then the actions I take now are far away in time (and probably also space) from the party they’re optimizing". We're going to think about planning events far away in time, but not from the perspective of the planner themself.
Lets kick things up a notch. Picture yourself swamped with the responsibility of planning 100 dinner parties. There's no way you'll be able to do this yourself. No worries, you think, and contact your secretary. One problem. He's too busy. Says he's got 100 dinner outfits to plan out. Damn. The solution of course is for you to instruct your secretary to hire an event planner. Genius. You call your secretary to inform him of the plan, he says he likes it but you can tell he's barely listening. Your secretary is very distracted and you know he will not take any initiative. You write out a detailed set of instructions that not only conveys information like how how the parties should be executed, the themes, the guest lists, etc but also instructions for how your secretary should go about selecting the event planner in the first place. You send the list to the secretary and he follows it to the letter. It all goes great. After holding 100 dinner parties you're the most popular person in town and everyone is willing to overlook that time you embarrassed yourself by mispronouncing hors d'oeuvre.***
Lets go back in time to the day you send your secretary those instructions. Remembering that he won't take any initiative, what could go wrong if there's a typo in your instructions?
Well, if the typo is in the section of instructions telling him how to select the event planner, usually either nothing happens or no parties get planned. In one alternative history he tells you he found you the perfect Evan Planner. In another, you misspelled his name and were cut off his Christmas Card list, but otherwise the parties went fine. Often you simply typo'd the address of the Party Planner agency which meant that he drove to the wrong address and gave up. When your secretary completely fails, it's often difficult to work out what exactly went wrong without checking the original instructions. But, if you were to throw a typo into the section of the instructions intended to be passed onto the event planner, it's a disaster. The events go as planned, but with bizarre changes to what you wanted. Often you can work out the exact typo you made by the food substitutions from your intended menu.
Picture now a physical workshop about to initiate the process of producing an AGI. Roaming among its neatly stacked shelves is an automaton precisely following the instructions fed into the back of its head by a long spool of punched tape. On the shelves are a vast amount of resources, enough to build large and powerful server. Plenty of digital storage containing any data that is to be used in the training. Imagine that Alignment Theory has dramatically advanced by the time this has occurred, and there is a generic process for producing an AGI that adheres to any goal. Thanks to the aid of advanced alignment software, the last thing one before the AGI is deployed is to literally type any goal into the terminal. The AGI produced by the process will adhere to this goal in a reasonable and predictable fashion. I take goal to broadly mean a set of conditions to satisfy or properties of the local environment to maximize.
Somewhere on the instruction tape being read by the automaton is a section of data encoding “type <goal> into the terminal”. Notice that the information encoding the goal can now be pinned down to a physical location, even if you don’t know how the AGI will eventually store that data.
For illustrative purposes let's say that the goal is to broadcast some special pattern of radio waves towards Pluto on repeat. Maybe the chosen pattern is a recording of classical music. Now, the automaton is activated. It constructs the server racks, powers them on, trains the AGI, types in our goal and deploys it into the world. Regardless of (almost) any other events occurring in history from the point of the AGI’s deployment, it is almost certain that whatever pattern was written on original tape will be beamed toward Pluto a hundred years from now.
***What happens if you go back before the automaton had even begun and alter the tape by applying random noise to a segment of it. Could you detect the change on Pluto 100 years later? Here are 3 non exhaustive cases.
It Breaks:You apply random noise to an early step and the noise on Pluto completely stops. If the instruction for “turn on the mainframe” becomes “tkdnm u the mais;ame” your training process never even gets off the ground The mutual information between the noise applied and what's seen from Pluto is almost zero because there's no way to recover the exact combination of bits that bricked the system.
If you apply noise to an area of the tape that is redundant, nothing changes. The mutual information between data on this section of the tape and the radiowaves on Pluto is zero.
Something Interesting Happens:
Apply noise to the segment of the tape that encodes the pattern of radio waves to be transmitted to Pluto. In the special case where you have altered the goal in this fashion, then your change will be recoverable, perhaps as a burst of static in the middle of an otherwise clear transmission of Mozart’s 5th. The mutual information is equal to the amount of bits of noise that you added.I refer to this phenomena as “Goal-like Correlations”. It is crucial to note that not only is the information preserved, but it is also preserved in the presence of external perturbations. A powerful agentic system will possess the capability to work around problems and adapt to unforeseen circumstances. For example, if the automaton had been tasked with building a powerful broadcast array directly, it might not reliably transmit to Pluto in the event of a powerful storm or earthquake. However, the AGI is able to adapt and maintain the transmissions despite these challenges because it is both intelligent and agentic. Notice another key difference in the behaviour of noise applied to this segment of the tape. Generally applying a second random error to a section of a tape that isn't encoding a goal will usually just break it if the first addition didn't, or keep it broken if it's already broken.
On the other hand applying noise to one section of the goal tape might result in a one change in behaviour, and applying the noise a second time would result in a new change. If a segment of tape encodes the color to paint the moon, the applying random noise again and again to that color value will continue to change it.
This is ongoing work, but I'm hoping I could incorporate other ideas from alignment literature into this toy model.
Imagine the tape parameterized by the point in time at which the automaton would read and act on the command. If there is some cutoff point after which any no change to the tape influences observations made in the distant future you can say that there is no longer Corrigibility-Like Behaviour.
A necessary but not sufficient condition for deceptive behaviour is that there is someone being decieved. In the workshops framework, this would look like a section of the tape that is supposed to display Goal-like Correlations with the behaviour of the system much later in time actually not being correlated with distant measurements at all.
The Process (Sort Of) Breaks:
You apply random noise to a non-goal segment of the tape and produce an AGI that behaves in an unanticipated manner, possibly still using some part of the tape as a goal. This might occur if some part of the tape was an explicit pointer to the "goal" variable which was defined earlier. Another example might be if it altered some parameter of training process. My intuition is that an error during "construction code" causing behaviour changes that don't brick the system should be rare.
This post used the world "AGI", but I am more specifically concerned with a class of "physically embedded autopoietic optimizers" of which a deployed AGI is an example.
This is still a developing idea in a developing field. I anticipate the biggest error I will have made will be not knowing about a popular post or paper in which someone else already came up with a similar thought experiment. I'm substantially less confident about claims made in the "Tangent on Terminology".