Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

Epistemic status: trying to feel out the shape of a concept and give it an appropriate name.  Trying to make explicit some things that I think exist implicitly in many people's minds. This post makes truth claims, but its main goal is to not to convince you that they are true.

Here are some things I would expect any AGI to be able to do:

  • Operate over long intervals of time relative to its sensory bandwidth (e.g. months or years of ~30 fps visual input). 
  • Remember specific sensory experiences from long ago that are relevant to what's happening to it now.  (E.g. remember things it saw months or years ago.)
  • Retain or forget information and skills over long time scales, in a way that serves its goals.  E.g. if it does forget some things, these should be things that are unusually unlikely to come in handy later.
  • Re-evaluate experiences that happened a long time ago (e.g. years ago) in light of newer evidence (observed in e.g. the last hour), and update its beliefs appropriately.
  • Continually adjust its world model in light of new information during operation.
    • E.g. upon learning that a particular war has ended, it should act as though the war is not happening, and do so in all contexts/modalities.
    • As with humans, this adaptation may take a nonzero amount of time, during which it might "forget" the new fact sometimes.  However, adaptation should be rapid enough that it does not impede acting prudently on the most relevant implications of the new information.
    • This may require regular "downtime" to run offline training/finetuning (humans have to sleep, after all).  But if so, it should require less than 1 second of downtime per second of uptime, ideally much less.
  • Perform adjustments to itself of the kind described above in a "stable" manner, with a negligibly low rate of large regressions in its knowledge or capabilities.
    • E.g. if it is updating itself by gradient descent, it should do so in a way that avoids (or renders harmless) the gradient spikes and other instabilities that cause frequent quality regression in the middle of training for existing models, especially large ones.
  • Keep track of the broader world context while performing a given task.
    • E.g. an AGI playing a video game should not forget about its situation and goals in the world outside the game.
    • It might "get distracted" by the game (as humans do), but it should have some mechanism for stopping the game and switching to another task if/when its larger goals dictate that it should do so, at least some of the time.
  • Maintain stable high-level goals across contexts.  E.g. if it is moved from one room to another, very different-looking room, it should not infer that it is now "doing a different task" and ignore all its previously held goals.

I'm not sure how related these properties are, though they feel like a cluster in my mind.  In any case, a unifying theme of this list is that current ML models generally do not do these things -- and we do not ask them to do these things.

We don't train models in a way that encourages these properties, and in some cases we design models whose structures rule them out.  Benchmarks for these properties are either nonexistent, or much less mature than more familiar benchmarks.

Is there an existing name for this cluster?  If there isn't one, I propose the name "autonomy."  This may not be an ideal name, but it's what I came up with.

I think this topic is worthy of more explicit discussion than it receives.  In debates about the capabilities of modern ML, I usually see autonomy brought up in a tangential way, if at all.

ML detractors sometimes cite the lack of autonomy in current models as a flaw, but they rarely talk about the fact that ML models are not directly trained to do any of this stuff, and indeed often deployed in a manner that renders this stuff impossible.  (The lack of autonomy in modern ML is more like a category mistake than a flaw.)

ML enthusiasts sometimes refer to autonomy dismissively, as something that will be solved incidentally by scaling up current models -- which seems like a very inefficient approach, compared to training for these properties directly, which is largely untried.

Alternatively, ML enthusiasts may cite the very impressive strides made by recent generative models as evidence of generally fast "ML progress," and then gesture in the direction of RL when autonomy is brought up.

However, current generative models are much closer to human-level perception and synthesis (of text, pictures, etc) than current RL models are to human-level autonomy.

State-of-the-art RL can achieve some of the properties above in toy worlds like video games; it can also perform at human level at some tasks orthogonal to autonomy, as when the (frozen) deployed AlphaZero plays a board game.  Meanwhile, generative models are producing illustrative art at a professional level of technical skill across numerous styles -- to pick one example.  There's a vast gap here.

Also, as touched upon below, even RL is usually framed in a way that rules out, or does not explicitly encourage, some parts of autonomy.

If not autonomy, what is the thing that current ML excels at?  You might call it something like "modeling static distributions":

  • The system is trying to reproduce a probability distribution with high fidelity.  The distribution may be conditional or unconditional.
  • It learns about the distribution by seeing real-world artifacts (texts, pictures) that are (by hypothesis) samples from it.
  • The distribution is treated as fixed.
    • There may be patterns in it that correspond to variations across real-world time: GPT-3 probably knows that an article from the US dated 1956 will not make reference to "President Clinton."
    • However, this fact about the distribution is isolated from any way in which the model itself may experience the passage of time during operation.  The model is not encouraged to adapt quickly (whatever that would mean) to the type of fact that quickly goes out of date.  (I'm ignoring models that do kNN lookup on databases, as these still depend on a human to make the right updates to the database.)
  • The model operates in two modes, "training" and "inference."
    • During "training," it learns a very large amount of information about the distribution, which takes many gigabytes to represent on a computer.
    • During "inference," it either cannot learn new information, or can only do so within the limits of a "short-term memory" that holds much less information than the fixed data store produced during "training."
    • If the "short-term memory" exists, it is not persisted into the "long-term memory" of the trained weights.  Once something is removed from it, it's gone.
    • There is a way to adapt the long-term memory to new information without doing training all over again (namely "finetuning").  But this is something done to the model from outside it by humans.  The model cannot finetune itself during "inference."
    • Another way to put this is that the model is basically stateless during operation.  There's a concept of a "request," and it doesn't remember things across requests.  GPT-3 doesn't remember your earlier interactions with it; DALLE-2 doesn't remember things you previously asked it to draw.
  • The model does not have goals beyond representing the distribution faithfully.  In principle, it could develop an inner optimizer with other goals, but it is never encouraged to have any other goals.

This is a very different setting than the one an AGI would operate in.  Asking whether a model of this kind displays autonomy doesn't really make sense.  At most, we can wonder whether it has an inner optimizer with autonomy.  But that is an inefficient (and uncontrollable!) way to develop an ML model with autonomy.

What's noteworthy to me is that we've done extremely well in this setting, at the goals laid out by this setting.  Meanwhile, we have not really tried to define a setting that allows for, and encourages, autonomy.  (I'm not sure what that would entail, but I know existing setups don't do it.)

Even RL is usually not set up to allow and encourage autonomy, though it is closer than the generative or classification settings.  There is still a distinction between "training" and "inference," though we may care about learning speed in training in a way we don't in other contexts.  We generally only let the model learn long enough to reach its performance peak; we generally don't ask the model to learn things in a temporal sequence, one after the other, while retaining the earlier ones -- and certainly not while retaining the earlier ones insofar as this serves some longer-term goal.  (The model is not encouraged to have any long-term goals.)

I realize this doesn't describe the entirety of RL.  But a version of the RL field that was focused on autonomy would look very different.

 The above could affect AGI timelines in multiple ways, with divergent effects.

  • It could be the case that "cracking" autonomy requires very different methods, in such a way that further progress in the current paradigm doesn't get us any closer to autonomy.
    • This would place AGI further off in time, especially relative to an estimate based on a generalized sense of "the speed of ML progress."
  • It could be the case that autonomy is actually fairly easy to "crack," requiring only some simple tricks.
    • The lack of researcher focus on autonomy makes it more plausible that there are low-hanging fruit no one has thought to try yet.
    • I'm reminded of the way that images and text were "cracked" suddenly by ConvNets and Transformers, respectively.  Before these advances, these domains felt like deep problems and it was easy to speculate that it would take very complex methods to solve them.  In fact, only simple "tricks" and scaling were needed.  (But this may not be the right reference class.)
    • This would place AGI closer in time, relative to an estimate that assumes we will get autonomy in some more inefficient, less targeted, accidental manner, without directly encouraging  models to develop it.
New Comment
13 comments, sorted by Click to highlight new comments since:

The relevant sub-field of RL interested in this calls this “lifelong learning”, though I actually prefer your framing because it makes pretty crisp what we actually want.

I also think that solving this problem is probably closer to “something like a transformer and not very far away”, considering, e.g. memorizing transformers work (

(Somewhat rambly response.) You're definitely on to something here. I've long felt that while deep learning may not be AGI, the machine learning industry sure seems 'general' in its ability to solve almost any well defined problem, and the only limitation is that the most difficult problems cannot be so easily put into a series of testcases. Once there is some metric to guide them, the ML guys can arrive at seemingly any destination in a relatively short order. But there are still some problems outside their reach where we can at least attempt to define a score:

  • Nethack and other complex games. The ideal game for testing autonomy under nostalgebraist's definition would show a lot of information, most of which is irrelevant, but at least some of which is very important, and the AI has to know which information to remember (and/or how to access relevant information e.g. via game menus or the AI's own database of memories)

  • Strong versions of the Turing Test, especially where the challenge is to maintain a stable personality over a long conversation, with a consistent (fake) personal history, likes and dislikes, things the personality knows and doesn't know, etc.

  • Autonomously create a complex thing: a long novel, a full-length movie, or a complex computer program or game. A proof of success would be getting a significant amount of attention from the public such that you could plausibly earn money using AI-generated content.

  • The Wozniak Test: make and present a cup of coffee using the tools of an unfamiliar kitchen. Given that pick-and-place is now relatively routine, this seems within reach to me, but it would require chaining together multiple pick-and-place tasks in a way that robotics AIs really struggle with today. I would not be surprised if, within the next year, there is an impressive demonstration of this kind of a robomaid task, seeming passing the Wozniak Test, but then the robot never arrives for consumers because the robomaid only works in the demo home that they set up, or only works 20% of the time, or takes an hour to make coffee, or something.

I think there are also two notions of intelligence that need to be differentiated here. One is the intelligence that does pattern matching on what seems to be totally arbitrary data, and generating some kind of response. This kind of intelligence has an abstract quality to it. The other kind of intelligence is being able to be instructed (by humans, or an instruction manual), or to do transfer learning (understanding that technology will make your nation in Civilization VI more powerful from having trained on language data of the history of real civilization, for example).

I saw someone ask recently, "Do humans really do anything zero shot?" If we can't actually zero shot things, then we shouldn't expect an AI to be able to do so. (It might be impossible under a future Theory of Intelligence.) If we actually can zero shot anything, we must either have been instructed or were able to reason from context. By definition, you can't do supervised learning from zero examples. Humans probably can do the more abstract, pattern matching type of intelligence, but my guess is that is somewhat rare and only happens in totally new domains, and only after reasoning has failed, because it's essentially guesswork.

Thanks for this! I've had similar things on my mind and not had a good way to communicate them to people I'm communicating with. I think this cluster ideas around 'autonomy' is pointing at an important point. One which I'm quite glad isn't being actively explored on the forefront of ML research, actually. I do think that this is a critical piece of AGI, and that deliberate attempts this under-explored topic would probably turn up some low-hanging fruit. I also think it would be bad if we 'stumble' onto such an agent without realizing we've done so.

 I feel like 'autonomy' is a decent but not quite right name. Exploring for succinct ways to describe the cluster of ideas, my first thoughts are 'temporally coherent consequentialism in pursuit of persistent goals', or 'long-term goal pursuit across varying tasks with online-learning of meta-strategies'?  

Anyway, I think this is exactly what we, as a society, shouldn't pursue until we've got a much better handle on AI alignment.

It might be that the 'goals' part of what nostalgebraist is waving at is separable from the temporal coherence part.

I mean this in the sense that GPT-3 doesn't have goals; obviously you can debate that one way or another, but consider all the attempts to make a long document transformer. The generative models have two kinds of memory, essentially: the weights, analogous to long term memory, and the context window, analogous to working memory. There either needs to be some kind of continuous training/fine-tuning of the weights in production, or it needs a third form of episodic memory where the AI can remember the context ("War with Eurasia ended, we have always been at war with Eastasia"). These improvements could make GPT-3 able to write a coherent novel, plausibly without making it any more agentic.

ML enthusiasts sometimes refer to autonomy dismissively, as something that will be solved incidentally by scaling up current models


I'm definitely in this camp.

One common "trick" you can use to defeat current chatbots is to ask "what day is today".  Since chatbots are pretty much all using static models, they will get this wrong every time.  

But the point is, it isn't hard to make a chatbot that know what day today is.  Nor is it hard to make a chatbot that reads the news every morning.  The hard part is making an AI that is truly intelligent.  Adding autonomy is then a trivial and obvious modification.

This reminds me a bit of Scott Aaronson's post about "Toaster Enhanced Turing Machines".  It's true that there are things Turing complete languages cannot compute.  But adding these features doesn't fundamentally change the system in any significant way.


Do you generally think that people in the AI safety community should write publicly about what they think is "the missing AGI ingredient"?

It's remarkable that this post was well received on the AI Alignment Forum (18 karma points before my strong downvote).

Retain or forget information and skills over long time scales, in a way that serves its goals.  E.g. if it does forget some things, these should be things that are unusually unlikely to come in handy later.


If memory is cheap, designing it to just remember everything may be a good idea. And there may be some architectural reason why choosing to forget things is hard.

I think this is absolutely correct. GPT-3/PaLM is scary impressive, but ultimately relies on predicting missing words, and its actual memory during inference is just the words in its context! What scares me about this is that I think there are some really simple low hanging fruit to modify something like this to be, at least, slightly more like an agent. Then plugging things like this as components into existing agent frameworks, and finally, having entire research programs think about it and experiment on it. Seems like the problem would crack. You never know, but it doesn't look like we're out of ideas any time soon.

This is a question for the community, is there any information hazard in speculating on specific technologies here? It would be totally fun, though seems like it could be dangerous...

My hope was initially that the market wasn't necessarily focused on this direction. Big tech is generally focused on predicting user behavior, which LLMs look to dominate. But then there's autonomous cars, and humanoid robots. No idea what will come of those. Thinking the car angle might be slightly safer, because of the need for transparency and explainability, a lot of the logic outside of perception might be hard coded. Humanoid robots... maybe they will take a long time to catch on, since most people are probably skeptical of them. Maybe factory automation...

My opinion is that you're not going to be able to crack the alignment problem if you have a phobia of infohazards. Essentially you need a 'Scout Mindset'. There's already smart people working hard on the problem, including in public such as on podcasts, so realistically the best (or worst) could do on this forum is attempt to parse out what is known publicly about the scary stuff (eg agency) from DeepMind's papers and then figure out if there is a path forward towards alignment.

Yeah, I tend to agree. Just wanted to make sure I'm not violating norms. In that case, my specific thoughts are as follows, with a thought to implementing AI transparency at the end.

There is the observation that the transformer architecture doesn't have a hidden state like an LSTM. I thought for a while something like this was needed for intelligence, to have a compact representation of the state one is in. (My biased view, that I've updated away from, was that the weights represented HOW to think, and less about knowledge.) However, it's really intractable to back propagate over so many time steps, and transformers have shown us that you really don't need to. The long term memory is just in the weights.

So, one obvious thing is you could simply make the language model continue to update on its dialogues, including its own response and that of its conversation partner(s). Over time, it could in theory learn from its own conversations and incorporate them into its long term memory, perhaps learning that this type of dialogue involving 'itself' is special. (It can even use special tokens or something if need be.)

As for doing variable length computation, the networks themselves do that with the number of layers they have, but that's really a fixed feature of the architecture. Instead, variable length computation would really be a data issue. Basically, having enough data that "shows your work". The model can learn to output intermediate results and use attention over those to continue its thought process. There is a small potential for baking transparency into the model this way, where you could force the model, through limited context windows, to carry its thoughts out using human language basically. Then if the model is doing a very careful analysis on whether it's ready to take over the world, it would need to go through those calculations, and you could read them.

Of course maybe it'll learn to use code words or something, but it's a thought. Maybe you would always be able to ask it what was it thinking about there, and maybe the explanations it gives might not always be convincing. 

Though, imagine that the machine starts outputting gibberish intermediate steps, and tells the human operators that thinking using human zipfian distributed words is highly inefficient, and the operators shrug their shoulders and say "Oh cool. Makes sense. Carry on." As I hear said around here, that's a way to die with less dignity.

Please message me, I believe you hit the nail on the head. I had this same vision and believe it’s the missing link.

Perhaps "agency" is a better term here? In the strict sense of an agent acting in an environment?

And yeah, it seems we have shifted focus away from that.

Thankfully, thanks to our natural play instincts, we have a wonderful collection of ready made training environments: I think the field needs a new challenge of an agent playing video games, only receiving instructions of what to do using natural language.

[+][comment deleted]10