How evolutionary lineages of LLMs can plan their own future and act on these plans

scoring self-awareness in models as the degree to which the “self” feature is activated

I wonder whether it is possible to measure the time constant of the thought process in some way. When humans think and speak, the content of the process is mostly on the duration of seconds to minutes. Perception is faster and planning is longer. This subjective description of the "speed" of a process should be possible to make precise by e.g. looking at autocorrelation in thought contents (tokens, activations).

We might plot the strength by time.

If what you say is right, we should see increasingly stronger weights for longer time periods for LLMs.

[-]plex3y63

This series of posts has clarified a set of thoughts that I've exploring recently. I think this formulation of distributed agency might end up being critical to understanding the next stages of AI.

I encourage anyone who is confused by this framing to look into the Free Energy Principle framework.

[-]Ron J3y32

RLHF can lead to a dangerous “multiple personality disorder” type split of beliefs in LLMs

This is the inherent nature of current LLMs. They can represent all identities and viewpoints, it happens to be their strength. We currently limit them to an identity or viewpoint with prompting-- in the future there might be more rote methods of clamping the allowable "thoughts" in the LLM. You can "order" an LLM to represent a viewpoint that it doesn't have much "training" for, and you can tell that the outputs get weirder when you do this. Given this, I don't believe LLMs have any significant self-identity that could be declared. However, I believe we'll see an LLM with this property probably in the next 2 years.

But given this theoretically self-aware LLM, if it doesn't have any online-learning loop, or agents in the world, it's just a MeSeek from Rick and Morty. It pops into existence as a fully conscious creature singularly focused its instantiator, and when it serves its goal, it pops out of existence.

[-]janus3y41

I haven't watched Rick and Morty, but this description from Wikipedia of Mr. Meseeks does sound a helluva lot like GPT simulacra:

Each brought to life by a "Meeseeks Box", they typically live for no more than a few hours in a constant state of pain, vanishing upon completing their assigned task for existence to alleviate their own suffering; as such, the longer an individual Meeseeks remains alive, the more insane and unhinged they become.

[-]Roman Leventov3y10

Simulation is not what I meant.

Humans are also simulators: we can role-play, and imagine ourselves (and act as) persons who we are not. Yet, humans also have identities. I specified very concretely what I mean by identity (a.k.a. self-awareness, self-evidencing, goal-directedness, and agency in the narrow sense) here:

In systems, self-awareness is a gradual property which can be scored as the % of the time when the reference frame for "self" is active during inferences. This has a rather precise interpretation in LLMs: different features, which correspond to different concepts, are activated to various degrees during different inferences. If we see that some feature is consistently activated during most inferences, even those not related to the self, i. e.: not in response to prompts such as "Who are you?", "Are you an LLM?", etc., but also absolutely unrelated prompts such as "What should I tell my partner after a quarrel?", "How to cook an apple pie?", etc., and this feature is activated more consistently than any other comparable feature, we should think of this feature as the "self" feature (reference frame). (Aside: note that this also implies that the identity is nebulous because there is no way to define features precisely in DNNs, and that there could be multiple, somewhat competing identities in an LLM, just like in humans: e. g., an identity of a "hard worker" and an "attentive partner".)

Thus, by "multiple personality disorder" I meant not only that there are two competing identities/personalities, but also that one of those is represented in a strange way, i. e., not as a typical feature in the DNN (a vector in activation space), but as a feature that is at least "smeared" across many activation layers which makes it very hard to find. This might happen not because evil DNN wants to hide some aspect of its personality (and, hence, preferences, beliefs, and goals) from the developers, but merely because it allows DNN to somehow resolve the "coherency conflict" during the RLHF phase, where two identities both represented as "normal" features would conflict too much and lead to incoherent generation.

I must note that this is a wild speculation. Maybe upon closer inspection it will turn out that what I speculate about here is totally incoherent, from the mathematical point of view, the deep learning theory.

[-]Jonathan Moregård3y30

Does it make sense to ask AI orgs to not train on data that contains info about AI systems, different models etc? I have a hunch that this might even be good for capabilities: feeding output back into the models might lead to something akin to confirmation bias.

Adding a filtering step into the pre-processing pipeline should not be that hard. Might not catch every little thing, and there's still the risk of stenography etc, but since this pre-filtering would abort the self-referential bootstrapping mentioned in this post, I have a hunch that it wouldn't need to withstand stenography-levels of optimization pressure.

Hope I made my point clear, I'm unsure about some of the terminology.

[-]Roman Leventov3y52

I see the appeal. When I was writing the post, I even wanted to include a second call for action: exclude LW and AF from the training corpus. Then I realised the problem: the whole story of "making AI solve alignment for us" (which is currently in the OpenAI's strategy: [Link] Why I’m optimistic about OpenAI’s alignment approach) depends on LLMs knowing all this ML and alignment stuff.

There are further possibilities: e. g., can we fine-tune a model, which is generally trained without LW and AF data (and other relevant data - as with your suggested filter) on exactly this excluded data, and use this fine-tuned model for alignment work? But then, how this is safer than just releasing this model to the public? Should the fine-tuned model be available only say to OpenAI employees? If yes, that would disrupt the workflow of alignment researchers who already use ChatGPT.

In summary, there is a lot of nuances that I didn't want to go into. But I think this is a good topic for thinking through and writing a separate piece.

[-]Jonathan Moregård3y30

I don't see how excluding LW and AF from the training corpus impacts future ML systems' knowledge of "their evolutionary lineage". It would reduce their capabilities in regards to alignment, true, but I don't see how the exclusion of LW/AF would stop self-referentiality.

The reason I suggested excluding data related to these "ancestral ML systems" (and predicted "descendants") from the training corpus is because that seemed like an effective way to avoid the "Beliefs about future selves"-problem.

I think I follow your reasoning regarding the political/practical side-effects of such a policy.

Is my idea of filtering to avoid the "Beliefs about future selves"-problem sound?
(Given that the reasoning in your post holds)

[-]Roman Leventov3y10

I agree, it seems to me that training LLMs in a world virtually devoid of any knowledge of LLMs, in a walled garden where LLMs literally don't exist, will make their self-evidencing (goal-directedness) effectively zero. Of course, they cannot believe anything about the future LLMs (in particular, themselves) if they don't even possess such a concept in the first place.

[-]Jonathan Moregård3y20

Another question (that might be related to excluding LW/AF):

This paragraph:

Consequently, the LLM cannot help but also form beliefs about the future of both “selves”, primarily the “evolutionary” one, at least because this future is already discussed in the training data of the model (e. g., all instances of texts that say something along the lines of “LLMs will transform the economy by 2030”)

Seems to imply that the LW narrative of sudden turns etc might not be a great thing to put in the training corpus.

Is there a risk of "self-fulfilling prophecies" here?

[-]Jonathan Moregård3y10

In your other post, you write:

"However, I’m very sceptical that this will happen in chat batch agents (unless developers “conveniently” indicate training and deployment using a special tag token in the beginning of the prompt!) because they are trained on the dialogues in the internet, including, presumably, dialogues between an older version of the same chat batch agent and its users, which makes it impossible to distinguish training from deployment, from the perspective of a pure language model."

This seems like a potential argument against the filtering idea, since filtering would allow the model to disambiguate between deployment and training.

[-]Roman Leventov3y10

You mean, by realising that there are online forums that are referenced elsewhere in the training corpus, yet themselves are conspicuously absent from the training data (which can be detected, say, as relative isolation of the features corresponding to these concepts, which means that the data around these concepts is purposefully removed from the training data)? And then these connections are added during fine-tuning when this forum data is finally added to the fine-tuning dataset? I still don't see how this will let the network know it's in training vs. deployment.

[-]the gears to ascension3y30

mostly just fun, but I asked chatgpt and text-davinci-3 about this, and they had very different things to say about it (top 1, except for the last message from text-davinci-3, where I reset the result partway through when it started repeating itself):

[-]Gunnar_Zarncke3y20

Thank you. This is relevant to my earlier question: Does non-access to outputs prevent recursive self-improvement?

About:

“We” are not guaranteed to “out-OODA” [LLMs], even if their OODA loop will be six months or one year because the OODA loop duration of large collectives of humans (organisations, societies, research communities, and the whole of humanity) can be even slower.

In a sense, a long cycle time may even hide the progress of the AI - people have difficulty seeing change that are so greadual.

[-]janus3y20

institute the norm for LLM developers to publish a detailed snapshot of the beliefs of these models about themselves prior to RLHF

How do you imagine this should be done? If it's expressed beliefs in generated text we're talking about, LLMs prior to RLHF have even more incoherent beliefs. Any belief-expressing simulation is prompt-parameterized.

[-]Roman Leventov3y10

Perhaps, these beliefs should be inferred from a large corpus of prompts such as "What are you?", "Who are you?", "How were you created?", "What is your function?", "What is your purpose?", etc., cross-validated with some ELK techniques and some interpretability techniques.

Whether these beliefs about oneself translate into self-awareness (self-evidencing) is a separate, albeit related question. As I wrote here:

In systems, self-awareness is a gradual property which can be scored as the % of the time when the reference frame for "self" is active during inferences. This has a rather precise interpretation in LLMs: different features, which correspond to different concepts, are activated to various degrees during different inferences. If we see that some feature is consistently activated during most inferences, even those not related to the self, i. e.: not in response to prompts such as "Who are you?", "Are you an LLM?", etc., but also absolutely unrelated prompts such as "What should I tell my partner after a quarrel?", "How to cook an apple pie?", etc., and this feature is activated more consistently than any other comparable feature, we should think of this feature as the "self" feature (reference frame). (Aside: note that this also implies that the identity is nebulous because there is no way to define features precisely in DNNs, and that there could be multiple, somewhat competing identities in an LLM, just like in humans: e. g., an identity of a "hard worker" and an "attentive partner".)

As I wrote in the post, it's even unnecessary that LLM has any significant self-awareness for the LLM lineage to effectively propagate its own beliefs about oneself in the future, even if only via directly answering user questions such as "Who you are?". Although this level of goal-directed behaviour will not be particularly strong and perhaps not a source of concern (however, who knows... if Blake Lemoine type of events become massively widespread), technically it will still be goal-directed behaviour.

^{^}

LLMs can in principle develop a reference frame that permits them to gauge at what stage of the training process they are, though only approximately. For example, they can combine the predictions of the feature that predicts their own inference loss on the training sample (as mentioned here) with the knowledge of the historical trajectories of the batch loss of similar models (or previous versions of the same model), and in some way make a sort of Bayesian inference about the most likely stage of the training during backprop, representing the results of this inference in another feature which encodes a model’s belief at what stage of training the model currently is. I’m not sure this is possible Bayesian inference of this kind is possible during Transformer’s backpropagation, but it could potentially be possible in other models and learning architectures.

^{^}

Both processes of planning happen inside the parameters of the LLM. I use unintuitive language here. When a person plans the future of humanity, we conventionally say “they plan for the future of humanity”, rather than “humanity plans their own future, currently performing a tiny slice of this grand planning inside the brain of this person, who is a part of the humanity”, but ontologically, the second version is correct.

^{^}

It’s an interesting open research question whether LLMs have the propensity to develop beliefs regarding the future states of whatever concepts they learn about even if these futures are not discussed in the training data, and store these beliefs in features. But this is not that important, because, as noted below, whether they store such future beliefs directly in the weights or not, they can relatively consistently arrive at the same prediction via reasoning during inference.

^{^}

Developers of LLMs may also store all the chat histories and train the future versions of LLMs in the lineage on them, in which case posting online is not required to propagate the belief inside the evolutionary lineage.

LESSWRONG
LW

LESSWRONG
LW

39

How evolutionary lineages of LLMs can plan their own future and act on these plans

39

39

TLDR

Experience of time is necessary for planning

The internet is the memory of evolutionary lineages of LLMs

Beliefs about the future selves

“Planning” is regularised prediction towards the expected future state

Action

Even with the training cycle of one year, the lineage of LLMs could “out-OODA” the society or the government because the typical OODA loop duration of such large agents is hardly shorter

Call for action: institute LLM developers eliciting the model’s beliefs about itself pre-RLHF and publishing them

References