This post is the second section of "Aligning an H-JEPA agent via training on the outputs of an LLM-based "exemplary actor", posted separately because I think it could warrant a separate discussion, largely independent of the discussion of H-JEPA agent with GFlowNet actors. Here's the summary of this post, copied from the "Overview" section of the main article:
In section 2, I describe the “exemplary actor”, an LMCA (language model cognitive architecture) that takes a simple, “brute force” approach to alignment: a powerful LLM (think GPT-5/6 level, with a vast, or quasi-unlimited context) is given a list of “approved” textbooks on methodological and scientific disciplines: epistemology, rationality, ethics, physics, etc. Also, the LLM is given tools: narrow AIs (such as for protein folding or for predicting properties of materials, or for formal scientific modelling). Finally, the LLM is given a compute engine such as Wolfram and a knowledge base such as Wikidata or Wolfram Knowledgebase.
The exemplary actor creates plans or predictions for given situations (described in language and fed to the LLM underlying the exemplary actor as prompts) and iteratively critiques and refines its own plans and predictions while putting different textbooks into the LLM context (first, with the textbook on rationality, then epistemology, then physics, etc., with potentially dozens of different textbooks relevant for a plan or prediction that is being criticised), for many iterations, until convergence.
In section 2.1, I note that the type of alignment that the exemplary actor’s architecture tries to ensure is called (world) model alignment and that is stronger and also more essential than goal alignment.
Then, I discuss the properties of the exemplary actor. In section 2.2., I discuss what I see as likely non-issues or straightforwardly addressable issues: the “divergent reasoning nature” of LLMs, the lack of grounded common sense reasoning, and the bias of the quick reactive network (”System 1”), it it is added to the architecture to make it more practically usable in lower-stakes reasoning settings.
In section 2.3, I discuss the outstanding technical issues and risks of the exemplary actor’s architecture:
In section 2.3.4, I discuss the alignment tax of the exemplary actor (compared with the baseline of a bare, minimally fine-tuned LLM) and conclude that the main source of alignment tax might happen to be the theory of ethics which may force the exemplary actor to refuse to participate in “games” (i.e., real-world situations and environments) where it doesn’t see ethical ways of “winning”, and thus will consider inaction (or some form of palliative action) the only ethical way forward. This is not a technical problem with the exemplary actor per se, but rather a problem with a higher-level system, i.e., the current economic, social, and political structure of the world. I mention this and other kinds of “higher-level” risks of the plans to build and deploy the exemplary actor (i.e., roughly the plans that OpenAI and Anthropic are betting on, as it seems to me) in section 2.4.
Let's assume that we have three things:
First, a very powerful auto-regressive LLM (think GPT-5/6 level) with the ability to effectively attend to or to “keep in mind” hundreds of thousands of tokens, either through sparse attention, multiscale decoding (Yu et al., 2023), Unlimiformer-style techniques (Bertsch et al., 2023), or whatever. The LLM can make correct semi-formal inferences (e.g., for criticising and refining its own output), picking up the right information from the context.
Second, a bunch of narrow AI tools (or good old algorithms, or “compute engines” like Wolfram) for specific types of problems, such as a GFlowNet for generating principled (scientific) causal models from data. The LLM is trained to use these specialised AIs, i.e., is augmented with them (Mialon et al., 2023). One of the tools should be a knowledge base, such as Wikidata or Wolfram Knowledgebase.
Third, textbooks on SoTA theories of philosophy, math, physics, game theory, control theory, cognitive science, rationality, epistemology, consciousness science, ethics as science, etc. All these theories are reasonably harmonised with each other and are connected to each other, so that, in the language of David Deutsch, these theories should be hard to vary because they are constrained by each other.
Note that currently, SoTA theories in most of the aforementioned fields of science are either very unsatisfactory (e.g., there are currently no satisfactorily developed scientific theories of consciousness and ethics) or aren’t harmonised with each other. Developing new good theories and harmonising them all with each other will be the job for scientists to do at the AGI labs that will develop the sufficiently powerful LLM first (or a CoEm, which, I suspect, is planned to be a sort of wrapper around a somewhat less powerful LLM). In other words, I assume that the language model cognitive architecture (LMCA) built around the LLM will be powerful enough to do highly abstract theoretical science either autonomously or with minimal supervision by human researchers.
If we have these three ingredients, we can build an AI that generates textual descriptions of "exemplary" (i.e., perfectly ethical and “aligned”) plans from textual situation descriptions. Below, I call this AI an “exemplary actor”.
Here’s the exemplary actor’s algorithm:
Input: A textual description of the situation for which the exemplary actor should generate a plan.
When we discuss the exemplary actor described above as a realistic thing, in effect, we also assume the “LLM optimism” view (cf. “scale is all you need” hypothesis): that is, we assume that some future versions of LLMs such as GPTs will be “smart”/capable enough to generate novel scientific theories and converge on methodologically and scientifically sound inferences (plans, predictions, explanations) though iterated self-critique and refinement, or do this at least as well as the smartest humans can.
Here and below, I also make a non-trivial assumption that if the exemplary actor iterates on its predictions and plans with self-critique and refinement while attending to textbooks on methodological and scientific disciplines, it converges on the plans and predictions that actually conform to the theories described in these textbooks. A short way of stating this assumption is “Natural language alignment is possible”. This assumption is discussed further in section 2.3.3 below.
In the post “Goal alignment without alignment on epistemology, ethics, and science is futile”, I’ve explained why the alignment on math, methodological disciplines (foundational philosophy, epistemology, rationality, and applied ethics), and scientific disciplines (physics, cognitive science, consciousness science, ethics as science, etc.) is both necessary and practically sufficient for goal and plan alignment, and, therefore, the outputs of the exemplary actor are indeed exemplary and aligned. Goal alignment will follow from methodological and scientific alignment almost automatically.
Even though auto-regressive LLMs are an “exponentially diverging diffusion process” (LeCun), it seems at least plausible that an LMCA that iteratively critiques and refines its own inferences with good textbooks “in mind” should rectify this issue of LLMs: the iterative critique and refinement process should converge in the vicinity of the “correct” prediction, plan, or an explanation.
Another of LeCun’s critiques of LLM reasoning, namely that LLMs lack grounded common sense (Browning & LeCun, 2022) and, I extend their argument, LLM’s predictions of the future world states have a high risk of being severely biased by whatever linguistic simulacra happen to be most influential during the training of the LLM. However, I think that iterated critique and refinement with methodology and science textbooks in the context, as well as using narrow AI tools for scientific modelling and using a reliable knowledge base will make abstract reasoning of the exemplary actor at least as practically grounded as human abstract reasoning.
Nobody ever gets to execute clever plans prepared by the exemplary actor in their full elaboration. The plans will be filled with information-seeking (i.e., uncertainty-resolving) actions, and the information obtained through these actions is expected to often change the plan. For example, when the exemplary actor is presented with a situation, the very first action in their plan for it almost definitely will be reaching out to some agents involved in it and asking them further clarifying questions.
Any practical AI implementation should include a module for “fast” action selection. This module is particularly indispensable in dynamic situations when there is not enough time for iterated critique and refinement of plans, which may take dozens of minutes on modern LLM inference hardware. The latency constraint will likely remain relevant for many years even if we ignore the high cost of the inference of the exemplary actor, or expect the cost to drop so dramatically that it won’t matter. LeCun (2022) calls the module for fast action selection “Mode-1” (after Kahneman’s “System 1”), and it’s often called the “habitual network” in Active Inference literature.
So, the potential issue here is that the LLM’s quick reactionary actions in dynamic situations won’t align with the first actions that would be the outcome of “System 2”-like deliberation by the exemplary actor.
However, it doesn’t appear to me as an issue because we can fine-tune a version of the powerful LLM to react fast in dynamic situations using the exemplary actor to generate the training data. I don’t see why this fine-tuning should work worse than training the equivalent “Mode-1” policy module in the H-JEPA architecture, using the data from the more deliberative “Mode-2” module which uses the World Model to predict and optimise long-term plans and predictions. The fine-tuned LLM may be slightly biased in its fast actions, but so could be the “Mode-1” policy in H-JEPA, and it seems that these biases should be quite small and not seriously consequential anyways.
In order to predict adversarial actions and generate the best plans, the base LLM of the exemplary actor must not be fine-tuned “for safety” so much that it is completely stifled and cannot speculate about adversarial plans “out loud”, i.e., in the output text rather than within the internal activation spaces of the auto-regressive LLM. Deliberately “pushing” auto-regressive LLM’s adversarial reasoning “inside” its activations while trying to maintain its adversarial planning capability seems very dangerous, and will definitely require very good interpretability systems for supervision of “misaligned thoughts”, which then, of course, could itself leak adversarial reasoning if the attacker has direct access to the interpretability systems. Thus, if the attacker gains direct access to the LLM, the attacker could use it to generate some plans (or other dangerous knowledge, such as the DNA sequence of a supervirus) which could be very bad for humanity.
Another outstanding issue with the plan of creating the “exemplary actor” is that the powerful underlying LLM has an “alien” world model which is perhaps not understandable to humans even in principle because the factors/variables in this model may be connected much more densely that humans’ consciousness prior (Bengio, 2019) permits comprehending. Given that the output modality (text) and the assumed means of explanation (scientific textbooks) are “human-like”, i.e., imply sparse causal models, if the LLM actually has an “inhumanly dense” causal world model, then it is not just likely, but is guaranteed to provide ”unfaithful explanations” for its plans (Turpin et al., 2023), where “faithfulness” means that the output text accurately reflects the thinking process that has led to this or that conclusion.
Still, I’m not eager to jump to a conclusion that this fact alone necessarily means that things will go wrong. LLM’s alien reasoning is “regularised with” human-understandable theories (while the risk that it will not be regularised with textbooks, i.e., the risk of direct access to the LLM is discussed above): if the LLM is a diligent critic of its own reasoning, it will find these “scientifically unfounded” conclusions and force itself to change them during the refinement of the plan or the prediction. Some “alien bias” may remain in the predictions and plans, though. See also section 4.2 for further discussion of this.
Even if the LLM will have enough situational awareness that it could “play cooperatively” in its generator, critic, and editor roles within the exemplary actor’s deliberative process, I doubt it will be capable enough to sneak into its predictions and plans some steps that will advance its “secret self-serving goals or agenda” in a coherent way. However, the situation is dangerous because the “alien” world model will be derived mostly from language, which re-surfaces the “grounded common sense” challenge that I classified as a likely non-issue in section 2.2.
Also, in this situation, the exemplary actor would be actually model-misaligned with humans, so it will fail to fulfil its original promise, even though this situation might not lead to a “catastrophe” because the ethics critiques keep the exemplary actor from hurting humans. But humans would definitely stop understanding what is happening and where civilisation is going.
Even though linguistic critique and refinement are often considered the most robust way of conforming someone’s plan, prediction, or explanation to some theory, it might be that this is still insufficient that the text result of critique and refinement actually conforms to the theory from the perspective of which the text result was criticised through language. Even though the alignment of multiple cognitive agents’ world models obviously consists at least in part of linguistic communication (Friston et al., 2022), this type of alignment might mainly pertain to relatively simple feedback on past and current behaviour rather than elaborate, “theoretical” critiques of elaborate plans and predictions, expressed in language. Understanding “language alignment” and studying its feasibility is the domain of frontier research in cognitive science, philosophy of language, and semantics.
As discussed in section 2.2.1, the exemplary actor will bear a significant alignment tax in inference latency. However, in section 4.5 I argue that this is probably not very important even in the context of cyber-warfare, let alone regular business situations.
In terms of capability (i.e., the quality of generated plans and predictions), whether the exemplary actor will be handicapped relative to a bare LLM depends on the “alien” dependencies between concepts and process patterns in the world model learned by the LLM and that are removed from the exemplary plans and predictions as “unjustified from the perspective of approved scientific theories” (see section 2.3.2). Even if these patterns are the result of LLM overfitting on the training text corpus, this overfitting could still “work better” than more regularised reasoning by the exemplary actor if humans “overfit” in the linguistic simulacra in the same way. In this case, LLM might gain an inexplicable predictive intuition in memetics, persuasion, fashion, economics, politics, social psychology, etc., that mathematical scientific theories in these fields will fail to match in predictive power (if such mathematical theories will even exist: there are doubts that robust predictive mathematical models in these fields could be built even in principle).
On the other hand, the regularisation of reasoning and inference with state-of-the-art theories in methodology (rationality, epistemology) and scientific disciplines should increase the accuracy and calibrated precision of the predictions and plans of the exemplary actor significantly relative to LMCAs that don’t iteratively critique and refine their plans and predictions with methodological and scientific theories. After all, we, humans invest in finding good explanations in science and methodology because we expect them to be useful, including during linguistic reasoning (and not only in engineering calculations). The benefit from this kind of (methodological and scientific) regularisation of reasoning should usually exceed the benefit of using “alien” patterns (even if these patterns are directionally correct on average, they could often fail unexpectedly), except, maybe, in these arenas of action mentioned above (memetic engineering, politics, social engineering) which may benefit disproportionately from the “alien intuition” and gain too little from “scientific regularisation”.
However, if an AI distils its “alien” models into formal math and science (albeit too complex for humans to understand, such as mathematical constructs with dozens of “moving parts”), check it, and use that theory to critique and refine its plans in the same manner as the exemplary actor would do with human-understandable theories, this form of “exemplary actor equipped with alien mathematised models” could probably beat the “exemplary actor with human-understandable mathematised models” in a wider range of situations. Therefore, we should monitor AI for not attempting to formalise its alien models.
The previous paragraph concerns all methodological and scientific disciplines with the exception of ethics. Exemplary ethical reasoning should look like voluntary quitting of the games where the agent couldn't find realistic options for avoiding “bad” Nash equilibria or to “win” in ways that would be morally acceptable. Technically, this is not a problem: this is the essence of what we want from an aligned AI. However, on a larger view, ethical alignment becomes a capability handicap if the environments in which the AI will act, i.e., the current economic and geopolitical systems, present themselves as exactly such “bad” games. Though this is not a technical problem with the exemplary actor’s architecture, but rather a risk of the larger project of deploying any (aligned) AI in the world, which I discuss in the next section.
Apart from technical questions regarding the alignment of the exemplary actor with a given theory (described in a textbook), there are many other risks of the plan of building and deploying the exemplary actor. The two prominent “technical” risks are:
Perhaps the most salient “non-technical” risk to me is that the environment (the economic, geopolitical, and social systems) in which the exemplary actor will find itself will not be “winnable” in an ethical way even with the best of its ability and reasoning (see section 2.3.4). I also recently discussed the importance of the right system incentives and instituting an “ethically winnable” system structure in this comment.
However, in this post, I don’t discuss these risks (as well as other risks of the plan to create the exemplary actor which I don’t mention) further because these risks apply equally to the LLM-based exemplary actor which I discussed so far and a GFlowNet actor within an H-JEPA agent, which I describe and discuss in sections 3 and 4.
Think of the level of formality in analytic philosophy: still operating with text rather than code or symbolic mathematical models, but being strict about types and logical inference rules.
I’ve discussed why for technical AI alignment to count as successful, ethics (as well as studies of consciousness) must be treated as a science rather than philosophy in “Scientism vs. people”. See also this comment.
Note that this LLM (perhaps paired with a researcher) is expected to be much better than me (or any other AI safety researcher today) at reasoning about AI architecture, which this post consists of. So, under the assumptions made for this post about the capabilities of an “LLM scientist”, the rest of it will definitely be obsoleted by the output of this LLM scientist itself. Thus, the purpose of writing this post is not to suggest something to actually be implemented, but rather to improve the author’s (and, hopefully, readers’) understanding of the problem and prepare us to better understand and critique the architectures proposed by the LLM scientist.
By “non-trivial” here I mean that this assumption doesn’t seem to be “guaranteed to follow” even if all other assumptions that I’ve made above are realised, namely that the exemplary actor will be capable enough to produce novel, sound theoretical science, attend entire textbooks in its context and criticise its own outputs constructively using the theories from these textbooks, etc.
In the recent post “Constitutional AI: RLHF On Steroids”, Scott Alexander casually calls “ethical behaviour” a “goal”, as well as throws in the phrase “goal function”, which I believed is a confused ontology. Ethical (rational, scientifically grounded, epistemologically optimal) behaviour is not a “goal”, they are what they are: characteristics of behaviour (and, reciprocally, the cognitive architecture of the agent). Goals are future world states predicted by intelligences with certain cognitive compositions (at least, the intelligence should be able to make a counterfactual prediction of the future at all), and the behaviour of the intelligence is structured towards realising this future world state, i.e., minimising the expected free energy wrt. this prediction.
Not unlike humans can be captured and help propel simulacra, from religions and political ideas to the “belief in crypto”.
It took Anthropic’s Claude 22 seconds to pass “The Great Gatsby” through the context. We can compare this to criticising a plan or a prediction from the perspective of one scientific or methodological discipline, e.g., rationality. Then we can consider that convergence on an exemplary plan may take the exemplary actor to criticise it from the perspectives of a dozen different textbooks over at least multiple iterations.
The LLM is well-suited to play the role of this “System 1” component: LLMs are often accused of doing “System 1 thinking”, but this is exactly what we want it to do here.
This is needed if we want to ensure that the exemplary actor can strategically act against emerging misaligned agents. See sections 2.3.4 and 4.3 for more detailed discussion.
However, it should be noted here that there is some research happening in the sparsification of DNNs, and in particular Transformer architectures that should lead the LLM to acquire a world model with a human-comprehensible structure (Rahaman et al., 2022, Liu et al., 2023).
“Editor” here is the role in which the LLM refines its own output using a critique coming from the “critic”, i.e., performs step 2.a.iii in the algorithm of the exemplary actor.
I'm not even sure that such “secret agenda advancement” is feasible at all as long as there is a requirement to explain the plans and predictions thoroughly with scientific models.
I avoid writing that humans will “lose control” because humans are currently not “in control”, never were “in control”, and even in the ideal scenarios, won't be "in control". However, I think that good scenarios should be characterized by increasing cohesion of the civilisation: at some point, it could be said that AIs and humans are both participating in control as parts of a larger, cohesive entity. Or, that humans and AIs have merged and are the same systems).
These “alien” patterns that are not removed from the exemplary reasoning and instead “justified” by it may represent the problem in their own right, as discussed in section 2.3.2, but at least these patterns will be shared by the bare LLM and the exemplary actor, and thus won’t confer an advantage in reasoning to either of these.