This post is a collection of replies to Steven Byrnes' "LeCun’s “A Path Towards Autonomous Machine Intelligence” has an unsolved technical alignment problem" which wind up into a whole separate post because I suggest tweaks to the H-JEPA architecture (aka APTAMI) that might address the problems pointed out by Byrnes.

In the end, I also suggest a reason to ponder on the H-JEPA architecture and its safety properties even if we don't expect it to work and/or don't actually plan to deploy it. In short, this reason is to simply train ourselves, as AI alignment and cognitive scientists, to understand and find problems with AI-generated alignment science and proposals.

Below are quotes from Steven's post and my replies.

Recommended skimming strategy: read section headings and bolded sentences in the text.

The ambivalence between pro-sociality and instrumental thinking could be useful

Such an AI would feel “torn”, so to speak, when deciding whether to advance its own goals versus humans’.

This ambivalence could also be a necessary condition for useful and/or safe general intelligence. People themselves are torn between pro-social and selfish motives all the time. Totally controllable and subservient robot might turn out not to be a very useful or downright dangerous thing, e.g., in the hands of misaligned people.

This is somewhat related to the currently open question of whether a "tool"/simulator AI is safer than an "agent" AI with a certain ethical ego or vice versa, all things considered.

The Altruism Controller

We can hope that the prosocial drives will “win out” in the AI’s reckoning, but we don’t have any strong reason to expect that they will in fact win out.

Let’s say there is a submodule within the Trainable Critic module, or as a completely separate module, called Altruism Controller, where the statistics are kept (or learned, by an NN) about how much do pro-social components of intrinsic cost (or, ultimately, trainable cost) vs. “instrumental” components (power, resources, interestingness, etc.) contribute, depending on the context: the world state and the level of inference. Let’s say, on the very low levels of inference, such as precise locomotive planning and control, instrumental components of energy usually dominate, i.e., fine-grained motion planning is usually motivated by power economy and low-level instrumental expediency, such as opening a kitchen cabinet for resolving epistemic uncertainty — finding a knife, rather than “moving in a playful way that surrounding people find funny”. On the higher levels of planning and inference, the picture could be reversed: plans for a day are usually made such that they minimise energy coming from pro-social intrinsic cost components (people are happy at the end of the day), rather than instrumental (full battery power at the end of the day, more resources at the end of the day). This picture also depends on the context: when there are no people around (e.g., a robot on a human-free Von Neumann probe to a different star), plans for a day are obviously not usually driven by pro-sociality (but plans for a century still could).

Then, when this separate module detects an unusual situation (e.g., a totally new context), or that a certain inferred plan at a certain level of the hierarchy is unusually strongly driven by instrumental components of intrinsic and trainable cost in the certain context, then such situation itself could be a trigger for out-of-order inference on the higher level of abstraction, or existing into the Configurator mode immediately to re-configure the Actor NN so that it hopefully generates more “pro-social” plans if unexpectedly failed in the given context and configuration previously.

Now, this doesn’t provide an ironclad guarantee because Altruism Controller could also unexpectedly fail, or be tampered with by the robot itself.

However, if we consider self-tampering, all bets are off because the robot could tamper with everything, including the “immutable” intrinsic cost module. That’s a very different line of criticism of LeCun’s confidence that “we are fine and alignment is easy”, which I totally agree with as unsolved, but I believe this is not the main topic of Steven's post.

Regarding the possibility of a failure of the Altruism Controller, yes, H-JEPA is not provably safe, but nor is RLHF/language feedback, Conjecture's CoEm agenda, Bayesian model alignment, and all other "prosaic" approaches. Only Highly Reliable Agent Designs aim to be provably safe (MIRI style), but many people are sceptical that this concept even makes sense from philosophical and scientific points of view, as I discussed in "A multi-disciplinary view on AI safety". But this is a digression.

Ways to address the "adversarial" OOD generalisation problem: prioritise science and continuously align on the models

Why is problem 2 an “adversarial” OOD problem? Here’s a toy example. Imagine that the AI is deciding what to do, out of a very wide possibility space. For example, once we get AIs that can invent new technology, then the AI has access to actions that might wildly change the world compared to anything in history. Thus, if there are any anomalies where the critic judges a weird situation as unusually low-intrinsic-cost, then we’re in a situation where the AI’s brainstorming process is actively seeking out such anomalies

(From our human perspective, we would say “this plan is exploiting an anomalous edge-case in the critic”. Whereas from the AI’s perspective, it would say, “this plan is a clever awesome out-of-the-box way to solve every problem!!” You say tomato, I say to-mah-to.)

It’s amusing to note that you don’t need to invoke machine intelligence here: this is exactly what AGI developers/maximalists/accelerationists (aspiring or real) are thinking and doing, today. So, I think we can deem the problem unsolved even in humans.

The possible systemic responses (but not “solutions”) to this issue are
  (1) prioritisation of reasoning based on predictive, scientific models/theories of everything (including psychology, ethics, sociology, political science, etc.) rather than intuitive guesses, and
  (2) continuous alignment of all intelligent agents (and, more generally, alignment subjects) on these models/theories, and proactive management of disagreements and incompatibilities (e.g., letting agents choose to interact with other agents whose local version of ethics they share).

Both responses are agnostic to whether we talk about human-to-human or human-to-AI interaction. And, of course, both are easier said than done.

Ways to address the misgeneralisation problem: align on epistemology and design better system incentives 

Out-of-distribution generalization problem 1: How does the Prosociality Score Model generalize from the supervised (human-labeled) examples to the AI’s future perceptions—which might be far outside that training distribution?

Why is problem 1 an “adversarial” OOD problem? Here’s a toy example. The AI might notice that it finds it pleasing to watch movies of happy people—because it spuriously triggers the Prosociality Score Model. Then the AI might find itself wanting to make its own movies to watch. As the AI fiddles with the settings in iMovie, it might find that certain texture manipulations make the movie really really pleasing to watch on loop—because it “tricks” the Prosociality Score Model into giving anomalously high scores.

This is the problem of misaligned epistemology: in order for inferences to be “aligned”, humans’ and AI’s epistemological disciplines/theories/skills/methods should also be aligned.

My post “Goal alignment without alignment on epistemology, ethics, and science is futile” was prompted by a very similar example given by Nate Soares, to which I responded:

Concrete example: "happiness" in the post ”Misgeneralization is a misnomer” sounds like a "predicted" future state of the world (where "all people are happy"), which implicitly leverages certain scientific theories (e.g., what does it mean for people to be happy), epistemology (how do we know that people are happy), and ethics: is the predicted plan of moving from the current state of the world, where not all people are happy, to the future state of the world where all people are happy, conforms with our ethical and moral theories? Does it matter how many people are happy? Does it matter whether other living being become unhappy in the course of this plan, and to what degree? Does it matter that AIs are happy or not? Wouldn't it be more ethical to "solve happiness" or "remove unhappiness" via human-AI merge, mind upload, or something else like that? And on and on.

Thus, without aligning with AI on epistemology, rationality, ethics, and science, "asking" AIs to "make people happy" is just a gamble with infinitesimal chances of "winning".

Thus, in the H-JEPA framework, this could potentially be implemented as a host of separate components in the Critic module, trained to evaluate whether the plans inferred by the AI exhibited the disciplines of foundational philosophy (philosophy of mathematics and philosophy of science), mathematics, epistemology, ethics, rationality, physics, communication, game theory, cognitive science, psychology/theory of mind, etc. that humans have.

Now, again it’s much easier said than done. Perhaps the biggest problem is that humans themselves are not aligned on these disciplines. Then, the approach obviously sets an extremely high entrance bar for making an AGI, much more of a Manhattan R&D megaproject than bottom-up AGI research and tinkering that happens in practice (and megaprojects themselves have well-known failure modes). Also, I'm not sure it's possible to learn abstract methodological and scientific disciplines (assuming that we have a reference textbook and a body of knowledge for the discipline, and a set of scientific models) through effectively supervised, imitation-trained NN module in the Critic (as Steven suggested above for the Prosociality Score Model).

Apart from alignment on epistemology and ethics, maybe one useful tactic could be reducing the inferential distance between human and AI intelligence, e.g., by giving AIs direct access to Neuralink data so that AI can infer more directly whether a human is “unhappy” from neuronal evidence. This is also the path towards cyborgisation and in line with the idea that I have that long-term (100+ years), the only viable strategy for humanity is to merge with AI.

This is also an interesting topic from the perspective of information asymmetry and game theory: presumably if we develop interpretability tools, we will have near-perfect visibility into AI’s cognitive states. But without neuronal data, AI won’t have visibility into ours. I’m not sure whether such an asymmetry is a good or bad thing. But it could be a good thing, and giving AIs direct access to human cognitive states could be more dangerous than useful.

Finally, on this topic, it’s interesting to note that this “Out-of-distribution generalization problem 1” is also unsolved for humans right now: certain evolutionary and pro-social drives in humans are hijacked by out-of-distribution stuff such as (social) media, destructive ideologies and cults, porn, online dating, etc. All these things are brand new and thus “out of distribution” on the timescale of the evolution of humans and humans often don’t handle them well.

There is probably no easy fix for hijacking on the level of an individual and thus the issue should be addressed on the higher-system level, with governance and deliberate societal/economic system design, incentive design, mechanism design, etc. Probably the same will apply to AIs: no matter how good their design, the higher-level system should also provide the right incentives. I discussed this recently in this comment.

AI alignment scientists could work on the H-JEPA proposal just to prepare themselves for evaluating AI-generated science

For one thing, we don’t actually know for sure that this technical alignment problem is solvable at all, until we solve it. And if it’s not in fact solvable, then we should not be working on this research program at all.

This is an important point, indeed. My attitude towards APTAMI is that it could work, but this depends on developing an enormous amount of novel science in cognitive science, epistemology, ethics, etc., which wouldn’t be possible without the help of an “alignment MVP”. But that “alignment MVP” (e.g., a CoEm) could tell us “Sorry guys, this looks like a technically unsolvable problem”, or fail to find a solution no matter how hard it tries. (This is a big risk factor in the “alignment MVP” plans such as those of OpenAI and Conjecture, but discussing this risk is out of scope here.)

So, in the context of the “alignment MVP” plans of AGI labs, I think we could and should think about the proposed research program, but not really with the purpose of actually developing this research program to maturity and deploying it “at full capacity” (unless the current LLM scaling paradigm will fail short of “AGI scientist” level, as LeCun predicts, and we will be forced to develop LeCun’s agenda to make the very “alignment MVP”). The purpose is, rather, to prepare ourselves as alignment scientists to find problems with whatever cognitive and alignment science “alignment MVP” will be generating (or assisting humans to generate), because ultimately it will be the human task to understand and verify that body of research and engineering designs.


New Comment
2 comments, sorted by Click to highlight new comments since: Today at 5:47 AM

Thus, in the H-JEPA framework, this could potentially be implemented as a host of separate components in the Critic module, trained to evaluate whether the plans inferred by the AI exhibited the disciplines of foundational philosophy (philosophy of mathematics and philosophy of science), mathematics, epistemology, ethics, rationality, physics, communication, game theory, cognitive science, psychology/theory of mind, etc. that humans have.

It sounds interesting but I’m really quite far from having a concrete picture in my head—or even a vague outline—of how to write source code that would correspond to this text description.

Like, you say:

Perhaps the biggest problem is that humans themselves are not aligned on these disciplines…

Well OK, let’s just take one opinionated human, Alice, who thinks that they know what it means for a plan to exhibit the discipline of philosophy / math / physics / game theory / etc. or not. Let’s say she’s the only human who matters. We don’t care about anyone else’s opinions. OK, now we have solved the “humans themselves are not aligned on these disciplines” problem, right? But I still would have no idea what code to write, such that it would correspond to what you have in mind. Not even vaguely.

New to LessWrong?