Red-teaming AI-safety concepts that rely on science metaphors

catubc

TLDR: I take two common concepts in AI-alignment: inner vs. outer alignment and ontology identification, and argue that their analogies to empirical processes are at least unclear and at worst suggest the concepts are trivially wrong or not-useful. I suggest that empirical-science-based analogies and metaphors can often fail in trivial ways in conceptual AI-safety research.

Background: After a fun night playing the Alignments Game Tree, we did not solve AI-alignment. Too bad for us and the rest of the world. For those who have not tried it, it's a blue team (come up with solutions) - red team (come up with criticisms) type of scenario. The group I was with came up with fun and creative approaches to solving the problem. However, we did not get even close to discussing some of the hot-topic/buzz words in AI-safety. This makes some sense as we were on limited time, but I was a bit disappointed that we did not get to think about (or even red-team) some of the ideas or solutions that more established AI researchers have come up with. Rather than spamming the meeting group I decided to write this short post with red-team suggestion to two concepts that are present in the AI-alignment literature: inner vs. outer alignment and ontology identification.

Red-teaming: inner vs. outer alignment. These two terms pervade many community posts and it is claimed that splitting the notion of alignment into these two concepts is helpful and has similarities or analogies to the real world. Outer alignment seems to be defined as models/AI systems that are optimizing something that is very close or identical to what they were programmed to do or the humans desire. Inner alignment seems to relate more to the goals/aims of a delegated optimizer that an AI system spawns in order to solve the problem it is tasked with. This explanation is not great but there seems to be no systematic (let alone peer reviewed/academic) work on this. Some are already criticizing this strategy as potentially turning one hard problem into two. But does this split make sense or is broadly helpful for alignment? I am aware of examples where RL agents that are expert at video games fail on trivial tasks and other types of ML-based RL models that comically fail to carry out some task that was intended but not achieved. But is this a real failure of alignment or some very specific sub-problem like trivially misspecified - or even omitted (in the case of the Mario world example) - goals? (I'm sure there's posts on this). But moving aside this interpretation, what is an example of outer-inner alignment separation in the real world? A common metaphor appears to be of mesa-optimizers and evolution. In this scenario, humans are (supposed to be) outer aligned to evolution's goal of replication/reproduction - but developed birth control and now can game the systems designed for reproduction. To me this is an unclear example as it assumes we know what evolution is and what its goals are. I'm not sure that empirical scientists think like this. I certainly do not. For example, evolution may simply be good at finding systems that are increasingly able to become independent of the world and more resilient. Humans somewhat fit this description. In that sense, could we not argue that silicon-based organisms such as AI-systems are the goal of evolution? And that humans being wiped out by AI is in fact a legitimate goal of evolution - with human reproduction only being necessary to the point of spawning AGI? If we are supposed to think about inner alignment as a process where a complex system (earth?) delegates some process like reproduction/survival to subsystems (humans?) and those subsystems somehow protect/understand/follow the goals of the outer system and do not deviate - then it is not clear what that would mean (or that this is even desirable). If we rely on this analogy alone - it's not clear whether there's a need for a distinction between inner and outer misalignment. And a more general point: I would think that empirical scientists and even humanities researchers would be less quick to jump to conclusions about teleological goals of complex systems. We certainly can understand the immediate goals of an organism and the problems it is trying to solve - one of the most famous teleological paradigm in the neuroscience-algorithm realm being David Marr's three levels of analysis, but requiring the separation of inner vs. outer teleological goals is not obviously helpful - let alone a guaranteed path to developing safe AI systems.

Red-teaming ontology identification. In preparing for the game tree of alignment we noticed a claim that several established AI-researchers have "converged" onto this idea of ontology identification (OI) as being important/critical to advancing AI safety/alignment. The idea seems to be that we need AI systems that "think" , "operate" or at least be able represent their processes and algorithms in a paradigm that humans can understand. More precisely, in terms of "ontologies", we need systems to make explicit their inner-workings or be prepared to explain their world of things (i.e. their ontology) into a world of things that humans can understand. This seems neither necessary nor sufficient and might not even be desirable of safe future AI systems. It does not seem necessary for AI systems to do this in order to be safe. This is trivial as we already have very safe but complex ML models that work in very abstract spaces. On this point, it's not clear whether OI is a type of intepretability goal - or a goal of creating AI-systems that must also act "translators" of their inner processes for humans. Additionally, such systems would not be sufficient for safety because human interpretable processes are not sufficient to guarantee human safety. This occurs because we may simply be intentionally/unintentionally directed to the wrong set of actions by an AI system which learns a human-friendly mapping that misrepresents its inner world. More trivially, at some point humans will not be able to evaluate these processes, ideas or conceptual notions (how many of us understand cellular machinery as a biologist; or mathematical proofs; or general relativity). Arguably, at some point we will not be able to evaluate the effects of the recommended actions on humanity's future - regardless of how much information is given to us. Basically we're back to playing chess with a god - how many steps in the future can we evaluate? Lastly, while it is generally desirable to have AI systems be able to explain their world of things to humans - this may hamper development (e.g. we/humanity is clearly not waiting until transformers, let alone GPT-4, have clear interpretations or human-friendly ontological groundings). One of the analogies provided for understanding this common-ontology goal is that when that when discussing physics two parties must have the same basic understanding of the world, and if one thinks in terms of classical physics (Newton) and the other modern (Plank, Einstein, Bohr), then they will have a hard time communicating (at least the classical physicist will have a hard time understanding the modern one). But the classical physicist can certainly benefit from solutions that are offered by the modern physicist (e.g. lasers) and use them in the "macro" world that they live in. More to the point, what if the classical physicist's brain will never be good enough to understand modern physics? Would we want to cap the development of modern physics? This may simply happen with human-AGI interactions: we will be given a list of actions to achieve some goal and will not be able to understand much of their effects. Another imperfect analogy: the vast majority of humanity has little understanding of classical let alone modern physics, but benefits greatly from application of technologies developed from modern physics.

If forced to summarize these two points, I would argue that a commonality is that the AI-alignment ideas that are explained in terms of notions from empirical sciences (e.g. evolution, physics, education etc.) can be limiting and sub-optimal or trivially falsifiable (in so far as making safe AI systems is concerned). At best we are left with a lack of clarity whether the notions are limited or the analogies need improvement - or whether the entire underlying programs are flaws in irreparable ways.

Outer alignment seems to be defined as models/AI systems that are optimizing something that is very close or identical to what they were programmed to do or the humans desire. Inner alignment seems to relate more to the goals/aims of a delegated optimizer that an AI system spawns in order to solve the problem it is tasked with.

This is not how I (or most other alignment researchers, I think) usually think about these terms. Outer alignment means that your loss function describes something that's very close to what you actually want. In other words, it means SGD is optimizing for the right thing. Inner alignment means that the model you train is optimizing for its loss function. If your AI systems creates new AI systems itself, maybe you could call that inner alignment too, but it's not the prototypical example people mean.

In particular, "optimizing something that is very close [...] to what they were programmed to do" is inner alignment the way I would use those terms. An outer alignment failure would be if you "program" the system to do something that's not what you actually want (though "programming" is a misleading word for ML systems).

[Ontology identification] seems neither necessary nor sufficient [... for] safe future AI systems

FWIW, I agree with this statement (even though I wrote the ontology identification post you link and am generally a pretty big fan of ontology identification and related framings). Few things are necessary or sufficient for safe AGI. In my mind, the question is whether something is a useful research direction.

It does not seem necessary for AI systems to do this in order to be safe. This is trivial as we already have very safe but complex ML models that work in very abstract spaces.

This seems to apply to any technique we aren't already using, so it feels like a fully general argument against the need for new safety techniques. (Maybe you're only using this to argue that ontology identification isn't strictly necessary, in which case I and probably most others agree, but as mentioned above that doesn't seem like the key question.)

More importantly, I don't see how "current AI systems aren't dangerous even though we don't understand their thoughts" implies "future more powerful AGIs won't be dangerous". IMO the reason current LLMs aren't dangerous is clearly their lack of capabilities, not their amazing alignment.

Thanks for the comment Erik (and taking the time to read the post).

I generally agree with you re: the inner/outer alignment comment I made. But the language I used and that others also use continues to be vague; the working def for inner-alignment on lesswrong.com is whether an "optimizer is the production of an outer aligned system, then whether that optimizer is itself aligned". I see little difference - but I could be persuaded otherwise.

My post was meant to show that it's pretty easy to find significant holes in some of the most central concepts researched now. This includes eclectic, but also mainstream research including the entire latent-knowledge approach which seems to make significant assumptions about the relationship between human decision making or intent and super-human AGIs. I work a lot on this concept and hold (perhaps too) many opinions.

The tone might not have been ideal due to time limits. Sorry if that was off putting.

I was also trying to make the point that we do not spend enough time shopping our ideas around with especially basic science researchers before we launch our work. I am a bit guilty of this. And I worry a lot that I'm actually contributing to capabilities research rather than long-term AI-safety. I guess in the end I hope for a way for AI-safety and science researchers to interact more easily and develop ideas together.

I think it could be a good idea to initially cap research to the limits where human brains (with the AI assiistance) can understand it. And after that - decide how we want to go on, based on the things we have understood.

Thanks for the comment. Indeed, if we could agree on capping, or slowing down, that would be a promising approach.