Where "powerful AI systems" mean something like "systems that would be existentially dangerous if sufficiently misaligned". Current language models are not "powerful AI systems".

In "Why Agent Foundations? An Overly Abstract Explanation" John Wentworth says:

Goodhart’s Law means that proxies which might at first glance seem approximately-fine will break down when lots of optimization pressure is applied. And when we’re talking about aligning powerful future AI, we’re talking about a lot of optimization pressure. That’s the key idea which generalizes to other alignment strategies: crappy proxies won’t cut it when we start to apply a lot of optimization pressure.

The examples he highlighted before that statement (failures of central planning in the Soviet Union) strike me as examples of "Adversarial Goodhart" in Garrabant's Taxonomy.

I find it non obvious that safety properties for powerful systems need to be adversarially robust. My intuitions are that imagining a system is actively trying to break safety properties is a wrong framing; it conditions on having designed a system that is not safe.

If the system is trying/wants to break its safety properties, then it's not safe/you've already made a massive mistake somewhere else. A system that is only safe because it's not powerful enough to break its safety properties is not robust to scaling up/capability amplification.

Other explanations my model generates for this phenomenon involve the phrases "deceptive alignment", "mesa-optimisers" or "gradient hacking", but at this stage I'm just guessing the teacher's passwords. Those phrases don't fit into my intuitive model of why I would want safety properties of AI systems to be adversarially robust. The political correctness alignment properties of ChatGPT need to be adversarially robust as it's a user facing internet system and some of its 100 million users are deliberately trying to break it. That's the kind of intuitive story I want for why safety properties of powerful AI systems need to be adversarially robust.

I find it plausible that strategic interactions in multipolar scenarios would exert adversarial pressure on the systems, but I'm under the impression that many agent foundations researchers expect unipolar outcomes by default/as the modal case (e.g. due to a fast, localised takeoff), so I don't think multi-agent interactions are the kind of selection pressure they're imagining when they posit adversarial robustness as a safety desiderata.

Mostly, the kinds of adversarial selection pressure I'm most confused about/don't see a clear mechanism for are:

  • Internal adverse selection
    • Processes internal to the system are exerting adversarial selection pressure on the safety properties of the system?
    • Potential causes: mesa-optimisers, gradient hacking?
      • Why? What's the story?
  • External adverse selection
    • Processes external to the system that are optimising over the system exerts adversarial selection pressure on the safety properties of the system?
      • E.g. the training process of the system, online learning after the system has been deployed, evolution/natural selection
      • I'm not talking about multi-agent interactions here (they do provide a mechanism for adversarial selection, but it's one I understand)
    • Potential causes: anti-safety is extremely fit by the objective functions of the outer optimisation processes
      • Why? What's the story?
  • Any other sources of adversarial optimisation I'm missing?

Ultimately, I'm left confused. I don't have a neat intuitive story for why we'd want our safety properties to be robust to adversarial optimisation pressure.

The lack of such a story makes me suspect there's a significant hole/gap in my alignment world model or that I'm otherwise deeply confused.


New Answer
New Comment

5 Answers sorted by

Yes, adversarial robustness is important.

You ask where to find the "malicious ghost" that tries to break alignment. The one-sentence answer is: The planning module of the agent will try to break alignment.

On an abstract level, we're designing an agent, so we create (usually by training) a value function, to tell the agent what outcomes are good, and a planning module, so that the agent can take actions that lead to higher numbers in the value function. Suppose that the value function, for some hacky adversarial inputs, will produce a large value even if humans would rate the corresponding agent behaviour as bad. This isn't a desirable property of a value function but if we can only solve alignment to a non-adversarial standard of robustness, then the value function is likely to have many such flaws. The planning module will be running a search for plans that leads to the biggest number coming out of the value function. In particular, it will be trying to come up with some of those hacky adversarial inputs, since those predictably lead to a very large score.

Of course not all agents will be designed with the words "planning module" in the blueprint, but generally analogous parts can be found in most RL agent designs that will try to break alignment if they can, and thus must be considered adversaries.

To take a specific concrete example, consider a reinforcement learning agent where a value network is trained to predict the expected value of a given world state, an action network is trained to try and pick good actions, and a world model network is trained to predict the dynamics of the world from actions and sensory input. The agent picks its actions by running Monte-Carlo tree search, using the world model to predict the future, and the action network to sample actions that are likely to be good. Then the value network is used to rate the expected value of the outcomes, and the outcome with the best expected value is picked as the actual action the agent will take. The Monte-Carlo search combined with the action network and world model are working together to search for adversarial inputs that will cause the value function to give an abnormally high value.

You'll note above that I said "most RL agent designs". Your analysis of why LLMs need adversarial robustness is correct. LLMs actually don't have a planning module, and so the only reason OpenAI would need adversarial robustness is because they want to constrain the LLM's behaviour while interacting with the public, who can submit adversarial inputs. Similarly, Gato was trained just to predict human actions, and doesn't have a planning module or reward function either. It's basically just another case of an LLM. So I'd pretty much trust Gato not to do bad things because of alignment-related issues, so long as nobody is going to be giving it adversarial inputs. On the other hand, I don't know if I'd really call Gato a "RL agent". Just predicting what action a human would take is a pretty limited task, and I'd expect it to have a very hard time of generalizing to novel tasks, or exceeding human abilities.

I can't actually think of a "real RL agent design" (something that could plausibly be scaled to make a strong AGI) that wouldn't try and search for adversarial inputs to its value function. If you (or anyone reading this) do have any ideas for designs that wouldn't require adversarial robustness, but could still go beyond human performance, I think such designs would constitute an important alignment advance, and I highly suggest writing them up on LW/Alignment Forum.

I disagree.

The planning module is not an independent agent, it's a thing that the rest of the agent interacts with to do whatever the agent as a whole wants. If we have successfully aligned the agent as a whole enough that it has a value function that is correct under non-adversarial inputs but not adversarially-robust, and it is even minimally reflective (it makes decisions about how it engages in planning), then why would it be using the planning module in such a way that requires adversarial robustness?

Think about how you construct plans. You aren't nai... (read more)

In order to do what has worked well in the past, you need some robust retrospective measure of what it means to "work well", so you don't accidentally start thinking that the times it went poorly in the past were actually times where it went well.
DragonGod links the same series of posts in a sibling comment, so I think my reply to that comment is mostly the same as my reply to this one. Once you've read it: Under your model, it sounds like producing lots of Diamonds is normal and good agent behaviour, but producing lots of Liemonds is probing weird quirks of my value function that I have no reason to care about pursuing. What's the difference between these two cases? What's the mechanism for how that manifests in a reasonably-designed agent? Also, I'm not sure we're using terminology in the same way here. Under my terminology, "planning module" is the thing doing the optimization. It's the function we call when it's time to figure out what's the next action the agent should take. So from that perspective "the planning module is not an independent agent" goes without saying, but "optimizing hard against the planning module" doesn't make sense: The planning module is generally the thing that does the optimizing, and it's not a function with a real number output, so it's not really a thing that can be optimized at all.
We are assuming that the agent has already learned a correct-but-not-adversarially-robust value function for Diamonds. That means that it makes correct distinctions in ordinary circumstances to pick out plans actually leading to Diamonds, but won't correctly distinguish between plans that were deceptively constructed so they merely look like they'll produce Diamonds but actually produce Liemonds vs. plans that actually produce Diamonds. But given that, the agent has no particular reason to raise thoughts to consideration like "probe my value function to find weird quirks in it" in the first place, or to regard them as any more promising than "study geology to figure out a better versions of Diamond-producing-causal-pathways", which is a thought that the existing circuits within the agent have an actual mechanistic reason to be raising, on the basis of past reinforcement around "what were the robust common factors in Diamonds-as-I-have-experienced-them" and "what general planning strategies actually helped me produce Diamonds better in the past" etc. and will in fact generally lead to Diamonds rather than Liemonds, because the two are presumably produced via different mechanisms. Inspecting its own value function, looking for flaws in it, is not generally an effective strategy for actually producing Diamonds (or whatever else is the veridical source of reinforcement) in the distribution of scenarios it encountered during the broad-but-not-adversarially-tuned distribution of training inputs. So the mechanistic difference between the scenario I'm suggesting and the one you are is that I think there will be lots of strong circuits that, downstream of the reinforcement events that produced the correct-but-not-adversarially-robust value function oriented towards Diamonds, will fire based on features that differentially pick out Diamonds (as well as, say, Liemonds) against the background of possible plan-targets by attending to historically relevant features like "does i
Yeah, so by "planning module", pretty much all I mean is this method of the Agent class, it's not a Cartesian theatre deal at all: def get_next_action(self, ...): ... Like, presumably it's okay for agents to be designed out of identifiable sub-components without there being any incoherence or any kind of "inner observer" resulting from that. In the example I gave in my original answer, the planning module made calls to the action network, world model network, and value network, which you'll note is all of the networks comprising that particular agent, so it's definitely a very interconnected thing. I think it's fair to say that when we call this method, we're in some sense kicking off an optimization procedure that "tries" to optimize the value function. With terminology hopefully nailed down, let's move on to the topic at hand. If I'm getting your point right here, the basic idea is that the world model network is just being trained to get correct predictions, so that's all okay, but the action network is not being trained to search out weird hacky exploits of the value function, because whenever it found one of those during training, the humans sent a negative reward, which caused the value function to self-correct, so then the action network didn't get the reward it was hoping for. Obstacles to this as an alignment solution: 1. You're not going to keep training the other networks in the agent after you've finished training the value network right? Right? Okay, good, 'cause that would totally wreck everything. Moving along... 2. Most central concern: The deployment environment is generally not going to be identical to the training environment, and could be quite different. Basic scenario: The value network mostly learns to assign the right values to things in the context of the training environment, maybe it fails in a few places. The planning module's search algorithms are adapted to work around those failures and not run
FWIW the kinds of agents I am imagining go beyond choosing their next actions, they also choose their next thoughts including thoughts-about-planning like "Call self.get_next_action(...) with this input". That is the mechanism by which the agent binds its entire chain of thought—not just the final action produced—to its desires, by using planning reflectively as a tool to achieve ends. No, that wasn't intended to be my point. I wasn't saying that I have an alignment solution, or saying that learning a correct-but-not-adversarially-robust value function and policy for Diamonds is something that we know how to do, or saying that doing so won't be hard. The claim I was pushing back against was that the problem is adversarially hard. I don't think you need a bunch of patches for it to be not-adversarially-hard, I think it is not-adversarially-hard by default. Ok on to the substance: Whoa whoa no I think the agent will very much need to keep updating large parts of its value function along with its policy during deployment, so there's no "after you've finished" (assuming you == the AI). That's a significant component of my threat model. If you think an AGI without this could be competitive, I am curious how. I don't really understand why this is the relevant scenario. The crux being discussed is whether the value function needs to be robust to adversarial distribution shifts, not whether it needs to be robust to ordinary non-adversarial distribution shifts. I think the relevant scenario for our thread would be an agent that correctly learned a value+policy function that picks out Diamonds in the training scenarios, and which learned a generally-correct concept of Diamonds, but where there are findable edge cases in its concept boundary such that it would count type-X Liemonds as Diamonds if presented with them. The question, then, is why would it be thinking about how to produce type-X Liemonds in particular? What reason would the agent have to pursue thoughts that
I think we agree here: As long as you're updating the value function along with the rest of the agent, this won't wreck everything. A slightly generalized version of what I was saying there still seems relevant to agents that are continually being updated: When you assign the agent tasks where you can't label the results, you should still avoid updating any of the agent's networks. Only updating non-value networks when you're lacking labels to update the value network would probably still wreck everything, even if the agent will be given more labels in the future. Okay, I have a completely different idea of what the crux is, so we probably need to figure this out before discussing much more, since this could be the whole reason for the disagreement. I'm definitely not saying the we need to prepare for the agent's environment to undergo an adversarial distribution shift. The source of the adversarial inputs was always the agent itself, and those inputs are only adversarial from the perspective of the humans who trained the agent, they're perfectly fine from the agent's perspective. The distribution shift only reveals holes in the agent's value function that weren't possible in the training environment, since all the training environment holes that the agent was able to find got trained out. The agent itself will apply the adversarial pressure to exploit those holes. Does that, by any chance, completely resolve this discussion?
I understand what you mean but I still think it's incorrect[1]. I think "The agent itself will apply the adversarial pressure to exploit those holes" (emphasis mine) is the key mistake. There are many directions in which the agent could apply optimization pressure and I think we are unfairly privileging the hypothesis that that direction will be towards "exploiting those holes" as opposed to all the other plausible directions, many of which are effectively orthogonal to "exploiting those holes". I would agree with a version of your claim that said "could apply" but not with this one with "will apply". The mere fact that there exist possible inputs that would fall into the "holes" (from our perspective) in the agent's value function does not mean that the agent will or even wants to try to steer itself towards those inputs rather than towards the non-"hole" high-value inputs. Remember that the trained circuits in the agent are what actually implement the agent's decision-making, deciding what things to recognize and raise to attention, making choices about what things it will spend time thinking about, holding memories of plan-targets in mind; all based on past experiences & generalizations learned from them. Even though there is one or many nameless pattern of OOD "hole" input that would theoretically light up the agent's value function (to MAX_INT or some other much-higher-than-desired level), that pattern is not a feature/pattern that the agent has ever actually seen, so its cognitive terrain has never gotten differential updates that were downstream of it, so its cognition doesn't necessarily flow towards it. The circuits in the agent's mind aren't set up to automatically recognize or steer towards the state-action preconditions of that pattern, whereas they are set up to recognize and steer towards the state-action preconditions of prototypical rewarded input encountered during training. In my model, that is what happens when an agent robustly "wants" something
Okay, cool, it seems like we're on the same page, at least. So what I expect to happen for AGI is that the planning module will end up being a good general-purpose optimizer: Something that has a good model of the world, and uses it to find ways of increasing the score obtained from the value function. If there is an obvious way of increasing the score, then the planning module can be expected to discover it, and take it. Scenario: We have managed to train a value function that values Liemonds as well as Diamonds. These both get a high score according to the value function, it doesn't really distinguish between the two. In your model, the agent as a whole does distinguish between them, so there's must be somewhere in the agent where the "Diamonds are better than Liemonds" information is stored, even if it's implicit rather than showing up explicitly in the value function. It sounds like what you're saying is that networks related to the planning module stores this information. Rather than being a good general-purpose optimizer, the planning module has some plans that it just won't consider. This property, of not considering certain plans that might break the value function, is what I referred to as "patches". As far as I can tell, your model depends on these "patches" generalizing really well, so that all chains of reasoning that could lead to discovering Liemonds are blocked. I find it fairly implausible that we'll be able to get "don't make any plans that break the value function" to generalize any better than the value function itself. If discovering Liemonds is like trying to find one of 1000 needles in a haystack, well finding needles in a haystack is what an optimizer is good at, and it'll probably discover Liemonds pretty soon. Now if we restrict the allowed reasoning steps so that far fewer of the Liemond plans are thoughts the agent is even capable of thinking, well that's now like finding one of the 5 remaining needles in the haystack, but we've still g
Looks like we're popping back up to an earlier thread of the conversation? Curious what your thoughts on the parent comment were, but I will address your latest comment here. :) I think the AGI will be able to do general-purpose optimization, and that this could be implemented via an internal general-purpose optimization subroutine/module it calls (though, again, I think there's a flavor of homunculus in this design that I dislike). I don't see any reason to think of this subroutine as something that itself cares about value function scores, it's just a generic function that will produce plans for any goal it gets asked to optimize towards. In my model, if there's a general-purpose optimization subroutine, it takes as an argument the (sub)goal that the agent is currently thinking of optimizing towards and the subroutine spits out some answer, possibly internally making calls to itself as it splits the problem into smaller components. In this model, it is false that the general-purpose optimization subroutine is freely trying to find ways to increase the value function scores. Reaching states with high value function scores is a side effect of the plans it outputs, but not the object-level goal of its planning. IF the agent were to set "maximize my nominal value function" as the (sub)goal that it inputs to this subroutine, THEN the subroutine will do what you described, and then the agent can decide what to do with those results. But I dispute that this is the default expectation we should have for how the agent will use the subroutine. Heck, you can do general-purpose optimization, and yet you don't necessarily do that thing. Instead you ask the subroutine to help with the object-level goals that you already know you want, like "plan a route from here to the bar" and "find food to satisfy my current hunger" and "come up with a strategy to get a promotion at work". The general-purpose optimization subroutine isn't pulling you around. In fact, sometimes you reject i
Just to clarify the parameters of the thought experiment, Liemonds are specified to be much easier to produce in large quantities than Diamonds, so the score attainable by producing them is many times higher than the maximum possible Diamond score. The thing that stands out about the holes is that some of them allow the agent to (incorrectly) get an extraordinarily high score. The agent isn't going to care about holes that allow it to get an incorrectly low score, or a score that is correct, but for weird incorrect reasons, though those kinds of hole will exist too. It seems like maybe the problem here is that you're modelling the agent as fairly dumb, certainly dumber than a human level intelligence? Like, if the agent's version of "how do I decide what to do next" is based entirely off of things like "do what worked before", and doesn't involve doing any actual original reasoning, then yeah, it's probably not going to think of making Liemonds. Adversarial robustness is much easier if your adversary is not too smart. I'm generally modelling the agent as being more intelligent: close to human if not greater than human. I generally expect that something this smart would think of "Liemonds", in the same way that a human might think of "save people's lives by uploading them" in my example above.
I get that. In addition, Liemonds and Diamonds are in reality different objects that require different processes to acquire, right? Like, even though Liemonds are easier to produce in large quantities if that's what you're going for, you won't automatically produce Liemonds on the route to producing Diamonds. If you're trying to produce Diamonds, you can end up accidentally producing other junk by failing at diamond manufacturing, but you won't accidentally produce Liemonds. So unless you are intentionally trying to produce Liemonds, say as an instrumental route to "produce the maximum possible Diamond score", you won't produce them. It sounds like the reason you think the agent will intentionally produce Liemonds is as a instrumental route to getting the maximum possible Diamond score. I agree that that would be a great way to produce such a score. But AFAICT getting the maximum possible Diamond score is not "what the agent wants" in general. Reward is not the optimization target, and neither is the value function. Agents use a value function, but the agent's goals =/= maximal scores from the value function. The value function aggregates information from the current agent state to forecast the reward signal. It's not (inherently) a goal, an intention, a desire, or an objective. The agent could use high value function scores as a target in planning if it wanted to, but it could use anything it wants as a target in planning, the value function isn't special in that regard. I expect that agents will use planning towards many different goals, and subgoals of those goals, and so on, with the highest level goals being about the concepts in the world model, not the literal outputs of the value function. I suspect you disagree with this and that this is a crux. No, I am modeling the agent as being quite intelligent, at least as intelligent as a human. I just think it deploys that inteligence in service of a different motivational structure than you do.
Yeah, agree that reward is not the optimization target. Otherwise, the agent would just produce diamonds, since that's what the rewards are actually given out for (or seize the reward channel, but we'll ignore that for now). I'm a lot less sure that the value function is not the optimization target. Ignoring other architectures for the moment, consider a design where the agent has a value function and a world model, uses Monte-Carlo tree search, and picks the action that gives the highest expected score according to the value function. In that case, I'm pretty comfortable saying, "yeah, that particular type of agent is in fact optimizing for its value function". Would you agree (just for that specific agent design)? I think you've identified the crux correctly there, with the caveat in my position that the value function doesn't necessarily have to be labelled "value function" on the blueprint. Maybe the single most helpful thing would be if you just described the agent you have in mind as doing all these things in as much detail a possible. Like, the best thing would be on a level of describing what all the networks in the agent are, how they're connected, and the gist of what they're tuned to achieve during training. I'll take whatever you can provide in terms of detail, though. Also, feel free to reduce down to the simplest agent that displays the robustness properties you're describing here.
I'm fine with describing that design like that. Though I expect we'd need a policy or some other method of proposing actions for the world model/MCTS to evaluate, or else we haven't really specified the design of how the agent makes decisions. Hmm. I wasn't imagining that any particularly exotic design choices were needed for my statements to hold, since I've mostly been arguing against things being required. What robustness properties are you asking about? A shot at the diamond alignment problem is probably a good place to start, if you're after a description of how the training process and internal cognition could work along a similar outline to what I was describing.
Sorry for the slow response, lots to read through and I've been kind of busy. Which of the following would you say most closely matches your model of how diamond alignment with shards works? 1. The diamond abstraction doesn't have any holes in it where things like Liemonds could fit in, due to the natural abstraction hypothesis. The training process is able to find exactly this abstraction and include it in the agent's world model. The diamond shard just points to the abstraction in the world model, and thus also has no holes. 2. Shards form a kind of vector space of desire. Rather than thinking of shards as distinct circuits, we should think of them as different amounts of pull in different directions. An agent that would pursue both diamonds and liemonds should be thought of as a linear combination of a diamond shard and a liemond shard. Thus, it makes sense to refer to a shard that points exactly in the direction of diamonds, with no liemond component. The liemond component might also be present in the agent, but we can conceptualize it as a separate shard. 3. Shards can be imperfect, the above two theories trying to make them perfect are silly. Shards aren't like a value function that looks at the world and assigns a value. Instead... a. ...the agent is set up so that no part of the agent is "trying to make the shard happy". (I think I'd still need a diagram that showed gradient flow here so I could satisfy myself that this was the case.) b. ...shards make bids on what the world should look like, using an internal language that is spoken by the world model. A diamond shard is just something that knows how to say "diamond" in this internal language. c. ...plans are generated by a "plausible plan generator" and then shards bid on those plans by the amount that they expect to be satisfied. Somehow the plausible plan generator is able to avoid "hacking" the shards by generating plans that they will incorre
3 is the closest. I don't even know what it would mean for a shard to be "perfect". I have a concept of diamonds in my world model, and shards attached to that concept. That concept includes some central features like hardness, and some peripheral features like associations with engagement rings. That concept doesn't include everything about diamonds, though, and it probably includes some misconceptions and misgeneralizations. I could certainly be fooled in certain circumstances into accepting a fake diamond, for ex. if a seemingly-reputable jewelry store told me it was real. But this isn't an obstacle to me liking & acquiring diamonds generally, because my "imperfect" diamond concept is nonetheless still a pointer grounded in the real world, a pointer that has been optimized (by myself and by reality) to accurately-enough track diamonds in the scenarios I've found myself in. That's good enough to hang a diamond-shard off of. Maybe that shard fires more for diamonds arranged in a circle than for diamonds arranged in a triangle. Is that an imperfection? I dunno, I think there are many different ways of wanting a thing, and I don't think we need perfect wanting, if that's even a thing. Notice that there is a giant difference between "try to get diamonds" and "try to get the diamond-shard to be happy", analogous to the difference between "try to make a million bucks" and "try to convince yourself you made a million bucks". If I wanted to generate a plan to convince myself I'd made a million bucks, my plan-generator could, but I don't want to, because that isn't a strategy I expect to help me get the things I want, like a million bucks. My shards shape what plans I execute, including plans about how I should do planning. The shard is the thing doing optimization in conjunction with the rest of the agent, not the thing being optimized against. If there was a part of you trying to make the diamond shard happy, that'd be, like, a diamond-shard-shard (latched onto the con
Cool, thanks for the reply, sounds like maybe a combination of 3a and the aspect of 1 where the shard points to a part of the world model? If no part of the agent is having its weights tuned to choose plans that make a shard happy, where would you say a shard mostly lives in an agent? World model? Somewhere else? Spread across multiple components? (At the bottom of this comment, I propose a different agent architecture that we can use to discuss this that I think fairly naturally matches the way you've been talking about shards.) My model doesn't predict that most agents will try to execute "fool myself into thinking I have a million bucks" style plans. If you think my model predicts that, then maybe this is an opportunity to make progress? In my model, the agent is actually allowed to care about actual world states and not just its own internal activations. Consider two ways the agent could "fool" itself into thinking it had a million bucks: Firstly, it could tamper with its own mind to create that impression. This could be either through hacking, or carefully spoofing its own sensory inputs. When planning, the predicted future states of the world are going to correctly show that the agent is fooling itself. So when the value function is fed the predicted world states, it's going to rate them as bad, since in those world states, the agent does not have a million bucks. It doesn't matter to the agent that later, after being hacked, the value function will think the agent has a million bucks. Right now, during planning, the value function isn't fooled. Secondly, it could create a million counterfeit bucks. Due to inaccuracies in training, maybe the agent actually thinks that having counterfeit money is just as good as real money. I.e. the value function does actually rate counterfeit bucks higher than real bucks. If so, then the agent is going to be perfectly satisfied with itself for coming up with this clever idea for satisfying its true values. The humans who
I'd say the shards live in the policy, basically, though these are all leaky abstractions. A simple model would be an agent consisting of a big recurrent net (RNN or transformer variant) that takes in observations and outputs predictions through a prediction head and actions through an action head, where there's a softmax to sample the next action (whether an external action or an internal action). The shards would then be circuits that send outputs into the action head. I brought this up because I interpreted your previous comment as expressing skepticism that Whereas I think that it will be true for analogous reasons as the reasons that explain why no part of the agent is "trying to make itself believe it has a million bucks". I have a vague feeling that the "value function map = agent's true values" bit of this is part of the crux we're disagreeing about. Putting that aside, for this to happen, it has to be simultaneously true that the agent's world model knows about and thinks about counterfeit money in particular (or else it won't be able to construct viable plans that produce counterfeit money) while its value function does not know or think about counterfeit money in particular. It also has to be true that the agent tends to generate plans towards counterfeit money over plans towards real money, or else it'll pick a real money plan it generates before it has had a chance to entertain a counterfeit money plan. But during training, the plan generator was trained to generate plans that lead to real money. And the agent's world model / plan generator knows (or at least thinks) that those plans were towards real money, even if its value function doesn't know. This is because it takes different steps to acquire counterfeit money than to acquire real money. If the plan generator was optimized based on the training environment, and the agent was rewarded there for doing the sorts of things that lead to acquiring real money (which are different from the things th
Thanks for describing this. Technical question about this design: How are you getting the gradients that feed backwards into the action head? I assume it's not supervised learning where it's just trying to predict which action a human would take? I'm aware of the issue of embedded agency, I just didn't think it was relevant here. In this case, we can just assume that the world looks fairly Cartesian to the agent. The agent makes one decision (though possibly one from an exponentially large decision space) then shuts down and loses all its state. The record of the agent's decision process in the future history of the world just shows up as thermal noise, and it's unreasonable to expect the agent's world model to account for thermal noise as anything other than a random variable. As a Cartesian hack, we can specify a probability distribution over actions for the world model to use when sampling. So for our particular query, we can specify a uniform distribution across all actions. Then in reality, the actual distribution over actions will be biased towards certain actions over others because they're likely to result in higher utility. This seems to be phrased in a weird way that rules out creative thinking. Nicola Tesla didn't have three phase motors in his world-model before he invented them, but he was able to come up with them (his mind was able to generate a "three phase motor" plan) anyways. The key thing isn't having a certain concept already existing in your world model because of prior experience. The requirement is just that the world model is able to reason about the thing. Nicola Tesla knew enough E&M to reason about three phase motors, and I expect that smart AIs will have world models that can easily reason about counterfeit money. The job of a value function isn't to know or think about things. It just gives either big numbers or small numbers when fed certain world states. The value function in question here gives a big number when you feed it a worl
Could be from rewards or other "external" feedback, could be from TD/bootstrapped errors, could be from an imitation loss or something else. The base case is probably just a plain ol' rewards that get backpropagated through the action head via policy gradients. Sorry for being unclear, I think you're talking about a different dimension of embeddness than what I was pointing at. I was talking about the issue of logical uncertainty: that the agent needs to actually run computation in order to figure out certain things. The agent can't magically sample from P(h) proportional to exp(U(h)), because it needs the exp(U(h')) of all the other histories first in order to weigh the distribution that way, which requires having already sampled h' and having already calculated U(h'). But we are talking about how it samples a history h in the first place! The "At best" comment was proposing an alternative that might work, where the agent samples from a prior that's been tuned based on U(h). Notice, though, that "our sampling is biased towards certain histories over others because they resulted in higher utility" does not imply "if a history would result in higher utility, then our sampling will bias towards it". Consider a parallel situation: sampling images and picking one that gets the highest score on a non-robust face classifier. If we were able to sample from the distribution of images proportional to their (exp) face classifier scores, then we would need to worry a lot about picking an image that's an adversarial example to our face classifier, because those can have absurdly high scores. But instead we need to sample images from the prior of a generative model like FFHQ StyleGAN2-ADA or DDPM, and score those images. A generative model like that will tend strongly to convert whatever input entropy you give it into a natural-looking image, so we can sample & filter from it a ton without worrying much about adversarial robustness. Even if you sample 10K images and pick the 5
Thanks for the reply. Just to prevent us from spinning our wheels too much, I'm going to start labelling specific agent designs, since it seems like some talking-past-each-other may be happening where we're thinking of agents that work in different ways when making our points. PolicyGradientBot: Defined by the following description: ThermodynamicBot: Defined by the following description: P(h)=exp(U(h)/T)∑h′exp(U(h')/T) Comments on ThermodynamicBot This bot is of course a bounded agent, and so the world model can't be perfect, but consider the following steps: 1. For each possible h, compute U(h) and exp(U(h)/T). 2. Compute the sum Z=∑h′exp(U(h′)/T) 3. Now we know the probability for any given history h: It's exp(U(h)/T)/Z. This is a finite sequence of computational steps that terminates without self-reference, so no logical induction is needed here. Now you may fairly object that there is still an issue of computational complexity. The space of histories is exponentially large, so in practice the computation couldn't be completed in time. This is the known-to-be-hard problem of computing the partition function. But the problem is tractable in many special cases, and humans get by well enough in our own reasoning about a world full of combinatorial explosions. We can suppose that, at the cost of making itself even more of an approximation, the world model has a way to efficiently sample from the distribution, even given the difficulty of computing Z. To take one particular concrete way this could be implemented, if the world model is a factor graph with few or no loops, then we can do the computation by adding on a few factors to account for exp(U(h)/T) and then using belief propagation to solve it. I agree with your predictions for what would happen here. ThermodynamicBot isn't really a GAN, though. If we tried to train a GAN to sub-in for ThermodynamicBot's world model, then we could do it two ways (2nd way is most similar to your proposal). In both cas
Sounds good. Comments on ThermodynamicBot If we assume that the agent is making decisions by (approximately) plugging in every possible h into U(h) and picking based on (the partition function derived from) that, then of course you need U(h) to be adversarially robust! I disagree with that as a model of how planning works or should work. IMO, not only is "plug every possible h into U(h)" extremely computationally infeasible, but even if it were feasible it would be a forseeably-broken (because fragile) planning strategy. Quote from a comment of TurnTrout about argmax planning, though I think it also applies to ThermodynamicBot, since that just does a softened version of argmax planning (converging to argmax planning as T->0): I think the sorts of planning methods that try to approximate in the real world the behavior of "think about all possible plans and pick a good one" are unworkable in the limit, not just from an alignment standpoint but also from a practical capability standpoint, so I don't expect us to build competent agents that use them, so I don't worry about them or their attendant need for adversarial robustness. Right, I wasn't thinking of it as actually a GAN, just giving an analogy where similar causal patterns are in play, to make my point clearer. But yeah, if we wanted to actually use a GAN, your suggestions sound reasonable. Comments on PolicyGradientBot & General Position I guess, but I'm confused why we're talking about competitiveness all of a sudden. I mean, variants on policy gradient algorithms (PPO, Actor-Critic, etc.) do some impressive things (at least to the extent any RL algorithms currently do impressive things). And I can imagine more sophisticated versions of even plain policy gradient that would do impressive things, if the action space includes mental actions like "sample a rollout from my world model". But that's neither here nor there IMO. In the previous comment, I tried to be clear that it makes me ~no difference wher
Comments on ThermodynamicBot To be clear, I'm not saying Thermodynamic bot does the computation the slow exponential way. I already explained how it could be done in polynomial time, at least for a world model that looks like a factor graph that's a tree. Call this ThermodynamicBot-F. You could also imagine the role of "world model" being filled by a neural network (blob of weights) that approximates the full thermodynamic computation. We can call this ThermodynamicBot-N. Yes, I understand that running a search that will kill you if it succeeds is dumb. This has been known for many years. The question is how do we actually write a program to do a sane search? You quote TurnTrout: I don't find this particularly helpful. If we know which plans are adversarial so we can eliminate them from the search space, we're already half way to solving alignment. I don't think the plans a bounded agent is going to eliminate so that it can finish its thinking on time are automatically going to be the adversarial ones. I think this is a problem that is going to take actual effort. In particular for ThermodynamicBot: * Case where the world model is implemented in a factor graph (ThermodynamicBot-F): This gives exactly the same result as searching across all inputs, but the computation is efficient, and not really wasteful in any sense. If we imagine trying to "improve" the belief propagation algorithm to simultaneously make it more efficient and also remove some subset of plans it's searching over that are "adversarial", I can't really imagine a way to do that, and it would certainly make the algorithm more complicated and less elegant. * Case where a neural network world model is being used (ThermodynamicBot-N): In this case there are likely plans that will be missed by ThermodynamicBot-N because of the bounded nature of its world model, even though they would be found by searching across all inputs. But if we imagine training the world model

I can't actually think of a "real RL agent design" (something that could plausibly be scaled to make a strong AGI) that wouldn't try and search for adversarial inputs to its value function. If you (or anyone reading this) do have any ideas for designs that wouldn't require adversarial robustness, but could still go beyond human performance, I think such designs would constitute an important alignment advance, and I highly suggest writing them up on LW/Alignment Forum.

I think @Quintin Pope would disagree with this. As I understand it, one of Shard Theory... (read more)

So, it looks like the key passage is this one: The above passage has a major flaw, which is that simulating other superintelligences is not where adverse selection pressure comes from. The agent generates the selection pressure on its own, though of course from the agent's perspective the selection is perfectly benign and not adverse at all. Making the agent reflective does not prevent this. To put things in terms of diamond maximization, the common starting point is that we can't perfectly specify the "maximize diamonds" objective. Let's say, for the sake of concreteness, that while our objective includes Diamonds, it also includes other objects, much cheaper to construct, which I'll call "Liemonds". These are fakes cleverly constructed to be interpreted as real. So we're trying to build an AI to maximize Diamonds, but due to being unable to fully solve alignment, our actual AI is maximizing Diamonds + Liemonds. So let's say the AI is deciding what to think about for the next few timesteps and is considering 3 options: 1. Simulate other, potentially malign, superintelligences. 2. Research how to make Diamonds 3. Research how to make Liemonds The AI, being reflective and suspecting itself susceptible to being hacked, rules out 1. Then it notes that Liemonds are cheaper to produce and so vastly more of them can be produced than Diamonds, and so it picks option 3. Onlooking humans are unhappy, since they would have suggested the AI pick option 2, and they will soon be even more unhappy once the AI tiles the universe with Liemonds. The agent is perfectly happy, though, since it's fulfilling its values of maximizing Diamonds + Liemonds to the highest degree possible. How was the agent even supposed to know that Liemonds were bad and didn't count? Yes, avoiding option 1 is a minor victory, but the real core of the alignment problem is getting the agent to choose Diamonds over Liemonds. I certainly agree with this, the question is how?

Strongly upvoted.

The first part of this post presents an intuitive story for how adverse selection pressure on safety properties of the system could arise internally.

I am confused by your confusion. Your basic question is "what is the source of the adversarial selection". The answer is "the system itself" (or in some cases, the training/search procedure that produces the system satisfying your specification). In your linked comment, you say "There's no malicious ghost trying to exploit weaknesses in our alignment techniques." I think you've basically hit on the crux, there. The "adversarially robust" frame is essentially saying you should think about the problem in exactly this way.

I think Eliezer has conceded that Stuart Russel puts the point best. It goes something like: "If you have an optimization process in which you forget to specify every variable that you care about, then unspecified variables are likely to be set to extreme values." I would tack on that due to the fragility of human value, it's much easier to set such a variable to an extremely bad value than an extremely good one.

Basically, however the goal of the system is specified or represented, you should ask yourself if there's some way to satisfy that goal in a way that doesn't actually do what you want. Because if there is, and it's simpler than what you actually wanted, then that's what will happen instead. (Side note: the system won't literally do something just because you hate it. But the same is true for other Goodhart examples. Companies in the Soviet Union didn't game the targets because they hated the government, but because it was the simplest way to satisfy the goal as given.)

"If the system is trying/wants to break its safety properties, then it's not safe/you've already made a massive mistake somewhere else." I mean, yes, definitely. Eliezer makes this point a lot in some Arbital articles, saying stuff like "If the system is spending computation searching for things to harm you or thwart your safety protocols, then you are doing the fundamentally wrong thing with your computation and you should do something else instead." The question is how to do so.

Also from your linked comment: "Cybersecurity requires adversarial robustness, intent alignment does not." Okay, but if you come up with some scheme to achieve intent alignment, you should naturally ask "Is there a way to game this scheme and not actually do what I intended?" Take this Arbital article on the problem of fully-updated deference. Moral uncertainty has been proposed as a solution to intent alignment. If the system is uncertain as to your true goals, then it will hopefully be deferential. But the article lays out a way the system might game the proposal. If the agent can maximize its meta-utility function over what it thinks we might value, and still not do what we want, then clearly this proposal is insufficient.

If you propose an intent alignment scheme such that when we ask "Is there any way the system could satisfy this scheme and still be trying to harm us?", the answer is "No", then congrats, you've solved the adversarial robustness problem! That seems to me to be the goal and the point of this way of thinking.

I think that the intuitions from "classical" multivariable optimization are poor guides for thinking about either human values or the cognition of deep learning systems. To highlight a concrete (but mostly irrelevant, IMO) example of how they diverge, this claim:

If you have an optimization process in which you forget to specify every variable that you care about, then unspecified variables are likely to be set to extreme values.

is largely false for deep learning systems, whose parameters mostly[1] don't grow to extreme positive or negative values. In ... (read more)

Another source of Adversarial robustness issues relates to the model itself becoming deceptive.

As for this:

My intuitions are that imagining a system is actively trying to break safety properties is a wrong framing; it conditions on having designed a system that is not safe.

I unfortunately think this is exactly what real world AI companies are building.

I agree with your intuition here. I don't think that AI systems need to be adversarially robust to any possible input in order for them to be safe. Humans are an existance proof for this claim, since our values / goals do not actually rely on us having a perfectly adversarially robust specification[1]. We manage to function anyways by not optimizing for extreme upwards errors in our own cognitive processes. 

I think ChatGPT is a good demonstration of this. There are numerous "breaks": contexts that cause the system to behave in ways not intended by its creators. However, prior to such a break occurring, the system itself is not optimizing to put itself into a breaking context, so users who aren't trying to break the system are mostly unaffected.

As humans, we are aware (or should be aware) of our potential fallibility and try to avoid situations that could cause us to act against our better judgement and endorsed values. Current AI systems don't do this, I think that's only because they're rarely put in contexts where they "deliberately" influence their future inputs. They do seem to be able to abstractly process the idea of an input that could cause them to behave undesirably, and that such inputs should be avoided. See Using GPT-Eliezer against ChatGPT Jailbreaking, and also some shorter examples I came up with:

  1. ^

    Suppose a genie gave you a 30,000 word book whose content it perfectly had optimized to make you maximally convinced that the book contained the true solution to alignment. Do you think that book actually contains an alignment solution? See also: Adversarially trained neural representations may already be as robust as corresponding biological neural representations

Let's suppose that you are DM of tabletop RPG campaign with homebrew rules. You want to have Epic Battles in your campaign, where players should perform incredibly complex tactical and technical moves to win and dozens of characters will tragically die. But your players discovered that fundamental rules of your homebrew alchemy allow them to mix bread, garlic and first level healing potion and get Enormous Explosions that kill all the enemies including Final Boss, so all of your plans for Epic Battles go to hell.

It's not like players are your enemies that want to hurt you, they just Play by The Rules and apply enormous amount of optimization to them. If your have Rules that are not adversarily robust and apply enormous amount of optimization to them, you get not what you want, even if you don't have actual adversary.

As the wise man say:

What matters isn't so much the “adversary” part as the optimization part. There are systematic, nonrandom forces strongly selecting for particular outcomes, causing pieces of the system to go down weird execution paths and occupy unexpected states. If your system literally has no misbehavior modes at all, it doesn't matter if you have IQ 140 and the enemy has IQ 160—it's not an arm-wrestling contest. It's just very much harder to build a system that doesn't enter weird states when the weird states are being selected-for in a correlated way, rather than happening only by accident. The weirdness-selecting forces can search through parts of the larger state space that you yourself failed to imagine.

(Some very rough thoughts I sent in DM, putting them up publicly on request for posterity, almost definitely not up to my epistemic standards for posting on LW).

So I think some confusion might come from connotations of word choices. I interpret adversarial robustness' importance in terms of alignment targets, not properties (the two aren't entirely different, but I think they aren't exactly the same, and evoke different images in my mind). Like, the naive example here is just Goodharting on outer objectives that aren't aligned at the limit, where optimization pressure is AGI powerful enough to achieve it at very late stages on a logarithmic curve, which runs into edge cases if you aren't adversarially robust. So for outer alignment, you need a true name target. It's worth noting that I think John considers a bulk of the alignment problem to be contained in outer alignment (or whatever the analogue is in your ontology of the problem), hence the focus on adversarial robustness - it's not a term I hear very commonly apart from narrower contexts where its implication is obvious.

With inner alignment, I think it's more confused, because adversarial robustness isn't a term I would really use in that context. I have heard it used by others though - for example, someone I know is primarily working on designing training procedures that make adversarial robustness less of a problem (think debate). In that context I'm less certain about the inference to draw because it's pretty far removed from my ontology, but my take would be that it removes problems where your training processes aren't robust in holding to their training goal as things change, with scale, inner optimizers, etc. I think (although I'm far less sure of my memory here, this was a long conversation) he also mentioned it in the context of gradient hacking. So it's used pretty broadly here if I'm remembering this correctly.

TL;DR: I've mainly heard it used in the context of outer alignment or the targets you want, which some people think is the bulk of the problem.

I can think of a bunch of different things that could feasibly fall under the term adversarial robustness in inner alignment as well (training processes robust to proxy-misaligned mesa-optimizers, processes robust to gradient hackers, etc), but it wouldn't really feel intuitive to me, like you're not asking questions framed the best way.

9 comments, sorted by Click to highlight new comments since: Today at 4:28 AM

Could you give an example of a desirable safety property where you are unsure whether it is important for it to be adversarially robust?

Actually, my confusion is mostly on the source of the adverse selection pressure.

The only source of adverse selection pressure I understand is from multi-agent interactions in multipolar environments, but my model of agent foundations researchers do not expect multipolar outcomes by default/as the modal case, so I'm not really understanding the mechanism for adverse selection.

I listed some hypotheses as to where the source of adverse selection could come from, but I don't have a compelling intuitive story for why they'd produce such adverse selection.

Could you give an example of a desirable safety property where you are unsure what angle the adversarial pressure would come from?

Corrigibility is the first one that comes to mind.

The John Wentworth argument that you are responding to is:

Goodhart’s Law means that proxies which might at first glance seem approximately-fine will break down when lots of optimization pressure is applied. And when we’re talking about aligning powerful future AI, we’re talking about a lot of optimization pressure. That’s the key idea which generalizes to other alignment strategies: crappy proxies won’t cut it when we start to apply a lot of optimization pressure.

What's a proxy of corrigibility that you think might at first glance seem approximately-fine?

Obedience/deference seem the obvious proxies of corrigibility.

A proxy is supposed to be observable so that it can be used for the purpose it is to be used for.

What use do you have for a measure of corrigibility, and how do you intend to observe obedience/deference for that use?

Maybe there's a connotation/denotation issue. If X is a proxy for Y, then "adversarial robustness" connotes robustness to those who try to minimize Y while maximizing X. But often the only required meaning is that exponential selection on X should still yield commensurable selection on Y, which is a weaker meaning as it doesn't directly involve minimizing Y.

Though if X and Y both require some shared limited resource, then the weaker adversarial robustness may essentially require the stronger one. 🤔

New to LessWrong?