Thoughts on safety in predictive learning

[-]evhub4yΩ670

By and large, we expect learning algorithms to do (1) things that they’re being optimized to do, and (2) things that are instrumentally useful for what they’re being optimized to do, or tend to be side-effect of what they’re being optimized to do, or otherwise “come along for the ride”. Let’s call those things “incentivized”. Of course, it’s dicey in practice to declare that a learning algorithm categorically won’t ever do something, just because it’s not incentivized. Like, we may be wrong about what’s instrumentally useful, or we may miss part of the space of possible strategies, or maybe it’s a sufficiently simple thing that it can happen by random chance, etc.

In the presence of deceptive alignment, approximately any goal is possible in this setting, not just the nearby instrumental proxies that you might be okay with. Furthermore, deception need not be 4th-wall-breaking, since the effect of deception on helping you do better in the training process entirely factors through the intended output channel. Thus, I would say that within-universe mesa-optimization can be arbitrarily scary if you have no way of ruling out deception.

[-]Steven Byrnes4yΩ120

Thanks!

The kind of incentive argument I'm trying to make here is "If the model isn't doing X, then by doing X a little bit it will score better on the objective, and by doing X more it will score even better on the objective, etc. etc." That's what I mean by "X is incentivized". (Or more generally, that gradient descent systematically tends to lead to trained models that do X.) I guess my description in the article was not great.

So in general, I think deceptive alignment is "incentivized" in this sense. I think that, in the RL scenarios you talked about in your paper, it's often the case that building a better and better deceptively-aligned mesa-optimizer will progressively increase the score on the objective function.

Then my argument here is that 4th-wall-breaking processing is not incentivized in that sense: if the trained model isn't doing 4th-wall-breaking processing at all right now, I think it does not do any better on the objective by starting to do a little bit of 4th-wall-breaking processing. (At least that's my hunch.)

(I do agree that if a deceptively-aligned mesa-optimizer with a 4th-wall-breaking objective magically appeared as the trained model, it would do well on the objective. I'm arguing instead that SGD is unlikely to create such a thing.)

Oh, I guess you're saying something different: that even a deceptive mesa-optimizer which is entirely doing within-universe processing is nevertheless scary. So that would by definition be an algorithm with the property "no operation in the algorithm is likelier to happen vs not happen specifically because of anticipated downstream chains of causation that pass through things in the real world". So I can say categorically: such an algorithm won't hurt anyone (except by freak accident), won't steal processing resources, won't intervene when I go for the off-switch, etc., right? So I don't see "arbitrarily scary", or scary at all, right? Sorry if I'm confused…

[-]evhub4yΩ220

Oh, I guess you're saying something different: that even a deceptive mesa-optimizer which is entirely doing within-universe processing is nevertheless scary. So that would by definition be an algorithm with the property "no operation in the algorithm is likelier to happen vs not happen specifically because of anticipated downstream chains of causation that pass through things in the real world".

Yep, that's right.

So I can say categorically: such an algorithm won't hurt anyone (except by freak accident), won't steal processing resources, won't intervene when I go for the off-switch, etc., right?

No, not at all—just because an algorithm wasn't selected based on causing something to happen in the real world doesn't mean it won't in fact try to make things happen in the real world. In particular, the reason that I expect deception in practice is not primarily because it'll actually be selected for, but primarily just because it's simpler, and so it'll be found despite the fact that there wasn't any explicit selection pressure in favor of it. See: “Does SGD Produce Deceptive Alignment?”

[-]Steven Byrnes4y*Ω120

I think you're misunderstanding (or I am).

I'm trying to make a two step argument:

(1) SGD under such-and-such conditions will lead to a trained model that does exclusively within-universe processing [this step is really just a low-confidence hunch but I'm still happy to discuss and defend it]

(2) trained models that do exclusively within-universe processing are not scary [this step I have much higher confidence in]

If you're going to disagree with (2), then SGD / "what the model was selected" for is not relevant.

"Doing exclusively within-universe processing" is a property of the internals of the trained model, not just the input-output behavior. If running the trained model involves a billion low-level GPU instructions, this property would correspond to the claim that each and every one of those billion GPU instructions is being executed for reasons that are unrelated to any anticipated downstream real-world consequences of that GPU instruction. (where "real world" = everything except the future processing steps inside the algorithm itself.)

[-]evhub4yΩ220

I mean, I guess it depends on your definition of “unrelated to any anticipated downstream real-world consequences.” Does the reason “it's the simplest way to solve the problem in the training environment” count as “unrelated” to real-world consequences? My point is that it seems like it should, since it's just about description length, not real-world consequences—but that it could nevertheless yield arbitrarily bad real-world consequences.

[-]Steven Byrnes4yΩ120

I think it can be simultaneously true that, say:

"weight #9876 is 1.2345 because out of all possible models, the highest-scoring model is one where weight #9876 happens to be 1.2345"
"weight #9876 is 1.2345 because the hardware running this model has a RowHammer vulnerability, and this weight is part of a strategy that exploits that. (So in a counterfactual universe where we made chips slightly differently such that there was no such thing as RowHammer, then weight #9876 would absolutely NOT be 1.2345.)"

The second one doesn't stop being true because the first one is also true. They can both be true, right?

In other words, "the model weights are what they are because it's the simplest way to solve the problem" doesn't eliminate other "why" questions about all the details of the model. There's still some story about why the weights (and the resulting processing steps) are what they are—it may be a very complicated story, but there should (I think) still be a fact of the matter about whether that story involves "the algorithm itself having downstream impacts on the future in non-random ways that can't be explained away by the algorithm logic itself or the real-world things upstream of the algorithm". Or something like that, I think.

[-]evhub4yΩ220

Sure, that's fair. But in the post, you argue that this sort of non-in-universe-processing won't happen because there's no incentive for it:

It seems like there’s no incentive whatsoever for a postdictive learner to have any concept that the data processing steps in the algorithm have any downstream impacts, besides, y’know, processing data within the algorithm. It seems to me like there’s a kind of leap to start taking downstream impacts to be a relevant consideration, and there’s nothing in gradient descent pushing the algorithm to make that leap, and there doesn’t seem to be anything about the structure of the domain or the reasoning it’s likely to be doing that would lead to making that leap, and it doesn’t seem like the kind of thing that would happen by random noise, I think.

However, if there's another “why” for why the model is doing non-in-universe-processing that is incentivized—e.g. simplicity—then I think that makes this argument no longer hold.

[-]Steven Byrnes4yΩ350

One thing is, I'm skeptical that a deceptive non-in-universe-processing model would be simpler for the same performance. Or at any rate, there's a positive case for the simplicity of deceptive alignment, and I find that case very plausible for RL robots, but I don't think it applies to this situation. The positive case for simplicity of deceptive models for RL robots is something like (IIUC):

The robot is supposed to be really good at manufacturing widgets (for example), and that task requires real-world foresighted planning, because sometimes it needs to substitute different materials, negotiate with suppliers and customers, repair itself, etc. Given that the model definitely needs to have capability of real-world foresighted planning and self-awareness and so on, the simplest high-performing model is plausibly one that applies those capabilities towards a maximally simple goal, like "making its camera pixels all white" or whatever, and then that preserves performance because of instrumental convergence.

(Correct me if I'm misunderstanding!)

If that's the argument, it seems not to apply here, because this task doesn't require real-world foresighted planning.

I expect that a model that can't do any real-world planning at all would be simpler than a model that can. In the RL robot example, it doesn't matter, because a model that can't do any real-world planning at all would do terribly on the objective, so who cares if it's simpler. But here, it would be equally good at the objective, I think, and simpler.

(A possible objection would be: "real-world foresighted planning" isn't a separate thing that adds to model complexity, instead it naturally falls out of other capabilities that are necessary for postdiction like "building predictive models" and "searching over strategies" and whatnot. I think I would disagree with that objection, but I don't have great certainty here.)

[-]evhub4yΩ230

(A possible objection would be: "real-world foresighted planning" isn't a separate thing that adds to model complexity, instead it naturally falls out of other capabilities that are necessary for postdiction like "building predictive models" and "searching over strategies" and whatnot. I think I would disagree with that objection, but I don't have great certainty here.)

Yup, that's basically my objection.

[-]dkirmani4yΩ250

A common trope is that brains are trained on prediction. Well, technically, I claim it would be more accurate to say that they’re trained on postdiction. Like, let’s say I open a package, expecting to see a book, but it’s actually a screwdriver. I’m surprised, and I immediately update my world-model to say "the box has a screwdriver in it".

I would argue that that the book-expectation is a prediction, and the surprise you experience is a result of low mutual information between your retinal activation patterns and the book-expectation in your head. That surprise (prediction error) is the learning signal that propagates up to your more abstract world-model, which updates into a state consistent with "the box has a screwdriver in it".

During this process, there was a moment when I was just beginning to parse the incoming image of the newly-opened box, expecting to see a book inside. A fraction of a second later, my brain recognizes that my expectation was wrong, and that’s the “surprise”. So in other words, my brain had an active expectation about something that had already happened—about photons that had by then already arrived at my retina—and that expectation was incorrect, and that’s what spurred me to update my world-model. Postdiction, not prediction.

Right, but the part of your brain that had that high-level model of "there is a book in the box" had at that time not received contradictory input from the lower-level edge detection / feature extraction areas. The abstract world-model does not directly predict retinal activations, it predicts the activations of lower-level sensory processing areas, which in turn predict the activations of the retina, cochlea, etc. There is latency in this system, so the signal takes a bit of time to filter up from the retinas to lower-level visual areas to your world-model. I don't think 'post-diction' makes sense in this context, as each brain region is predicting the activations of the one below it, and updates its state when those predictions are wrong.

(Also, I think Easy Win #3 is a really good point for Predict-O-Matic-esque sytems)

[-]Steven Byrnes4yΩ240

I think you're interpreting "prediction" and "postdiction" differently than me.

Like, let's say GPT-3 is being trained to guess the next word of a text. You mask (hide) the next word, have GPT-3 guess it, and then compare the masked word to the guess and make an update.

I think you want to call the guess a "prediction" because from GPT-3's perspective, the revelation of the masked data is something that hasn't happened yet. But I want to call the guess a "postdiction" because the masked data is already "locked in" at the time that the guess is formed. The latter is relevant when we're thinking about incentives to form self-fulfilling prophecies.

Incidentally, to be clear, people absolutely do make real predictions constantly. I'm just saying we don't train on those predictions. I'm saying that by the time the model update occurs, the predictions have already been transmuted into postdictions, because the thing-that-was-predicted has now already been "locked in".

(Sorry if I'm misunderstanding.)

[-]dkirmani4yΩ010

Nope, that's an accurate representation of my views. If "postdiction" means "the machine has no influence over its sensory input", then yeah, that's a really good idea.

There are 2 ways to reduce prediction error: change your predictions, or act upon the environment to make your predictions come true. I think the agency of an entity depends on how much of each they do. An entity with no agency would have no influence over its sensory inputs, instead opting to update beliefs in the face of prediction error. Taking agency from AIs is a good idea for safety.

Scott Alexander recently wrote about a similar quantity being ecoded in humans through the 5-HT1A / 5-HT2A receptor activation ratio: link

[-]adamShimi4yΩ240

5 years later, I'm finally reading this post. Thanks for the extended discussions of postdictive learning; it's really relevant to my current thinking about alignment for potential simulators-like Language Models.

Note that others disagree, e.g. advocates of Microscope AI.

I don't think advocates of Microscope AI think you can reach AGI that way. More that through Microscope AI, we might end up solving the problems we have without relying on an agent.

Why? Because in predictive training, the system can (under some circumstances) learn to make self-fulfilling prophecies—in other words, it can learn to manipulate the world, not just understand it. For example see Abram Demski’s Parable of the Predict-O-Matic. In postdictive training, the answer is already locked in when the system is guessing it, so there’s no training incentive to manipulate the world. (Unless it learns to hack into the answer by row-hammer or whatever. I’ll get back to that in a later section.)

Agreed, but I think you could be even clearer that the real point is that postdiction can never causally influence the output. As you write there are cases and version where prediction also has this property, but it's not a guarantee by default.

As for the actual argument, that's definitely part of my reasoning why I don't expect GPT-N to have deceptive incentives (although maybe what it simulates would have).

In backprop, but not trial-and-error, and not numerical differentiation, we get some protection against things like row-hammering the supervisory signal.

Even after reading the wikipedia page, it's not clear to me what "row-hammering the supervisory signa"l would look like. Notably, I don't see the analogy to the electrical interaction here. Or do you mean literally that the world-model uses row-hammer on the computer it runs, to make the supervisory signal positive?

The differentiation engine is essentially symbolic, so it won’t (and indeed can’t) “differentiate through” the effects of row-hammer or whatever.

No idea what this means. If row-hammering (or whatever) improves the loss, then the gradient will push in that direction. I feel like the crux is in the specific way you imagine row-hammering happening here, so I'd like to know more about it.

Easy win #3: Don’t access the world-model and then act on that information, at least not without telling it

Slight nitpicking, but this last one doesn't sound like an easy win to me -- just an argument for not using a naive safety strategy. I mean, it's not like we really get anything in terms of safety, we just don't mess up the capabilities of the model completely.

(Human example of this error: Imagine someone saying "If fast-takeoff AGI happens, then it would have bizarre consequence X, and there’s no way you really expect that to happen, right?!? So c’mon, there’s not really gonna be fast-takeoff AGI.". This is an error because if there’s a reason to expect fast-takeoff AGI, and fast-takeoff AGI leads to X, we should make the causal update (“X is more likely than I thought”), not the retrocausal update (“fast-takeoff AGI is less likely than I thought”). Well, probably. I guess on second thought it’s not always a reasoning error.)

I see what you did there. (Joke apart, that's a telling example)

And, like other reasoning errors and imperfect heuristics, I expect that it’s self-correcting—i.e., it would manifest more early in training, but gradually go away as the AGI learns meta-cognitive self-monitoring strategies. It doesn’t seem to have unusually dangerous consequences, compared to other things in that category, AFAICT.

One way to make this argument more concrete relies on saying that solving this problem helps capabilities as well as safety. So as long as what we worry is a very capable AGI, this should be mitigated.

There are within-universe consequences of a processing step, where the step causes things to happen entirely within the intended algorithm. (By "intended", I just mean that the algorithm is running without hardware errors). These same consequences would happen for the same reasons if we run the algorithm under homomorphic encryption in a sealed bunker at the bottom of the ocean.
Then there are 4th-wall-breaking consequences of a processing step, where the step has a downstream chain of causation that passes through things in the real world that are not within-universe. (I mean, yes, the chip’s transistors have real-world-impacts on each other, in a manner that implements the algorithm, but that doesn’t count as 4th-wall-breaking.)

This distinction makes some sense to me, but I'm confused by your phrasing (and thus by what you actually mean). I guess my issue is that stating it like that made me think that you expected processing steps to be one or the other, whereas I can't imagine any processing step without 4th-wall-breaking consequences. What you do with these, about whether the 4th-wall-breaking consequences are reasons for specific actions, makes it clearer IMO.

Out-of-distribution, maybe the criterion in question diverges from a good postdiction-generation strategy. Oh well, it will make bad postdictions for a while, until gradient descent fixes it. That’s a capability problem, not a safety problem.

Agreed. Though, as Evan already pointed, the real worry with mesa-optimizers isn't proxy alignment but deceptive alignment. And deceptive alignment isn't just a capability problem.

Another way I've been thinking about the issue of mesa-optimizers in GPT-N is the risk of something like malign agents in the models (a bit like this) that GPT-N might be using to simulate different texts. (Oh, I see you already have a section about that)

It seems like there’s no incentive whatsoever for a postdictive learner to have any concept that the data processing steps in the algorithm have any downstream impacts, besides, y’know, processing data within the algorithm. It seems to me like there’s a kind of leap to start taking downstream impacts to be a relevant consideration, and there’s nothing in gradient descent pushing the algorithm to make that leap, and there doesn’t seem to be anything about the structure of the domain or the reasoning it’s likely to be doing that would lead to making that leap, and it doesn’t seem like the kind of thing that would happen by random noise, I think.

Just because I share this intuition, I want to try pushing back against it.

First, I don't see any reason why a sufficiently advance postdictive learner with a general enough modality (like text) wouldn't learn to model 4th-wall-breaking consequences: that's just the sort of thing you need to predict security exploits or AI alignment posts like this one.

Next comes the questions of whether it will take advantage of this. Well, a deceptive mesa-optimizer would have an incentive to use this. So I guess the question boils down to the previous discussion, of whether we should expect postdictive learners to spin deceptive mesa-optimizers.

So a self-aware, aligned AGI could, and presumably would, figure out the idea “Don’t do a step-by-step emulation in your head of a possibly-adversarial algorithm that you don’t understand; or do it in a super-secure sandbox environment if you must”, as concepts encoded in its value function and planner. (Especially if we warn it / steer it away from that.)

I see a thread of turning potential safety issues into capability issues, and then saying that the AGI being competent, it will not have them. I think this makes sense for a really competent AGI, which would not be taken over by budding agents inside its simulation. But there's still the risk of spinning agents early in training, and if those agents get good enough to take over the model from the inside and become deceptive, competence at the training task become decorrelated with what happens in deployment.

[-]Steven Byrnes4yΩ240

Thanks!

Or do you mean literally that the world-model uses row-hammer on the computer it runs, to make the supervisory signal positive?

Yes!

If row-hammering (or whatever) improves the loss, then the gradient will push in that direction.

I don't think this is true in the situation I'm talking about ("literally that the world-model uses row-hammer on the computer it runs, to make the supervisory signal positive").

Let's say we have weights θ, and loss is nominally the function f(θ), but the actual calculated loss is F(θ). Normally f(θ)=F(θ), but there are certain values of θ for which merely running the trained model corrupts the CPU, and thus the bits in the loss register are not what they're supposed to be according to the nominal algorithm. In those cases f(θ)≠F(θ).

Anyway, when the computer does symbolic differentiation / backprop, it's calculating ∇f, not ∇F. So it won't necessarily walk its way towards the minimum of F.

I can't imagine any processing step without 4th-wall-breaking consequences

Oh yeah, for sure. My idea was: sometimes the 4th-wall-breaking consequences are part of the reason that the processing step is there in the first place, and sometimes the 4th-wall-breaking consequences are just an incidental unintended side-effect, sorta an "externality".

Like, as the saying goes, maybe a butterfly flapping its wings in Mexico will cause a tornado in Kansas three months later. But that's not why the butterfly flapped its wings. If I'm working on the project of understanding the butterfly—why does it do the things it does? why is it built the way it's built?—knowing that there was a tornado in Kansas is entirely unhelpful. It contributes literally nothing whatsoever to my success in this butterfly-explanation project.

So by the same token, I think it's possible that we can work on the project of understanding a postdictively-trained model—why does it do the things it does? why is it built the way it's built?—and find that thinking about the 4th-wall-breaking consequences of the processing steps is entirely unhelpful for this project.

I don't see any reason why a sufficiently advance postdictive learner with a general enough modality (like text) wouldn't learn to model 4th-wall-breaking consequences: that's just the sort of thing you need to predict security exploits or AI alignment posts like this one.

Of course a good postdictive learner will learn that other algorithms can be manipulative, and it could even watch itself in a mirror and understand the full range of things that it could do (see the part of this post "Let’s take a postdictive learner, and grant it “self-awareness”…"). Hmm, maybe the alleged mental block I have in mind is something like "treating one's own processing steps as being part of the physical universe, as opposed to taking the stance where you're trying to the universe from outside it". I think an algorithm could predict that security researchers can find security exploits, and predict that AI alignment researchers could write comments like this one, while nevertheless "trying to understand the universe from outside it".

there's still the risk of spinning agents early in training

Oh yeah, for sure, in fact I think there's a lot of areas where we need to develop safety-compatible motivations as soon as possible, and where there's some kind of race to do so (see "Fraught Valley" section here). I mean, "hacking into the training environment" is in that category too—you want to install the safety-compatible motivation (where the model doesn't want to hack into the training environment) sooner than the model becomes a superintelligent adversary trying to hack into the training environment. I don't like those kinds of races and wish I had better ideas for avoiding them.

[-]adamShimi4yΩ120

Let's say we have weights θ, and loss is nominally the function f(θ), but the actual calculated loss is F(θ). Normally f(θ)=F(θ), but there are certain values of θ for which merely running the trained model corrupts the CPU, and thus the bits in the loss register are not what they're supposed to be according to the nominal algorithm. In those cases f(θ)≠F(θ).
Anyway, when the computer does symbolic differentiation / backprop, it's calculating ∇f, not ∇F. So it won't necessarily walk its way towards the minimum of F

Explained like that, it makes sense. And that's something I hadn't thought about.

So by the same token, I think it's possible that we can work on the project of understanding a postdictively-trained model—why does it do the things it does? why is it built the way it's built?—and find that thinking about the 4th-wall-breaking consequences of the processing steps is entirely unhelpful for this project.

Completely agree. This is part of my current reasoning for why GPT-3 (and maybe GPT-N) aren't incentivized for predict-o-matic behavior.

Hmm, maybe the alleged mental block I have in mind is something like "treating one's own processing steps as being part of the physical universe, as opposed to taking the stance where you're trying to the universe from outside it". I think an algorithm could predict that security researchers can find security exploits, and predict that AI alignment researchers could write comments like this one, while nevertheless "trying to understand the universe from outside it".

I'm confused by that paragraph: you sound like you're saying that the postdictive learner would not see itself as outside the universe in one sentence and would do so in the next. Either way, it seemed linked with the 1st person problem we're discussing in your research update: this is a situation where you seem to expect that the translation into 1st person knowledge isn't automatic, and so can be controlled, incentivized or not.

[-]Charlie Steiner4yΩ240

I feel scooped by this post! :) I was thinking along different lines - using induction (postdictive learning) to get around Goodhart's law specifically by using the predictions outside of their nominal use case. But now I need to go back and think more about self-fulfilling prophecies and other sorts of feedback.

Maybe I'll try to get you to give me some feedback later this week.

[-]Steven Byrnes4yΩ120

Sounds interesting!

LESSWRONG
is fundraising!
LW

LESSWRONG
is fundraising!
LW

20

Thoughts on safety in predictive learning

20

Ω 11

20

Ω 11

Easy win #3: Don’t access the world-model and then act on that information, at least not without telling it

Background: Why do I care?

Easy win #1: Use postdictive learning, not predictive learning

Background 1/2: Most “predictive learning” (in both ML and brains) is actually postdictive learning

Background 2/2: Postdictive learning eliminates some potential safety issues

Easy win #2: Use backprop, or something functionally similar to backprop, but not trial-and-error, and not numerical differentiation

Background: “Differentiating through” things

Why is using backprop an “easy win” for safety?

Easy win #3: Don’t access the world-model and then act on that information, at least not without telling it

How exactly do we avoid this problem?

For completeness: One more weird incentive to take purposeful real-world actions

Background concept: “Within-universe processing” vs “4th-wall-breaking processing”

Within-universe mesa-optimizers seem kinda inevitable, but also seem not dangerous

Harder problems

Hard problem #1: Spontaneous arising of 4th-wall-breaking processing

Hard problem #2: Agents in the simulation

Conclusion: slight optimism (on this narrow issue)