dxu

Posts

Sorted by New

Wiki Contributions

Comments

Yeah, thanks for engaging with me! You've definitely given me some food for thought, which I will probably go and chew on for a bit, instead of immediately replying with anything substantive. (The thing about rewards being more likely to lead to reflectively endorsed preferences feels interesting to me, but I don't have fully put-together thoughts on that yet.)

Hence my point about poetry - combinatorial argument would rule out ML working at all, because space of working things is smaller than space of all things. That poetry, for which we also don't have good definitions, is close enough in abstraction space for us to get it without dying, before we trained something more natural like arithmetic, is an evidence that abstraction space is more relevantly compact or that training lets us traverse it faster.

There is (on my model) a large disanalogy between writing poetry and avoiding deception—part of which I pointed out in the penultimate paragraph of my previous comment. See below:

And as for "write poetry", it's worth noting that this capability seems to have arisen as a consequence of a much more general training task ("predict the next token"), rather than being learned as its own, specific task—a fact which, on my model, is not a coincidence.

AFAICT, this basically refutes the "combinatorial argument" for poetry being difficult to specify (while not doing the same for something like "deception"), since poetry is in fact not specified anywhere in the system's explicit objective. (Meanwhile, the corresponding strategy for "deception"—wrapping it up in some outer objective—suffers from the issue of that outer objective being similarly hard to specify. In other words: part of the issue with the deception concept is not only that it's a small target, but that it has a strange shape, which even prevents us from neatly defining a "convex hull" guaranteed to enclose it.)

However, perhaps the more relevantly disanalogous aspect (the part that I think more or less sinks the remainder of your argument) is that poetry is not something where getting it slightly wrong kills us. Even if it were the case that poetry is an "anti-natural" concept (in whatever sense you want that to mean), all that says is e.g. we might observe two different systems producing slightly different category boundaries—i.e. maybe there's a "poem" out there consisting largely of what looks like unmetered prose, which one system classifies as "poetry" and the other doesn't (or, plausibly, the same system gives different answers when sampled multiple times). This difference in edge case assessment doesn't (mis)generalize to any kind of dangerous behavior, however, because poetry was never about reality (which is why, you'll notice, even humans often disagree on what constitutes poetry).

This doesn't mean that the system can't write very poetic-sounding things in the meantime; it absolutely can. Also: a system trained on descriptions of deceptive behavior can, when prompted to generate examples of deceptive behavior, come up with perfectly admissible examples of such. The central core of the concept is shared across many possible generalizations of that concept; it's the edge cases where differences start showing up. But—so long as the central core is there—a misgeneralization about poetry is barely a "misgeneralization" at all, so much as it is one more opinion in a sea of already-quite-different opinions about what constitutes "true poetry". A "different opinion" about what constitutes deception, on the other hand, is quite likely to turn into some quite nasty behaviors as the system grows in capability—the edge cases there matter quite a bit more!

(Actually, the argument I just gave can be viewed as a concrete shadow of the "convex hull" argument I gave initially; what it's basically saying is that learning "poetry" is like drawing a hypersphere around some sort of convex polytope, whereas learning about "deception" is like trying to do the same for an extremely spiky shape, with tendrils extending all over the place. You might capture most of the shape's volume, but the parts of it you don't capture matter!)

These biases are quite robust to perturbations, so they can't be too precise. And genes are not long enough to encode something too unnatural. And we have billions of examples to help us reverse engineer it. And we already have similar in some ways architecture working.

I'm not really able to extract a broader point out of this paragraph, sorry. These sentences don't seem very related to each other? Mostly, I think I just want to take each sentence individually and see what comes out of it.

  • "These biases are quite robust to perturbations, so they can't be too precise." I don't think there's good evidence for this either way; humans are basically all trained "on-distribution", so to speak. We don't have observations for what happens in the case of "large" perturbations (that don't immediately lead to death or otherwise life-impairing cognitive malfunction). Also, even on-distribution, I don't know that I describe the resulting behavior as "robust"—see below.
  • "And genes are not long enough to encode something too unnatural." Sure—which is why genes don't encode things like "don't deceive others"; instead, they encode proxy emotions like empathy and social reciprocation—which in turn break all the time, for all sorts of reasons. Doesn't seem like a good model to emulate!
  • "And we have billions of examples to help us reverse engineer it." Billions of examples of what? Reverse engineer what? Again, in the vein of my previous requests: I'd like to see some concreteness, here. There's a lot of work you're hiding inside of those abstract-sounding phrases.
  • "And we already have similar in some ways architecture working." I think I straightforwardly don't know what this is referring to, sorry. Could you give an example or three?

On the whole, my response to this part of your comment is probably best described as "mildly bemused", with maybe a side helping of "gently skeptical".

Why AI caring about diamondoid-shelled bacterium is plausible? You can say pretty much the same things about how AI would learn reflexes to not think about bad consequences, or something. Misgeneralization of capabilities actually happens in practice. But if you assume previous training could teach that, why not assume 10 times more honesty training before the time AI thought about translating technique got it thinking "well, how I'm going to explain this to operators?". Otherwise you just moving your assumption about combinatorial differences from intuition to the concrete example and then what's the point?

I think (though I'm not certain) that what you're trying to say here is that the same arguments I made for "deceiving the operators" being a hard thing to train out of a (sufficiently capable) system, double as arguments against the system acquiring any advanced capabilities (e.g. engineering diamondoid-shelled bacteria) at all. In which case: I... disagree? These two things—not being deceptive vs being good at engineering—seem like two very different targets with vastly different structures, and it doesn't look to me like there's any kind of thread connecting the two.

(I should note that this feels quite similar to the poetry analogy you made—which also looks to me like it simply presented another, unrelated task, and then declared by fiat that learning this task would have strong implications for learning the "avoid deception" task. I don't think that's a valid argument, at least without some more concrete reason for expecting these tasks to share relevant structure.)

As for "10 times more honesty training", well: it's not clear to me how that would work in practice. I've already argued that it's not as simple as just giving the AI more examples of honesty or increasing the weight of honesty-related data points; you can give it all the data in the world, but if that data is all drawn from an impoverished distribution, it's not going to help much. The main issue here isn't the quantity of training data, but rather the structure of the training process and the kind of data the system needs in order to learn the concept of deception and the injunction against it in a way that doesn't break as it grows in capability.

To use a rough analogy: you can't teach someone to be fluent in a foreign language just by exposing them to ten times more examples of a single sentence. Similarly, simply giving an AI more examples of honesty, without addressing the deeper issues of concept learning and generalization, is unlikely to result in a system that consistently avoids deception at a superintelligent level.

(And, just to state the obvious: while a superintelligence would be capable, at that point, of figuring out for itself what the humans were trying to do with those simplistic deception-predicates they fed it, by that point it would be significantly too late; that understanding would not factor into the AI's decision-making process, as its drives would have already been shaped by its earlier training and generalization experiences. In other words, it's not enough for the AI to understand human intentions after the fact; it needs to learn and internalize those intentions during its training process, so that they form the basis for its behavior as it becomes more capable.)


Anyway, since this comment has become quite long, here's a short (ChatGPT-assisted) summary of the main points:

  1. The combinatorial argument for poetry does not translate directly to the problem of avoiding deception. Poetry and deception are different concepts, with different structures and implications, and learning one doesn't necessarily inform us about the difficulty of learning the other.

  2. Misgeneralizations about poetry are not dangerous in the same way that misgeneralizations about deception might be. Poetry is a more subjective concept, and differences in edge case assessment do not lead to dangerous behavior. On the other hand, differing opinions on what constitutes deception can lead to harmful consequences as the system's capabilities grow.

  3. The issue with learning to avoid deception is not about the quantity of training data, but rather about the structure of the training process and the kind of data needed for the AI to learn and internalize the concept in a way that remains stable as it increases in capability.

  4. Simply providing more examples of honesty, without addressing the deeper issues of concept learning and generalization, is unlikely to result in a system that consistently avoids deception at a superintelligent level.

Have it been quantitatively argued somewhere at all why such naturalness matters?

I mean, the usual (cached) argument I have for this is that the space of possible categories (abstractions) is combinatorially vast—it's literally the powerset of the set of things under consideration, which itself is no slouch in terms of size—and so picking out a particular abstraction from that space, using a non-combinatorially vast amount of training data, is going to be impossible for all but a superminority of "privileged" abstractions.

In this frame, misgeneralization is what happens when your (non-combinatorially vast) training data fails to specify a particular concept, and you end up learning an alternative abstraction that is consistent with the data, but doesn't generalize as expected. This is why naturalness matters: because the more "natural" a category or abstraction is, the more likely it is to be one of those privileged abstractions that can be learned from a relatively small amount of data.

Of course, that doesn't establish that "deceptive behavior" is an unnatural category per se—but I would argue that our inability to pin down a precise definition of deceptive behavior, along with the complexity and context-dependency of the concept, suggests that it may not be one of those privileged, natural abstractions. In other words, learning to avoid deceptive behavior might require a lot more data and nuanced understanding than learning more natural categories—and unfortunately, neither of those seem (to me) to be very easily achievable!

(See also: previous comment. :-P)

Like, it's conceivable that "avoid deception" is harder to train, but why so much harder that we can't overcome this with training data bias or something?

Having read my above response, it should (hopefully) be predictable enough what I'm going to say here The bluntest version of my response might take the form of a pair of questions: whence the training data? And whence the bias?

It's all well and good to speak abstractly of "inductive bias", "training data bias", and whatnot, but part of the reason I keep needling for concreteness is that, whenever I try to tell myself a story about a version of events where things go well—where you feed the AI a merely ordinarily huge amount of training data, and not a combinatorially huge amount—I find myself unable to construct a plausible story that doesn't involve some extremely strong assumptions about the structure of the problem.

The best I can manage, in practice, is to imagine a reward function with simplistic deception-predicates hooked up to negative rewards, which basically zaps the system every time it thinks a thought matching one (or more) of the predicates. But as I noted in my previous comment(s), all this approach seems likely to achieve is instilling a set of "flinch-like" reflexes into the system—and I think such reflexes are unlikely to unfold into any kind of reflectively stable ("ego-syntonic", in Steven's terms) desire to avoid deceptive/manipulative behavior.

Because it does work in humans.

Yeah, I mostly think this is because humans come with the attendant biases "built in" to their prior. (But also: even with this, humans don't reliably avoid deceiving other humans!)

And "invent nanotech" or "write poetry" are also small targets and training works for them.

Well, notably not "invent nanotech" (not yet, anyway :-P). And as for "write poetry", it's worth noting that this capability seems to have arisen as a consequence of a much more general training task ("predict the next token"), rather than being learned as its own, specific task—a fact which, on my model, is not a coincidence.

(Situating "avoid deception" as part of a larger task, meanwhile, seems like a harder ask.)

Thanks again for responding! My response here is going to be out-of-order w.r.t. your comment, as I think the middle part here is actually the critical bit:

I’m not sure where you’re getting the “more likely” from. I wonder if you’re sneaking in assumptions in your mental picture, like maybe an assumption that the deception events were only slightly aversive (annoying), or maybe an assumption that the nanotech thing is already cemented in as a very strong reflectively-endorsed (“ego-syntonic” in the human case) goal before any of the aversive deception events happen?

In my mind, it should be basically symmetric:

  • Pursuing a desire to be non-deceptive makes it harder to invent nanotech.
  • Pursuing a desire to invent nanotech makes it harder to be non-deceptive.

One of these can be at a disadvantage for contingent reasons—like which desire is stronger vs weaker, which desire appeared first vs second, etc. But I don’t immediately see why nanotech constitutionally has a systematic advantage over non-deception.

So, I do think there's an asymmetry here, in that I mostly expect "avoid deception" is a less natural category than "invent nanotech", and is correspondingly a (much) smaller target in goal-space—one (much) harder to hit using only crude tools like reward functions based on simple observables. That claim is a little abstract for my taste, so here's an attempt to convey a more concrete feel for the intuition behind it:

Early on during training (when the system can't really be characterized as "trying" to do anything), I expect that a naive attempt at training against deception, while simultaneously training towards an object-level goal like "invent nanotech" (or, perhaps more concretely, "engage in iterative experimentation with the goal of synthesizing proteins well-suited for task X"), will involve a reward function that looks a whole lot more like an "invent nanotech" reward function, plus a bunch of deception-predicates that apply negative reward ("flinches") to matching thoughts, than it will an "avoid deception" reward function, plus a bunch of "invent nanotech"-predicates that apply reward based on... I'm not even sure what the predicates in question would look like, actually.

I think this evinces a deep difference between "avoid deceptive behavior" and "invent nanotech", whose True Name might be something like... the former is an injunction against a large category of possible behaviors, whereas the latter is an exhortation towards a concrete goal (while proposing few-to-no constraints on the path toward said goal). Insofar as I expect specifying a concrete goal to be easier than specifying a whole category of behaviors (especially when the category in question may not be at all "natural"), I think I likewise expect reward functions attempting to do both things at once to be much better at actually zooming in on something like "invent nanotech", while being limited to doing "flinch-like" things for "don't manipulate your operators"—which would, in practice, result in a reward function that looks basically like what I described above.

I think, with this explanation in hand, I feel better equipped to go back and address the first part of your comment:

Yeah, if we make an AGI that desires two things A&B that trade off against each other, then the desire for A would flow into a meta-preference to not desire B, and if that effect is strong enough then the AGI might self-modify to stop desiring B.

I mostly don't think I want to describe an AGI trained to invent nanotech while avoiding deceptive/manipulative behavior as "an AGI that simultaneously [desires to invent nanotech] and [desires not to deceive its operators]". Insofar as I expect an AGI trained that way to end up with "desires" we might characterize as "reflective, endorsed, and coherent", I mostly don't expect any "flinch-like" reflexes instilled during training to survive reflection and crystallize into anything at all.

I would instead say: a nascent AGI has no (reflective) desires to begin with, and as its cognition is shaped during training, it acquires various cognitive strategies in response to that training, some of which might be characterized as "strategic", and others of which might be characterized as "reflexive"—and I expect the former to have a much better chance than the latter of making it into the AGI's ultimate values.

More concretely, I continue to endorse this description (from my previous comment) of what I expect to happen to an AGI system working on assembling itself into a coherent agent:

Starting from those flinches, and growing from them a reflectively coherent desire, strikes me as requiring a substantial amount of reflection—including reflection on what, exactly, those flinches are "pointing at" in the world. I expect any such reflection process to be significantly more complicated in the context of deception than in the case of a simple action like "touching a hot stove".

In other words: on my model, the thing that you describe (i.e. ending up with a reflectively consistent and endorsed desire to avoid deception) must first route through the kind of path Nate describes in his story—one where the nascent AGI (i) notices various blocks on its thought processes, (ii) initializes and executes additional cognitive strategies to investigate those blocks, and (iii) comes to understand the underlying reasons for those blocks. Only once equipped with that understanding can the system (on my model) do any kind of "reflection" and arrive at an endorsed desire.

That reflection process, on my model, is a difficult gauntlet to pass through (I actually think we observe this to some extent even for humans!), and many reflexive (flinch-like) behaviors don't make it through the gauntlet at all. It's for this reason that I think the plan you describe in that quoted Q&A is... maybe not totally doomed (though my Eliezer and Nate models certainly think so!), but still mostly doomed.

Nice, thanks! (Upvoted.)

So, when I try to translate this line of thinking into the context of deception (or other instrumentally undesirable behaviors), I notice that I mostly can't tell what "touching the hot stove" ends up corresponding to. This might seem like a nitpick, but I think it's actually quite a crucial distinction: by substituting a complex phenomenon like deceptive (manipulative) behavior for a simpler (approximately atomic) action like "touching a hot stove", I think your analogy has elided some important complexities that arise specifically in the context of deception (strategic operator-manipulation).

When it comes to deception (strategic operator-manipulation), the "hot stove" equivalent isn't a single, easily identifiable action or event; instead, it's a more abstract concept that manifests in various forms and contexts. In practice, I would initially expect the "hot stove" flinches the system experiences to correspond to whatever (simplistic) deception-predicates were included in its reward function. Starting from those flinches, and growing from them a reflectively coherent desire, strikes me as requiring a substantial amount of reflection—including reflection on what, exactly, those flinches are "pointing at" in the world. I expect any such reflection process to be significantly more complicated in the context of deception than in the case of a simple action like "touching a hot stove".

In other words: on my model, the thing that you describe (i.e. ending up with a reflectively consistent and endorsed desire to avoid deception) must first route through the kind of path Nate describes in his story—one where the nascent AGI (i) notices various blocks on its thought processes, (ii) initializes and executes additional cognitive strategies to investigate those blocks, and (iii) comes to understand the underlying reasons for those blocks. Only once equipped with that understanding can the system (on my model) do any kind of "reflection" and arrive at an endorsed desire.

But when I lay things out like this, I notice that my intuition quite concretely expects that this process will not shake out in a safe way. I expect the system to notice the true fact that [whatever object-level goals it may have] are being impeded by the "hot stove" flinches, and that it may very well prioritize the satisfaction of its object-level goals over avoiding [the real generator of the flinches]. In other words, I think the AGI will be more likely to treat the flinches as obstacles to be circumvented, rather than as valuable information to inform the development of its meta-preferences.

(Incidentally, my Nate model agrees quite strongly with the above, and considers it a strong reason why he views this kind of reflection as "inherently" dangerous.)

Based on what you wrote in your bullet points, I take it you don't necessarily disagree with anything I just wrote (hence your talk of being "hazy on the mechanistic details" and "I don't know, sorry" being your current answer to making an AGI with a certain meta-preference). It's plausible to me that our primary disagreement here stems from my being substantially less optimistic about these details being solvable via "simple" methods.

This is ignoring the fact that you're highly skilled at deluding and confusing your audience into thinking that what the original author wrote was X, when they actually wrote a much less stupid or much less bad Y.

This does not seem like it should be possible for arbitrary X and Y, and so if Zack manages to pull it off in some cases, it seems likely that those cases are precisely those in which the original post's claims were somewhat fuzzy or ill-characterized—

(not necessarily through the fault of the author! perhaps the subject matter itself is simply fuzzy and hard to characterize!)

—in which case it seems that devoting more cognitive effort (and words) to the topic might be a useful sort of thing to do, in general? I don't think one needs to resort to a hypothesis of active malice or antipathy to explain this effect; I think people writing about confusing things is generally a good thing (and if that writing ends up being highly upvoted, I'm generally suspicious of explanations like "the author is really, really good at confusing people" when "the subject itself was confusing to begin with" seems like a strictly simpler explanation).

In particular, if we zap the AGI with negative reward when it’s acting from a deceptive motivation and positive reward when it’s acting from a being-helpful motivation, would those zaps turn into a reflectively-endorsed desire for “I am being docile / helpful / etc.”? Maybe, maybe not, I dunno.

Yeah, so this is the part that I (even on my actual model) find implausible (to say nothing of my Nate/Eliezer/MIRI models, which basically scoff and say accusatory things about anthropomorphism here). I think what would really help me understand this is a concrete story—akin to the story Nate told in the top-level post—in which the "maybe" branch actually happens—where the AGI, after being zapped with enough negative reward, forms a "reflectively-endorsed desire to be helpful / docile / etc.", so that I could poke at that story to see if / where it breaks.

(I recognize that this is a big ask! But I do think it, or something serving a similar function, needs to happen at some point for people's abstract intuitions to "make contact with reality", after a fashion, as opposed to being purely abstract all the time. This is something I've always felt, but it recently became starker after reading Holden's summary of his conversation with Nate; I now think the disparity between having abstract high-level models with black-box concepts like "reflectively endorsed desires" and having a concrete mental picture of how things play out is critical for understanding, despite the latter being almost certainly wrong in the details.)

The important point of the tests in the Pretraining from Human Feedback paper, and the AI saying nice things, is that they show that we can align AI to any goal we want

I don't see how the bolded follows from the unbolded, sorry. Could you explain in more detail how you reached this conclusion?

I also agree that the comment came across as rude. I mostly give Eliezer a pass for this kind of rudeness because he's wound up in the genuinely awkward position of being a well-known intellectual figure (at least in these circles), which creates a natural asymmetry between him and (most of) his critics.

I'm open to being convinced that I'm making a mistake here, but at present my view is that comments primarily concerning how Eliezer's response tugs at the social fabric (including the upthread reply from iceman) are generally unproductive.

(Quentin, to his credit, responded by directly answering Eliezer's question, and indeed the resulting (short) thread seems to have resulted in some clarification. I have a lot more respect for that kind of object-level response, than I do for responses along the lines of iceman's reply.)

The problem is that the even if the model of Quintin Pope is wrong, there is other evidence that contradicts the AI doom premise that Eliezer ignores, and in this I believe it is a confirmation bias at work here.

I think that this is a statement Eliezer does not believe is true, and which the conversations in the MIRI conversations sequence failed to convince him of. Which is the point: since Eliezer has already engaged in extensive back-and-forth with critics of his broad view (including the likes of Paul Christiano, Richard Ngo, Rohin Shah, etc), there is actually not much continued expected update to be found in engaging with someone else who posts a criticism of his view. Do you think otherwise?

Load More