dxu

Posts

Sorted by New

Comments

Alignment By Default

I like this post a lot, and I think it points out a key crux between what I would term the "Yudkowsky" side (which seems to mostly include MIRI, though I'm not too sure about individual researchers' views) and "everybody else".

In particular, the disagreement seems to crystallize over the question of whether "human values" really are a natural abstraction. I suspect that if Eliezer thought that they were, he would be substantially less worried about AI alignment than he currently is (though naturally all of this is my read on his views).

You do provide some reasons to think that human values might be a natural abstraction, both in the post itself and in the comments, but I don't see these reasons as particularly compelling ones. The one I view as the most compelling is the argument that humans seems to be fairly good at identifying and using natural abstractions, and therefore any abstract concept that we seem to be capable of grasping fairly quickly has a strong chance of being a natural one.

However, I think there's a key difference between abstractions that are developed for the purposes of prediction, and abstractions developed for other purposes (by which I mostly mean "RL"). To the extent that a predictor doesn't have sufficient computational power to form a low-level model of whatever it's trying to predict, I definitely think that the abstractions it develops in the process of trying to improve its prediction will to a large extent be natural ones. (You lay out the reasons for this clearly enough in the post itself, so I won't repeat them here.)

It seems to me, though, that if we're talking about a learning agent that's actually trying to take actions to accomplish things in some environment, there's a substantial amount of learning going on that has nothing to do with learning to predict things with greater accuracy! The abstractions learned in order to select actions from a given action-space in an attempt to maximize a given reward function--these, I see little reason to expect will be natural. In fact, if the computational power afforded to the agent is good but not excellent, I expect mostly the opposite: a kludge of heuristics and behaviors meant to address different subcases of different situations, with not a whole lot of rhyme or reason to be found.

As agents go, humans are definitely of the latter type. And, therefore, I think the fact that we intuitively grasp the concept of "human values" isn't necessarily an argument that "human values" are likely to be natural, in the way that it would be for e.g. trees. The latter would have been developed as a predictive abstraction, whereas the former seems to mainly consist of what I'll term a reward abstraction. And it's quite plausible to me that reward abstractions are only legible by default to agents which implement that particular reward abstraction, and not otherwise. If that's true, then the fact that humans know what "human values" are is merely a consequence of the fact that we happen to be humans, and therefore have a huge amount of mind-structure in common.

To the extent that this is comparable to the branching pattern of a tree (which is a comparison you make in the post), I would argue that it increases rather than lessens the reason to worry: much like a tree's branch structure is chaotic, messy, and overall high-entropy, I expect human values to look similar, and therefore not really encompass any kind of natural category.

The "AI Dungeons" Dragon Model is heavily path dependent (testing GPT-3 on ethics)

Here's the actual explanation for this: https://twitter.com/nickwalton00/status/1289946861478936577

This seems to have been an excellent exercise in noticing confusion; in particular, to figure this one out properly would have required one to not recognize that this behavior does not accord with one's pre-existing model, rather than simply coming up with an ad hoc explanation to fit the observation.

I therefore award partial marks to Rafael Harth for not proposing any explanations in particular, as well as Viliam in the comments:

I assumed that the GPT's were just generating the next word based on the previous words, one word at a time. Now I am confused.

Zero marks to Andy Jones, unfortunately:

I am fairly confident that Latitude wrap your Dungeon input before submitting it to GPT-3; if you put in the prompt all at once, that'll make for different model input than putting it in one line at a time.

Don't make up explanations! Take a Bayes penalty for your transgressions!

(No one gets full marks, unfortunately, since I didn't see anyone actually come up with the correct explanation.)

Alignment As A Bottleneck To Usefulness Of GPT-3

For what it's worth, my perception of this thread is the opposite of yours: it seems to me John Wentworth's arguments have been clear, consistent, and easy to follow, whereas you (John Maxwell) have been making very little effort to address his position, instead choosing to repeatedly strawman said position (and also repeatedly attempting to lump in what Wentworth has been saying with what you think other people have said in the past, thereby implicitly asking him to defend whatever you think those other people's positions were).

Whether you've been doing this out of a lack of desire to properly engage, an inability to comprehend the argument itself, or some other odd obstacle is in some sense irrelevant to the object-level fact of what has been happening during this conversation. You've made your frustration with "AI safety people" more than clear over the course of this conversation (and I did advise you not to engage further if that was the case!), but I submit that in this particular case (at least), the entirety of your frustration can be traced back to your own lack of willingness to put forth interpretive labor.

To be clear: I am making this comment in this tone (which I am well aware is unkind) because there are multiple aspects of your behavior in this thread that I find not only logically rude, but ordinarily rude as well. I more or less summarized these aspects in the first paragraph of my comment, but there's one particularly onerous aspect I want to highlight: over the course of this discussion, you've made multiple references to other uninvolved people (either with whom you agree or disagree), without making any effort at all to lay out what those people said or why it's relevant to the current discussion. There are two examples of this from your latest comment alone:

Daniel K agreed with me the other day that there isn't a standard reference for this claim. [Note: your link here is broken; here's a fixed version.]

A MIRI employee openly admitted here that they apply different standards of evidence to claims of safety vs claims of not-safety.

Ignoring the question of whether these two quoted statements are true (note that even the fixed version of the link above goes only to a top-level post, and I don't see any comments on that post from the other day), this is counterproductive for a number of reasons.

Firstly, it's inefficient. If you believe a particular statement is false (and furthermore, that your basis for this belief is sound), you should first attempt to refute that statement directly, which gives your interlocutor the opportunity to either counter your refutation or concede the point, thereby moving the conversation forward. If you instead counter merely by invoking somebody else's opinion, you both increase the difficulty of answering and end up offering weaker evidence.

Secondly, it's irrelevant. John Wentworth does not work at MIRI (neither does Daniel Kokotajlo, for that matter), so bringing up aspects of MIRI's position you dislike does nothing but highlight a potential area where his position differs from MIRI's. (I say "potential" because it's not at all obvious to me that you've been representing MIRI's position accurately.) In order to properly challenge his position, again it becomes more useful to critique his assertions directly rather than round them off to the closest thing said by someone from MIRI.

Thirdly, it's a distraction. When you regularly reference a group of people who aren't present in the actual conversation, repeatedly make mention of your frustration and "grumpiness" with those people, and frequently compare your actual interlocutor's position to what you imagine those people have said, all while your actual interlocutor has said nothing to indicate affiliation with or endorsement of those people, it doesn't paint a picture of an objective critic. To be blunt: it paints a picture of someone with a one-sided grudge against the people in question, and is attempting to inject that grudge into conversations where it shouldn't be present.

I hope future conversations can be more pleasant than this.

The Basic Double Crux pattern

I think shminux may have in mind one or more specific topics of contention that he's had to hash out with multiple LWers in the past (myself included), usually to no avail. 

(Admittedly, the one I'm thinking of is deeply, deeply philosophical, to the point where the question "what if I'm wrong about this?" just gets the intuition generator to spew nonsense. But I would say that this is less about an inability to question one's most deeply held beliefs, and more about the fact that there are certain aspects of our world-models that are still confused, and querying them directly may not lead to any new insight.)

Alignment As A Bottleneck To Usefulness Of GPT-3
dxu2mo6Ω3

If it's read moral philosophy, it should have some notion of what the words "human values" mean.

GPT-3 and systems like it are trained to mimic human discourse. Even if (in the limit of arbitrary computational power) it manages to encode an implicit representation of human values somewhere in its internal state, in actual practice there is nothing tying that representation to the phrase "human values", since moral philosophy is written by (confused) humans, and in human-written text the phrase "human values" is not used in the consistent, coherent manner that would be required to infer its use as a label for a fixed concept.

Alignment As A Bottleneck To Usefulness Of GPT-3

On "conceding the point":

You said earlier that "The argument for the fragility of value never relied on AI being unable to understand human values." I gave you a quote from Superintelligence which talked about AI being unable to understand human values. Are you gonna, like, concede the point or something?

The thesis that values are fragile doesn't have anything to do with how easy it is to create a system that models them implicitly, but with how easy it is to get an arbitrarily intelligent agent to behave in a way that preserves those values. The difference between those two things is analogous to the difference between a prediction task and a reinforcement learning task, and your argument (as far as I can tell) addresses the former, not the latter. Insofar as my reading of your argument is correct, there is no point to concede.

On gwern's article:

Anyway, I read Gwern's article a while ago and I thought it was pretty bad. If I recall correctly, Gwern confuses various different notions, for example, he seemed to think that if you replace enough bits of handcrafted software with bits trained using machine learning, an agent will spontaneously emerge.

I'm not sure how to respond to this, except to state that neither this specific claim nor anything particularly close to it appears in the article I linked.

On Tool AI:

Are possible

As far as I'm aware, this point has never been the subject of much dispute.

Are easier to build than Agent AIs

This is still arguable; I have my doubts, but in a "big picture" sense this is largely irrelevant to the greater point, which is:

Will be able to solve the value-loading problem

This is (and remains) the crux. I still don't see how GPT-3 supports this claim! Just as a check that we're on the same page: when you say "value-loading problem", are you referring to something more specific than the general issue of getting an AI to learn and behave according to our values?

***

META: I can understand that you're frustrated about this topic, especially if it seems to you that the "MIRI-sphere" (as you called it in a different comment) is persistently refusing to acknowledge something that appears obvious to you.

Obviously, I don't agree with that characterization, but in general I don't want to engage in a discussion that one side is finding increasingly unpleasant, especially since that often causes the discussion to rapidly deteriorate in quality after a few replies.

As such, I want to explicitly and openly relieve you of any social obligation you may have felt to reply to this comment. If you feel that your time would be better spent elsewhere, please do!

Alignment As A Bottleneck To Usefulness Of GPT-3

My claim is that we are likely to see a future GPT-N system which [...] does not "resist attempts to meddle with its motivational system".

Well, yes. This is primarily because GPT-like systems don't have a "motivational system" with which to meddle. This is not a new argument by any means: the concept of AI systems that aren't architecturally goal-oriented by default is known as "Tool AI", and there's plenty of pre-existing discussion on this topic. I'm not sure what you think GPT-3 adds to the discussion that hasn't already been mentioned?

Alignment As A Bottleneck To Usefulness Of GPT-3

I'm confused by what you're saying.

The argument for the fragility of value never relied on AI being unable to understand human values. Are you claiming it does?

If not, what are you claiming?

Coronavirus as a test-run for X-risks

I'd love to see more thought about how the MNM effect might look in an AI scenario. Like you said, maybe denials and assurances followed by freakouts and bans. But maybe we could predict what sorts of events would trigger the shift?

I take it you're presuming slow takeoff in this paragraph, right?

Philosophy in the Darkest Timeline: Basics of the Evolution of Meaning

Differing discourse norms; in general, communities that don't expend a constant amount of time-energy into maintaining better-than-average standards of discourse will, by default, regress to the mean. (We saw the same thing happen with LW1.0.)

Load More