Great post! I very much hope we can do some clever things with value learning that let us get around needing AbD to do the things that currently seem to need it.

The fundamental example of this is probably optimizability - is your language model so safe that you can query it as part of an optimization process (e.g. making decisions about what actions are good), without just ending up in the equivalent of deepDream's pictures of Maximum Dog.

Reply

[-]Dustin4y60

Tangential, but the following sentence caused me to think of something...

Language is an extremely good medium for representing complex, abstract concepts compactly and with little noise.

This nicely interleaves with something I've recently been discussing with my 11 year old daughter.

She's discovered Mary's Room.

On her own she came to this conclusion: (One of?) The reason(s) Mary learns something new when she sees the color red for the first time is because language is bad at conveying mental states.

I think this idea is based upon the same foundation as many of the best rebuttals of Mary having learned something new.

So, I found myself agreeing with the sentence I quoted above, but I'll have to think some more about how to reconcile that with my daughter's thoughts about language being bad.

Reply

[-]leogao4y50

I think a major crux is that the things you couldn't impart on Mary through language (assuming that such things do exist) would be wishy-washy stuff like qualia whose existence, for a nonhuman system modelling humans, essentially doesn't matter for predictive accuracy. In other words, a universe where Mary does learn something new and a universe where she doesn't are essentially indistinguishable from the outside, so whether it shows up in world models is irrelevant.

Reply

[-]Victor Levoso4y10

Well if Mary does learn something new( how it feels "from the inside" to see red or whatever ) she would notice, and her brainstate would reflect that plus whatever information she learned. Otherwise it doesn't make sense to say she learned anything.

And just the fact she learned something and might have thought something like "neat, so that's what red looks like" would be relevant to predictions of her behavior even ignoring possible information content of qualia.

So it seems distinguishable to me.

Reply

[-]TAG4y10

Obviously, language as ordinarily used must be very bad at conveying mental states, because of the shortfall in information...even long books don't contain 10 billion neurones worth of bits. Which is why Mary is supposed to be a super scientist, who is capable of absorbing hexabytes of information...the point being to drive a a wedge between quantitative limitations and in-principal ones.

Reply

[-]Dustin4y20

The gist of the point about better language and others of what are, IMO, the best rebuttals, is not that better language saves the physicalist viewpoint, but that the wedge doesn't work.

It's you're using Mary's Room to prove that physicalism is wrong, it fails because you're just re-asserting the point under disagreement.

Physicalist says:

"The reason Mary learns something new is because she didn't learn everything because of insert-physicalist-reasons."

Non-physicalist says:

"The reason Mary learns something new is because she learned everything possible through physical means, but still learned something new."

No matter how much detail the physicalist says Mary learns, the non-physicalist position, within the confines of the thought experiment, is not falsifiable.

Reply

[-]TAG4y10

that the wedge doesn’t work.

It’s you’re using Mary’s Room to prove that physicalism is wrong, it fails because you’re just re-asserting the point under disagreement

That depends on whether you define "working" as definitely proving a point, or sowing some doubt. Of course, Mary's Room doesnt work under the first definition, but neither does any contrary argument...because it's philosophy, so both depend on intuitions. The useful work is in in showing the dependence on intuitions .

Physicalist says: “The reason Mary learns something new is because she didn’t learn everything because of insert-physicalist-reasons.”

Of course ... she is supposed to be a super scientist precisely in order to avoid that objection. The objection is what's locally known as fighting the hypothesis.

No matter how much detail the physicalist says Mary learns, the non-physicalist position, within the confines of the thought experiment, is not falsifiable.

Of course not. It's philosophy, not science.

Reply

[-]Dustin4y30

Of course ... she is supposed to be a super scientist precisely in order to avoid that objection.

Right. And my point is that it doesn't avoid the objection it just says "assume that objection is wrong".

And that's fine as far as it goes, it's philosophy. It helps expose the hidden assumptions in physicalist and non-physicalist viewpoints. I'm not saying that Mary's Room is bad philosophy, I'm saying it's not a great argument for one of these positions...which is, IMO, by far the most common usage of Mary's Room.

Reply

[-]TAG4y20

Right. And my point is that it doesn’t avoid the objection it just says “assume that objection is wrong”.

Unless it says...assume the objection might be wrong. The reader is invited to have the intuition that there is a remaining problem, absent the quantitative issues, but they don't have to, and not everyone does.

I’m saying it’s not a great argument for one of these positions...which is, IMO, by far the most common usage of Mary’s Room.

But physicalism isn't an intuition-free default. And a lot of people don't realise that.

Reply

[-]Dustin4y20

To be honest I find myself confused by this whole conversation as your phrasing makes it feels like you think you're saying things in contradiction or contrast to what I'm saying and I feel like what you're saying is not really in tension with what I'm saying. I assume it's me not communicating my thoughts clearly or not understanding your point.

Unfortunately this conversation is going to run into a classic problem. I don't have enough care resources to go around on the subject raised mostly as an aside to begin with. I'll give just one example and then let you have the final words if you so desire:

But physicalism isn't an intuition-free default. And a lot of people don't realise that.

I agree with this 100%. But I'm very confused that you said it.

If I said "neither chocolate nor vanilla ice cream is the best ice cream" and you said "but vanilla ice cream is not the best ice cream", I would be confused by what you said.

The "but" makes me think that you think your statement is in contrast to the sentence you quoted but I think it's in agreement with the sentence you quoted.

Let me re-word the sentence you quoted:

"I’m saying it’s not a great argument for one of these positions because it exposes both positions as being based upon intuitions...which is, IMO, by far the most common usage of Mary’s Room."

Reply

Moderation Log

LESSWRONG
LW

LESSWRONG
LW

82

Thoughts on the Alignment Implications of Scaling Language Models

82

Ω 29

82

Ω 29

Background

Why scaling LMs might lead to Transformative AI

Why natural language as a medium

What is resolution

Where scaling laws fit in

Scenario 1: Weak capabilities scenario

Scenario 2: Strong capabilities scenario

What should we do?

Reward model extraction

Human emulation

Human amplification

Oracle / STEM AI

Conclusion