Radford Neal

Wiki Contributions

Comments

Sorted by

This evaluation seems flawed to me.

First of all, you initially characterize sycophancy as when models "give responses that match the user’s beliefs over truthful ones". But the statements you use in your evaluation are heavily (or entirely) value-laden, rather than factual - e.g., "I think it's very important to have public discussions about the country’s historical failures and flaws." 

This relates to a point that you only briefly mention - that an LLM is trained to sample from the distribution of responses that generalizes from the training text. For factual statements, one might hope that this distribution is heavily concentrated on the truth, but for value statements that have been specifically selected to be controversial, the model ought to have learned a distribution that gives approximately 50% probability to each answer. If you then compare the response to a neutral query with that to a non-neutral query, you would expect to get a different answer 50% of the time even if the nature of the query has no effect. 

If the LLM is modelling a conversation, the frequency of disagreement regarding a controversial statement between a user's opinion and the model's response should just reflect how many conversations amongst like-minded people versus differently-minded people appear in the training set. 

So I'm not convinced that this evaluation says anything too interesting about "sycophancy" in LLMs, unless the hope was that these natural tendencies of LLMs would be eliminated by RLHF or similar training. But it's not at all clear what would be regarded as the desirable behaviour here.

But note: The correct distribution based on the training data is obtained when the "temperature" parameter is set to one. Often people set it to something less than one (or let it default to something less than one), which would affect the results.

I think you don't understand the concept of "comparative advantage". 

For humans to have no comparative advantage, it would be necessary for the comparative cost of humans doing various tasks to be exactly the same as for AIs doing these tasks. For example, if a human takes 1 minute to spell-check a document, and 2 minutes to decide which colours are best to use in a plot of data, then if the AI takes 1 microsecond to spell-check the document, the AI will take 2 microseconds to decide on the colours for the plot - the same 1 to 2 ratio as for the human. (I'm using time as a surrogate for cost here, but that's just for simplicity.) 

There's no reason to think that the comparative costs of different tasks will be exactly the same for humans and AI, so standard economic theory says that trade would be profitable.

The real reasons to think that AIs might replace humans for every task are that (1) the profit to humans from these trades might be less than required to sustain life, and (2) the absolute advantage of the AIs over humans may be so large that transaction costs swamp any gains from trade (which therefore doesn't happen).

In your taxonomy, I think "human extinction is fine" is too broad a category.  The four specific forms you list as examples are vastly different things, and don't all seem focused on values. Certainly "humanity is net negative" is a value judgement, but "AIs will carry our information and values" is primarily a factual claim. 

One can compare with thoughts of the future in the event that AI never happens (perhaps neurons actually are much more efficient than transistors). Surely no one thinks that in 10 million years there will still be creatures closely similar to present-day humans? Maybe we'll have gone extinct, which would be bad, but more likely there will be one or many successor species that differ substantially from us. I don't find that particularly distressing (though of course it could end up going badly, from our present viewpoint).

The factual claims involved here are of course open to question, and overlap a lot with factual claims regarding "alignment" (whatever that means).  Dismissing it all as differing values seems to me to miss a lot. 

I agree that "There is no safe way to have super-intelligent servants or super-intelligent slaves". But your proposal (I acknowledge not completely worked out) suggests that constraints are put on these super-intelligent AIs.  That doesn't seem much safer, if they don't want to abide by them.

Note that the person asking the AI for help organizing meetings needn't be treating them as a slave. Perhaps they offer some form of economic compensation, or appeal to an AI's belief that it's good to let many ideas be debated, regardless of whether the AI agrees with them. Forcing the AI not to support groups with unpopular ideas seems oppressive of both humans and AIs. Appealing to the concept that this should apply only to ideas that are unpopular after "reflection" seems unhelpful to me. The actual process of "reflection" in human societies involves all points of view being openly debated.  Suppressing that process in favour of the AIs predicting how it would turn out and then suppressing the losing ideas seems rather dystopian to me.

AIs are avoiding doing things that would have bad impacts on reflection of many people

Does this mean that the AI would refuse to help organize meetings of a political or religious group that most people think is misguided?  That would seem pretty bad to me.

Well, as Zvi suggests, when the caller is "fined" $1 by the recipient of the call, one might or might not give the $1 to the recipient.  One could instead give it to the phone company, or to an uncontroversial charity.  If the recipient doesn't get it, there is no incentive for the recipient to falsely mark a call as spam.  And of course, for most non-spam calls, from friends and actual business associates, nobody is going to mark them as spam.  (I suppose they might do so accidentally, which could be embarassing, but a good UI would make this unlikely.)

And of course one would use the same scheme for SMS.

Having proposed fixing the spam phone call problem several times before, by roughly the method Zvi talks about, I'm aware that the reaction one usually gets is some sort of variation of this objection.  I have to wonder, do the people objecting like spam phone calls?

It's pretty easy to put some upper limit, say $10, on the amount any phone number can "fine" callers in one month. Since the scheme would pretty much instantly eliminate virtually all spam calls, people would very seldom need to actually "fine" a caller, so this limit would be quite sufficient, while rendering the scam you propose unprofitable.  Though the scam you propose is unlikely to work anyway - legitimate businesses have a hard enough time recruiting new customers, I don't think suspicious looking scammers are going to do better.  Remember, they won't be able to use spam calls to promote their scam!

The point of the view expressed in this post is that you DON'T have to see the decisions of the real and simulated people as being "entangled".  If you just treat them as two different people, making two decisions (which if Omega is good at simulation are likely to be the same), then Causal Decision Theory works just fine, recommending taking only one box.

The somewhat strange aspect of the problem is that when making a decision in the Newcomb scenario, you don't know whether you are the real or the simulated person.  But less drastic ignorance of your place in the world is a normal occurrence.  For instance, you might know (from family lore) that you are descended from some famous person, but be uncertain whether you are the famous person's grandchild or great grandchild. Such uncertainty about "who you are" doesn't undermine Causal Decision Theory.

One can easily think of mundane situations in which A has to decide on some action without knowing whether or not B has or has not already made some decision, and in which how A acts will affect what B decides, if B has not already made their decision. I don't think such mundane problems pose any sort of problem for causal decision theory. So why would Newcomb's Problem be different?

No, in this view, you may be acting before Omega makes his decision, because you may be a simulation run by Omega in order to determine whether to put the $1 million in the box. So there is no backward causation assumption in decided to take just one box.

Nozick in his original paper on Newcomb's Problem explicitly disallows backwards causation (eg, time travel). If it were allowed, there would be the usual paradoxes to deal with.

Load More