Jan Rzymkowski — LessWrong

What specific dangers arise when asking GPT-N to write an Alignment Forum post?

Upon reflection, you're right that it won't be maximizing response per se.

But as we get deeper it's not so straightforward. GTP-3 models can be trained to minimize prediction loss (or, plainly speaking, to simply predict more accurately) on many different tasks, which usually are very simply stated (eg. choose a word that would fill the blank).

But we end up with people taking models trained thusly and use them to generate a long texts based on some primer. And yes, in most cases such abuse of the model will end up with text that is simply coherent. But I would expect humans to have a tendency to conflate coherence and persuasiveness.

I suppose one can fairly easily choose such prediction loss for GTP-3 models that the longer texts would have some desired characteristics. But also even standard tasks probably shape GTP-3 so that it would keep producing vague sentences that continue the primer and that give the reader a feel of "it making sense". That would entail possibly producing fairly persuasive texts reinforcing primer thesis.

What specific dangers arise when asking GPT-N to write an Alignment Forum post?

Answer by Jan RzymkowskiJul 31, 202010

As far as I understand GPT-N it's not very agent-like (it doesn't perform me vs environment abstraction and doesn't look for ways to transform its perceived environment to satisfy some utility function). I wouldn't expect it to "scheme" against people since it lacks any concept of "affecting its environment".

However it seems likely that GTP-N can perfect the skill of crowd-pleasing (we already see that; we're constantly amazed by it, despite little meaning of created texts). It can precisely modulate it's tone and identify the talking points that get the most response.

So I expect the GTP-N generated texts to sound really persuasive, not because of novel ideas but because of superhuman ability to compose heard ideas into persuasive essay.

I would expect GTP-N to focus on presenting solutions for alignement (therefore making us overly optimistic about naive approaches), presenting novel risks (it's easy to make something up by simple rehashing) and possibly venturing in philosophical muddling the water (humans prove to be very easily engaged by certain topics, like self-consciousness)

Being the (Pareto) Best in the World

Jan Rzymkowski6y20

This analysis seems to quietly assume that various important skills are independent variables and therefor many people in top of their field will neccesserly be average in various other skills (actually, the chart goes even further and assumes that there's universal negative correlation between skills -- I'm not even sure if that's mathematically possible for more than 2 variables).

World's greatest genontologist will probably be very good at statistics and even Ed Jaynes would probably be a above average generontologist just because he can effectively interpret generontology data.

What kind of thing is logic in an ontological sense?

Jan Rzymkowski7y20

Appliability of logic in physical world is sort of a theorem based on the laws of physics (mostly more metaphysical and less technical like the persistence of objects, that themselves as theorems of the basic laws of physics) and the laws governing the process of formulating atomic statements based on the observations.

At the same time we need to be careful as we can easily fall into the trap of unfalsifiability -- when the predictions of logic fail, we're used to say that the problem was with our atomic statements.

That's just the sketch of the full explanation of the topic, which would require at least a chapter.

LESSWRONG
LW

LESSWRONG
LW

Posts

Wikitag Contributions

Comments