Wiki Contributions


Thanks for coming. :)

Answer by Noah TopperFeb 09, 20237-2

I am confused by your confusion. Your basic question is "what is the source of the adversarial selection". The answer is "the system itself" (or in some cases, the training/search procedure that produces the system satisfying your specification). In your linked comment, you say "There's no malicious ghost trying to exploit weaknesses in our alignment techniques." I think you've basically hit on the crux, there. The "adversarially robust" frame is essentially saying you should think about the problem in exactly this way.

I think Eliezer has conceded that Stuart Russel puts the point best. It goes something like: "If you have an optimization process in which you forget to specify every variable that you care about, then unspecified variables are likely to be set to extreme values." I would tack on that due to the fragility of human value, it's much easier to set such a variable to an extremely bad value than an extremely good one.

Basically, however the goal of the system is specified or represented, you should ask yourself if there's some way to satisfy that goal in a way that doesn't actually do what you want. Because if there is, and it's simpler than what you actually wanted, then that's what will happen instead. (Side note: the system won't literally do something just because you hate it. But the same is true for other Goodhart examples. Companies in the Soviet Union didn't game the targets because they hated the government, but because it was the simplest way to satisfy the goal as given.)

"If the system is trying/wants to break its safety properties, then it's not safe/you've already made a massive mistake somewhere else." I mean, yes, definitely. Eliezer makes this point a lot in some Arbital articles, saying stuff like "If the system is spending computation searching for things to harm you or thwart your safety protocols, then you are doing the fundamentally wrong thing with your computation and you should do something else instead." The question is how to do so.

Also from your linked comment: "Cybersecurity requires adversarial robustness, intent alignment does not." Okay, but if you come up with some scheme to achieve intent alignment, you should naturally ask "Is there a way to game this scheme and not actually do what I intended?" Take this Arbital article on the problem of fully-updated deference. Moral uncertainty has been proposed as a solution to intent alignment. If the system is uncertain as to your true goals, then it will hopefully be deferential. But the article lays out a way the system might game the proposal. If the agent can maximize its meta-utility function over what it thinks we might value, and still not do what we want, then clearly this proposal is insufficient.

If you propose an intent alignment scheme such that when we ask "Is there any way the system could satisfy this scheme and still be trying to harm us?", the answer is "No", then congrats, you've solved the adversarial robustness problem! That seems to me to be the goal and the point of this way of thinking.

I mean, fair enough, but I can't weigh it up against every other opportunity available to you on your behalf. I did try to compare it to learning other languages. I'll toss into the post that I also think it's comparatively easy to learn.

FWIW I genuinely think ASL is easy to learn with the videos I linked above. Overall I think sign is more worthwhile to learn than most other languages, but yes, not some overwhelming necessity. Just very personally enriching and neat. :)

It's entirely just a neat thing. I think most people should consider learning to sign, and the idea of it becoming a rationalist "thing" just sounded fun to me.  I did try to make that clear, but apologies if it wasn't. And as I said, sorry this is kind of off topic, it's just been a thing bouncing around in my head.

Honestly I found ASL easier to learn than, say, the limited Spanish I tried to learn in high school. Maybe because it doesn't conflict with the current way you communicate. Just from watching the ASL 1 - 4 lectures I linked to, I was surprisingly able to manage once dropped in a one-on-one conversation with a deaf person.

It would definitely be good to learn with a buddy. My wife hasn't explicitly learned it yet, but she's picked up some from me. Israel is a tough choice, I'm not sure what the learning resources are like for it.

...and now I am also feeling like I really should have realized this as well.

I agree that there isn’t an “obvious” set of assumptions for the latter question that yields a unique answer. And granted I didn’t really dig into why entropy is a good measure, but I do think it ultimately yields the unique best guess given the information you have. The fact that it’s not obvious is rather the point! The question has a best answer, even if you don’t know what it is or how to give it.

In any real-life inference problem, nobody is going to tell you: "Here is the exact probability space with a precise, known probability for each outcome." (I literally don't know what such a thing would mean anyway). Is all inference thereby undefined? Like Einstein said, "As far as laws of mathematics refer to reality, they are not certain; and as far as they are certain, they do not refer to reality". If you can't actually fulfill the axioms in real life, what's the point?

If you still want to make inferences anyway, I think you're going to have to adopt the Bayesian view. A probability distribution is never handed to us, and must always be extracted from what we know. And how you update your probabilities in response to new evidence also depends on what you know. If you can formalize exactly how, then you have a totally well-defined mathematical problem, hooray!

My point, then, is that we feel a problem isn't well-defined exactly when we don't know how to convert what we know into clear mathematics. (I'm really not trying to play a semantics game. This was an attempt to dissolve the concept of "well-defined" for probability questions.) But you can see a bit of a paradox when adding more information makes the mathematical problem harder, even though this shouldn't make the problem any less "well-defined".

Load More