Thanks for your reply!
our value functions probably only "make sense" in a small region of possibility space, and just starts behaving randomly outside of that.
Okay, that helps me understand what you're talking about a bit better. It sounds like the concept of a partial function, and in the ML realm like the notorious brittleness that makes systems incapable of generalizing or extrapolating outside of a limited training set. I understand why you're approaching this from the adversarial angle though, because I suppose you're concerned about the AI just bringing about some state that's outside the domain of definition which just happens to yield a high "random" score.
It doesn't seem right to treat that random behavior as someone's "real values" and try to maximize that.
Upon first reading, I kind of agreed, so I definitely understand this intuition. "Random" behavior certainly doesn't sound great, and "arbitrary" or "undefined" isn't much better. But upon further reflection I'm not so sure.
First of all, what does it mean for a value system to behave randomly/arbitrarily, and is it ever not arbitrary? Arbitrary to me means that there is no reason for something, which sounds a lot like a terminal value to me. If you morally justify having a terminal value X because of reason Y, then X is instrumental to the real terminal value Y.
Secondly, I question whether my value system really is like some kind of partial function that yields random outcomes outside the domain of definition. If you asked me for a (relative) value judgment about two situations that are completely alien to me, then I would imagine being indifferent about their ordering: not ordering them randomly. It's possible that I could be persuaded to order one over the other, but then that seems more about changing my beliefs/knowledge and understanding (the is domain) than it is about changing my values (the ought domain). This may happen in less alien situations too: should we invest in education or healthcare? I don't know, but that's primarily because I can't predict the actual outcomes in terms of things I care about.
Finally, even if a value system was to order two alien situations randomly, how can we say it's wrong? Clearly it wouldn't be wrong according to / compared with that value system, right? And how else are you going to judge whether something is right or wrong, better or worse?
I feel like these questions lead deeply into philosophical territory that I'm not particularly familiar with, but I hope it's useful (rather than nitpicky) to ask these things, because if the intuitive that "random is wrong" is itself wrong, then perhaps there's no actual problem we need to pay extra attention to. I also think that some of my questions here can be answered by pointing out that someone's values may be inconsistent / conflicting. But then that seems to be the problem that needs to be solved.
I would like to acknowledge the rest of your comment without responding to it in-depth. I think I have personally spent relatively little time thinking about the complexities of multipolar scenarios (which is likely in part because I haven't stumbled upon as much reading material about it, which may reflect on the AI safety community), so I don't have much to add on this. My previous comment was aimed almost exclusively at your first point (in my mind), because the issue of what value systems are like and what an ASI that's aligned with your might (unintentionally) do wrong seems somewhat separate from the issue of defending against competing ASIs doing bad things to you or others.
I acknowledge that having simpler and constant values may be a competitive advantage, and that it may be difficult to transfer the nuances of when you think it's okay to manipulate/corrupt someone's values into an ASI. I'm less concerned about other people not thinking of the corruption problem (since their ASIs are presumably smarter), and if they simply don't care (and their aligned ASIs don't either), then this seems like a classic case of AI that's misaligned with your values. Unless you want to turn this hypothetical multipolar scenario into a singleton with your ASI at the top, it seems inevitable that some things are going to happen that you don't like.
I also acknowledge that your ASI may in some sense behave suboptimally if it's overly conservative or cautious. If a choice must be made between alien situations, then it may certainly seem prudent to defer judgment until more information can be gathered, but this is again a knowledge issue rather than a values issue. The values system should then help determine a trade-off between the present uncertainty about the alternatives and the utility of spending more time to gather information (presumably getting outcompeted while you do nothing ranks as "bad" according to most value systems). This can certainly go wrong, but again that seems like more of a knowledge issue (although I acknowledge some value systems may have a competitive advantage over others;
What does it mean for human values to be vulnerable to adversarial examples? When we say this about AI systems (e.g. image classifiers), I think it's either because their judgments on manipulated situations/images are misaligned with ours/humans, or perhaps because they get the "ground truth" wrong. But how can a value system be misaligned with itself or different from the ground truth? For alignment purposes, isn't it itself the ground truth? It could of course fail to match "objective morality" if you believe in that, but in that case we should probably be trying to make our AI align with that and not with someone's human values.
I could (easily) imagine that my values are inconsistent, conflicting, and ever-changing, but these seem like different issues.
It also seems like you have a value that says something to the effect of "it's wrong to corrupt people's values (in certain circumstances)". Then wouldn't an AI that's aligned with your values share this value, and not do this intentionally? And as for unintentionally: it seems that you have thought of this problem, and an ASI would presumably be much smarter than you, so wouldn't it think of it too, and try hard to avoid it? [My reasoning here sounds a bit naive or "too simple" to me, but I'm not sure it's wrong.]
I could understand that there might be issues with value learning AIs that imperfectly learn something close to a human's value function, which may be vulnerable to adversarial examples, but this again seems like a different issue.