LW1.0 username Manfred. Day job is condensed matter physics, hobby is thinking I know how to assign anthropic probabilities.
Speaking as a character, I too think the player can just go jump in a lake.
My response to this post is to think about something else instead, so if you'll excuse me getting on a hobby horse...
I agree that when we look at someone making bizarre rationalizations, "their values" are not represented consciously, and we have to jump to a different level to find human values. But I think that conscious->unconscious is the wrong level jump to make.
Instead, the jump I've been thinking about recently is to our own model of their behavior. In this case, our explanation of their behavior relies on the unconscious mind, but in other cases, I predict that we'll identify values with conscious desires when that is a more parsimonious explanation of behavior. An AI learning human values would then not merely be modeling humans, but modeling humans' models of humans. But I think it might be okay if it makes those models out of completely alien concepts (at least outside of deliberately self-referential special cases - there might be an analogy here to the recursive modeling of Gricean communication).
I need to work things out carefully not to obey norms of communication (I have some posts on reasoning inside causal networks that I think literally only Ilya S. put in the effort to decode - weird flex, I know), but to protect me from myself.
But maybe I'm thinking of a different context than you, here. If I was writing about swing dancing or easy physics, I'd probably be a lot happier to trust my gut. But for new research, or in other cases where there's optimization pressure working against human intuition, I think it's better to check, and not much worth reading low-bandwidth text from someone who doesn't check.
This is a really nice way of explaining the "camera-like point of view." Obv. "consciousness" has a bunch of extra grab-bag components that we associate with the word.
Is your point mostly centered around there being no single correct way to generalize to new domains, but humans have preferences about how the AI should generalize, so to generalize properly, the AI needs to learn how humans want it to do generalization?
The above sentence makes lots of sense to me, but I don't see how it's related to inner alignment (it's just regular alignment), so I feel like I'm missing something.
I've seen several examples that I would have naively expected to be a "change of the causal structure," which were treated by just adding more links to the static causal structure. So, apologies, I can only give evidence that what you're looking for is rare.
I'm pretty sure the concept of defeating spam by making emails cost 1¢ to send is an ancient one - I can't remember where I first encountered it. The hard part seems to be that the difference between "free" and "1¢" has been huge enough to deter most human users. I think we're slowly chipping away at this problem both through microtransaction technology and very slow cultural change.
Thanks for this perspective! I really should get around to reading this book...
Have you ever played the game Hanabi? Some of the statements you make imply, "why would he say them otherwise?" style, that your error bars aren't big enough.
So, depending on how you feel about statements like, e.g., "Human Compatible neither confirms nor dispels the impression that Russell is a bit too academic", I think you should either widen your error bars, or do a better job of communicating wide error bars.
I think we can go a bit farther in predicting that backwards causation will be a useful concept in some very specific cases, which will break down far above the scale of the normal second law.
We "see" backwards causation when we know the outcome but not how the system will get there. What does this behavior sound like a hallmark of? Optimization processes! We can predict in advance that backwards causation will be a useful idea to talk about the behavior of some optimization processes, but that it will stop contributing useful information when we want to zoom in past the "intentional stance" level of description.
I thought "aesthetics come from facts" was going to go off into evolutionary psychology. Health being good for our genes is a fact that explains why (without explaining away) health is aesthetically better than sickness (for most people), etc.
As with many film franchises, the first Jurassic Park movie is actually titled "Jurassic Park."