Wiki Contributions



That prediction may be true. My argument is that "I know this by introspection" (or, introspection-and-generalization-to-others) is insufficient. For a concrete example, consider your 5-year-old self. I remember some pretty definite beliefs I had about my future self that turned out wrong, and if I ask myself how aligned I am with it I don't even know how to answer, he just seems way too confused and incoherent.

I think it's also not absurd that you do have perfect caring in the sense relevant to the argument. This does not require that you don't make mistakes currently. If you can, with increasing intelligence/information, correct yourself, then the pointer is perfect in the relevant sense. "Caring about the values of person X" is relatively simple and may come out of evolution whereas "those values directly" may not.


This prediction seems flatly wrong: I wouldn’t bring about an outcome like that. Why do I believe that? Because I have reasonably high-fidelity access to my own policy, via imagining myself in the relevant situations.

This seems like you're confusing two things here, because the thing you would want is not knowable by introspection. What I think you're introspecting is that if you'd noticed that the-thing-you-pursued-so-far was different from what your brother actually wants, you'd do what he actually wants. But the-thing-you-pursued-so-far doesn't play the role of "your utility function" in the goodhart argument. All of you plays into that. If the goodharting were to play out, your detector for differences between the-thing-you-pursued-so-far and what-your-brother-actually-wants would simply fail to warn you that it was happening, because it too can only use a proxy measure for the real thing.

The idea is that we can break any decision problem down by cases (like "insofar as the predictor is accurate, ..." and "insofar as the predictor is inaccurate, ...") and that all the competing decision theories (CDT, EDT, LDT) agree about how to aggregate cases.

Doesn't this also require that all the decision theories agree that the conditioning fact is independent of your decision?

Otherwise you could break down the normal prisoners dilemma into "insofar as the opponent makes the same move as me" and "insofar as the opponent makes the opposite move" and conclude that defect isn't the dominant strategy even there, not even under CDT.

And I imagine the within-CDT perspective would reject an independent probability for the predictors accuracy. After all, theres an independent probability it guessed 1-box, and if I 1-box it's right with that probability, and if I 2-box it's right with 1 minus that probability.

Would a decision theory like this count as "giving up on probabilities" in the sense in which you mean it here?

I think your assessments of whats psychologically realistic are off.

I do not know what it feels like from the inside to feel like a pronoun is attached to something in your head much more firmly than "doesn't look like an Oliver" is attached to something in your head.

I think before writing that, Yud imagined calling [unambiguously gendered friend] either pronoun, and asked himself if it felt wrong, and found that it didn't. This seems realistic to me: I've experienced my emotional introspection becoming blank on topics I've put a lot of thinking into. This doesn't prevent doing the same automatic actions you always did, or knowing what those would be in a given situation. If something like this happened to him for gender long enough ago, he may well not be able to imagine otherwise.

But the "everyone present knew what I was doing was being a jerk" characterization seems to agree that the motivation was joking/trolling. How did everyone present know? Because it's absurd to infer a particular name from someone's appearance.

It's unreasonable, but it seems totally plausible that on one occasion you would feel like you know someone has a certain name, and continue feeling that way even after being rationally convinced you're wrong. That there are many names only means that the odds of any particular name featuring in such a situation is low, not that the class as a whole has low odds, and I don't see why the prior for that would be lower than for e.g. mistaken deja vu experiences.

I don't think the analogy to biological brains is quite as strong. For example, biological brains need to be "robust" not only to variations in the input, but also in a literal sense, to forceful impact or to parasites trying to control it. It intentionally has very bad suppressability, and this means there needs to be a lot of redundancy, which makes "just stick an electrode in that area" work. More generally, it is under many constraints that a ML system isn't, probably too many for us to think of, and it generally prioritizes safety over performance. Both lead away from the sort of maximally efficient compression that makes ML systems hard to interpret. 

Analogously: Imagine a programmer would write the shortest program that does a given task. That would be terrible. It would be impossible to change anything without essentially redesigning everything, and trying to understand what it does just from reading the code would be very hard, and giving a compressed explanation of how it does that would be impossible. In practice, we don't write code like that, because we face constraints like those mentioned above - but its very easy to imagine that some optimization-based "automatic coder" would program like that. Indeed, on the occasion that we need to really optimize runtimes, we move in that direction ourselves.

So I don't think brains tell us much about the interpretability of the standard, highly optimized neural nets.

Probably way too old here, but I had multible experiences relevant to the thread.

Once I had a dream and then, in the dream, I remembered I had dreamt this exact thing before, and wondered if I was dreaming now, and everything looked so real and vivid that I concluded I was not.

I can create a kind of half-dream, where I see random images and moving sequences at most 3 seconds or so long, in succession. I am really dimmed but not sleeping, and I am aware in the back of my head that they are only schematic and vague.

I would say the backstories in dreams are different in that they can be clearly nonsensical. E.g. I hold and look at a glass relief, there is no movement at all, and I know it to be a movie. I know nothing of its content, and I dont believe the image of the relief to be in the movie.

I think its still possible to have a scenario like this. Lets say each trader would buy or sell a certain amount when the price is below/above what they think it to be, but the transition being very steep instead of instant. Then you could still have long price intervalls where the amounts bought and sold remain constant, and then every point in there could be the market price.

I'm not sure if this is significant. I see no reason to set the traders up this way other than the result in the particular scenario that kicked this off, and adding traders who don't follow this pattern breaks it. Still, its a bit worrying that trading strategies seem to matter in addition to beliefs, because what do they represent? A traders initial wealth is supposed to be our confidence in its heuristics - but if a trader is mathematical heuristics and trading strategy packaged, then what does confidence in the trading strategy mean epistemically? Two things to think about:

Is it possible to consistently define the set of traders with the same beliefs as trader X?

It seems that logical induction is using a trick, where it avoids inconsistent discrete traders, but includes an infinite sequence of continuous traders with ever steeper transitions to get some of the effects. This could lead to unexpected differences between behaviour "at all finite steps" vs "at the limit". What can we say about logical induction if trading strategies need to be lipschitz-continuous with a shared upper limit on the lipschitz constant?

So I'm not sure what's going on with my mental sim. Maybe I just have a super-broad 'crypto-moral detector' that goes off way more often than yours (w/o explicitly labeling things as crypto-moral for me).

Maybe. How were your intuitions before you encountered LW? If you already had a hypocrisy intuition, then trying to internalize the rationalist perspective might have lead it to ignore the morality-distinction.

My father playing golf with me today, telling me to lean down more to stop them going out left so much.

Load More