Views purely my own unless clearly stated otherwise
I think the confusion here is that "Goodness" means different things depending on whether you're a moral realist or anti-realist.
If you're a moral realist, Goodness is an objective quality that doesn't depend on your feelings/mental state. What is Good may or may not overlap with what you like/prefer/find yummy, but it doesn't have to.
If you're a moral anti-realist, either:
I think "Human Values" is a very poor phrase because:
Instead, people referring to "Human Values" obscure whether they are moral realists or anti-realists, which causes a lot of confusion when determining the implications and logical consistency of their views.
So as I understand it, your (and the MIRI/LW) frame is that:
I think the issue here is with moving down a level of abstraction to a new “map” in a way that makes the entire ontology of decision theory meaningless.
Yes, on some level, we are just atoms following the laws of physics. There are no “agents”, “identities”, or “decisions”. We can just talk about which configurations of atoms we prefer, and agree that we prefer the configuration of atoms where we get more money.
This is not the correct level for thinking about decision theory—we don’t think about any of our decisions that way. Decision theory is about determining the output of the specific choice-making procedure “consider all available options and pick the best one in the moment”. This is the only sense in which we appear to make choices—insofar as we make choices, those choices are over actions.
you are an algorithm, and that algorithm picks the action. Even when an algorithm gets no further external input, and its result is fully determined by the algorithm and so can't change, its result is still determined by the algorithm and not by other things, it's the algorithm that chooses the result
This notion of “choice”, though perhaps reasonable in some cases, seems incorrect for decision theory, where there idea is that, until the point when you make the decision, you could (logically and physically) go for some other option.
If you think of yourself as carrying out a predetermined algorithm, there’s no choice or decision to make or discuss. Maybe, in some sense “you decided” to one-box but this is just a question of definitions. You could not have decided otherwise, making all questions of “what should you decide” moot.
Further, if you’re even slightly unsure whether you’re carrying out a predetermined algorithm, you can apply the logic I discussed:
What I'm sketching is a standard MIRI/LW way of seeing this
Sure. But as far as I can tell, that way of seeing decision theory is denying the notion of real choice in the moment and saying “actually all there is is what type of algorithm you are and it’s best to be a UDT/FDT/etc. algorithm.”
But no-one is arguing that being a hardcoded one-boxer is worse than being a hardcoded two-boxer.
The nontrivial question is how do you become an FDT agent? The MIRI view implies you can choose to be an FDT agent so that you win in Newcomb’s. But how? The only way this could be true is if you can pre-commit to one-boxing etc. before the fact, or if you started out on Earth as an FDT agent. Since most people can rule out the latter empirically, we’re back to square one, where MIRI is smuggling in the ability to pre-commit to a problem that doesn’t allow it.
I see what you’re saying now, but I think that’s an exotic interpretation of decision-theoretic problems.
The classical framing is: “you’re faced with options A, B, C… at time t, which one should you pick?”
This is different from “what algorithm would you prefer to implement, one that picks A, B, or C at time t?”.
Specifically, picking an algorithm is like allowing yourself to make a decision/pre-commitment before time t, which wasn’t allowed in the original problem.
Decision theory asks "what sort of motor output is the best?“
Best, from a set of available options
The fact that a calculator prints "83" is response to "71+12=" doesn't make 83 not-the-decision of 71+12.
But it would be meaningless to discuss whether the calculator should choose to print 83 or 84.
Perhaps. I see where you are coming from. Though I think it’s possible contrastive-prompt-based vectors (eg. CAA) also approximate “natural” features better than training on those same prompts (fewer degrees of freedom with the correct inductive bias). I should check whether there has been new research on this…
Not sure what distinction you're making. I'm talking about steering for controlling behavior in production, not for red-teaming at eval time or to test interp hypotheses via causal interventions. However this still covers both safety (e.g. "be truthful") and "capabilities" (e.g. "write in X style") interventions.
Whenever I read yet another paper or discussion of activation steering to modify model behavior, my instinctive reaction is to slightly cringe at the naiveté of the idea. Training a model to do some task only to then manually tweak some of the activations or weights using a heuristic-guided process seems quite un-bitter-lesson-pilled. Why not just directly train for the final behavior you want—find better data, tweak the reward function, etc.?
But actually there may be a good reason to continue working on model-internals control (i.e. ways of influencing model behavior outside of modifying the text input or training process, by directly changing internal state). For some applications, you may want to express something in terms of the model’s own abstractions, something that you won’t know a priori how to do in text or via training data in fine-tuning. Throughout the training process, a model naturally learns a rich semantic activation space. And in some cases, the “cleanest” way to modify its behavior is by expressing the change in terms of its learned concepts, whose representations are sculpted by exaflops of compute.
I think Rationalists have Gettier'ed[1] into reasonable beliefs about good strategies for iterated games/situations where reputation matters and people learn about your actions. But you don't need exotic decision theories for that.
I address this in the post:
i.e. stumbled onto a correct conclusion for the wrong reason