Nina Panickssery — LessWrong

Categorical imperative has been popular for a long while

I think Rationalists have Gettier'ed^[1] into reasonable beliefs about good strategies for iterated games/situations where reputation matters and people learn about your actions. But you don't need exotic decision theories for that.

I address this in the post:

...makes sense under two conditions:
Their cooperative actions directly cause desirable outcomes by making observers think they are trustworthy/cooperative.
Being deceptive is too costly, either because it’s literally difficult (requires too much planning/thought), or because it makes future deception impossible (e.g. because of reputation and repeated interactions).
Of course, whether or not we have some free will, we are not entirely free—some actions are outside of our capability. Being sufficiently good at deception may be one of these. Hence why one might rationally decide to always be honest and cooperative—successfully only pretending to be so when others are watching might be literally impossible (and messing up once might be very costly).

^{^}
i.e. stumbled onto a correct conclusion for the wrong reason

Human Values ≠ Goodness

Nina Panickssery11h91

I think the confusion here is that "Goodness" means different things depending on whether you're a moral realist or anti-realist.

If you're a moral realist, Goodness is an objective quality that doesn't depend on your feelings/mental state. What is Good may or may not overlap with what you like/prefer/find yummy, but it doesn't have to.

If you're a moral anti-realist, either:

"Goodness" is meaningless.
"Goodness" is a shorthand for something like:
- "My fundamental, least changeable preferences/likes/wants"
- "The subset of my preferences/likes/wants that many other people share"
- "The subset of my preferences/likes/wants that it's socially acceptable to talk a lot about/encourage others to adopt"
- "The subset of my preferences/likes/wants that I want others to adopt"

I think "Human Values" is a very poor phrase because:

If you're a moral realist, you can just say "Goodness" instead of "Human Values".
If you're a moral anti-realist, you can just talk about your preferences, or a particular subset of your preferences (e.g. any of the options listed above).

Instead, people referring to "Human Values" obscure whether they are moral realists or anti-realists, which causes a lot of confusion when determining the implications and logical consistency of their views.

Decision theory when you can't make decisions

Nina Panickssery16h3-3

So as I understand it, your (and the MIRI/LW) frame is that:

A choice is not made, it is “discovered” (ie. the choice the agent is determined to make is revealed to it after its computation of a certain “choice-making” procedure). This process is internally indistinguishable from “actually choosing” because we couldn’t know the result of the choice-making computation before doing it. However, an external system *could* know this, for example by simulating us.
There are certain choices we should be more or less happy to discover, or “make” in this sense. We should be more happy to have choice-making procedures that result in happy choices.
Correct decision theory specifies the best choice-making procedures.

I think the issue here is with moving down a level of abstraction to a new “map” in a way that makes the entire ontology of decision theory meaningless.

Yes, on some level, we are just atoms following the laws of physics. There are no “agents”, “identities”, or “decisions”. We can just talk about which configurations of atoms we prefer, and agree that we prefer the configuration of atoms where we get more money.

This is not the correct level for thinking about decision theory—we don’t think about any of our decisions that way. Decision theory is about determining the output of the specific choice-making procedure “consider all available options and pick the best one in the moment”. This is the only sense in which we appear to make choices—insofar as we make choices, those choices are over actions.

Decision theory when you can't make decisions

Nina Panickssery21h42

you are an algorithm, and that algorithm picks the action. Even when an algorithm gets no further external input, and its result is fully determined by the algorithm and so can't change, its result is still determined by the algorithm and not by other things, it's the algorithm that chooses the result

This notion of “choice”, though perhaps reasonable in some cases, seems incorrect for decision theory, where there idea is that, until the point when you make the decision, you could (logically and physically) go for some other option.

If you think of yourself as carrying out a predetermined algorithm, there’s no choice or decision to make or discuss. Maybe, in some sense “you decided” to one-box but this is just a question of definitions. You could not have decided otherwise, making all questions of “what should you decide” moot.

Further, if you’re even slightly unsure whether you’re carrying out a predetermined algorithm, you can apply the logic I discussed:

If my actions are predetermined by an algorithm, it doesn’t matter what I feel like I’m “choosing”
If my actions are not predetermined by an algorithm, I should choose whatever improves my current position (ie. two-box)

What I'm sketching is a standard MIRI/LW way of seeing this

Sure. But as far as I can tell, that way of seeing decision theory is denying the notion of real choice in the moment and saying “actually all there is is what type of algorithm you are and it’s best to be a UDT/FDT/etc. algorithm.”

But no-one is arguing that being a hardcoded one-boxer is worse than being a hardcoded two-boxer.

The nontrivial question is how do you become an FDT agent? The MIRI view implies you can choose to be an FDT agent so that you win in Newcomb’s. But how? The only way this could be true is if you can pre-commit to one-boxing etc. before the fact, or if you started out on Earth as an FDT agent. Since most people can rule out the latter empirically, we’re back to square one, where MIRI is smuggling in the ability to pre-commit to a problem that doesn’t allow it.

Decision theory when you can't make decisions

Nina Panickssery1d2-2

I see what you’re saying now, but I think that’s an exotic interpretation of decision-theoretic problems.

The classical framing is: “you’re faced with options A, B, C… at time t, which one should you pick?”

This is different from “what algorithm would you prefer to implement, one that picks A, B, or C at time t?”.

Specifically, picking an algorithm is like allowing yourself to make a decision/pre-commitment before time t, which wasn’t allowed in the original problem.

Decision theory when you can't make decisions

Nina Panickssery1d20

Decision theory asks "what sort of motor output is the best?“

Best, from a set of available options

Decision theory when you can't make decisions

Nina Panickssery1d20

The fact that a calculator prints "83" is response to "71+12=" doesn't make 83 not-the-decision of 71+12.

But it would be meaningless to discuss whether the calculator should choose to print 83 or 84.

Nina Panickssery's Shortform

Nina Panickssery4d40

Perhaps. I see where you are coming from. Though I think it’s possible contrastive-prompt-based vectors (eg. CAA) also approximate “natural” features better than training on those same prompts (fewer degrees of freedom with the correct inductive bias). I should check whether there has been new research on this…

Nina Panickssery's Shortform

Nina Panickssery4d*62

Not sure what distinction you're making. I'm talking about steering for controlling behavior in production, not for red-teaming at eval time or to test interp hypotheses via causal interventions. However this still covers both safety (e.g. "be truthful") and "capabilities" (e.g. "write in X style") interventions.

Nina Panickssery's Shortform

Nina Panickssery5d144

Whenever I read yet another paper or discussion of activation steering to modify model behavior, my instinctive reaction is to slightly cringe at the naiveté of the idea. Training a model to do some task only to then manually tweak some of the activations or weights using a heuristic-guided process seems quite un-bitter-lesson-pilled. Why not just directly train for the final behavior you want—find better data, tweak the reward function, etc.?

But actually there may be a good reason to continue working on model-internals control (i.e. ways of influencing model behavior outside of modifying the text input or training process, by directly changing internal state). For some applications, you may want to express something in terms of the model’s own abstractions, something that you won’t know a priori how to do in text or via training data in fine-tuning. Throughout the training process, a model naturally learns a rich semantic activation space. And in some cases, the “cleanest” way to modify its behavior is by expressing the change in terms of its learned concepts, whose representations are sculpted by exaflops of compute.

LESSWRONG
LW

LESSWRONG
LW

Posts

Wikitag Contributions

Comments