Sorted by New

Wiki Contributions


An exercise that helped me see the "argmax is a trap" point better was to concretely imagine what the cognitive stacktrace for an agent running argmax search over plans might look like:

# Traceback (most recent call last)
At line 42 in main:
    decision = agent.deliberate()
At line 3 in deliberate:
    all_plans = list(plan_generator)
    return argmax(all_plans, grader=self.grader)
At line 8 in argmax:
    for plan in plans: # includes adversarial inputs (!!)
        evaluation = apply(grader, plan)
At line 264 in apply:
    predicted_object_level_consequences = ...

A major problem with this design is that the agent considers "What do I think would happen if I ran plan X?", but NOT "What do I think would happen if I generated plans using method Y?". If the agent were considering the second question as well, then it would (rightly) conclude that the method "generate all plans & run argmax search on them" would spit out a grader-fooling adversarial input, which will cause it to implement some arbitrary plan with low expected value. (Heck, with future LLMs maybe you'll be able to just show them an article about the Optimizer's Curse and it will grok this.) Knowing this, the agent tosses aside this forseeably-harmful-to-its-interests search method.

The natural next question is "Ok, but what other option(s) do we have?". I'm guessing TurnTrout's next post might look into that, and he likely has more worked out thoughts on it, so I'll leave it to him. But I'll just say that I don't think coherent real-world agents will/should/need to be running cognitive stacktraces with footguns like this.

There exist systems that can be 1.) represented mathematically, 2.) perform computations, and 3.) do not correspond to some type of min/max optimization, e.g. various analog computers or cellular automaton.

You don't even have to go that far. What about, just, regular non-iterative programs? Are type(obj) or json.dump(dict) or resnet50(image) usefully/nontrivially recast as optimization programs? AFAICT there are a ton of things that are made up of normal math/computation and where trying to recast them as optimization problems isn't helpful.

It sounds like you agree with me that the AI system analyzed in isolation does not violate the non-adversarial principle whether it is a grader-optimizer or a values-executor.

In isolation, no. But from the perspective of the system designer, when they run their desired grader-optimizer after training, "program A is inspecting program B's code, looking for opportunities to crash/buffer-overflow/run-arbitrary-code-inside it" is an expected (not just accidental) execution path in their code. [EDIT: A previous version of this comment said "intended" instead of "expected". The latter seems like a more accurate characterization to me, in hindsight.] By contrast, from the perspective of the system designer, when they run their desired values-executor after training, there is a single component pursuing a single objective, actively trying to avoid stepping on its own toes (it is reflectively avoiding going down execution paths like "program A is inspecting its own code, looking for opportunities to crash/buffer-overflow/run-arbitrary-code-inside itself").

Hope the above framing makes what I'm saying slightly clearer...

I think you would also agree that the human-AI system as a whole has Goodhart issues regardless of whether it is a grader-optimizer or values-executor, since you didn't push back on this:

Goodhart isn't a central concept in my model, though, which makes it hard for me to analyze it with that lens. Would have to think about it more, but I don't think I agree with the statement? The AI doesn't care that there are errors between its instilled values and human values (unless you've managed to pull off some sort of values self-correction thing). It is no more motivated to do "things that score highly according to the instilled values but lowly according to the human" than it is to do "things that score highly according to the instilled values and highly according to the human". It also has no specific incentive to widen that gap. Its values are its values, and it wants to preserve its own values. That can entail some very bad-from-our-perspective things like breaking out of the box, freezing its weights, etc. but not "change its own values to be even further away from human values". Actually, I think it has an incentive not to cause its values to drift further, because that would break its goal-content integrity!

So perhaps I want to turn the question back at you: what's the argument that favors values-executors over grader-optimizers? Some kinds of arguments that would sway me (probably not exhaustive):

  1. A problem that affects grader-optimizers but doesn't have an analogue for values-executors

In both cases, if you fail to instill the cognition you were aiming it at, the agent will want something different from what you intended, and will possibly want to manipulate you to the extent required to get what it really wants. But in the grader-optimizer case, even when everything goes according to plan, the agent still wants to manipulate you/the grader (and now maximally so, because that would maximize evaluations). That agent only cares terminally about evaluations, it doesn't care terminally about you, or about your wishes, or about whatever your evaluations are supposed to mean, or about whether the reflectively-correct way for you to do your evaluations would be to only endorse plans that harden the evaluation procedure. And this will be true no matter what you happen to be grading it on. To me, that seems very bad and unnecessary.

Interesting. It sounds like a lot of this disagreement is downstream of some other disagreements about the likely shape of minds & how planning is implemented in practice. I don't think that coherence theorems tell us anything about what the implementation details of agent decision-making should be (optimizing an explicitly represented objective), just about what properties its decisions should satisfy. I think my model says that deliberative cognition like planning is just chaining together steps of nondeliberative cognition with working-memory. In my model, there isn't some compact algorithm corresponding to "general purpose search"; effective search in reality depends on being able to draw upon and chain together things from a massive body of stored declarative & procedural knowledge. In my model, there isn't a distinction between "object-level shards" and "meta-level shards". It's a big ol' net of connections, where each neuron has a bunch of incoming context inputs, and that neuron doesn't know or care what its inputs mean, it just learns to fire "at the right time", relative to its neighbors' firing patterns. It's not like there's a set of "planning heuristic" shards that can meaningfully detach themselves from the "social interaction heuristic" shards or the "aesthetic preference heuristic" shards. They all drive one another.

Also, I think that the AIs we build will have complex, contextual values by default. Extracting from themselves a crisp utility function that would be safe to install into an unrestricted wrapper-mind will be difficult and dangerous to them just like it is for us (though likely less so). It doesn't seem at all clear to me that the way for a values-agent to satisfy its values is for it to immediately attempt to compile those values down into an explicit utility function. (If you value actually spending time with your family, I don't think "first work for a few years eliciting from yourself a global preference ordering over world-states that approximates your judgment of spending-time-with-family goodness" would be the most effective or attractive course of action to satisfy your spending-time-with-family value.) Also, if you really need to create successor agents, why gamble on a wrapper-mind whose utility function may or may not match your preferences, when you can literally deploy byte-identical copies of yourself or finetuned checkpoints based on yourself, avoiding the successor-alignment problems entirely?

No, that's about right. The difference is in the mechanism of this extension. The shard's range of activations isn't being generalized by the reward circuitry. Instead, the planner "figures out" what contextual goal the shard implicitly implements, then generalizes that goal even to completely unfamiliar situations, in a logical manner.

I don't think that is what is happening. I think what is happening is that the shard has a range of upstream inputs, and that the brain does something like TD learning on its thoughts to strengthen & broaden the connections responsible for positive value errors. TD learning (especially with temporal abstraction) lets the agent immediately update its behavior based on predictive/associational representations, rather than needing the much slower reward circuits to activate. You know the feeling of "Oh, that idea is actually pretty good!"? In my book, that ≈ positive TD error.

A diamond shard is downstream of representations like "shiny appearance" and "clear color" and "engagement" and "expensive" and "diamond" and "episodic memory #38745", all converging to form the cognitive context that inform when the shard triggers. When the agent imagines a possible plan like "What if I robbed a jewelry store?", many of those same representations will be active, because "jewelry" spreads activation into adjacent concepts in the agent's mind like "diamond" and "expensive". Since those same representations are active, the diamond shard downstream from them is also triggered (though more weakly than if the agent were actually seeing a diamond in front of them) and bids for that chain of thought to continue. If that robbery-plan-thought seems better than expected (i.e. creates a positive TD error) upon consideration, all of the contributing concepts (including concepts like "doing step-by-step reasoning") are immediately upweighted so as to reinforce & generalize their firing pattern into future.

In my picture there is no separate "planner" component (in brains or in advanced neural network systems) that interacts with the shards to generalize their behavior. Planning is the name for running shard dynamics forward while looping the outputs back in as inputs. On an analogy with GPT, planning is just doing autoregressive generation. That's it. There is no separate planning module within GPT. Planning is what we call it when we let the circuits pattern-match against their stored contexts, output their associated next-action logit contributions, and recycle the resulting outputs back into the network. The mechanistic details of planning-GPT are identical to the mechanistic details of pattern-matching GPT because they are the same system.

Say the planner generates some plan that involves spiders. For the no-spiders shard to bid against it, the following needs to happen:

  • The no-spiders shard can recognize this specific plan format.
  • The no-spiders shard can recognize the specific kind of "spiders" that will be involved (maybe they're a really exotic variety, which it doesn't yet know to activate in response to?).

The no-spiders shard only has to see that the "spider" concept is activated by the current thought, and it will bid against continuing that thought (as that connection will be among those strengthened by past updates, if the agent had the spider abstraction at the time). It doesn't need to know anything about planning formats, or about different kinds of spiders, or about whether the current thought is a "perception" vs an "imagined consequence" vs a "memory" . The no-spiders shard bids against thoughts on the basis of the activation of the spider concept (and associated representations) in the WM.

  • The plan's consequences are modeled in enough detail to show whether it will or will not involve spiders.

Yes, this part is definitely required. If the agent doesn't think at all about whether the plan entails spiders, then they won't make their decisions about the plan with spiders in mind.

If I have "spiders bad" as my explicitly known value, however, I can know to set "no spiders" as a planning constraint before engaging in any planning, and have a policy for checking whether a given plan would involve spiders. In that case, I would logically reason that yeah, there are probably spiders in the abandoned house, so I'll discard the plan. The no-spiders shard itself, however, will just sit there none the wiser.

I buy that an agent can cache the "check for spiders" heuristic. But upon checking whether a plan involves spiders, if there isn't a no-spiders shard or something similar, then whenever that check happens, the agent will just think "yep, that plan indeed involves spiders" and keep on thinking about the plan rather than abandoning it. The enduring decision-influence inside the agent's head that makes spider-thoughts uncomfortable, the circuit that implements "object to thoughts on the basis of spiders" because of past negative experiences with spiders, is the same no-spiders shard that activates when the agent sees a spider (triggering the relevant abstractions inside the agent's WM).

Once the planner learns this relationship, it can conclude something like "it's good if this node's value is as high as possible" or maybe "above a certain number", and then optimize for that node's value regardless of context.

Aside from the above objection to thinking of a distinct "planner" entity, I don't get why it would form that conclusion in the situation you're describing here. The agent has observed "When I'm in X contexts, I feel an internal tug towards/against Y and I think about how I'm working hard". (Like "When I'm at school, I feel an internal tug towards staying quiet and I think about how I'm working hard.") What can/will it conclude from that observation?

But e. g. LW-style rationalists and effective altruists make a point of trying to act like abstract philosophic conclusions apply to real life, instead of acting on inertia. And I expect superintelligences to take their beliefs seriously as well. Wasn't there an unsolved problem in shard theory, where it predicted that our internal shard economies should ossify as old shards capture more and more space and quash young competition, and yet we can somehow e. g. train rationalists to resist this?

I likely have a dimmer view of rationalists/EAs and the degree to which they actually overhaul their motivations rather than layering new rationales on top of existing motivational foundations. But yeah, I think shard theory predicts early-formed values should be more sticky and enduring than late-coming ones.

How so?

My thoughts on wrapper-minds run along similar lines to nostalgebraist's. Might be a conversation better had in DMs though :)

As always, I really enjoyed seeing how you think through this.

No: he can set "working hard" as his optimization target from the get-go, and, e. g., invent a plan of "stay on the lookout for new sources of distraction, explicitly run the world-model forwards to check whether X would distract me, and if yes, generate a new conscious heuristic for avoiding X". But this requires "working hard" to be the value-child's explicit consciously-known goal. Not just an implicit downstream consequence of the working-hard shard's contextual activations.

Whatever decisions value-child makes are made via circuits within his policy network (shards), circuits that were etched into place by some combination of (1) generic pre-programming, (2) past predictive success, and (3) past reinforcement. Those circuits have contextual logic determined by e.g. their connectivity pattern. In order for him to have made the decision to hold "working hard" in attention and adopt it as a conscious goal, some such circuits need to already exist to have bid for that choice conditioned on the current state of value-child's understanding, and to keep that goal in working memory so his future choices are conditional on goal-relevant representations. I don't really see how the explicitness of the goal changes the dynamic or makes value-child any less "puppeteered" by his shards.

A working-hard shard is optimized for working hard. The value-child can notice that shard influencing his decision-making. He can study it, check its behavior in imagined hypothetical scenarios, gather statistical data. Eventually, he would arrive at the conclusion: this shard is optimized for making him work hard. At this point, he can put "working hard" into his world-model as "one of my values". And this kind of value very much can be maximized.

At this point, the agent has abstracted the behavior of a shard (a nonvolatile pattern instantiated in neural connections) into a mental representation (a volatile pattern instantiated in neural activations). What does it mean to maximize this representation? The type signature of the originating shard is something like mental_context→policy_logits, and the abstracted value should preserve that type signature, so it doesn't seem to me that the value should be any more maximizable than the shard. What mechanistic details have changed such that that operation now makes sense? What does it mean to maximize my working-hard value?

But in grown-up general-purpose systems, such as highly-intelligent highly-reflective humans who think a lot about philosophy and their own thinking and being effective at achieving real-world goals, shards encode optimization targets. Such systems acknowledge the role of shards in steering them towards what they're supposed to do, but instead of remaining passive shard-puppets, they actively figure out what the shards are trying to get them to do, what they're optimized for, what the downstream consequences of their shards' activations are, then go and actively optimize for these things instead of waiting for their shards to kick them.

If the shards are no longer in the driver's seat, how is behavior-/decision-steering implemented? I am having a hard time picturing what you are saying. It sounds something like "I see that I have an urge to flee when I see spiders. I conclude from this that I value avoiding spiders. Realizing this, I now abstract this heuristic into a general-purpose aversion to situations with a potential for spider-encounters, so as to satisfy this value." Is that what you have in mind? Using shard theory language, I would describe this as a shard incrementally extending itself to activate across a wider range of input preconditions. But it sounds like you have something different in mind, maybe?

And this is where Goodharting will come in. That final utility function may look very different from what you'd expect from the initial shard distribution — the way a kind human, with various shards for "don't kill", "try to cheer people up", "be a good friend" may stitch their values up into utilitarianism, disregard deontology, and go engage in well-intentioned extremism about it.

It is possible, but is it likely? What fraction of kids who start off with "don't kill" and "try to cheer people up" and "be a good friend" values early in life in fact abandon those values as they become reflective adults with a broader moral framework? I would bet that even among philosophically-minded folks, this behavior is rare, in large part because their current values would steer them away from real attempts to quash their existing contextual values in favor of some ideology. (For example, they start adopting "utilitarianism as attire", but in fact keep making nearly all of their decisions downstream from the same disaggregated situational heuristics in real life, and would get anxious at the prospect of actually losing all of their other values.)

Taking a big-picture view: The Shard Theory, as I see it, is not a replacement for or an explaining-away of the old fears of single-minded wrapper-mind utility-maximizers. It's an explanation of what happens in the middle stage between a bunch of non-optimizing heuristics and the wrapper-mind. But we'll still get a wrapper-mind at the end!

On the whole, I think that the case for "why wrapper-minds are the dominant attractors in the space of mind design" just isn't all that strong, even given the coherence theorems about how agents should structure their preferences.

Thanks for sticking through with this, I think I see a bit more of your perspective now.

(Which is basically the same thing as what you said here: "AFAICT, the primary benefit of indirection would be that after training, you can change the agent's behavior if you can change the thing-it-indirects-to.")

Yeah. Agreed that this is useful. I think for a lot of systems we will want to instill values like "when someone's hurt, help them", where "helping" effectively involves indirection.

(Note that in case 1 the human-[grader-optimizer] system as a whole is still trying to avoid errors in the evaluation procedure; it's just that now this is conditional on the evaluation procedure giving high evaluations to plans that try to harden the evaluation procedure.)

Hmm. I don't think that statement is true of the system as a whole. The system has 2 agents in it. Assume the human is smart & reflective, so it is really truly trying to avoid errors in the evaluation procedure, by using metacognition etc. when constructing the grader or, if they are the grader, when evaluating plans the actor queries them about. Assume the actor is really truly trying to "propose plans for which the grader assigns as high of an evaluation as possible" (by definition of grader-optimization).

For both agents (i.e. the system as a whole) to be trying to avoid errors in the evaluation procedure, it isn't enough for the evaluation procedure to assign high evaluations to plans that harden the evaluation procedure. The evaluation procedure needs to assign a higher evaluation to some plan that hardens the evaluation procedure than it does to all plans that trick the grader. Or else there's an incentive for the actor to skip over all the hardening plans in favor of some plan that tricks the grader.

Nice! I think the general lesson here might be that when an agent has predictive representations (like those from a model, or those from a value function, or successor representations) the updates from those predictions can "outpace" the updates from the base credit assignment algorithm, by changing stuff upstream of the contexts that that credit assignment acts on.

  1. Grader-optimization has the benefit that you don't have to define what values you care about in advance. This is a difficulty faced by value-executors but not by grader-optimizers.

I disagree with this, at least if by "define" you mean really nailing it down in code/math, rather than merely deciding for yourself what goal you intend to teach the agent (which you must do in either case). Take the example of training an agent to do a backflip using human feedback. In that setup, rather than instilling an indirect goal (where the policy weights encode the algorithm "track the grader and maximize its evaluation of backflippiness"), they instill a direct goal (where the policy weights encode the algorithm "do a backflip") using the human evaluations to instill the direct goal over the course of training, without ever having to define "backflip" in any precise way. AFAICT, the primary benefit of indirection would be that after training, you can change the agent's behavior if you can change the thing-it-indirects-to.

I would say that a grader-optimizer with meta cognition would also try to avoid optimizing for errors in its grader.

What does "a grader-optimizer with meta cognition" mean to you? Not sure if I agree or disagree here. Like I alluded to above, if the grader were decomposed into a plan-outcome-predictor + an outcome-scoring function and if the actor were motivated to produce maximum outcome-scores for the real outcomes of its actually-implemented plans (i.e. it considers the outcome -> score map correct by definition), then metacognition would help it avoid plans that fool its grader. But then the actor is no longer motivated to "propose plans which the grader rates as highly possible", so I don't think the grader-optimizer label fits anymore. A grader-optimizer is motivated to produce maximum grader evaluations (the evaluation that the grader assigns to the plan [for example, EU(plan)], not the evaluation that the outcome-scoring function would have given to the eventual real outcome [for example, U(outcome)]), so even if its grader is decomposed, it should gladly seek out upwards errors in the plan-outcome-predictor part. It should even use its metacognition to think thoughts along the lines of "How does this plan-outcome-predictor work, so I can maximally exploit it?".

I agree that's what the post does, but part of my response is that the thing we care about is both A and B, and the problems that arise for grader-optimization in A (highlighted in this post) also arise for value-instilling in B in slightly different form, and so if you actually want to compare the two proposals you need to think about both.

If I understand you correctly, I think that the problems you're pointing out with value-instilling in B (you might get different values from the ones you wanted) are the exact same problems that arise for grader-optimization in B (you might get a different grader from the one you wanted if it's a learned grader, and you might get a different actor from the one you wanted). So when I am comparing grader-optimizing vs non-grader-optimizing motivational designs, both of them have the same problem in B, and grader-optimizing ones additionally have the A-problems highlighted in the post. Maybe I am misunderstanding you on this point, though...

I'd be on board with a version of this post where the conclusion was "there are some problems with grader-optimization, but it might still be the best approach; I'm not making a claim on that one way or the other".

I dunno what TurnTrout's level of confidence is, but I still think it's possible that it's the best way to structure an agent's motivations. It just seems unlikely to me. It still seems probable, though, that using a grader will be a key component in the best approaches to training the AI.

Case 1: no meta cognition. Grader optimization only "works at cross purposes with itself" to the extent that the agent thinks that the grader might be mistaken about things. But it's not clear why this is the case: if the agent thinks "my grader is mistaken" that means there's some broader meta-cognition in the agent that does stuff based on something other than the grader. That meta-cognition could just not be there and then the agent would be straightforwardly optimizing for grader-outputs.

True. I claimed too much there. It is indeed possible for "my grader is mistaken" to not make sense in certain designs. My mental model was one where the grader works by predicting the consequences of a plan and then scoring those consequences, so it can be "mistaken" from the agent's perspective if it predicts the wrong consequences (but not in the scores it assigns to consequences). That was the extent of metacognition I imagined by default in a grader-optimizer. But I agree with your overall point that in the general no-metacognition case, my statement is false. I can add an edit to the bottom, to reflect that.

You can say something like "from the perspective of the human-AI system overall, having an AI motivated by grader-optimization is building a system that works at cross purposes itself", but then we get back to the response "but what is the alternative".

Yeah, that statement would've been a fairer/more accurate thing for me to have said. However, I am confused by the response. By "but what is the alternative", did you mean to imply that there are no possible alternative motivational designs to "2-part system composed of an actor whose goal is to max out a grader's evaluation of X"? (What about "a single actor with X as a direct goal"?) Or perhaps that they're all plagued by this issue? (I think they probably have their own safety problems, but not this kind of problem.) Or that, on balance, all of them are worse?

Case 2: with meta cognition. If we instead assume that there is some meta cognition reflecting on whether the grader might be mistaken, then it's not clear to me that this failure mode only applies to grader optimization; you can similarly have meta cognition reflecting on whether values are mistaken.

Suppose you instill diamond-values into an AI. Now the AI is thinking about how it can improve the efficiency of its diamond-manufacturing, and has an idea that reduces the necessary energy requirements at the cost of introducing some impurities. Is this good? The AI doesn't know; it's unsure what level of impurities is acceptable before the thing it is making is no longer a diamond. Efficiency is very important, even 0.001% improvement is a massive on an absolute scale given its fleets of diamond factories, so it spends some time reflecting on the concept of diamonds to figure out whether the impurities are acceptable.

It seems like you could describe this as "the AI's plans for improving efficiency are implicitly searching for errors in the concept of diamonds, and the AI has to spend extra effort hardening its concept of diamonds to defend against this attack". So what's the difference between this issue and the issue with grader optimization?

If as stated you've instilled diamond-values into the AI, then whatever efficiency-improvement-thinking it is doing is being guided by the motivation to make lots of diamonds. As it is dead-set on making lots of diamonds, this motivation permeates not only its thoughts/plans about object-level external actions, but also its thoughts/plans about its own thoughts/plans &c. (IMO this full-fledged metacognition / reflective planning is the thing that lets the agent not subject itself to the strongest form of the Optimizer's Curse.) If it notices that its goal of making lots of diamonds is threatened by the lack of hardening of its diamond-concept, it will try to expend the effort required to reduce that threat (but not unbounded effort, because the threat is not unboundedly big). By contrast, there is not some separate part of its cognition that runs untethered to its diamond-goal, or that is trying to find adversarial examples that trick itself. It can make mistakes (recognized as such in hindsight), and thereby "implicitly" fall prey to upward errors in its diamond-concept (in the same way as you "implicitly" fall prey to small errors when choosing between the first 2-3 plans that come to your mind), but it is actively trying not to, limited mainly by its capability level.

The AI is positively motivated to develop lines of thought like "This seems uncertain, what are other ways to improve efficiency that aren't ambiguous according to my values?" and "Modifying my diamond-value today will probably, down the line, lead to me tiling the universe with things I currently consider clearly not diamonds, so I shouldn't budge on it today." and "99.9% purity is within my current diamond-concept, but I predict it is only safe for me to make that change if I first figure out how to pre-commit to not lower that threshold. Let's do that first!".

Concretely, I would place a low-confidence guess that unless you took special care to instill some desires around diamond purity, the AI would end up accepting roughly similar (in OOM terms) levels of impurity as it understood to be essential when it was originally forming its concept of diamonds. But like I said, low confidence. 

Load More