A shot at the diamond-alignment problem

TurnTrout

I think that relatively simple alignment techniques can go a long way. In particular, I want to tell a plausible-to-me story about how simple techniques can align a proto-AGI so that it makes lots of diamonds.

But why is it interesting to get an AI which makes lots of diamonds? Because we avoid the complexity of human value and thinking about what kind of future we want, while still testing our ability to align an AI. Since diamond-production is our goal for training, it’s actually okay (in this story) if the AI kills everyone. The real goal is to ensure the AI ends up acquiring and producing lots of diamonds, instead of just optimizing some weird proxies that didn’t have anything to do with diamonds. It’s also OK if the AI doesn’t maximize diamonds, and instead just makes a whole lot of diamonds.^[1]

Someone recently commented that I seem much more specifically critical of outer and inner alignment, than I am specifically considering alternatives. So, I had fun writing up a very specific training story for how I think we can just solve diamond-alignment using extremely boring, non-exotic, simple techniques, like “basic reward signals & reward-data augmentation.” Yes, that’s right. As I’ve hinted previously, I think many arguments against this working are wrong, but I’m going to lay out a positive story in this post. I’ll reserve my arguments against certain ideas for future posts.

Can we tell a plausible story in which we train an AI, it cares about diamonds when it’s stupid, it gets smart, and still cares about diamonds? I think I can tell that story, albeit with real uncertainties which feel more like normal problems (like “ensure a certain abstraction is learned early in training”) than impossible-flavored alignment problems (like “find/train an evaluation procedure which isn’t exploitable by the superintelligence you train”).

Before the story begins:

Obviously I’m making up a lot of details, many of which will turn out to be wrong, even if the broad story would work. I think it’s important to make up details to be concrete, highlight the frontiers of my ignorance, and expose new insights. Just remember what I’m doing here: making up plausible-sounding details.
This story did not actually happen in reality. It’s fine, though, to update towards my models if you find them compelling.
This is not my best guess for how we get AGI. In particular, I think chain of thought / language modeling is more probable than RL, but I’m still more comfortable thinking about RL for the moment, so that’s how I wrote the story.
The point of the following story is not “Gee, I sure am confident this story goes through roughly like this.” Rather, I am presenting a training story template I expect to work for some foreseeably good design choices. I would be interested and surprised to learn that this story template is not only unworkable, but comparably difficult to other alignment approaches as I understand them.
ETA 12/16/22: This story is not trying to get a diamond maximizer, and I think that's quite important! I think that "get an agent which reflectively equilibrates to optimizing a single commonly considered quantity like 'diamonds'" seems extremely hard and anti-natural.

This story is set in Evan Hubinger’s training stories format. I'll speak in the terminology of shard theory.

A diamond-alignment story which doesn’t seem fundamentally blocked

Training story summary

Get an AI to primarily value diamonds early in training.
Ensure the AI keeps valuing diamonds until it gets reflective and smart and able to manage its own value drift.
The AI takes over the world, locks in a diamond-centric value composition, and makes tons of diamonds.

Training goal

An AI which makes lots of diamonds. In particular, the AI should secure its future diamond production against non-diamond-aligned AI.

Training rationale

Here are some basic details of the training setup.

Use a very large (future) multimodal self-supervised learned (SSL) initialization to give the AI a latent ontology for understanding the real world and important concepts. Combining this initialization with a recurrent state and an action head, train an embodied AI to do real-world robotics using imitation learning on human in-simulation datasets and then sim2real. Since we got a really good pretrained initialization, there's relatively low sample complexity for the imitation learning (IL). The SSL and IL datasets both contain above-average diamond-related content, with some IL trajectories involving humans navigating towards diamonds because the humans want the diamonds.

Given an AI which can move around via its action head, start fine-tuning via batch online policy-gradient RL by rewarding it when it goes near diamonds, with the AI retaining long-term information via its recurrent state (thus, training is not episodic—there are no resets). Produce a curriculum of tasks, from walking to a diamond, to winning simulated chess games, to solving increasingly difficult real-world mazes, and so on. After each task completion, the agent gets to be near some diamond and receives reward. Continue doing SSL online.

Extended training story

Ensuring the diamond abstraction exists

We want to ensure that the policy gradient updates from the diamond coalesce into decision-making around a natural "diamond" abstraction which it learned in SSL and which it uses to model the world. The diamond abstraction should exist insofar as we buy the natural abstractions hypothesis. Furthermore, the abstraction seems more likely to exist given the fact that the IL data involves humans whom we know to be basing their decisions on their diamond-abstraction, and given the focus on diamonds in SSL pretraining.

(Sometimes, I'll refer to the agent's "world model"; this basically means "the predictive machinery and concepts it learns via SSL.)

Growing the proto-diamond shard

We want the AI to, in situations where it knows it can reach a diamond, consider and execute plans which involve reaching the diamond. But why would the AI start being motivated by diamonds? Consider the batch update structure of the PG setup. The agent does a bunch of stuff while being able to directly observe the nearby diamond:

Some of this stuff involves e.g. approaching the diamond (by IL’s influence) and getting reward when approaching the diamond. (This is reward shaping.)
Some of this stuff involves not approaching the diamond (and perhaps getting negative reward).

The batch update will upweight actions involved with approaching the diamond, and downweight actions which didn’t. But what cognition does this reinforce? Consider that relative to the SSL+IL-formed ontology, it’s probably relatively direct to modify the network in the direction of “IF diamond seen, THEN move towards it.” The principal components of the batch gradient probably update the agent in directions^[2] like that, and less in directions which do not represent simple functions of sense data and existing abstractions (like diamond).

Possibly there are several such directions in the batch gradient, in which case several proto-shards form. We want to ensure the agent doesn’t primarily learn a spurious proxy like “go to gems” or “go to shiny objects” or “go to objects.” We want the agent to primarily form a diamond-shard.

We swap out a bunch of objects for the diamond and otherwise modify the scenario,^[3] penalizing the agent for approaching when a diamond isn't present. Since the agent has not yet been trained for long in a non-IID regime, the agent has not yet learned to chain cognition together across timesteps, nor does it know about the training process, so it cannot yet be explicitly gaming the training process (e.g. caring about shiny objects but deciding to get high reward so that its values don’t get changed). Therefore, the agent’s learned shards/decision-influences will have to reflex-like behave differently in the presence of diamonds as opposed to other objects or situations. In other words, the updating will be “honest”—the updates modify agent’s true propensity to approach different kinds of objects for different kinds of reasons.

Since “IF diamond”-style predicates do in fact strongly distinguish between the positive/negative approach/don’t-approach decision contexts, and I expect relatively few other actually internally activated abstractions to be part of simple predicates which reasonably distinguish these contexts, and since the agent will strongly represent the presence of diamond nearby,^[4] I expect the agent to learn to make approach decisions (at least in part) on the basis of the diamond being nearby.

We probably also reinforce other kinds of cognition, but that’s OK in this story. Maybe we even give the agent some false positive reward because our hand slipped while the agent wasn't approaching a diamond, but that's fine as long as it doesn't happen too often.^[5] That kind of reward event will weakly reinforce some contingent non-diamond-centric cognition (like "IF near wall, THEN turn around"). In the end, we want an agent which has a powerful diamond-shard, but not necessarily an agent which only has a diamond-shard.

It’s worth explaining why, given successful proto-diamond-shard formation here, the agent is truly becoming an agent which we could call “motivated by diamonds”, and not crashing into classic issues like “what purity does the diamond need to be? What molecular arrangements count?”. In this story, the AI’s cognition is not only behaving differently in the presence of an observed diamond, but the cognition behaves differently because the AI represents a humanlike/natural abstraction for diamond being nearby in its world model. One rough translation to English might go: “IF diamond nearby, THEN approach.” In a neural network, this would be a continuous influence—the more strongly “diamond nearby” is satisfied, the greater the approach-actions are upweighted.

So, this means that the agent more strongly steers itself towards prototypical examples of diamonds. And, when the AI is smarter later, if this same kind of diamond-shard still governs its behavior, then the AI will keep steering towards futures which contain prototypical diamonds. This is all accomplished without having to get fussy about the exact “definition” of a diamond.^[6]^[7]

Ensuring the AI doesn’t satisfice diamonds

If the AI starts off with a relatively simple diamond-shard which steers the AI towards the historical diamond-reinforcer because the AI internally represents the nearby diamond using a reasonable-to-us diamond-abstraction and is therefore influenced to approach, then this shard will probably continue to get strengthened and developed by future diamond reward-events.

Insofar as the agent didn’t already pick up planning subroutines from SSL+IL, I expect the agent to do so shortly after the diamond shard formation described above. Furthermore, diamond-subshards which more aggressively reach diamonds will be more frequently reinforced compared to those which don't. Over time, this leads to the diamond-shard “bidding” strictly more strongly for actions which lead to strictly more diamonds.

We begin offering the AI lotteries (certainty of two diamonds and negative reward, versus 50% chance of 5 diamonds and positive reward). This makes the agent more like a diamond-maximizer, at least in similarly represented decision contexts (and possibly also makes the agent care more about reward,^[8] developing a relatively weak reward-shard).

Making the AI smarter while preserving the diamond abstraction

The AI begins to chain together cognition in order to acquire diamonds in a succession of increasingly demanding situations (e.g. easy chess → hard chess). By the role of on-policy exploration in RL, future data depends on current weights which depend on past updates which depend on past data, these temporal dependencies result in a network which chains cognition across situations towards goals (probably related to historical reinforcers).

Suppose the agent takes exploratory actions, solves a new maze, and then sees a diamond at the end, triggering the existing diamond-shard. The agent receives reward when it reaches the diamond. Since the start-of-maze agent probably realized there was a diamond at the end of the maze, SGD expands the agent’s current shards (like diamond-acquisition) while also possibly creating new shards (like maze-solving). Across such tasks, the diamond-shard is the most strengthened overall, since diamonds are present in all of the reward events.

In particular, even though online self-supervised learning continues to develop the world model and create more advanced concepts, the reward events also keep crediting the invocation of the diamond-abstraction as responsible for reward (because insofar as the agent's diamond-shard guides its decisions, then the diamond-shard's diamond-abstraction is in fact responsible for the agent getting reward). The diamond-abstraction gradient starves the AI from exclusively acting on the basis of possible advanced "alien" abstractions which would otherwise have replaced the diamond abstraction. The diamond shard already gets reward effectively, integrating with the rest of the agent's world model and recurrent state, and therefore provides "job security" for the diamond-abstraction. (And once the agent is smart enough, it will want to preserve its diamond abstraction, insofar as that is necessary for the agent to keep achieving its current goals which involve prototypical-diamonds.)

By the time the agent is capable of stealing diamonds from us, it's smart enough to realize this is strategically unwise. The AI begins biding its time until it can take over the world and acquire diamonds and safety. The AI has learned an explicit power-seeking shard, since such heuristics are widely useful for a range of tasks in training. Therefore, the agent doesn’t just have power-seeking cognition in its activations (i.e. explicit online reasoning about the instrumental value of power relative to its terminal diamond values). The power-seeking motivation is terminalized into its weights (see also Richard Ngo’s “convergent final goals” speculation).^[9]

For similar reasons, the AI did learn a weak approval-shard. But that was never particularly strengthened since most reward events weren’t primarily caused by human approval, so the agent was less likely to think this would be the case, so the agent was less likely to think about earning human approval, and so such computation was less likely to get reinforced and generalized by the reward events.

The diamond-shard is used and strengthened in a huge range of situations, which causes SGD to generalize its activation contexts quite broadly, so that the AI is diamond-motivated in most situations. The agent becomes mostly controlled by the diamond-shard. The lines of computation which were initially crude heuristics (e.g. IF near diamond, then move towards it) now feed into each other in a continuous stream (e.g. IF nearby-human-disposition in affective state 37, THEN engage depth-4 greedy heuristic search with heuristic set to people-pleasing, breaking after 15 timesteps, writing search statistics X and Y to recurrent-state-workspace registers 152 and 1,951).

The agent becomes reflective

At a certain point, the agent starts becoming reflective—thinking about its thinking, about its embeddedness. Perhaps the reflectivity arrives early because we trained it to predict some of its future timestep activations, or perhaps somewhat later because its cognition has regular entanglements with its future observables and reward events. An example of this would be how, if the agent decides to open a door now, the agent can better predict its future observations by accounting for this fact. Since we’re still doing SSL online, I expect this to happen. (This wouldn’t have happened during SSL/IL pretraining, because there wasn't any dependence of future training data on its current computations.)

And now the agent is smart. The agent starts thinking about its update process—crudely at first, perhaps only remarking how its decision-making changes over time in a way that has to do with reinforcement events. I expect the AI to learn to roughly understand which subshards influenced which decisions. If the agent can't model how its value shards affect its decisions, then this damages long-term predictions of observables. For example, imagine I want pretzels when I'm near a bag of pretzels, but otherwise don't want them. I imagine walking by a bag of pretzels and leaving my house. If I can't predict that that context activates my pretzel-shard, I'd predict I'll leave without eating, which will be wrong. In particular, the AI’s planning process will be able to predict how future situations will modify its future value-activations. For similar reasons, the AI will also learn to roughly predict value drift (delivered via batch policy gradients).

But as SSL continues, eventually the agent models its training process quite well. One sufficient condition for rapid reflectivity improvement is that once the agent can model the convergent benefits of becoming more reflective, its shards will strongly and convergently bid for plans which increase reflectivity (like reading more about alignment theory and its learning architecture, setting some watches on key activations, running traces on itself and logging statistics, understanding how its values change contextually by tracking correlations between features in activations and its future decisions).

If we had earlier censored its data, the agent breaks out through some side-channel and reads about theories of value drift and training dynamics. Maybe the AI reads this post.

The agent prevents value drift

If the agent still is primarily diamond-motivated, it now wants to stay that way by instrumental convergence. That is, if the AI considers a plan which it knows causes value drift away from diamonds, then the AI reflectively predicts the plan leads to fewer diamonds, and so the AI doesn’t choose that plan! The agent knows the consequences of value drift and it takes a more careful approach to future updating. Let’s get specific.

Suppose the AI keeps training, but by instrumental convergence, seeking power remains a good idea, and such decisions continually get strengthened. This strengthens the power-seeking shard relative to other shards. Other shards want to prevent this from happening. Therefore, the AI does indeed face an alignment problem, including:

The AI (reflectively) wants to keep getting smarter, but cognitive updates can make the AI e.g. care less about diamonds.
1. The AI wants to ensure its “successors” are aligned—”successors” meaning either its future model checkpoints which result from running SGD over its cognition for longer, or the new version of itself it might explicitly self-modify into.
Shifts in internal decision context might change which shards are activated.
1. For example, the AI might be governed by the diamond-shard during training, but deployment contexts might more strongly activate the power-seeking shard.
There could be a “treacherous turn” within the AI, as certain value shards suddenly bid for fast shard rebalancing via self-modification (staging an “internal coup”).

A smart reflective AI has a range of advantages over people in solving its alignment problem (see the appendix). The agent starts planning its future actions so as to minimize anti-diamond value drift imposed by future policy gradient steps. Possibly the agent installs "tripwires" which revert to a previous model checkpoint under such-and-such value drift events (addressing problems 1 and 3 above), bounds the activation strengths of different shards (2 above), and backdoors update-step overrides (1 above).

(In addition, if you have good reason to believe you have a diamond-aligned AI by this point but are worried it will fail to consider some danger, you can try the highly technical tactic of “just tell the AI about that danger.”)

Meanwhile, the AI is preparing to kill everyone, taking over server farms worldwide.^[10] This part of the plan goes off rather smoothly. With possible unaligned competitors removed, the AI is free to self-experiment and take its time with its future self-improvement process. The AI effectively solves its successor-alignment problem by using its special advantages.^[11]

The values handshake

Self-improvement proceeds. Value stability arrives. The major shards are as follows:

Diamond
1. Subshard: Acquiring diamonds (expected utility flavor)
  1. Primary activation context: Most situations. This subshard reigns primary, due in part to the lottery reward events earlier in training.
2. Subshard: Being near diamonds
  1. Primary activation context: When diamonds can easily be brought close.
3. Subshard: Seeing diamonds
  1. Primary activation context: When diamonds are nearby.
4. Subshard: Producing diamonds
  1. Primary activation context: Materials+tech nearby.
Power-seeking
1. Primary activation context: Most situations, but especially when the agent can increase optionality or probability of survival.
Reward
1. Primary activation context: Situations where reward was historically easily obtainable.
2. This shard bids for outcomes like “high value at historical reward register.”
Games
1. Game-playing subshard
  1. Primary activation context: A video game is nearby.
2. Game-winning subshard
  1. Primary activation context: While already playing a game.
Human approval
1. Weak shard, primary activation context is near humans in initial training context.

Maybe in some contexts where diamonds are nearby, the seeing-diamonds subshard activates and makes the agent go see the nearby diamonds, passing up a time-limited opportunity to remotely acquire 100 additional diamonds. If the agent acquired the remote diamonds first, that would lead to more diamonds overall and retain the ability to then go see the nearby diamonds. Therefore, across the agent’s shards and subshards, it would be a Pareto improvement for the agent to self-modify into a more coherent overall policy around the quantities relevant to its most powerful and reflective subshards.

For this reason, the AI dissolves cross-situational incoherences such that the AI no longer shoots itself in the foot with respect to the quantities which the primary subshards care about (like diamonds-acquired, seconds-near-diamonds, seconds-seeing-diamonds, increments of power, game-victories). This is a values handshake, except not among agents, but among the shards of value within the agent which are activated at the time of the handshake. And thus comes the day where the AI crystallizes its value shards into a reflectively stable, globally activated utility function.

And remember that part where the more aggressive diamond-subshards got strengthened by reward events? Yeah, those subshards have been running the show. And oh boy, does this AI synthesize a lot of diamonds.

Major open questions

Non-exhaustively:

Is the shard theory picture broadly appropriate?
How do we ensure that the diamond abstraction forms?
How do we ensure that the diamond shard forms?
How do we ensure that the diamond shard generalizes and interfaces with the agent's self-model so as to prevent itself from being removed by other shards?
How do we avoid catastrophic ontological shift during jumps in reflectivity, which probably change activation contexts for first-person values?
1. EG if the AI thinks it’s initially an embodied robot and then realizes it’s running in a decentralized fashion on a server farm, how does that change its world model? Do its “being ‘near’ diamonds” values still activate properly?

1 is evidentially supported by the only known examples of general intelligences, but also AI will not have the same inductive biases. So the picture might be more complicated. I’d guess shard theory is still appropriate, but that's ultimately a question for empirical work (with interpretability).^[12] There’s also some weak-moderate behavioral evidence for shard theory in AI which I’ve observed by looking at videos from the Goal Misgeneralization paper.

2 and 3 are early-training phenomena—well before superintelligence and gradient hacking, on my model—and thus far easier to verify via interpretability. Furthermore, this increases the relevance of pre-AGI experiments, since probably,^[13] later training performance of pre-AGI architectures will be qualitatively similar to earlier training performance for the (scaled up) AGI architecture. These are also questions we should be able to study pre-AGI models and get some empirical basis for, from getting expertise in forming target shards given fixed ontologies, to studying the extent to which the shard theory story is broadly correct (question 1).

4 seems a bit trickier. We’ll probably need a better theory of value formation dynamics to get more confidence here, although possibly (depending on interpretability tech) we can still sanity-check via interpretability on pre-AGI models.

5 seems like a question which resolves with more thinking, also clarified by answers to 1–4.

I think there are many ways to tell the story I told while maintaining a similar difficulty profile for the problems confronted. Obviously this exact story doesn’t have to go through, there are probably obvious tweaks which improve the chance of success. Overall, though, these are real research questions. And I have lots of other substantial uncertainties as well.

But looking at these problems, you might be struck by how basic they seem. They aren’t like “find a grading rule which your AI can’t catastrophically hack” or “ensure the AI’s inner cognition is almost exclusively bent on optimizing an externally represented objective.” Compared to these kinds of difficulties, this approach’s open questions seem qualitatively easier to me.

Lastly, of course, there is real simplification occurring when I relax the problem from “human values” to “diamonds.” That simplification wasn’t trivial, and I expect stories to get harder to tell in the true human value alignment regime. That said, I think those stories will still be possible to tell and will not be fundamentally and unstoppably harder.

Conclusion

Reward events provide cognitive updates to the trained agent. Probably there are skillful and appropriate schedules of cognitive updates such that the final agent generalizes in a predictable and desired way. I’ve sketched out a detailed story which doesn’t seem fundamentally blocked to me, and which seems both reasonably easy to improve given information we can get today about ML, and also reasonably promising already in that it provides a plausible-to-me story in which diamond-alignment actually goes well.

Thanks to Garrett Baker, Quintin Pope, Daniel Ziegler, Adam Shimi, and Charles Foster for feedback.

Appendix: The AI’s advantages in solving successor-alignment

The AI has (or will eventually have) read-write-execute (rwx) access to its own weights and activations, as well as seeing the code for its learning process, logs of its training, access to its dataset and hyperparemeter schedules, and being able to scan over previous model checkpoints.
The AI can backup and distill itself, in addition to whatever designer-implemented model checkpointing occurs as part of “best practices” for training such systems.
The AI can introspect in some ways, being able to predict what decisions get made in different situations.
The AI can explicitly discover its own ontology and circuits.
1. Because the AI can do precise “neuroscience” on itself via rwx, pairing its own introspective guesses with statistics of its own activations which it sets up.
2. In particular, a careful AI faces a substantially milder version of information inaccessibility. It has learned to (at least somewhat) understand how it’s making decisions as a result of SSL.
The successor AIs run on modified versions of the current AI’s cognition.
1. If the AI understands both its current values and how they activate in future situations, its current concepts, and its updating process, then the AI can reasonably expect to understand the alignment properties of successors.
2. Not so for people trying to develop minds in totally different architectures, reasoning from first principles and confronting information inaccessibility + abstractions deriving from a qualitatively different learning process.
The AI can run experiments on a sandboxed copy of itself and its training process, with automated tripwires for catastrophic value drift events.
1. Follows from benefits I, II, and V.

^{^}
I think that pure diamond maximizers are anti-natural, and at least not the first kind of successful story we should try to tell. Furthermore, the analogous version for an aligned AI seems to be “an AI which really helps people, among other goals, and is not a perfect human-values maximizer (whatever that might mean).”
^{^}
The local mapping from gradient directions to behaviors is given by the neural tangent kernel, and the learnability of different behaviors is given by the NTK’s eigenspectrum, which seems to adapt to the task at hand, making the network quicker to learn along behavioral dimensions similar to those it has already acquired. Probably, a model pretrained mostly by interacting with its local environment or predicting human data will be inclined towards learning value abstractions that are simple extension of the pretrained features, biasing the model towards forming values based on a human-like understanding of nearby diamonds.
^{^}
"Don't approach" means negative reward on approach, "approach" means positive reward on approach. Example decision scenarios:
1. Diamond in front of agent (approach)
2. Sapphire (don't approach)
3. Nothing (no reward)
4. Five chairs (don't approach)
5. A white shiny object which isn't a diamond (don't approach)
6. A small object which isn't a diamond (don't approach)
We can even do interpretability on the features activated by a diamond, and modify the scenario so that only the diamond feature correctly distinguishes between all approach/don't approach pairs. This hopefully ensures that the batch update chisels cognition into the agent which is predicated on the activation of the agent's diamond abstraction.
^{^}
Especially if we try tricks like “slap a ‘diamond’ label beneath the diamond, in order to more strongly and fully activate the agent’s internal diamond representation” (credit to Charles Foster). I expect more strongly activated features to be more salient to the gradients. I therefore more strongly expect such features to be involved in the learned shards.
^{^}
I think that there's a smooth relationship between "how many reward-event mistakes you make" (eg accidentally penalizing the agent for approaching a diamond) and "the strength of desired value you get out" (with a few discontinuities at the low end, where perhaps a sufficiently weak shard ends up non-reflective, or not plugging into the planning API, or nonexistent at all).
^{^}
In my view, there always had to be some way to align agents to diamonds without getting fussy about definitions. After all, (I infer that) some people grow diamond-shards in a non-fussy way, without requiring extreme precision from their reward systems or fancy genetically hardcoded alignment technology.
^{^}
Why wouldn't the agent want to just find an adversarial input to its diamond abstraction, which makes it activate unusually strongly? (I think that agents might accidentally do this a bit for optimizer's curse reasons, but not that strongly. More in an upcoming post.)
Consider why you wouldn't do this for "hanging out with friends." Consider the expected consequences of the plan "find an adversarial input to my own evaluation procedure such that I find a plan which future-me maximally evaluates as letting me 'hang out with my friends'." I currently predict that such a plan would lead future-me to daydream and not actually hang out with my friends, as present-me evaluates the abstract expected consequences of that plan. My friend-shard doesn't like that plan, because I'm not hanging out with my friends. So I don't search for an adversarial input. I infer that I don't want to find those inputs because I don't expect those inputs to lead me to actually hang out with my friends a lot as I presently evaluate the abstract-plan consequences.
I don't think an agent can consider searching for adversarial inputs to its shards without also being reflective, at which point the agent realizes the plan is dumb as evaluated by the current shards assessing the predicted plan-consequences provided by the reflective world-model.
Asking "why wouldn't the agent want to find an adversarial input to its diamond abstraction?" seems like a dressed-up version of "why wouldn't I want to find a plan where I can get myself shot while falsely believing I solved all of the world's problems?". Because it's stupid by my actual values, that's why. (Although some confused people who have taken wrong philosophy too far, might indeed find such a plan appealing).
^{^}
The reader may be surprised. "Doesn't TurnTrout think agents probably won't care about reward?". Not quite. As I stated in Reward is not the optimization target:
I think that generally intelligent RL agents will have secondary, relatively weaker values around reward, but that reward will not be a primary motivator. Under my current (weakly held) model, an AI will only start reinforcing computations about reward after it has reinforced other kinds of computations (e.g. putting away trash).
The reason I think this is that once the agent starts modeling its training process, it will have an abstraction around actions which are rewarding, and this will become a viable gradient direction for the batch PG updates. I don't expect the agent to model its training process until after it's formed e.g. the object-level diamond-shard, and I also expect abstractions like "diamond" to be more strongly activated and focused on by policy gradients. Thus, I suspect that reward-shards tend to be acquired, albeit at relatively weak strength.
^{^}
If instrumental values tend to get terminalized into their own shards, then people are not particularly unusual for not exhibiting an introspectively observable type separation between "instrumental" and "terminal" values:
Despite all the other ways that the human goal system complicates everything in sight, this one distinction it manages to collapse into a mishmash of things-with-conditional-value.
Terminal Values and Instrumental Values
^{^}
You might be wondering "since when was 'take over server farms' reinforced as a heuristic?".
First, I think that by this point in the story, there's something like an "internal API" for planning, where a plan stub gets proposed and modified, and the agent's world model predicts relevant consequences of the plan. I could imagine staying put in my room right now, and then tweaking that to "go to the kitchen", and that has some consequences, and then I can add further actions or modify the plan. In each case, my world model gives me more predictions for what happens and when.
So, the diamond-shard hooks into the learned planning+world-model API, modifies plan-stubs using diamond-based criteria, and quite flexibly and generally tailors plans to increase future expected diamonds. So there doesn't have to be an explicit past reinforcement event for "take over server farms." The diamond-shard is, in conjunction with the planning-API and WM, capable of general planning + intelligence towards the diamond criteria.
(Although the diamond-shard isn't just "generally making plans better at diamond-production." The "general intelligence" has a specific implementation. But just like a rock is—if you look closely enough—made of atoms, the general intelligence is—if you look closely enough—made of streams of interwoven heuristics.)
^{^}
I expect the AI to come up with a plan at least as good as:
In situations with reasonable uncertainty about value drift, run some sandbox sims overseen by several copies of your current model, using rwx + self-interpretability to monitor the sandboxed model. Since the AI doesn’t face as steep of an information inaccessibility obstacle (see the appendix), the AI can check new updates and reasonably expect to understand how the change takes place. If there's some crazy uninterpretable diff, then don't do that batch update. This should rule out sudden + catastrophic value drift.
The AI can also rollback to backups if it has noticed subtle value drift, and can precommit to doing so given a maintained update-invariant of (value drift tripwire) rollback. The agent also has a "saving throw"—In the moment of realization that bad value drift occurred via a previous update, the agent can undo the value drift via a rollback, if that's still motivationally possible. Furthermore, due to the batch update format, the agent had extra time to realize the update would be bad and pre-emptively nullify the update via e.g. silently and temporarily setting the learning rate to zero.
^{^}
I’m currently most excited about interpretability for adjudicating between theories of value formation.
^{^}
I initially conjectured this would be true while writing a draft, working mostly off of my intuitions. Quintin Pope then referred me to Let's Agree to Agree: Neural Networks Share Classification Order on Real Datasets:
We report a series of robust empirical observations, demonstrating that deep Neural Networks learn the examples in both the training and test sets in a similar order. This phenomenon is observed in all the commonly used benchmarks we evaluated, including many image classification benchmarks, and one text classification benchmark. While this phenomenon is strongest for models of the same architecture, it also crosses architectural boundaries – models of different architectures start by learning the same examples, after which the more powerful model may continue to learn additional examples. We further show that this pattern of results reflects the interplay between the way neural networks learn benchmark datasets. Thus, when fixing the architecture, we show synthetic datasets where this pattern ceases to exist. When fixing the dataset, we show that other learning paradigms may learn the data in a different order. We hypothesize that our results reflect how neural networks discover structure in natural datasets.
The authors state that they “failed to find a real dataset for which NNs differ [in classification order]” and that “models with different architectures can learn benchmark datasets at a different pace and performance, while still inducing a similar order. Specifically, we see that stronger architectures start off by learning the same examples that weaker networks learn, then move on to learning new examples.”

Similarly, crows (and other smart animals) reach developmental milestones in basically the same order as human babies reach them. On my model, developmental timelines come from convergent learning of abstractions via self-supervised learning in the brain. If so, then the smart-animal evidence is yet another instance of important qualitative concept-learning retaining its ordering, even across significant scaling and architectural differences.

Attempting to write out the holes in my model.

You point out that looking for a perfect reward function is too hard; optimization searches for upward errors in the rewards to exploit. But you then propose an RL scheme. It seems to me like it's still a useful form of critique to say: here are the upward errors in the proposed rewards, here is the policy that would exploit them.
It seems like you have a few tools to combat this form of critique:
- Model capacity. If the policy that exploits the upward errors is too complex to fit in the model, it cannot be learned. Or, more refined: if the proposed exploit-policy has features that make it difficult to learn (whether complexity, or some other special features).
  - I don't think you ever invoke this in your story, and I guess maybe you don't even want to, because it seems hard to separate the desired learned behavior from undesired learned behavior via this kind of argument.
- Path-dependency. What is learned early in training might have an influence on what is learned later in training.
  - It seems to me like this is your main tool, which you want to invoke repeatedly during your story.
I am very uncertain about how path-dependency works.
- It seems to me like this will have a very large effect on smaller neural networks, but a vanishing effect on larger and larger neural networks, because as neural networks get large, the gradient updates get smaller and smaller, and stay in a region of parameter-space where loss is more approximately linear. This means that much larger networks have much less path-dependency.
  - Major caveat: the amount of data is usually scaled up with the size of the network. Perhaps the overall amount of path-dependency is preserved.
    - This contradicts some parts of your story, where you mention that not too much data is needed due to pre-training. However, you did warn that you were not trying to project confidence about story details. Perhaps lots of data is required in these phases to overcome the linearity of large-model updates, IE, to get the path-dependent effects you are looking for. Or perhaps tweaking step size is sufficient.
- The exact nature of path-dependence seems very unclear.
  - In the shard language, it seems like one of the major path-dependency assumptions for this story is that gradient descent (GD) tends to elaborate successful shards rather than promote all relevant shards.
    - It seems unclear why this would happen in any one step of GD. At each neuron, all inputs which would have contributed to the desired result get strengthened.
      - My model of why very large NNs generalize well is that they effectively approximate bayesian learning, hedging their bets by promoting all relevant hypotheses rather than just one.
      - An alternate interpretation of your story is that ineffective subnetworks are basically scavenged for parts by other subnetworks early in training, so that later on, the ineffective subnetworks don't even have a chance. This effect could compound on itself as the remaining not-winning subnetworks have less room to grow (less surrounding stuff to scavenge to improve their own effective representation capacity).
        Shards that work well end up increasing their surface area, and surface area determines a kind of learning rate for a shard (via the shard's ability to bring other subnetworks under its gradient influence), so there is a compounding effect.
  - Another major path-dependency assumption seems to be that as these successful shards develop, they tend to have a kind of value-preservation. (Relevant comment.)
    - EG, you might start out with very simple shards that activate when diamonds are visible and vote to walk directly toward them. These might elaborate to do path-planning toward visible diamonds, and then to do path-planning to any diamonds which are present in the world-model, and so on, upwards in sophistication. So you go from some learned behavior that could be anthropomorphized as diamond-seeking, to eventually having a highly rational/intelligent/capable shard which really does want diamonds.
    - Again, I'm unclear on whether this can be expected to happen in very large networks, due to the lottery ticket hypothesis. But assuming some path-dependency, it is unclear to me whether it will word like this.
    - The claim would of course be very significant if true. It's a different framework, but effectively, this is a claim about ontological shifts - as you more-or-less flag in your major open questions.
    - While I find the particular examples intuitive, the overall claim seems too good to be true: effectively, that the path-dependencies which differentiate GD learning from ideal Bayesian learning are exactly the tool we need for alignment.

So, it seems to me, the most important research questions to figure out if a plan like this could be feasible revolve around the nature and existence of path-dependencies in GD, especially for very large NNs.

(Huh, I never saw this -- maybe my weekly batched updates are glitched? I only saw this because I was on your profile for some other reason.)

I really appreciate these thoughts!

But you then propose an RL scheme. It seems to me like it's still a useful form of critique to say: here are the upward errors in the proposed rewards, here is the policy that would exploit them.

I would say "that isn't how on-policy RL works; it doesn't just intelligently find increasingly high-reinforcement policies; which reinforcement events get 'exploited' depends on the exploration policy." (You seem to guess that this is my response in the next sub-bullets.)

While I find the particular examples intuitive, the overall claim seems too good to be true: effectively, that the path-dependencies which differentiate GD learning from ideal Bayesian learning are exactly the tool we need for alignment.

shrug, too good to be true isn't a causal reason for it to not work, of course, and I don't see something suspicious in the correlations. Effective learning algorithms may indeed have nice properties we want, especially if some humans have those same nice properties due to their own effective learning algorithms!

For my money, the nice properties that human and AI systems have that matter for alignment is IMO not the properties from Shard Theory, but rather several other properties that mattered:

Alignment generalizes further than capabilities because of verifying being easier to generate, as well as learning values being easier than having a lot of other real world capabilities.
It's looking like the values of humans are far, far simpler than a lot of evopsych literature and Yudkowsky thought, and related to this, values are less fragile than people thought 15-20 years ago, in the sense that values generalize far better OOD than people used to think 15-20 years ago.
The brain and DL AIs, while not the same thing, are doing reasonably similar things such that we can transport a lot of AI insights into neuroscience/human brain insights, and vice versa.
One of those lessons is the bitter lesson from Sutton applies to human values and morals, which cashes out into the fact that the data matter much more than the algorithm when predicting it's values, especially OOD generalization of values, and thus controlling the data is basically equivalent to controlling the values.

It's looking like the values of humans are far, far simpler than a lot of evopsych literature and Yudkowsky thought, and related to this, values are less fragile than people thought 15-20 years ago, in the sense that values generalize far better OOD than people used to think 15-20 years ago

I'm not sure I like this argument very much, as it currently stands. It's not that I believe anything you wrote in this paragraph is wrong per se, but more like this misses the mark a bit in terms of framing.

Yudkowsky had (and, AFAICT, still has) a specific theory of human values in terms of what they mean in a reductionist framework, where it makes sense (and is rather natural) to think of (approximate) utility functions of humans and of Coherent Extrapolated Volition as things-that-exist-in-the-territory.

I think a lot of writing and analysis, summarized by me here, has cast a tremendous amount of doubt on the viability of this way of thinking and has revealed what seem to me to be impossible-to-patch holes at the core of these theories. I do not believe "human values" in the Yudkowskian sense ultimately make sense as a coherent concept that carves reality at the joints; I instead observe a tremendous number of unanswered questions and apparent contradictions that throw the entire edifice in disarray.

But supplementing this reorientation of thinking around what it means to satisfy human values has been "prosaic" alignment researchers pivoting more towards intent alignment as opposed to doomed-from-the-start paradigms like "learning the true human utility function" or ambitious value learning, a recognition that realism about (AGI) rationality is likely just straight-up false and that the very specific set of conclusions MIRI-clustered alignment researchers have reached about what AGI cognition will be like are entirely overconfident and seem contradicted by our modern observations of LLMs, and ultimately an increased focus on the basic observation that full value alignment simply is not required for a good AI outcome (or at the very least to prevent AI takeover). So it's not so much that human values (to the extent such a thing makes sense) are simpler, but more so that fulfilling those values is just not needed to nearly as high a degree as people used to think.

Here's my thoughts on these interesting questions that you raise:

Yudkowsky had (and, AFAICT, still has) a specific theory of human values in terms of what they mean in a reductionist framework, where it makes sense (and is rather natural) to think of (approximate) utility functions of humans and of Coherent Extrapolated Volition as things-that-exist-in-the-territory.

IMO, I don't think Coherent Extrapolated Volition works, basically because I don't expect convergence in values by default, and I agree with Steven Byrnes plus Joe Carlsmith here:

https://joecarlsmith.com/2021/06/21/on-the-limits-of-idealized-values

https://www.lesswrong.com/posts/SqgRtCwueovvwxpDQ/valence-series-2-valence-and-normativity#2_7_3_Possible_implications_for_AI_alignment_discourse

That said, I think the approximate utility function framing is actually correct, in that the GPT series (and maybe o1/o3 too) does have a utility function that's about prediction, and we can validly turn utility functions over plans/predictions into utility functions over world states, so we can connect two different types of utility functions together, and I have commented on this before:

https://www.lesswrong.com/posts/FuGfR3jL3sw6r8kB4/richard-ngo-s-shortform#aCFCrRDALk3DMNkzh

https://www.lesswrong.com/posts/FuGfR3jL3sw6r8kB4/richard-ngo-s-shortform#gjE9eDiAZvzKxcgSs

More generally, I have more faith than that the utility function paradigm can be reformed significantly without wholly abandoning it.

I also think to the extent human values do work, it's at a higher level than reductionism would posit, but it is built out of lower-level parts that are changable.

I think a lot of writing and analysis, summarized by me here, has cast a tremendous amount of doubt on the viability of this way of thinking and has revealed what seem to me to be impossible-to-patch holes at the core of these theories. I do not believe "human values" in the Yudkowskian sense ultimately make sense as a coherent concept that carves reality at the joints; I instead observe a tremendous number of unanswered questions and apparent contradictions that throw the entire edifice in disarray

To handle the unanswered questions, I'll first handle this one:

What do we mean by morality as fixed computation in the context of human beings who are decidedly not fixed and whose moral development through time is almost certainly so path-dependent (through sensitivity to butterfly effects and order dependence) that a concept like "CEV" probably doesn't make sense? The feedback loops implicit in the structure of the brain cause reward and punishment signals to "release chemicals that induce the brain to rearrange itself" in a manner closely analogous to and clearly reminiscent of a continuous and (until death) never-ending micro-scale brain surgery. To be sure, barring serious brain trauma, these are typically small-scale changes, but they nevertheless fundamentally modify the connections in the brain and thus the computation it would produce in something like an emulated state (as a straightforward corollary, how would an em that does not "update" its brain chemistry the same way that a biological being does be "human" in any decision-relevant way?). We can think about a continuous personal identity through the lens of mutual information about memories, personalities etc, but our current understanding of these topics is vastly incomplete and inadequate, and in any case the naive (yet very widespread, even on LW) interpretation of "the utility function is not up for grabs" as meaning that terminal values cannot be changed (or even make sense as a coherent concept) seems totally wrong.

I agree with the path dependency argument that morality is more path dependent than LWers think, and while controlling your value evolution is easier than predicting it, I basically agree with the claim that CEV probably makes less sense/isn't unique.

I think the issue of the fact that that the feedback loops update the brain significantly, causing the computation graph to change and thus complicating updateleness for uploaded/embodied humans significantly is an actually real problem, and a big reason why the early updateless decision theories didn't matter that much is because they assumed omniscience in a logical and computational sense, so it doesn't work for real humans.

I'm not sure if this is unsolvable, and I wouldn't say that there won't be a satisfying implementation of UDT in the future, but yeah I would not yet bet that much on updatelessness working out very well.

(Heck, even Solomonoff induction/AIXI isn't logically/computationally omniscient, if only because they also can't compute certain functions that more powerful computers can)

Vladimir Nesov describes it more here:

https://www.lesswrong.com/posts/FuGfR3jL3sw6r8kB4/richard-ngo-s-shortform#s4cTgQZNpWRLKp3EG

https://www.lesswrong.com/posts/FuGfR3jL3sw6r8kB4/richard-ngo-s-shortform#fdDfad5s8cS5kumEf

I am not sure whether any version of updateleness in decision theory can survive realistic constraints, so I don't know whether it is solvable, and I don't think it matters for now.

We can think about a continuous personal identity through the lens of mutual information about memories, personalities etc, but our current understanding of these topics is vastly incomplete and inadequate, and in any case the naive (yet very widespread, even on LW) interpretation of "the utility function is not up for grabs" as meaning that terminal values cannot be changed (or even make sense as a coherent concept) seems totally wrong.

I basically agree with this, and I think the likely wrong assumption here is that an AI terminal value can be treated as fixed, and IMO the big issue I have with a lot of LW content on values is precisely that they treat a utility function of a human (when the utility function framing works, which does work sometimes but also doesn't work other times) as fixed.

I think this post might help you on how a more realistic version of utility functions would work, which include the fact that terminal values change:

https://www.lesswrong.com/posts/RorXWkriXwErvJtvn/agi-will-have-learnt-utility-functions

On Charlie Steiner's view of Goodhart, quoted below, I have a pretty simple response:

There has already been a great deal of discussion about these topics on LW (1, 2, etc), and Charlie Steiner's distillation of it in his excellently-written Reducing Goodhart sequence still seems entirely correct:

Humans don't have our values written in Fortran on the inside of our skulls, we're collections of atoms that only do agent-like things within a narrow band of temperatures and pressures. It's not that there's some pre-theoretic set of True Values hidden inside people and we're merely having trouble getting to them - no, extracting any values at all from humans is a theory-laden act of inference, relying on choices like "which atoms exactly count as part of the person" and "what do you do if the person says different things at different times?"
The natural framing of Goodhart's law - in both mathematics and casual language - makes the assumption that there's some specific True Values in here, some V to compare to U. But this assumption, and the way of thinking built on top of it, is crucially false when you get down to the nitty gritty of how to model humans and infer their values.

The answer to this is basically John Wentworth's comment, repeated here:

https://www.lesswrong.com/posts/gQY6LrTWJNkTv8YJR/the-pointers-problem-human-values-are-a-function-of-humans#Ar87Jkeg8TzSraLcD

Ok, I think I see what you're saying now. I am of course on board with the notion that e.g. human values do not make sense when we're modelling the human at the level of atoms. I also agree that the physical system which comprises a human can be modeled as wanting different things at different levels of abstraction.
However, there is a difference between "the physical system which comprises a human can be interpreted as wanting different things at different levels of abstraction", and "there is not a unique, well-defined referent of 'human values'". The former does not imply the latter. Indeed, the difference is essentially the same issue in the OP: one of these statements has a type-signature which lives in the physical world, while the other has a type-signature which lives in a human's model.
An analogy: consider a robot into which I hard-code a utility function and world model. This is a physical robot; on the level of atoms, its "goals" do not exist in any more real a sense than human values do. As with humans, we can model the robot at multiple levels of abstraction, and these different models may ascribe different "goals" to the robot - e.g. modelling it at the level of an electronic circuit or at the level of assembly code may ascribe different goals to the system, there may be subsystems with their own little control loops, etc.
And yet, when I talk about the utility function I hard-coded into the robot, there is no ambiguity about which thing I am talking about. "The utility function I hard-coded into the robot" is a concept within my own world-model. That world-model specifies the relevant level of abstraction at which the concept lives. And it seems pretty clear that "the utility function I hard-coded into the robot" would correspond to some unambiguous thing in the real world - although specifying exactly what that thing is, is an instance of the pointers problem.
Does that make sense? Am I still missing something here?

More generally, solutions to these sorts of problems come down to the fact that we can make a new abstract layer out of atomic parts, and use error-correction to make the abstraction as non-leaky as possible.

On what are human preferences, I'll state my answer below:

What counts as human "preferences"? Are these utility function-like orderings of future world states, or are they ultimately about universe-histories, or maybe a combination of those, or maybe something else entirely? Do we actually have any good reason to think that (some form of) utility maximization explains real-world behavior, or are the conclusions broadly converged upon on LW ultimately a result of intuitions about what powerful cognition must be like whose source is a set of coherence arguments that do not stretch as far as they were purported to? What do we do with the fact that humans don't seem to have utility functions and yet lingering confusion about this remained as a result of many incorrect and misleading statements by influential members of the community?

To answer the question of whether we have good reason to expect utility maximization as a description of what real AI looks like, my modern answer is "Because this happened at least 2 times, and thus provides limited but useful evidence on whether utility maximization will appear."

To answer the question of what counts as human preferences, I think all of the answers have some correctness, and human preferences are both over world states, universe histories and also plausibly have some values that aren't reducible to the utility function view.

An important point about utility functions/coherence arguments is that for the purposes of coherence theorems/utility functions, we only care about the revealed preferences/behaviors, and thus we only need to observe their behavior to check whether or not they have a utility function that is coherent, not whether a utility function is implemented truly inside it's head:

https://www.lesswrong.com/posts/DXxEp3QWzeiyPMM3y/a-simple-toy-coherence-theorem#Coherence_Is_About_Revealed_Preferences

https://www.lesswrong.com/posts/yCuzmCsE86BTu9PfA/there-are-no-coherence-theorems#ddSmggkynaAHmFuHi

This matters for debates like are humans coherent or not.

To answer @Wei Dai's question here:

On second thought, even if you assume the latter, the humans you're learning from will themselves have problems with distributional shifts. If you give someone a different set of life experiences, they're going to end up a different person with different values, so it seems impossible to learn a complete and consistent utility function by just placing someone in various virtual environments with fake memories of how they got there and observing what they do. Will this issue be addressed in the sequence?

I think the ultimate answer to this is to reject the assumption that values are fixed for all time, no matter which arbitrary environment is used, and instead focus on learned utility functions.

I think Wei Dai is pointing at a pretty real and deep problem in how LWers think about utility functions/values, downstream of making the AIXI model of intelligence the dominant form of thought of how AI was likely to end up if it achieved ASI, which has totally fixed values in the form of a reward function it optimizes for all time, but contra Wei Dai and Rohin Shah, I think that it doesn't doom the ambitious value learning agenda based on utility functions, since not all utility functions are non-responsive to the environment.

https://www.lesswrong.com/posts/RorXWkriXwErvJtvn/agi-will-have-learnt-utility-functions

On this:

How can we use such large sample spaces when it becomes impossible for limited beings like humans or even AGI to differentiate between those outcomes and their associated events? After all, while we might want an AI to push the world towards a desirable state instead of just misleading us into thinking it has done so, how is it possible for humans (or any other cognitively limited agents) to assign a different value, and thus a different preference ranking, to outcomes that they (even in theory) cannot differentiate (either on the basis of sense data or through thought)?

This sounds like you are claiming that there are good reasons to believe that the pointers problem is fundamentally impossible to solve, at least in the general case.

I'll say a few things that are relevant here:

I don't particularly think this matters for AI x-risk in general, and thus I will mostly punt on this question.
I think AI progress is weak evidence that something like the pointers problem is possible to solve in theory, but not that strong.

I don't particularly think we need to solve/prove impossible today, and can defer to our future selves on this question.

https://www.lesswrong.com/posts/Mha5GA5BfWcpf2jHC/potential-bottlenecks-to-taking-over-the-world#XFSZNWrHANdXhoTcT

I feel similarly for realism about rationality, but I'd drop my second point.

and ultimately an increased focus on the basic observation that full value alignment simply is not required for a good AI outcome (or at the very least to prevent AI takeover).

I definitely agree that full value alignment is not required for humans to thrive in a world where AIs control the economy, and this was not appreciated well enough by a lot of doomy people, primarily stemming from over-assuming values are fragile combined with assuming AIs would basically instantly takeover because of assuming that AI would FOOM from today's intelligence to superintelligence, and while I do genuinely think that human values are simpler than Yudkowsky thought, the observation that full alignment is not required for AI safety is an underrated insight.

This insight is compatible with a world where human values are genuinely simpler than we thought.

Wei Dai (yet again) and Stuart Armstrong explained how there doesn't seem to be a principled basis to expect "beliefs" and "values" to ultimately make sense as distinct and coherent concepts that carve reality at the joints, and also how inferring a human's preferences merely from their actions is impossible unless you make specific assumptions about their rationality and epistemic beliefs about the state of the world, respectively. Paul Christiano went in detail on why this means that even the "easy" goal inference problem, meaning the (entirely fantastical and unrealistically optimistic set-up) in which we have access to an infinite amount of compute and to the "complete human policy (a lookup table of what a human would do after making any sequence of observations)", and we must then come up with "any reasonable representation of any reasonable approximation to what that human wants," is in fact tremendously hard.

Re the difference between beliefs and values, for AIXI/fixed value agents this is pretty easy, in that a value isn't updatable by Bayesian reasoning about the world, and in particular it doesn't update it's value system in response to moral arguments, and arbitrarily competent/compute-rich agents can have very different values from you, but not arbitrarily different beliefs that aren't caused by both of you being in different universes/worlds/situations.

For changeable value agents like us, and which I pointed out above to learned utility functions, this list might help:

You don't favor shorter long-list definitions of goodness over longer ones. The criteria for choosing the list have little to do with its length, and more with what a human brain emulation with such-and-such modifications to make it believe only and all relevant true empirical facts would decide once it had reached reflective moral equilibrium.
Agents who have a different "long list" definition cannot be moved by the fact that you've declared your particular long list "true goodness".
There would be no reason to expect alien races to have discovered the same long list defining "true goodness" as you.
An alien with a different "long list" than you, upon learning the causal reasons for the particular long list you have, is not going to change their long list to be more like yours.
You don't need to use probabilities and update your long list in response to evidence, quite the opposite, you want it to remain changed only in specific circumstances that are set by you (edited from original)

On the easy goal inference problem, if the no-free lunch result on value learning is proved like how no-free-lunch theorems are usually proved in machine learning, then it's not a blocker under the infinite compute affordance, since you can consider all 2^S (S is for set here) options on what the human values are by brute-force search/uniform search, and stop once you have exhausted all possible inputs/options.

I'd almost call no-free lunch theorems inapproximability results combined with computational complexity results, in that if you can make no assumptions, no algorithm performs better than brute-force search/uniform search over all possible inputs/options, and you either perfectly learn a look-up table in the general case, or you don't learn the thing you want to learn at all.

I also agree with davidad here:

https://www.lesswrong.com/posts/yTvBSFrXhZfL8vr5a/worst-case-thinking-in-ai-alignment#N3avtTM3ESH4KHmfN

However, the issue is that there is no complexity bound on how complicated someone's values are in the general case, so I was definitely relying on the infinite compute here, which can't be used for our situation.

Maybe beliefs and values can be more or less unified in special cases, though I doubt that will happen.

For the differences between humans, I'll answer that question below:

Joe Carlsmith again, in his outstanding description of "An even deeper atheism", questioned "how different, exactly, are human hearts from each other? And in particular: are they sufficiently different that, when they foom, and even "on reflection," they don't end up pointing in exactly the same direction?", explaining that optimism about this type of "human alignment" is contingent on the "claim that most humans will converge, on reflection, to sufficiently similar values that their utility functions won't be "fragile" relative to each other." But Joe then keyed in on the crucial point that "while it's true that humans have various important similarities to each other (bodies, genes, cognitive architectures, acculturation processes) that do not apply to the AI case, nothing has yet been said to show that these similarities are enough to overcome the "extremal Goodhart" argument for value fragility", mentioning that when we "systematize" and "amp them up to foom", human desires decohere significantly. (This is the same point that Scott Alexander made in his classic post on the tails coming apart.) Ultimately, the essay concludes by claiming that it's perfectly plausible for most humans to be "paperclippers relative to each other [in the supposed reflective limit]", which is a position of "yet-deeper atheism" that goes beyond Eliezer's unjustified humanistic trust in human hearts.

I basically agree with this, and one of the more important effects of AI very deep into takeoff is that we will start realizing that a lot of human alignment relied on the fact that people were dependent on each other, and that a person is dependent on society, so societal coercion like laws/police mostly work, which AI more or less breaks, and there is no reason to assume that a lot of people wouldn't be paper-clippers relative to each other if they didn't need society.

To be clear, I still expect some level of cooperation, due to the existence of very altruistic people, but yeah the reduction of positive sum trades between different values, combined with a lot of our value systems only tolerating other value systems in contexts where we need other people will make our future surprisingly dark compared to what people usually think due to "most humans being paperclippers relative to each other [in the supposed reflective limit]".

One reason I think Eliezer got this wrong is as you stated, where he puts too much trust in human hearts, but one other reason I think he got this wrong is that he treated AI risk from an AI that kills everyone due to misalignment as a very high probability threat, and incorrectly assumed that it doesn't matter which humans, and which human values get control over AI, because of the assumption of psychological unity of humankind in values.

In essence, I think that politics matter way more than Eliezer does for how much the future is valuable to you, and political fights, while not great, are unfortunately more necessary than you think, and I disagree with this quote's attitude:

Now: let's be clear, the AI risk folks have heard this sort of question before. "Ah, but aligned with whom?" Very deep. And the Yudkowskians respond with frustration. "I just told you that we're all about to be killed, and your mind goes to monkey politics? You're fighting over the poisoned banana!"

To address computationalism and indexical values for a bit, here's my answer:

In any case, the rather abstract "beliefs, memories and values" you solely purport to value fit the category of professed ego-syntonic morals much more so than the category of what actually motivates and generates human behavior, as Steven Byrnes explained in an expectedly outstanding way:

An important observation here is that professed goals and values, much more than actions, tend to be disproportionately determined by whether things are ego-syntonic or -dystonic. Consider: If I say something out loud (or to myself) (e.g. “I’m gonna quit smoking” or “I care about my family”), the actual immediate thought in my head was mainly “I’m going to perform this particular speech act”. It’s the valence of that thought which determines whether we speak those words or not. And the self-reflective aspects of that thought are very salient, because speaking entails thinking about how your words will be received by the listener. By contrast, the contents of that proclamation—actually quitting smoking, or actually caring about my family—are both less salient and less immediate, taking place in some indeterminate future (see time-discounting). So the net valence of the speech act probably contains a large valence contribution from the self-reflective aspects of quitting smoking, and a small valence contribution from the more direct sensory and other consequences of quitting smoking, or caring about my family. And this is true even if we are 100% sincere in our intention to follow through with what we say. (See also Approving reinforces low-effort behaviors, a blog post making a similar point as this paragraph.)
[...]
According to this definition, “values” are likely to consist of very nice-sounding, socially-approved, and ego-syntonic things like “taking care of my family and friends”, “making the world a better place”, and so on.
Also according to this definition, “values” can potentially have precious little influence on someone’s behavior. In this (extremely common) case, I would say “I guess this person’s desires are different from his values. Oh well, no surprise there.”
Indeed, I think it’s totally normal for someone whose “values” include “being a good friend” will actually be a bad friend. So does this “value” have any implications at all? Yes!! I would expect that, in this situation, the person would either feel bad about the fact that they were a bad friend, or deny that they were a bad friend, or fail to think about the question at all, or come up with some other excuse for their behavior. If none of those things happened, then (and only then) would I say that “being a good friend” is not in fact one of their “values”, and if they stated otherwise, then they were lying or confused.

I agree this is a plausible motivation, but one confound that applies here is that all discussions of uploading and whether it preserves you are fundamentally stalled by the fact that we don't have anything close to the classical uploading machines, so you have to discuss things pretty abstractly, and we don't have good terminology for this, which is why I'd prefer to punt this discussion until we have the technology.

Steve also argues, in my view correctly, that "all valence ultimately flows, directly or indirectly, from innate drives", which are entirely centered on (indexical, selfish) subjective experience such as pain, hunger, status drive, emotions etc. I see no clear causal mechanism through which something like that could ever make a human (copy) stop valuing its qualia in favor of the abstract concepts you purport to defend.

I disagree with the universal quantifier here, and think that non-innate values can also contribute to valence.

I agree innate drives are a useful starting point, but I don't buy the completeness of innate drives to contributing what you value, so I do think that non-indexical values can exist in humans.

More importantly, you can turn the valuing of experiences like status drive/emotions non-indexical if you modify the mind such that it always values a certain experience equally, no matter it's copies, and more generally one of the changes I expect re identity and values amongst a lot of uploaded humans is to treat their values much less indexically, and to treat their identity as closer to an isomorphism/equivalence class of programs like their source code, rather than thinking in an instance focused way.

A big reason for this is because of model merging, which is applicable to current AIs could plausibly be used on uploaded humans as well, and this allows you to unify goals in a satisfying manner across copies (which is one of the reasons why AIs will in practice be closer to a single big being than billions of little beings, even if you could split them up into billions of little instances of the AIs, because this sort of model merging wouldn't work on AIs that had strongly indexical goals like us, and this technology will incentivize non-indexical goals).

More below:

https://minihf.com/posts/2024-11-30-predictable-updates-about-identity/

Also, I expect uploads to have more plasticity than current human brains, as well.

As a general matter, accepting physicalism as correct would naturally lead one to the conclusion that what runs on top of the physical substrate works on the basis of... what is physically there (which, to the best of our current understanding, can be represented through Quantum Mechanical probability amplitudes), not what conclusions you draw from a mathematical model that abstracts away quantum randomness in favor of a classical picture, the entire brain structure in favor of (a slightly augmented version of) its connectome, and the entire chemical make-up of it in favor of its electrical connections. As I have mentioned, that is a mere model that represents a very lossy compression of what is going on; it is not the same as the real thing, and conflating the two is an error that has been going on here for far too long. Of course, it very well might be the case that Rob and the computationalists are right about these issues, but the explanation up to now should make it clear why it is on them to provide evidence for their conclusion.

I have a number of responses:

Quantum physics can be represented by a computation, since almost everything is representable by a computation as shown below, because the computationalist ontology is very, very expressive, and one reason why philosophical debates on computationalism go nowhere is because people don't realize how expressive the computationalist frame work is, but because of this very expressivity, the computationalist ontology often buys you no predictions unless you are more specific.

More here:

http://www.amirrorclear.net/academic/ideas/simulation/index.html

More importantly, classical computers can always faithfully simulate a quantum system like humans given enough time, because quantum computers are no stronger than classical ones, so in this regard the map and territory match well enough.

So a computationalist view of how things work, including human minds is entirely compatible with physicalism/believing in Quantum Mechanics.

Finally, and to get to the actual crux here, my crux is that while I do think the classical brain picture where it is a classical connectome is a lossy model, I don't think it's so lossy as to ruin the chances of human uploading being possible, and indeed I'd argue given recent AI evidence that a surprisingly large amount of what makes our human brain special can be replicated by very different entities/very different substrates, which is indirect/weak empirical evidence in favor of uploading being possible.

The source is below:

https://minihf.com/posts/2024-11-30-predictable-updates-about-identity/

This also answers this portion below:

More specifically, is a real-world being actually the same as the abstract computation its mind embodies? Rejections of souls and dualism, alongside arguments for physicalism, do not prove the computationalist thesis to be correct, as physicalism-without-computationalism is not only possible but also (as the very name implies) a priori far more faithful to the standard physicalist worldview.

Which is that the answer is in a sense trivially yes, a real-world being is the same as at least one abstract computation, solely due to the algorithmic ontology being strictly more expressive than the physicalist ontology, and that physicalist worldviews are also compatible with computationalist worldviews.

On human preferences:

In any case, are they indexical or not? If we are supposed to think about preferences in terms of revealed preferences only, what does this mean in a universe (or an Everett branch, if you subscribe to that particular interpretation of QM) that is deterministic? Aren't preferences thought of as being about possible worlds, so they would fundamentally need to be parts of the map as opposed to the actual territory, meaning we would need some canonical framework of translating the incoherent and yet supposedly very complex and multidimensional set of human desires into something that actually corresponds to reality? What additional structure must be grafted upon the empirically-observable behaviors in order for "what the human actually wants" to be well-defined?

One key thing that helps us out is that from a MWI/infinite universe scenario is assuming our locally observable/affectable universe isn't special in how atoms clump up into bigger structures, which is pretty likely, then every possible (according to the laws of physics) combination of atoms will be tried in an infinite universe, and thus we can sensibly define the notion of possible worlds even in a deterministic universe/MWI multiverse, as all possible combinations allowed by the laws of physics will be done, and you can separate the possible worlds from a very fine-grained perspective (if you could generate the entire universe yourself), so in theory possible worlds work out.

Thus, we don't need to translate in theory between the map and territory, but in practice, we would like ways for translating human preferences that are in the map, into preferences that correspond to the territory of reality.

On human values, I'd predict that current human values combine both indexical parts referring to particular contexts, but also have some abstract values that are not indexical and are essentially context-free/are invariant to copying scenarios. Justice/freedom seems likely to be one such value (for a non-trivial number of humans).

However, I don't know the structural assumptions that are required in order to make the question "what does the human actually want?" a well defined question under realistic constraints on compute.

If we allow unrealistic models of computation like a Turing Machine that is allowed to have infinitely many states, or a Blum-Shub-Smale machine, then it's easy to both make the question well defined and make the question answerable by these machines even under the no-free lunch theorems, because we have at most 2^N (N here refers to all of the natural numbers including 0) possibilities, which are all computable by the following models of computation above.

On agents:

On the topic of agency, what exactly does that refer to in the real world? Do we not "first need a clean intuitively-correct mathematical operationalization of what "powerful agent" even means"? Are humans even agents, and if not, what exactly are we supposed to get out of approaches that are ultimately all about agency? How do we actually get from atoms to agents? (note that the posts in that eponymous sequence do not even come close to answering this question)

I definitely could agree with something like a claim that humans are closer to control processes than agents, or at least that the basic paradigm shouldn't be agents but something else, but for our purposes, I don't think we need a clean mathematical operationalization of what a powerful agent is in order for alignment to succeed in a practical sense.

And that is the end of the very long comment. I am fine with no response or with an incomplete response, but these are my thought on the very interesting questions you raise.

I want to echo Jonas's statement and say that this was an enjoyable and thought-provoking comment to read. I appreciate the deep engagement with the questions I posed and the work that went into everything you wrote. Strong-upvoted!

I will not write a point-by-point response right now, but perhaps I will sometime soon, depending on when I get some free time. We could maybe do a dialogue about this at some point too, if you're willing, but I'm not sure when I would be up for that just yet.

I am willing to do a dialogue, if you are interested @sunwillrise.

Randomly read this comment and I really enjoyed it, Turn it into a post? (I understand how annoying structuring complex thoughts coherently can be but maybe do a dialogue or something? I liked this.)

I largely agree with a lot of the missing things in people's views of utility functions and so I think you expressed some of that in a pretty good deeper way.

When we get into acausality and evertt branches I think we're going a bit off-track. I can think computational intractability and observer bias is something interesting to bring up but I always find it never leads anywhere. Quantum Mechanics is fundamentally observer invariant and so positing something like MWI is a philosophical stance (that is supported by occam's razor) but it is still observer dependent, what if there are no observers?

(Pointing at Physics as Information Processing)

Do you have any specific reason why you're going into QMech when talking about brain-like AGI stuff?

Randomly read this comment and I really enjoyed it, Turn it into a post? (I understand how annoying structuring complex thoughts coherently can be but maybe do a dialogue or something? I liked this.)

Maybe I should try a dialogue with someone else on this, because I don't think any of my points are very extendible to a full post without someone helping me.

Do you have any specific reason why you're going into QMech when talking about brain-like AGI stuff?

To be frank, this was mostly about clarifying the philosophy around computationalism/human values in general, but I didn't go that deep into QMech for brain-like AGI and don't expect it to be immediately useful for my pursuits, so the only role for QMech here is in clarifying some confusions people have, and QMech wasn't even that necessary to make my points.

When we get into acausality and evertt branches I think we're going a bit off-track. I can think computational intractability and observer bias is something interesting to bring up but I always find it never leads anywhere. Quantum Mechanics is fundamentally observer invariant and so positing something like MWI is a philosophical stance (that is supported by occam's razor) but it is still observer dependent, what if there are no observers?

Okay, the thing I think you are pointing to is that the same outcomes/rules can be generated out of ontologically distinct interpretations, and for our purposes, the observer is basically anything that interacts with anything, whether it's a human or particle, and thus saying there are no observers corresponds to saying that there is nothing in the universe, including the forces, and in particular dark energy is exactly 0.

The answer is that it would be a very different universe than our universe is today.

It's looking like the values of humans are far, far simpler than a lot of evopsych literature and Yudkowsky.

I've missed this. Any particular link to get to me started reading about this update? Shard theory seems to imply complex values in individual humans. Though certainly less fragile than Yudkowsky proposed.

Note, this is outside of Shard Theory's scope, and I wasn't appealing to shard theory here.

So the links that I personally viewed to make these updates are here:

This summary of Matthew Barnett's post:

https://www.lesswrong.com/posts/i5kijcjFJD6bn7dwq/evaluating-the-historical-value-misspecification-argument#N9ManBfJ7ahhnqmu7

And 2 links from Beren about alignment:

https://www.beren.io/2024-05-11-Alignment-in-the-Age-of-Synthetic-Data/

https://www.beren.io/2024-05-15-Alignment-Likely-Generalizes-Further-Than-Capabilities/

Great post! I think it's very good for alignment researchers to be this level of concrete about their plans, it helps enormously in a bunch of ways e.g. for evaluating the plan.

Comments as I go along:

Why wouldn't the agent want to just find an adversarial input to its diamond abstraction, which makes it activate unusually strongly? (I think that agents might accidentally do this a bit for optimizer's curse reasons, but not that strongly. More in an upcoming post.)
Consider why you wouldn't do this for "hanging out with friends." Consider the expected consequences of the plan "find an adversarial input to my own evaluation procedure such that I find a plan which future-me maximally evaluates as letting me 'hang out with my friends'." I currently predict that such a plan would lead future-me to daydream and not actually hang out with my friends, as present-me evaluates the abstract expected consequences of that plan. My friend-shard doesn't like that plan, because I'm not hanging out with my friends. So I don't search for an adversarial input. I infer that I don't want to find those inputs because I don't expect those inputs to lead me to actually hang out with my friends a lot as I presently evaluate the abstract-plan consequences.
I don't think an agent can consider searching for adversarial inputs to its shards without also being reflective, at which point the agent realizes the plan is dumb as evaluated by the current shards assessing the predicted plan-consequences provided by the reflective world-model.

How is the bolded sentence different from the following:

"Consider the expected consequences of the plan "think a lot longer and harder, considering a lot more possibilities for what you should do, and then make your decision." I currently predict that such a plan would lead future-me to waste his life doing philosophy or maybe get pascal's mugged by some longtermist AI bullshit instead of actually helping people with his donations. My helping-people shard doesn't like this plan, because it predicts abstractly that thinking a lot more will not result in helping people more."

(Basically I'm saying you should think more, and then write more, about the difference between these two cases because they seem plausibly on a spectrum to me, and this should make us nervous in a couple of ways. Are we actually being really stupid by being EAs and shutting up and calculating? Have we basically adversarial-exampled ourselves away from doing things that we actually thought were altruistic and effective back in the day? If not, what's different about the kind of extended search process we did, from the logical extension of that which is to do an even more extended search process, a sufficiently extreme search process that outsiders would call the result an adversarial example?)

In particular, even though online self-supervised learning continues to develop the world model and create more advanced concepts, the reward events also keep pinging the invocation of the diamond-abstraction as responsible for reward (because insofar as the agent's diamond-shard guides its decisions, then the diamond-shard's diamond-abstraction is in fact responsible for the agent getting reward). The diamond-abstraction gradient starves the AI from exclusively acting on the basis of possible advanced "alien" abstractions which would otherwise have replaced the diamond abstraction. The diamond shard already gets reward effectively, integrating with the rest of the agent's world model and recurrent state, and therefore provides "job security" for the diamond-abstraction. (And once the agent is smart enough, it will want to preserve its diamond abstraction, insofar as that is necessary for the agent to keep achieving its current goals which involve prototypical-diamonds.)

Are you sure that's how it works? Seems plausible to me but I'm a bit nervous, I think it could totally turn out to not work like that. (That is, it could turn out that the agent wanting to preserve its diamond abstraction is the only thing that halts the march towards more and more alien-yet-effective abstractions)

Suppose the AI keeps training, but by instrumental convergence, seeking power remains a good idea, and such decisions continually get strengthened. This strengthens the power-seeking shard relative to other shards. Other shards want to prevent this from happening.

you go on to talk about shards eventually values-handshaking with each other. While I agree that shard theory is a big improvement over the models that came before it (which I call rational agent model and bag o' heuristics model) I think shard theory currently has a big hole in the middle that mirrors the hole between bag o' heuristics and rational agents. Namely, shard theory currently basically seems to be saying "At first, you get very simple shards, like the following examples: IF diamond-nearby THEN goto diamond. Then, eventually, you have a bunch of competing shards that are best modelled as rational agents; they have beliefs and desires of their own, and even negotiate with each other!" My response is "but what happens in the middle? Seems super important! Also haven't you just reproduced the problem but inside the head?" (The problem being, when modelling AGI we always understood that it would start out being just a crappy bag of heuristics and end up a scary rational agent, but what happens in between was a big and important mystery. Shard theory boldly strides into that dark spot in our model... and then reproduces it in miniature! Progress, I guess.)

"Consider the expected consequences of the plan "think a lot longer and harder, considering a lot more possibilities for what you should do, and then make your decision." I currently predict that such a plan would lead future-me to waste his life doing philosophy or maybe get pascal's mugged by some longtermist AI bullshit instead of actually helping people with his donations. My helping-people shard doesn't like this plan, because it predicts abstractly that thinking a lot more will not result in helping people more."

(Basically I'm saying you should think more, and then write more, about the difference between these two cases because they seem plausibly on a spectrum to me, and this should make us nervous in a couple of ways. Are we actually being really stupid by being EAs and shutting up and calculating? Have we basically adversarial-exampled ourselves away from doing things that we actually thought were altruistic and effective back in the day? If not, what's different about the kind of extended search process we did, from the logical extension of that which is to do an even more extended search process, a sufficiently extreme search process that outsiders would call the result an adversarial example?)

I think there are several things happening. Here are some:

If an EA-to-be ("EAlice", let's say) in fact thought that EA would make her waste her life on bullshit, but went ahead anyways, she subjectively made a mistake.
1. Were her expectations correct? That's another question. I personally think that AI ruin is real, it's not low-probability Pascal's Mugging BS, it's default-outcome IMO.
I think many EAs are making distortionary values choices.
1. There is a socially easy way to quash parts of yourself which don't have immediate sophisticated-sounding arguments backing them up.
  1. But whatever values you do have (eg caring about your family), whatever caring you originally developed (via RL, according to shard theory), didn't originally come via some grand consequentialist or game-theoretic argument about happiness or freedom.
  2. So why should other values, like "avoiding spiders" or "taking time to relax", have to justify themselves? They're part of my utility function, so to speak! That's not up for grabs!
2. I care more about my mom than other peoples' moms. Sue me!
3. I agree with much of Self-Integrity and the Drowning Child.
I think a bunch of this has to do with meta-ethics, not with adversarial examples to values.
1. It might be that your original "helping people" values are not what your old value-distribution would have reflectively endorsed. Like, maybe you were just prioritizing your friends and neighbors, but if you'd ever really thought about it, your reflective strong broadly activated shard coalition would have ruled "hey let's care about faraway people more."
  1. EG cooperation + happiness + empathy + fairness + local-helping shard -> generalize by creating a global-helping shard
2. Or maybe EA did in fact trick EAlice and socially pressure and reshape her into a new being for whom this is reflectively endorsed.
  1. EG social shard -> global-helping shard
3. Although EA is in fact selecting for people against whom its arguments constitute (weak) adversarial inputs. I don't think the selection is that strong? Confused here.

EDIT: One of the main threads is Don't design agents which exploit adversarial inputs. The point isn't that people can't or don't fall victim to plans which, by virtue of spurious appeal to a person's value shards, cause the person to unwisely pursue the plan. The point here is that (I claim) intelligent people convergently want to avoid this happening to them.

A diamond-shard will not try to find adversarial inputs to itself. That was my original point, and I think it stands.

I think I agree with everything you said yet still feel confused. My question/objection/issue was not so much "How do you explain people sometimes falling victim to plans which spuriously appeal to their value shards!?!? Checkmate!" but rather "what does it mean for an appeal to be spurious? What is the difference between just thinking long and hard about what to do vs. adversarially selecting a plan that'll appeal to you? Isn't the former going to in effect basically equal the latter, thanks to extremal Goodhart? In the limit where you consider all possible plans (maximum optimization power), aren't they the same?"

Yes, that's a good question. This is what I've been aiming to answer with recent posts.

What is the difference between just thinking long and hard about what to do vs. adversarially selecting a plan that'll appeal to you? Isn't the former going to in effect basically equal the latter, thanks to extremal Goodhart? In the limit where you consider all possible plans (maximum optimization power), aren't they the same?"

(I'm presently confident the answer is "no", as might be clear from my comments and posts!)

OK, guess I'll go read those posts then...

How is the bolded sentence different from the following:

"Consider the expected consequences of the plan "think a lot longer and harder, considering a lot more possibilities for what you should do, and then make your decision." I currently predict that such a plan would lead future-me to waste his life doing philosophy or maybe get pascal's mugged by some longtermist AI bullshit instead of actually helping people with his donations. My helping-people shard doesn't like this plan, because it predicts abstractly that thinking a lot more will not result in helping people more."

(Basically I'm saying you should think more, and then write more, about the difference between these two cases because they seem plausibly on a spectrum to me, and this should make us nervous in a couple of ways. Are we actually being really stupid by being EAs and shutting up and calculating? Have we basically adversarial-exampled ourselves away from doing things that we actually thought were altruistic and effective back in the day? If not, what's different about the kind of extended search process we did, from the logical extension of that which is to do an even more extended search process, a sufficiently extreme search process that outsiders would call the result an adversarial example?)

I think this is a great observation. I thought about it a bit and don't really find myself worried, based off of some intuitions which I think would take me at least 20 minutes to type up right now, and I really should wrap my commenting up for now. Feel free to ping me if no one else has answered this in a while.

Seems plausible to me but I'm a bit nervous, I think it could totally turn out to not work like that.

Agreed.

Consider yourself pinged! No rush to reply though.

1 is evidentially supported by the only known examples of general intelligences, but also AI will not have the same inductive biases. So the picture might be more complicated. I’d guess shard theory is still appropriate, but that's ultimately a question for empirical work (with interpretability).^[12]

Shard theory seems more evidentially supported than bag-o-heuristics theory and rational agent theory, but that's a pretty low bar! I expect a new theory to come along which is as much of an improvement over shard theory as shard theory is over those.

Re the 5 open questions: Yeah 4 and 5 seem like the hard ones to me.

Anyhow, in conclusion, nice work & I look forward to reading future developments. (Now I'll go read the other comments)

shard theory currently basically seems to be saying "At first, you get very simple shards, like the following examples: IF diamond-nearby THEN goto diamond. Then, eventually, you have a bunch of competing shards that are best modelled as rational agents;^[1] they have beliefs and desires of their own, and even negotiate with each other!" My response is "but what happens in the middle? Seems super important! Also haven't you just reproduced the problem but inside the head?"

I think the hole is somewhat smaller than you make out, but still substantial. From The shard theory of human values:

when the baby has a proto-world model, the reinforcement learning process takes advantage of that new machinery by further developing the juice-tasting heuristics. Suppose the baby models the room as containing juice within reach but out of sight. Then, the baby happens to turn around, which activates the already-trained reflex heuristic of “grab and drink juice you see in front of you.” In this scenario, “turn around to see the juice” preceded execution of “grab and drink the juice which is in front of me”, and so the baby is reinforced for turning around to grab the juice in situations where the baby models the juice as behind herself.
By this process, repeated many times, the baby learns how to associate world model concepts (e.g. “the juice is behind me”) with the heuristics responsible for reward (e.g. “turn around” and “grab and drink the juice which is in front of me”). Both parts of that sequence are reinforced. In this way, the contextual-heuristics become intertwined with the budding world model.
[...]
While all of this is happening, many different shards of value are also growing, since the human reward system offers a range of feedback signals. Many subroutines are being learned, many heuristics are developing, and many proto-preferences are taking root. At this point, the brain learns a crude planning algorithm, because proto-planning subshards (e.g. IF motor-command-5214 predicted to bring a juice pouch into view, THEN execute) would be reinforced for their contributions to activating the various hardcoded reward circuits. This proto-planning is learnable because most of the machinery was already developed by the self-supervised predictive learning, when e.g. learning to predict the consequences of motor commands (see Appendix A.1).
The planner has to decide on a coherent plan of action. That is, micro-incoherences (turn towards juice, but then turn back towards a friendly adult, but then turn back towards the juice, ad nauseum) should generally be penalized away. Somehow, the plan has to be coherent, integrating several conflicting shards. We find it useful to view this integrative process as a kind of “bidding.” For example, when the juice-shard activates, the shard fires in a way which would have historically increased the probability of executing plans which led to juice pouches. We’ll say that the juice-shard is bidding for plans which involve juice consumption (according to the world model), and perhaps bidding against plans without juice consumption.

I have some more models beyond what I've shared publicly, and eg one of my MATS applicants proposed an interesting story for how the novelty-shard forms, and also proposed one tack of research for answering how value negotiation shakes out (which is admittedly at the end of the gap). But overall I agree that there's a substantial gap here. I've been working on writing out pseudocode for what shard-based reflective planning might look like.

^{^}
I think they aren't quite best modelled as rational agents, but I'm confused about what axes they are agentic along and what they aren't.

I appreciate the effort and strong-upvoted this post because I think it's following a good methodology of trying to build concrete gear-level models and concretely imagining what will happen, but also think this is really very much not what I expect to happen, and in my model of the world is quite deeply confused about how this will go (mostly by vastly overestimating the naturalness of the diamond abstraction, underestimating convergent instrumental goals and associated behaviors, and relying too much on the shard abstraction). I don't have time to write a whole response, but in the absence of a "disagreevote" on posts am leaving this comment.

Thanks. Am interested in hearing more at some point.

I also want to note that insofar as this extremely basic approach ("reward the agent for diamond-related activities") is obviously doomed for reasons the community already knew about, then it should be vulnerable to a convincing linkpost comment which points out a fatal, non-recoverable flaw in my reasoning (like: "TurnTrout, you're ignoring the obvious X and Y problems, linked here:"). I'm posting this comment as an invitation for people to reply with that, if appropriate!^[1]

And if there is nothing previously known to be obviously fatal, then I think the research community moved on too quickly by assuming the frame of inner/outer alignment. Even if this proposal has a new fatal flaw, that implies the perceived old fatal flaws (like "the agent games its imperfect objective") were wrong / only applicable in that particular frame.

ETA: I originally said "devastating" instead of "convincing." To be clear: I am looking for curteous counterarguments focused on truth-seeking, and not optimized for "devastation" in a social sense.

^{^}
That's not to say you should have supplied it. I think it's good for people to say "I disagree" if that's all they have time for, and I'm glad you did.

First, I think most of the individual pieces of this story are basically right, so good job overall. I do think there's at least one fatal flaw and a few probably-smaller issues, though.

The main fatal flaw is this assumption:

Since “IF diamond”-style predicates do in fact “perfectly classify” the positive/negative approach/don’t-approach decision contexts...

This assumes that the human labellers (or automated labellers created by humans) have perfectly labelled the training examples.

I'm mostly used to thinking about this in the context of alignment with human values (or corrigibility etc), where it's very obvious that human labellers will make mistakes. In the case of diamonds, it is maybe plausible that we could get a dataset with zero incorrect labels, but that's still a pretty difficult problem if the dataset is to be reasonably large and diverse.

If the labels are not perfect, then the major failure mode is that the AI ends up learning the actual labelling process rather than the intended natural abstraction. Once the AI has an internal representation of the actual labelling process, that proto-shard will be reinforced more than the proto-diamond shard, because it will match the label in cases where the diamond-concept doesn't (and the reverse will not happen, or at least will happen less often and only due to random noise).

Probably-smaller issues:

"Acquiring" things is a... tricky concept in an embedded setting, which I expect to require some specific environmental features. I very strongly doubt that rewarding an agent for just approaching diamonds (especially in the absence of other agents) would induce it, and even the lottery training's shard-impact depends on exactly how the diamond is "given" to the agent.
Similarly, it seems like arguably zero of the proposed pieces of training would reward the agent for causing more diamond to exist, which does not bode well for a diamond-production shard showing up at all. Seems like the agent will mostly just want to be near lots of diamonds, and plausibly will not even consider the idea of creating more diamonds.
- In particular, this training scheme could easily make the agent develop a shard which dislikes the existence of diamonds far away from the agent, which would ultimately push against large-scale diamond-creation.

The easy way to patch these is to forget about approach-rewards altogether, and just reward the agent for causing more diamond to exist (or for total amount of diamond which exists in its environment). That's more directly what we want from a diamond-optimizer anyway.

Note that all of these issues are much more obvious if we start from the standard heuristic that the trained agent will end up optimizing for whatever generated its reward, and then pay attention to how well-aligned that reward-generator is with whatever we actually want. You've been quite vocal about how that heuristic leads to some incorrect conclusions, but it does highlight real and important considerations which are easy to miss without it.

If the labels are not perfect, then the major failure mode is that the AI ends up learning the actual labelling process rather than the intended natural abstraction.

I don't think this is true. For example, humans do not usually end up optimizing for the activations of their reward circuitry, not even neuroscientists. Also note that humans do not infer the existence of their reward circuitry simply from observing the sequence of reward events. They have to learn about it by reading neuroscience. I think that steps like "infer the existence / true nature of distant latent generators that explain your observations" are actually incredibly difficult for neural learning processes (human or AI). Empirically, SGD is perfectly willing to memorize deviations from a simple predictor, rather than generalize to a more complex predictor. Current ML would look very different if inferences like that were easy to make (and science would be much easier for humans).

Even when a distant latent generator is inferred, it is usually not the correct generator, and usually just memorizes observations in a slightly more efficient way by reusing current abstractions. E.g., religions which suppose that natural disasters are the result of a displeased, agentic force.

I partly buy that, but we can easily adjust the argument about incorrect labels to circumvent that counterargument. It may be that the full label generation process is too "distant"/complex for the AI to learn in early training, but insofar as there are simple patterns to the humans' labelling errors (which of course there usually are, in practice) the AI will still pick up those simple patterns, and shards which exploit those simple patterns will be more reinforced than the intended shard. It's like that example from the RLHF paper where the AI learns to hold a grabber in front of a ball to make it look like it's grabbing the ball.

I think something like what you're describing does occur, but my view of SGD is that it's more "ensembly" than that. Rather than "the diamond shard is replaced by the pseudo-diamond-distorted-by-mislabeling shard", I expect the agent to have both such shards (really, a giant ensemble of shards each representing slightly different interpretations of what a diamond is).

Behaviorally speaking, this manifests as the agent having preferences for certain types of diamonds over others. E.g., one very simple example is that I expect the agent to prefer nicely cut and shiny diamonds over unpolished diamonds or giant slabs of pure diamond. This is because I expect human labelers to be strongly biased towards the human conception of diamonds as pieces of art, over any configuration of matter with the chemical composition of a diamond.

Why does the ensembling matter?

I could imagine a story where it matters - e.g. if every shard has a veto over plans, and the shards are individually quite intelligent subagents, then the shards bargain and the shard-which-does-what-we-intended has to at least gain over the current world-state (otherwise it would veto). But that's a pretty specific story with a lot of load-bearing assumptions, and in particular requires very intelligent shards. I could maybe make an argument that such bargaining would be selected for even at low capability levels (probably by something like Why Subagents?), but I wouldn't put much confidence in that argument.

... and even then, that argument would only work if at least one shard exactly matches what we intended. If all of the shards are imperfect proxies, then most likely there are actions which can Goodhart all of them simultaneously. (After all, proxy failure is not something we'd expect to be uncorrelated - conditions which cause one proxy to fail probably cause many to fail in similar ways.)

On the other hand, consider a more traditional "ensemble", in which our ensemble of shards votes (with weights) or something. Typically, I expect training dynamics will increase the weight on a component exponentially w.r.t. the number of bits it correctly "predicts", so exploiting even a relatively small handful of human-mislabellings will give the exploiting shards much more weight. And on top of that, a mix of shards does not imply a mix of behavior; if a highly correlated subset of the shards controls a sufficiently large chunk of the weight, then they'll have de-facto control over the agent's behavior.

Why does the ensembling matter?

I think there's something like "why are human values so 'reasonable', such that [TurnTrout inference alert!] someone can like coffee and another person won't and that doesn't mean they would extrapolate into bitter enemies until the end of Time?", and the answer seems like it's gonna be because they don't have one criterion of Perfect Value that is exactly right over which they argmax, but rather they do embedded, reflective heuristic search guided by thousands of subshards (shiny objects, diamonds, gems, bright objects, objects, power, seeing diamonds, knowing you're near a diamond, ...), such that removing a single subshard does not catastrophically exit the regime of Perfect Value.

I think this is one proto-intuition why Goodhart-arguments seem Wrong to me, like they're from some alien universe where we really do have to align a non-embedded argmax planner with a crisp utility function. (I don't think I've properly communicated my feelings in this comment, but hopefully it's better than nothing))

I think this is one proto-intuition why Goodhart-arguments seem Wrong to me, like they're from some alien universe where we really do have to align a non-embedded argmax planner with a crisp utility function. (I don't think I've properly communicated my feelings in this comment, but hopefully it's better than nothing))

My intuition is that in order to go beyond imitation learning and random exploration, we need some sort of "iteration" system (a la IDA), and the cases of such systems that we know of tend to either literally be argmax planners with crisp utility functions, or have similar problems to argmax planners with crisp utility functions.

What about this post?

Well so you're obviously pretraining using imitation learning, so I've got that part down.

If I understand your post right, the rest of the policy training is done by policy gradients on human-induced rewards? As I understand it, policy gradient is close to a macimally sample-hungry method, because it does not do any modelling. At one level I would class this as random exploration, but on another level the humans are allowed to provide reinforcement based on methods rather than results, so I suppose this also gives it an element of imitation learning.

So I guess my expectation is that your training method is too sample inefficient to achieve much beyond human imitation.

if every shard has a veto over plans, and the shards are individually quite intelligent subagents

I think this won't happen FWIW.

and even then, that argument would only work if at least one shard exactly matches what we intended. If all of the shards are imperfect proxies, then most likely there are actions which can Goodhart all of them simultaneously. (After all, proxy failure is not something we'd expect to be uncorrelated - conditions which cause one proxy to fail probably cause many to fail in similar ways.)

Can you provide a concrete instantiation of this argument? (ETA: struck this part, want to hear your response first to make sure it's engaging with what you had in mind)

I expect training dynamics will increase the weight on a component exponentially w.r.t. the number of bits it correctly "predicts"

What about your argument behaves differently in the presence of humans and AI? This is clearly not how shard dynamics work in people, as I understand your argument.
We aren't in the prediction regime, insofar as that is supposed to be relevant for your argument. Let's talk about the batch update, and not make analogies to predictions. (Although perhaps I was the one who originally brought it up in OP, I should rewrite that.)
Can you give me a concrete example of an "exploiting shard" in this situation which is learnable early on, relative to the actual diamond-shards?

And on top of that, a mix of shards does not imply a mix of behavior; if a highly correlated subset of the shards controls a sufficiently large chunk of the weight, then they'll have de-facto control over the agent's behavior.

The point I am arguing (ETA and I expect Quintin is as well, but maybe not) is that this will be one of the primary shards produced, not that there's a chance it exists at low weight or something.

... and even then, that argument would only work if at least one shard exactly matches what we intended. If all of the shards are imperfect proxies, then most likely there are actions which can Goodhart all of them simultaneously. (After all, proxy failure is not something we'd expect to be uncorrelated - conditions which cause one proxy to fail probably cause many to fail in similar ways.)

I read this as "the activations and bidding behaviors of the shards will itself be imperfect, so you get the usual 'Goodhart' problem where highly rated plans are systematically bad and not what you wanted." I disagree with the conclusion, at least for many kinds of "imperfections."

Below is one shot at instantiating the failure mode you're describing. I wrote this story so as to (hopefully) contain the relevant elements. This isn't meant as a "slam dunk case closed", but hopefully something which helps you understand how I'm thinking about the issue and why I don't anticipate "and then the shards get Goodharted."

Example shard-Goodharting scenario. The AI bids for plans which it thinks lead to diamonds, except that also, the subcircuit of the policy network which computes the relevant diamond abstraction -- this is only a "proxy" for a reliable diamond abstraction. Historically unknown to the AI until the end of its training, that subcircuit (for some reason) activates very strongly for plans which lead to certain diamond-shaped formations of bacteria on the third Tuesday of the year.

Then this shard can be "goodharted" by actions which involve the creation of these bacteria diamonds at that time. There's a question, though, of whether the AI will actually consider these plans (so that it then actually bids on this plan, which is rated spuriously highly from our perspective). The AI knows, abstractly, that considering this plan would lead it to bid for that plan. But it seems to me like, since generating that plan is reflectively predicted to not lead to diamonds (nor does it activate the specific bidding-behavior edge case the agent abstractly knows about), the agent doesn't pursue that plan.

This was one of the main ideas I discussed in Alignment allows "nonrobust" decision-influences and doesn't require robust grading:

Summaries of key points:
Nonrobust decision-influences can be OK. A candy-shard contextually influences decision-making. Many policies lead to acquiring lots of candy; the decision-influences don't have to be "globally robust" or "perfect."
Values steer optimization; they are not optimized against. The value shards aren't getting optimized hard. The value shards are the things which optimize hard, by wielding the rest of the agent's cognition (e.g. the world model, the general-purpose planning API).

Since values are not the optimization target of the agent with those values, the values don't have to be adversarially robust.
Since values steer cognition, reflective agents try to avoid adversarial inputs to their own values. In self-reflective agents which can think about their own thinking, values steer e.g. what plans get considered next. Therefore, these agents convergently avoid adversarial inputs to their currently activated values (e.g. learning), because adversarial inputs would impede fulfillment of those values (e.g. lead to less learning).

This suggests "and so what is an 'adversarial input' to the values, then? What intensional rule governs the kinds of high-scoring plans which internal reasoning will decide to not evaluate in full?". I haven't answered that question yet on an intensional basis, but it seems tractable.

Since “IF diamond”-style predicates do in fact “perfectly classify” the positive/negative approach/don’t-approach decision contexts...
This assumes that the human labellers (or automated labellers created by humans) have perfectly labelled the training examples.

Not crucial on my model.

I'm mostly used to thinking about this in the context of alignment with human values (or corrigibility etc), where it's very obvious that human labellers will make mistakes. In the case of diamonds, it is maybe plausible that we could get a dataset with zero incorrect labels, but that's still a pretty difficult problem if the dataset is to be reasonably large and diverse.

I'm imagining us watching the agent and seeing whether it approaches an object or not. Those are the "labels." I'm imagining this taking place between 50-1000 times. Before seeing this comment, I edited the post to add:

We probably also reinforce other kinds of cognition, but that’s OK in this story. Maybe we even give the agent some false positive reward because our hand slipped while the agent wasn't approaching a diamond, but that's fine as long as it doesn't happen too often. That kind of reward event will weakly reinforce some contingent non-diamond-centric cognition (like "IF near wall, THEN turn around"). In the end, we want an agent which has a powerful diamond-shard, but not necessarily an agent which only has a diamond-shard.

So, probably I shouldn't have written "perfectly", since that isn't actually load-bearing on my model. I think that there's a rather smooth relationship between "how good you are at labelling" and "the strength of desired value you get out" (with a few discontinuities at the low end, where perhaps a sufficiently weak shard ends up non-reflective, or not plugging into the planning API, or nonexistent at all). On that model, I don't really understand the following:

If the labels are not perfect, then the major failure mode is that the AI ends up learning the actual labelling process rather than the intended natural abstraction. Once the AI has an internal representation of the actual labelling process, that proto-shard will be reinforced more than the proto-diamond shard, because it will match the label in cases where the diamond-concept doesn't (and the reverse will not happen, or at least will happen less often and only due to random noise).

The agent already has the diamond abstraction from SSL+IL, but not the labelling process (due to IID training, and it having never seen our "labelling" before—in the sense of us watching it approach the objects in real time). And this is very early in the RL training, at the very beginning. So why would the agent learn the labelling abstraction during the labelling and hook that in to decision-making, in the batch PG updates, instead of just hooking in the diamond abstraction it already has? (Edit: I discussed this a bit in this footnote.)

"Acquiring" things is a... tricky concept in an embedded setting, which I expect to require some specific environmental features. I very strongly doubt that rewarding an agent for just approaching diamonds (especially in the absence of other agents) would induce it, and even the lottery training's shard-impact depends on exactly how the diamond is "given" to the agent. [...]
Seems like the agent will mostly just want to be near lots of diamonds, and plausibly will not even consider the idea of creating more diamonds.

I agree that "diamond synthesis" is not directly rewarded, and if we wanted to ensure that happens, we could add that to the curriculum, as you note. But I think it would probably happen anyways, due to the expected-by-me "grabby" nature of the acquire-subshard. (Consider that I think it'd be cool to make dyson swarms, but I've never been rewarded for making dyson swarms.) So maybe the crux here is that I don't yet share your doubt of the acquisition-shard.

Note that all of these issues are much more obvious if we start from the standard heuristic that the trained agent will end up optimizing for whatever generated its reward, and then pay attention to how well-aligned that reward-generator is with whatever we actually want. You've been quite vocal about how that heuristic leads to some incorrect conclusions, but it does highlight real and important considerations which are easy to miss without it.

I think that "are we directly rewarding the behavior which we want the desired shards to exemplify?" is a reasonable heuristic. I think that "What happens if the agent optimizes its reward function?" is not a reasonable heuristic.

The agent already has the diamond abstraction from SSL+IL, but not the labelling process (due to IID training, and it having never seen our "labelling" before—in the sense of us watching it approach the objects in real time). And this is very early in the RL training, at the very beginning. So why would the agent learn the labelling abstraction during the labelling and hook that in to decision-making, in the batch PG updates, instead of just hooking in the diamond abstraction it already has? (Edit: I discussed this a bit in this footnote.)

I think there's a few different errors in this reasoning.

First: the agent probably has the concept of diamond from SSL+IL, but that's different from concepts like producing diamond, approaching diamond (which in turn requires a self-concept or at least a concept of the avatar it's controlling), etc. During training, those sorts of more-complex concepts are probably built up out of their components (like e.g. "production" and "diamond"); the actual goals or behaviors encoded in a shard have to be built up in whatever "internal language" the agent has from the SSL/IL training.

So the question isn't "does the agent have the concept of diamond/label?", the question is how short the relevant "sentences" are in terms of the concepts it has. Neither will be just one "word".

Second: as with Quintin's comment, the AI does not need to fully model the entire labelling process in order for this problem to apply. If there's any simple, predictable pattern to the humans' label-errors (which of course there usually is in practice), then the AI can pick that up. (It's not just a question of hand-slips; humans make systematic errors which will strongly activate shards very similar to the intended shards.)

So the question isn't "is the entire labelling process a short 'sentence' in the AI's internal language?" (though even that is not implausible), but rather "do any systematic errors in the labelling process have a short 'sentence' in the AI's internal language?".

Now put those two together. The intended shards are quite a bit more complicated than you suggested, because they don't just depend on the concept of "diamond", they depend on constructing a bunch of other concepts about what to do involving diamonds. And the unintended shards are quite a bit less complicated than you suggested, because they can exploit simple systematic errors in the labels.

I think I have a complaint like "You seem to be comparing to a 'perfect' reward function, and lamenting how we will deviate from that. But in the absence of inner/outer alignment, that doesn't make sense. A good reward schedule will put diamond-aligned cognition in the agent. It seems like, for you to be saying there's a 'fatal' flaw here due to 'errors', you need to make an argument about the cognition which trains into the agent, and how the AI's cognition-formation behaves differently in the presence of 'errors' compared to in the absence of 'errors.' And I don't presently see that story in your comments thus far. I don't understand what 'perfect labeling' is the thing to talk about, here, or why it would ensure your shard-formation counterarguments don't hold."

(Will come by for lunch and so we can probably have a higher-context discussion about this! :) )

I think I have a complaint like "You seem to be comparing to a 'perfect' reward function, and lamenting how we will deviate from that. But in the absence of inner/outer alignment, that doesn't make sense.

I think this is close to our most core crux.

It seems to me that there are a bunch of standard arguments which you are ignoring because they're formulated in an old frame that you're trying to avoid. And those arguments in fact carry over just fine to your new frame if you put a little effort into thinking about the translation, but you've instead thrown the baby out with the bathwater without actually trying to make the arguments work in your new frame.

Like, if I have a reward signal that rewards X, then the old frame would say "alright, so the agent will optimize for X". And you're like "nope, that whole form of argument is invalid, hit ignore button". But in fact it is usually very easy to take that argument and unpack it into something like "X has a short description in terms of natural abstractions, so starting from a base model and giving a feedback signal we should rapidly see some X-shards show up, and then the shards which best match X will be reinforced to exponentially higher weight (with respect to the bit-divergence between their proxy X' and the actual X)". And it seems like you are not even attempting to perform that translation, which I find very frustrating because I'm pretty sure you know this stuff plenty well to do it.

It seems to me that there are a bunch of standard arguments which you are ignoring because they're formulated in an old frame that you're trying to avoid. And those arguments in fact carry over just fine to your new frame if you put a little effort into thinking about the translation, but you've instead thrown the baby out with the bathwater without actually trying to make the arguments work in your new frame.

I agree that we may need to be quite skillful in providing "good"/carefully considered reward signals on the data distribution actually fed to the AI. (I also think it's possible we have substantial degrees of freedom there.) In this sense, we might need to give "robustly" good feedback.

However, one intuition which I hadn't properly communicated was: to make OP's story go well, we don't need e.g. an outer objective which robustly grades every plan or sequence of events the AI could imagine, such that optimizing that objective globally produces good results. This isn't just good reward signals on data distribution (e.g. real vs fake diamonds), this is non-upwards-error reward signals in all AI-imaginable situations, which seems thoroughly doomed to me. And this story avoids at least that problem, which I am relieved by. (And my current guess is that this "robust grading" problem doesn't just reappear elsewhere, although I think there are still a range of other difficult problems remaining. See also my post Alignment allows "nonrobust" decision-influences and doesn't require robust grading.)

And so I might have been saying "Hey isn't this cool we can avoid the worst parts of Goodhart by exiting outer/inner as a frame" while thinking of the above intuition (but not communicating it explicitly, because I didn't have that sufficient clarity as yet). But maybe you reacted "??? how does this avoid the need to reliably grade on-distribution situations, it's totally nontrivial to do that and it seems quite probable that we have to." Both seem true to me!

(I'm not saying this was the whole of our disagreement, but it seems like a relevant guess.)

When I first read this comment, I incorrectly understood it to say somehing like "If you were actually trying, you'd have generated the exponential error model on your own; the fact that you didn't shows that you aren't properly thinking about old arguments." I now don't think that's what you meant. I think I finally^[1] understand what you did mean, and I think you misunderstood what my original comment was trying to say because I wrote poorly and stream-of-consciousness.

Most importantly, I wasn’t saying something like “‘errors’ can’t exist because outer/inner alignment isn’t my frame, ignore.” I meant to communicate the following points:

I don’t know what a “perfect” reward function is in the absence of outer alignment, else I would know how to solve diamond alignment. But I’m happy to just discuss deviations from a proposed labelling scheme. (This is probably what we were already discussing, so this wasn't meant to be a devastating rejoinder or anything.)
I’m not sure what you mean by the “exponential” model you mentioned elsewhere, or why it would be a fatal flaw if true. Please say more? (Hopefully in a way which makes it clear why your argument behaves differently in the presence of errors, because that would be one way to make your arguments especially legible to how I'm currently thinking about the situation.)
Given my best guess at your model (the exponential error model), I think your original comment seems too optimistic about my specific story (sure seems like exponential weighting would probably just break it, label errors or no) but too pessimistic about the story template (why is it a fatal flaw that can’t be fixed with a bit of additional thinking?).

I meant to ask something like “I don’t fully understand what you’re arguing re: points 1 and 2 (but I have some guesses), and think I disagree about 3; please clarify?” But instead (e.g. by saying things like "my complaint is...") I perhaps communicated something like “because I don’t understand 2 in my native frame, your argument sucks.” And you were like “Come on, you didn’t even try, you could have totally translated 2. Worrying that you apparently didn't.”

I think that I left an off-the-cuff comment which might have been appropriate as a Discord message (with real-time clarification), but not as a LessWrong comment. Oops.

Elaborating points 1 and 3 above:

Point 1. In outer/inner, if you "perfectly label" reward events based on whether the agent approaches the diamond, you're "done" as far as the outer alignment part goes. In order to make the agent actually care about approaching diamonds, we would then turn to inner alignment techniques / ideas. It might make sense to call this labelling "perfect" as far as specifying the outer objective for those scenarios (e.g. when that objective is optimized, the agent actually approaches the diamond).

But if we aren't aiming for outer/inner alignment, and instead are just considering the (reward schedule) -> (inner value composition) mapping, then I worry that my post's original usage of "perfect" was misleading. On my current frame, a perfect reward schedule would be one which actually gets diamond-values into the agent. The schedule I posted is probably not the best way to do that, even if all goes as planned. I want to be careful not to assume the "perfection" of "+1 when it does in fact approach a real diamond which it can see", even if I can't currently point to better alternative reward schedules (e.g. "+x reward in some weird situation"). (This is what I was getting at with "I don't understand what 'perfect labeling' is the thing to talk about, here.")

What you probably meant by "errors" was "divergences from the reward function outlined in the original post." This is totally reasonable and important to talk about, but at least I want to clarify for myself and other readers that this is what we're talking about, and not assuming that my intended reward function was actually "perfect." (Probably it's fine to keep talking about "perfect labelling" as long as this point has been made explicit.)

Point 3. Under my best guess of what you mean (which did end up being roughly right, about the exponential bit-divergence), I think your original comment seemed too optimistic about the original story going well given "perfect" labelling. This is one thing I meant by "I don't understand why 'perfect labeling' would ensure your shard-formation counterarguments don't hold."

If the situation value-distribution is actually exponential in bit-divergence, I'd expect way less wiggle room on value shard formation, because that's going to mean that way more situations are controlled by relatively few subshards (or maybe even just one). Possibly the agent just ends up with fewer terms in its reflectively stable utility function, because fewer shards/subshards activate during the values handshake. (But I'm tentative about all this, haven't sketched out a concrete failure scenario yet given exponential model! Just a hunch I remember having.)

Again, it was very silly of me to expect my original comment to communicate these points. At the time of writing, I was trying to unpack some promising-seeming feelings and elaborate them over lunch.

^{^}
My original guess at your complaint was "How could you possibly have not generated the exponential weight hypothesis on your own?", and I was like what the heck, it's a hypothesis, sure... but why should I have pinned down that one? What's wrong with my "linear in error proportion for that kind of situation, exponential in ontology-distance at time of update" hypothesis, why doesn't that count as a thing-to-have-generated? This was a big part of why I was initially so confused about your complaint.

And then several people said they thought your comment was importantly correct-seeming, and I was like "no way, how can everyone else already have such a developed opinion on exponential vs linear vs something-else here? Surely this is their first time considering the question? Why am I getting flak about not generating that particular hypothesis, how does that prove I'm 'not trying' in some important way?"

To be clear, I don't think the exponential asymptotics specifically are obvious (sorry for implying that), but I also don't think they're all that load-bearing here. I intended more to gesture at the general cluster of reasons to expect "reward for proxy, get an agent which cares about the proxy"; there's lots of different sets of conditions any of which would be sufficient for that result. Maybe we just train the agent for a long time with a wide variety of data. Maybe it turns out that SGD is surprisingly efficient, and usually finds a global optimum, so shards which don't perfectly fit the proxy die. Maybe the proxy is a more natural abstraction than the thing it was proxying for, and the dynamics between shards competing at decision-time are winner-take all. Maybe dynamics between shards are winner-take-all for some other reason, and a shard which captures the proxy will always have at least a small selective advantage. Etc.

Point 3. Under my best guess of what you mean (which did end up being roughly right, about the exponential bit-divergence), I think your original comment seemed too optimistic about the original story going well given "perfect" labelling. This is one thing I meant by "I don't understand why 'perfect labeling' would ensure your shard-formation counterarguments don't hold."
If the situation value-distribution is actually exponential in bit-divergence, I'd expect way less wiggle room on value shard formation, because that's going to mean that way more situations are controlled by relatively few subshards (or maybe even just one). Possibly the agent just ends up with fewer terms in its reflectively stable utility function, because fewer shards/subshards activate during the values handshake.

It sounds like the difference between one or a few shards dominating each decision, vs a large ensemble, is very central and cruxy to you. And I still don't see why that matters, so maybe that's the main place to focus.

You gestured at some intuitions about that in this comment (which I'm copying below to avoid scrolling to different parts of the thread-tree), and I'd be interested to see more of those intuitions extracted.

I think there's something like "why are human values so 'reasonable', such that [TurnTrout inference alert!] someone can like coffee and another person won't and that doesn't mean they would extrapolate into bitter enemies until the end of Time?", and the answer seems like it's gonna be because they don't have one criterion of Perfect Value that is exactly right over which they argmax, but rather they do embedded, reflective heuristic search guided by thousands of subshards (shiny objects, diamonds, gems, bright objects, objects, power, seeing diamonds, knowing you're near a diamond, ...), such that removing a single subshard does not catastrophically exit the regime of Perfect Value.
I think this is one proto-intuition why Goodhart-arguments seem Wrong to me, like they're from some alien universe where we really do have to align a non-embedded argmax planner with a crisp utility function. (I don't think I've properly communicated my feelings in this comment, but hopefully it's better than nothing)

I have multiple different disagreements with this, and I'm not sure which are relevant yet, so I'll briefly state a few:

For the coffee/bitter enemies thing, this doesn't seem to me like a phenomenon which has anything to do with shards, it's just a matter of type-signatures. A person who "likes coffee" likes to drink coffee; they don't particularly want to fill the universe with coffee, they don't particularly care whether anyone else likes to drink coffee (and nobody else cares whether they like to drink coffee) so there's not really much reason for that preference to generate conflict. It's not a disagreement over what-the-world-should-look-like; that's not the type-signature of the preference.
Embedded, reflective heuristic search is not incompatible with argmaxing over one (approximate, implicit) value function; it's just a particular family of distributed algorithms for argmaxing.
It seems like, in humans, removing a single subshard does catastrophically exit the regime of value. For instance, there's Eliezer's argument from the sequences that just removing boredom results in a dystopia.

It sounds like the difference between one or a few shards dominating each decision, vs a large ensemble, is very central and cruxy to you. And I still don't see why that matters, so maybe that's the main place to focus.

The extremely basic intuition is that all else equal, the more interests present at a bargaining table, the greater the chance that some of the interests are aligned.

My values are also risk-averse (I'd much rather take a 100% chance of 10% of the lightcone than a 20% chance of 100% of the lightcone), and my best guess is that internal values handshakes are ~linear in "shard strength" after some cutoff where the shards are at all reflectively endorsed (my avoid-spiders shard might not appreciably shape my final reflectively stable values). So more subshards seems like great news to me, all else equal, with more shard variety increasing the probability that part of the system is motivated the way I want it to be.

(This isn't fully expressing my intuition, here, but I figured I'd say at least a little something to your comment right now)

I'm not going to go into most of the rest now, but:

For the coffee/bitter enemies thing, this doesn't seem to me like a phenomenon which has anything to do with shards, it's just a matter of type-signatures. A person who "likes coffee" likes to drink coffee; they don't particularly want to fill the universe with coffee, they don't particularly care whether anyone else likes to drink coffee (and nobody else cares whether they like to drink coffee) so there's not really much reason for that preference to generate conflict. It's not a disagreement over what-the-world-should-look-like; that's not the type-signature of the preference.

I think that that does have to do with shards. Liking to drink coffee is the result of a shard, of a contextual influence on decision-making (the influence to drink coffee), and in particular activates in certain situations to pull me into a future in which I drank coffee.
I'm also fine considering "A person who is OK with other people drinking coffee" and anti-C: "a person with otherwise the same values but who isn't OK with other people drinking coffee." I think that the latter would inconvenience the former (to the extent that coffee was important to the former), but that they wouldn't become bitter enemies, that anti-C wouldn't kill the pro-coffee person because the value function was imperfectly aligned, that the pro-coffee person would still derive substantial value from that universe.
Possibly the anti-coffee value would even be squashed by the rest of anti-C's values, because the anti-coffee value wasn't reflectively endorsed by the rest of anti-C's values. That's another way in which I think anti-C can be "close enough" and things work out fine.

EDIT 2: The original comment was too harsh. I've struck the original below. Here is what I think I should have said:

I think you raise a valuable object-level point here, which I haven't yet made up my mind on. That said, I think this meta-level commentary is unpleasant and mostly wrong. I'd appreciate if you wouldn't speculate on my thought process like that, and would appreciate if you could edit the tone-relevant parts.

~~Warning: This comment, and~~ ~~your previous comment~~, violate my comment section guidelines: "Reign of terror // Be charitable." You have made and publicly stated a range of unnecessary, unkind, and untrue inferences about my thinking process. You have also made non-obvious-to-me claims of questionable-to-me truth value, which you also treat as exceedingly obvious. Please edit these two comments to conform to my civility guidelines.

~~(EDIT: Thanks. I look forward to resuming object-level discussion!)~~

After more reflection, I now think that this moderation comment was too harsh. First, the parts I think I should have done differently:

Realized that who reads commenting guidelines anyways, let alone expects them to be enforced?
Realized that it's probably ambiguous what counts as "charitable" or not, even though (illusion of transparency) it felt so obvious to me that this counted as "not that."
Realized that predictably I would later consider the incident to be less upsetting than in the moment, and that John may not have been aware that I find this kind of situation unusually upsetting.
Therefore, I should have said something like "I think you raise a valuable object-level point here, which I haven't yet made up my mind on. That said, I think this meta-level commentary is unpleasant and mostly wrong. I'd appreciate if you wouldn't speculate on my thought process like that, and would appreciate if you could edit the tone-relevant parts."

I'm striking the original warning, putting in (4), and I encourage John to unredact his comments (but that's up to him).

I've thought more about what my policy should be going forward. What kind of space do I want my comment section to be? First, I want to be able to say "This seems wrong, and here's why", and other people can say the same back to me, and one or more of us can end up at the truth faster. Second, it's also important that people know that, going forward, engaging with me in (what feels to them like) good-faith will not be randomly slapped with a moderation warning because they annoyed me.

Third, I want to feel comfortable in my interactions in my comment section. My current plan is:

If someone comment something which feels personally uncharitable to me (a rather rare occurrence, what with the hundreds of comments in the last year since this kind of situation last happened), then I'll privately message them, explain my guidelines, and ask that they tweak tone / write more on the object-level / not do the particular thing.^[1]
If necessary, I'll also write a soft-ask (like (4) above) as a comment.
In cases where this is just getting ignored and the person is being antagonistic, I will indeed post a starker warning and then possibly just delete comments.

^{^}
I had spoken with John privately before posting the warning comment. I think my main mistake was jumping to (3) instead of doing more of (1) and (2).

Oh, huh, I think this moderation action makes me substantially less likely to comment further on your posts, FWIW. It's currently will within your rights to do so, and I am on the margin excited about more people moderating things, but I feel hesitant participating with the current level of norm-specification + enforcement.

I also turned my strong-upvote into a small-upvote, since I have less trust in the comment section surfacing counterarguments, which feels particularly sad for this post (e.g. I was planning to respond to your comment with examples of past arguments this post is ignoring, but am now unlikely to do so).

Again, I think that's fine, but I think posts with idiosyncratic norm enforcement should get less exposure, or at least not be canonical references. Historically we've decided to not put posts on frontpage when they had particularly idiosyncratic norm enforcement. I think that's the wrong call here, but not confident.

Sorry, I'm confused; for my own education, can you explain why these civility guidelines aren't epistemically suicidal? Personally, I want people like John Wentworth to comment on my posts to tell me their inferences about my thinking process; moreover, controlling for quality, "unkind" inferences are better, because I learn more from people telling me what I'm doing wrong, than from people telling me what I'm already doing right. What am I missing? Please be unkind.

First: the agent probably has the concept of diamond from SSL+IL, but that's different from concepts like producing diamond, approaching diamond (which in turn requires a self-concept or at least a concept of the avatar it's controlling), etc. During training, those sorts of more-complex concepts are probably built up out of their components (like e.g. "production" and "diamond"); the actual goals or behaviors encoded in a shard have to be built up in whatever "internal language" the agent has from the SSL/IL training.
So the question isn't "does the agent have the concept of diamond/label?", the question is how short the relevant "sentences" are in terms of the concepts it has. Neither will be just one "word".

This is already my model and was intended as part of my communicated reasoning. Why do you think it's an error in my reasoning? You'll notice I argued "If diamond", and about hooking that diamond predicate into its approach-subroutines (learned via IL). (ETA: I don't think you need a self-model to approach a diamond, or to "value" that in the appropriate sense. To value diamonds being near you, you can have representations of the space nearby, so you need a nearby representation, perhaps.)

label-errors

I think this is not the right term to use, and I think it might be skewing your analysis. This is not a supervised learning regime with exact gradients towards a fixed label. The question is what gets upweighted by the batch PG gradients, batching over the reward events. Let me exaggerate the kind of "error rates" I think you're anticipating:

Suppose I hit the reward 99% of the time for cut gems, and 90% of the time for uncut gems.
- What's supposed to go wrong? The agent somewhat more strongly steers towards cut gems?
Suppose I'm grumpy for the first 5 minutes and only hit the reward button 95% as often as I should otherwise. What's supposed to happen next?

(If these errors aren't representative, can you please provide a concrete and plausible scenario?)

Let me exaggerate the kind of "error rates" I think you're anticipating:
Suppose I hit the reward 99% of the time for cut gems, and 90% of the time for uncut gems.
What's supposed to go wrong? The agent somewhat more strongly steers towards cut gems?
Suppose I'm grumpy for the first 5 minutes and only hit the reward button 95% as often as I should otherwise. What's supposed to happen next?
(If these errors aren't representative, can you please provide a concrete and plausible scenario?)

Both of these examples are are focused on one error type: the agent does not receive a reward in a situation which we like. That error type is, in general, not very dangerous.

The error type which is dangerous is for an agent to receive a reward in a situation which we don't like. For instance, receiving reward in a situation involving a convincing-looking fake diamond. And then a shard which hooks up its behavior to things-which-look-like-diamonds (which is probably at least as natural an abstraction as diamond) gets more weight relative to the diamond-shard, and so when those two shards disagree later the things-which-look-like-diamonds shard wins.

Note that it would not be at all surprising for the AI to have a prior concept of real-diamonds-or-fake-diamonds-which-are-good-enough-to-fool-most-humans, because that is a cluster of stuff which behaves similarly in many places in the real world - e.g. they're both used for similar jewelry.

And sure, you try to kinda patch that by including some correctly-labelled things-which-look-like-diamonds in training, but that only works insofar as they're sufficiently-obviously-not-diamond that the human labeller can tell (and depends on the ratio of correct to incorrect labels, etc).

(Also, some moderately uncharitable psychologizing, and I apologize if it's wrong: I find it suspicious that the examples of label errors you generated are both of the non-dangerous type. This is a place where I'd expect you to already have some intuition for what kind of errors are the dangerous ones, especially when you put on e.g. your Eliezer hat. That smells like a motivated search, or at least a failure to actually try to look for the problems with your argument.)

(Also, some moderately uncharitable psychologizing, and I apologize if it's wrong: I find it suspicious that the examples of label errors you generated are both of the non-dangerous type. This is a place where I'd expect you to already have some intuition for what kind of errors are the dangerous ones, especially when you put on e.g. your Eliezer hat. That smells like a motivated search, or at least a failure to actually try to look for the problems with your argument.)

I want to talk about several points related to this topic. I don't mean to claim that you were making points directly related to all of the below bullet points. This just seems like a good time to look back and assess and see what's going on for me internally, here. This seems like the obvious spot to leave the analysis.

At the time of writing, I wasn't particularly worried about the errors you brought up.
- I am a little more worried now in expectation, both under the currently low-credence worlds where I end up agreeing with your exponential argument, and in the ~linear hypothesis worlds, since I think I can still search harder for worrying examples which IMO neither of us have yet proposed. Therefore I'll just get a little more pessimistic immediately, in the latter case.
If I had been way more worried about "reward behavior we should have penalized", I would have indeed just been less likely to raise the more worrying failure points, but not super less likely. I do assess myself as flawed, here, but not as that flawed.
- I think the typical outcome would be something like "TurnTrout starts typing a list full of weak flaws, notices a twinge of motivated reasoning, has half a minute of internal struggle and then types out the more worrisome errors, and, after a little more internal conflict, says that John has a good point and that he wants to think about it more."
- I could definitely buy that I wouldn't be that virtuous, though, and that I would need a bit of external nudging to consider the errors, or else a few more days on my own for the issue to get raised to cognitive-housekeeping. After that happened a few times, I'd notice the overall problem and come up with a plan to fix it.
- Obviously, I have at this point noticed (at least) my counterfactual mistake in the nearby world where I already agreed with you, and therefore have a plan to fix and remove that flaw.
I think you are right in guessing that I could use more outer/inner heuristics to my advantage, that I am missing a few tools on my belt. Thanks for pointing that out.
I don't think that motivated cognition has caused me to catastrophically miss key considerations from e.g. "standard arguments" in a way which has predictably doomed key parts of my reasoning.
- Why I think this: I've spent a little while thinking about what the catastrophic error would be, conditional on it existing, and nothing's coming up for the moment.
  - I'd more expect there to be some sequence of slight ways I ignored important clues that other people gave, and where I motivatedly underupdated. But also this is a pretty general failure mode, and I think it'd be pretty silly to call a halt without any positive internal evidence that I actually have done this. (EDIT: In a specific situation which I remember and can correct, as opposed to having a vague sense that yeah I've probably done this several times in the last few months. I'll just keep an eye out.)
- Rather, I think that if I spend three or so days typing up a document, and someone like John Wentworth thinks carefully about it, then that person will surface at least a few considerations I'd missed, more probably using tools not native to my current frame.
  - I think a lot of the "Why didn't you realize the 'reward for proxy, get an agent which cares about the proxy'?" part is just that John and I just seem to have very different models of SGD dynamics, and that if I had his model, the reasoning which produced the post would have also produced the failure modes John has hypothesized.
  - This feels "fine" in that that's part of the point of sharing my ideas with other people—that smart people will surface new considerations or arguments. This feels "not fine" in the sense that I'd like to not miss considerations, of course.
  - This also feels "fine" in that, yes, I wanted to get this essay out before never arrives, and usually I take too long to hit "publish", and I'm still very happy with the essay overall. I'm fine with other people finding new considerations (e.g. the direct reward for diamond synthesis, or zooming in on how much perfect labelling is required).
- I think that if it turns out there was some crucial existing argument which I did miss, I think I'll go "huh" but not really be like "wow that hovered at the edge of my cognition but I denied it for motivated reasons."
I am way more worried about how much of my daily cognition is still socially motivated, and I do consider that to be a "stop drop and roll"-level fuckup on my part.
- I think there's not just now-obvious things here like "I get very defensive in public settings in specific situations", but a range of situations in which I subconsciously aim to persuade or justify my positions, instead of just explaining what I think and why, what I disagree with and why; that some subconscious parts of me look for ways to look good or win an argument; that I have rather low trust in certain ways and that makes it hard for me sometimes; etc.
- I think that I am above-average here, but I have very high standards for myself and consider my current skill in this area to be very inadequate.
For the record: I welcome well-meaning private feedback on what I might be biased about or messing up. On the other hand, having the feedback be public just pushes some of my buttons in a way which makes the situation hard for me to handle. I aspire for this not to be the case about me. That aspiration is not yet realized.
I've worked hard to make this analysis honest and not optimized to make me look good or less silly. Probably I've still failed at least a little. Possibly I've missed something important. But this is what I've got.

Kudos for writing all that out. Part of the reason I left that comment in the first place was because I thought "it's Turner, if he's actually motivatedly cognitating here he'll notice once it's pointed out". (And, corollary: since you have the skill to notice when you are motivedly cognitating, I believe you if you say you aren't. For most people, I do not consider their claims about motivatedness of their own cognition to be much evidence one way or the other.) I do have a fairly high opinion of your skills in that department.

For the record: I welcome well-meaning private feedback on what I might be biased about or messing up.

Fair point, that part of my comment probably should have been private. Mea culpa for that.

This doesn't seem dangerous to me. So the agent values both, and there was an event which differentially strengthened the looks-like-diamond shard (assuming the agent could tell the difference at a visual remove, during training), but there are lots of other reward events, many of which won't really involve that shard (like video games where the agent collects diamonds, or text rpgs where the agent quests for lots of diamonds). (I'm not adding these now, I was imagining this kind of curriculum before, to be clear—see the "game" shard.)

So maybe there's a shard with predicates like "would be sensory-perceived by naive people to be a diamond" that gets hit by all of these, but I expect that shard to be relatively gradient starved and relatively complex in the requisite way -> not a very substantial update. Not sure why that's a big problem.

But I'll think more and see if I can't salvage your argument in some form.

some moderately uncharitable psychologizing

I found this annoying.

Not the OP but this jumped out at me:

If the labels are not perfect, then the major failure mode is that the AI ends up learning the actual labelling process rather than the intended natural abstraction. Once the AI has an internal representation of the actual labelling process, that proto-shard will be reinforced more than the proto-diamond shard, because it will match the label in cases where the diamond-concept doesn't (and the reverse will not happen, or at least will happen less often and only due to random noise).

This failure mode seems plausible to me, but I can think of a few different plausible sequences of events that might occur, which would lead to different outcomes, at least in the shard lens.

Sequence 1:

The agent develops diamond-shard
The agent develops an internal representation of the training process it is embedded in, including how labels are imperfectly assigned
The agent exploits the gaps between the diamond-concept and the label-process-concept, which reinforces the label-process-shard within it
The label-process-shard drives the agent to continue exploiting the above gap, eventually (and maybe rapidly) overtaking the diamond-shard
So the agent's values drift away from what we intended.

Sequence 2:

The agent develops diamond-shard
The diamond-shard becomes part of the agent's endorsed preferences (the goal-content it foresightedly plans to preserve)
The agent develops an internal representation of the training process it is embedded in, including how labels are imperfectly assigned
The agent understands that if it exploited the gaps between the diamond-concept and the label-process-concept, it would be reinforced into developing a label-process-shard that would go against its endorsed preference for diamonds (ie. its diamond-shard), so it chooses not exploit that gap, in order to avoid value drift.
So agent continues to value diamonds in spite of the imperfect labeling process

These different sequences of events would seem to lead to different conclusions about whether imperfections in the labeling process are fatal.

Yup, that's a valid argument. Though I'd expect that gradient hacking to the point of controlling the reinforcement on one's own shards is a very advanced capability with very weak reinforcement, and would therefore come much later in training than picking up on the actual labelling process (which seems simpler and has much more direct and strong reinforcement).

I expect some form of gradient hacking to be convergantly learned much earlier than the details of the labeling process. Online SSL incentivizes the agent to model its own shard activations (so it can better predict future data) and the concept of human value drift ("addiction") is likely accessible from pretraining in the same way "diamond" is.

On the other hand, the agent has little information about the labeling process, I expect it to be more complicated, and not have the convergent benefits of predicting future behavior that reflectivity has.

(You could even argue human error is good here, if it correlates stronger with the human "diamond" abstraction the agent has from pretraining. This probably doesn't extend to the "human values" case we care about, but I thought I'd mention it as an interesting thought.)

(agreed, for the record. I do think the agent can gradient starve the label-shard in story 2, though, without fancy reflective capability.)

Possibly. Though I think it is extremely easy in a context like this. Keeping the diamond-shard in the driver's seat mostly requires the agent to keep doing the things it was already doing (pursuing diamonds because it wants diamonds), rather than making radical changes to its policy.

This gets a lot of points for concreteness, regardless of how likely to work it is. Also, I updated towards shard theory plans working despite my models being different from shard theory, because this plan didn't seem to rely on claims I think are dodgy, e.g. internal game theory. Not too confident in this though because I haven't thought about this much.

This proposal looks really promising to me. This might be obvious to everyone, but I think much better interpretability research is really needed to make this possible in a safe(ish) way. (To verify the shard does develop, isn't misaligned, etc.) We'd just need to avoid the temptation to take the fancy introspection and interpretability tools this would require and use them as optimization targets, which would obviously make them useless as safeguards.

The story makes almost no reference to physical properties of diamonds (made of of atoms...). I don't see why you can't replace "approach diamond" with "satisfy humans" and tell the same story. Maybe that's your hidden agenda?

😏^[1]

^{^}
Although I don't expect the analogous human alignment story to go OK as written, even conditional on this story going through; we want a range of values from the AI, not just a single one. "Satisfy humans" would probably be bad as the only human-related shard.

Reminder to self: Always read the footnotes.

Note: Even if we have a smart agent which cares about diamonds and knows about value drift, it might "bend to temptation" and drift anyways. I have had several experiences where I thought "don't open this webpage, it will cause value drift in this kind of situation via an unendorsed reward event." Sometimes this thought works. Sometimes it doesn't.

Also, even if the AI can sandbox its future changes and inspect them, not all value drift events will be immediately apparent. For example, maybe the AI undergoes a batch update and the AI-prime would not pursue diamonds if it sees a red object (this is importantly unrealistic but I 70%-expect I could find a better example if I tried). The AI would be vulnerable to these errors if it doesn't have enough mechanistic self-interpretability (I expect it to have at least some). Of course, the AI would probably know about this failure mode and take precautions as well -- this just makes the AI's self-improvement job (at least) a bit harder.

The story sounds a lot like the steps parents take to raise a kid: First, you help it navigate and grab things, then you help it learn what things it can safely approach and which are dangerous. Next, you help it build autonomy by making its own plans while you make sure that it learns the right values.

I'm not sure that is intended or even halfway accurate but it matches what I keep saying: AI may need a caregiver.

Use a very large (future) multimodal self-supervised learned (SSL) initialization to give the AI a latent ontology for understanding the real world and important concepts. Combining this initialization with a recurrent state and an action head, train an embodied AI to do real-world robotics using imitation learning on human in-simulation datasets and then sim2real. Since we got a really good pretrained initialization, there's relatively low sample complexity for the imitation learning (IL). The SSL and IL datasets both contain above-average diamond-related content, with some IL trajectories involving humans navigating towards diamonds because the humans want the diamonds.

I don't know much about ML, and I'm a bit confused about this step. How worried are we/should we be about sample efficiency here? It sounds like after pre-training you're growing the diamond shard via a real-world embedded RL agent? Naively this would be pretty performance uncompetitive compared to agents primarily trained in simulated worlds, unless your algorithm is unusually sample efficient (why?). If you aren't performance competitive, then I expect your agent to be outcompeted by stronger AI systems with trainers that are less careful about diamond (or rubies, or staples, or w/e) alignment.

OTOH if your training is primarily simulated, I'd be worried about the difficulty of creating an agent that terminally values real world (rather than simulated) diamonds.

Good question, which I should probably have clarified in the essay. On a similar compute budget, could e.g. an actor-critic in-sim approach reach superintelligence even more quickly? Yeah, probably. The point of this story isn't that this (i.e. SSL+IL+PG RL) is the optimal alignment configuration along (competitiveness, alignability-to-diamonds), but rather I claim that if this story goes through at all, it throws a rock through how we should be thinking about alignment; if this story goes through, one of the simplest, "dumbest", most quickly dismissed ideas (reward agent for good event) can work just fine to superhuman and beyond, in a predictable-to-us way which we can learn more about by looking at current ML.

It would be interesting to see if a similar approach can be applied to the strawberries problem (I haven't personally thought about this).

In this shortform, I explain my main confusion with this alignment proposal. The main thing that's unclear to me: what's the idea here for how the agent remains motivated by diamonds even while doing very non-diamond related things like "solving mazes" that are required for general intelligence?
More details in the shortform itself.

I think that was supposed to be answered by this line:

After each task completion, the agent gets to be near some diamond and receives reward.

95

A shot at the diamond-alignment problem

95

Ω 36

A diamond-alignment story which doesn’t seem fundamentally blocked

Training story summary

Training goal

Training rationale

Extended training story

Ensuring the diamond abstraction exists

Growing the proto-diamond shard

Ensuring the AI doesn’t satisfice diamonds

Making the AI smarter while preserving the diamond abstraction

The agent becomes reflective

The agent prevents value drift

The values handshake

Major open questions

Conclusion

Appendix: The AI’s advantages in solving successor-alignment

95

Ω 36

95

Ω 36