I think that relatively simple alignment techniques can go a long way. In particular, I want to tell a plausible-to-me story about how simple techniques can align a proto-AGI so that it makes lots of diamonds.
But why is it interesting to get an AI which makes lots of diamonds? Because we avoid the complexity of human value and thinking about what kind of future we want, while still testing our ability to align an AI. Since diamond-production is our goal for training, it’s actually okay (in this story) if the AI kills everyone. The real goal is to ensure the AI ends up acquiring and producing lots of diamonds, instead of just optimizing some weird proxies that didn’t have anything to do with diamonds. It’s also OK if the AI doesn’t maximize diamonds, and instead just makes a whole lot of diamonds.
Someone recently commented that I seem much more specifically critical of outer and inner alignment, than I am specifically considering alternatives. So, I had fun writing up a very specific training story for how I think we can just solve diamond-alignment using extremely boring, non-exotic, simple techniques, like “basic reward signals & reward-data augmentation.” Yes, that’s right. As I’ve hinted previously, I think many arguments against this working are wrong, but I’m going to lay out a positive story in this post. I’ll reserve my arguments against certain ideas for future posts.
Can we tell a plausible story in which we train an AI, it cares about diamonds when it’s stupid, it gets smart, and still cares about diamonds? I think I can tell that story, albeit with real uncertainties which feel more like normal problems (like “ensure a certain abstraction is learned early in training”) than impossible-flavored alignment problems (like “find/train an evaluation procedure which isn’t exploitable by the superintelligence you train”).
Before the story begins:
- Obviously I’m making up a lot of details, many of which will turn out to be wrong, even if the broad story would work. I think it’s important to make up details to be concrete, highlight the frontiers of my ignorance, and expose new insights. Just remember what I’m doing here: making up plausible-sounding details.
- This story did not actually happen in reality. It’s fine, though, to update towards my models if you find them compelling.
- This is not my best guess for how we get AGI. In particular, I think chain of thought / language modeling is more probable than RL, but I’m still more comfortable thinking about RL for the moment, so that’s how I wrote the story.
- The point of the following story is not “Gee, I sure am confident this story goes through roughly like this.” Rather, I am presenting a training story template I expect to work for some foreseeably good design choices. I would be interested and surprised to learn that this story template is not only unworkable, but comparably difficult to other alignment approaches as I understand them.
- ETA 12/16/22: This story is not trying to get a diamond maximizer, and I think that's quite important! I think that "get an agent which reflectively equilibrates to optimizing a single commonly considered quantity like 'diamonds'" seems extremely hard and anti-natural.
This story is set in Evan Hubinger’s training stories format. I'll speak in the terminology of shard theory.
A diamond-alignment story which doesn’t seem fundamentally blocked
Training story summary
- Get an AI to primarily value diamonds early in training.
- Ensure the AI keeps valuing diamonds until it gets reflective and smart and able to manage its own value drift.
- The AI takes over the world, locks in a diamond-centric value composition, and makes tons of diamonds.
An AI which makes lots of diamonds. In particular, the AI should secure its future diamond production against non-diamond-aligned AI.
Here are some basic details of the training setup.
Use a very large (future) multimodal self-supervised learned (SSL) initialization to give the AI a latent ontology for understanding the real world and important concepts. Combining this initialization with a recurrent state and an action head, train an embodied AI to do real-world robotics using imitation learning on human in-simulation datasets and then sim2real. Since we got a really good pretrained initialization, there's relatively low sample complexity for the imitation learning (IL). The SSL and IL datasets both contain above-average diamond-related content, with some IL trajectories involving humans navigating towards diamonds because the humans want the diamonds.
Given an AI which can move around via its action head, start fine-tuning via batch online policy-gradient RL by rewarding it when it goes near diamonds, with the AI retaining long-term information via its recurrent state (thus, training is not episodic—there are no resets). Produce a curriculum of tasks, from walking to a diamond, to winning simulated chess games, to solving increasingly difficult real-world mazes, and so on. After each task completion, the agent gets to be near some diamond and receives reward. Continue doing SSL online.
Extended training story
Ensuring the diamond abstraction exists
We want to ensure that the policy gradient updates from the diamond coalesce into decision-making around a natural "diamond" abstraction which it learned in SSL and which it uses to model the world. The diamond abstraction should exist insofar as we buy the natural abstractions hypothesis. Furthermore, the abstraction seems more likely to exist given the fact that the IL data involves humans whom we know to be basing their decisions on their diamond-abstraction, and given the focus on diamonds in SSL pretraining.
(Sometimes, I'll refer to the agent's "world model"; this basically means "the predictive machinery and concepts it learns via SSL.)
Growing the proto-diamond shard
We want the AI to, in situations where it knows it can reach a diamond, consider and execute plans which involve reaching the diamond. But why would the AI start being motivated by diamonds? Consider the batch update structure of the PG setup. The agent does a bunch of stuff while being able to directly observe the nearby diamond:
- Some of this stuff involves e.g. approaching the diamond (by IL’s influence) and getting reward when approaching the diamond. (This is reward shaping.)
- Some of this stuff involves not approaching the diamond (and perhaps getting negative reward).
The batch update will upweight actions involved with approaching the diamond, and downweight actions which didn’t. But what cognition does this reinforce? Consider that relative to the SSL+IL-formed ontology, it’s probably relatively direct to modify the network in the direction of “IF
diamond seen, THEN move towards it.” The principal components of the batch gradient probably update the agent in directions like that, and less in directions which do not represent simple functions of sense data and existing abstractions (like
Possibly there are several such directions in the batch gradient, in which case several proto-shards form. We want to ensure the agent doesn’t primarily learn a spurious proxy like “go to gems” or “go to shiny objects” or “go to objects.” We want the agent to primarily form a diamond-shard.
We swap out a bunch of objects for the diamond and otherwise modify the scenario, penalizing the agent for approaching when a diamond isn't present. Since the agent has not yet been trained for long in a non-IID regime, the agent has not yet learned to chain cognition together across timesteps, nor does it know about the training process, so it cannot yet be explicitly gaming the training process (e.g. caring about shiny objects but deciding to get high reward so that its values don’t get changed). Therefore, the agent’s learned shards/decision-influences will have to reflex-like behave differently in the presence of diamonds as opposed to other objects or situations. In other words, the updating will be “honest”—the updates modify agent’s true propensity to approach different kinds of objects for different kinds of reasons.
diamond”-style predicates do in fact strongly distinguish between the positive/negative approach/don’t-approach decision contexts, and I expect relatively few other actually internally activated abstractions to be part of simple predicates which reasonably distinguish these contexts, and since the agent will strongly represent the presence of
diamond nearby, I expect the agent to learn to make approach decisions (at least in part) on the basis of the diamond being nearby.
We probably also reinforce other kinds of cognition, but that’s OK in this story. Maybe we even give the agent some false positive reward because our hand slipped while the agent wasn't approaching a diamond, but that's fine as long as it doesn't happen too often. That kind of reward event will weakly reinforce some contingent non-diamond-centric cognition (like "IF near wall, THEN turn around"). In the end, we want an agent which has a powerful diamond-shard, but not necessarily an agent which only has a diamond-shard.
It’s worth explaining why, given successful proto-diamond-shard formation here, the agent is truly becoming an agent which we could call “motivated by diamonds”, and not crashing into classic issues like “what purity does the diamond need to be? What molecular arrangements count?”. In this story, the AI’s cognition is not only behaving differently in the presence of an observed diamond, but the cognition behaves differently because the AI represents a humanlike/natural abstraction for
diamond being nearby in its world model. One rough translation to English might go: “IF
diamond nearby, THEN approach.” In a neural network, this would be a continuous influence—the more strongly “diamond nearby” is satisfied, the greater the approach-actions are upweighted.
So, this means that the agent more strongly steers itself towards prototypical examples of diamonds. And, when the AI is smarter later, if this same kind of diamond-shard still governs its behavior, then the AI will keep steering towards futures which contain prototypical diamonds. This is all accomplished without having to get fussy about the exact “definition” of a diamond.
Ensuring the AI doesn’t satisfice diamonds
If the AI starts off with a relatively simple diamond-shard which steers the AI towards the historical diamond-reinforcer because the AI internally represents the nearby diamond using a reasonable-to-us diamond-abstraction and is therefore influenced to approach, then this shard will probably continue to get strengthened and developed by future diamond reward-events.
Insofar as the agent didn’t already pick up planning subroutines from SSL+IL, I expect the agent to do so shortly after the diamond shard formation described above. Furthermore, diamond-subshards which more aggressively reach diamonds will be more frequently reinforced compared to those which don't. Over time, this leads to the diamond-shard “bidding” strictly more strongly for actions which lead to strictly more diamonds.
We begin offering the AI lotteries (certainty of two diamonds and negative reward, versus 50% chance of 5 diamonds and positive reward). This makes the agent more like a diamond-maximizer, at least in similarly represented decision contexts (and possibly also makes the agent care more about reward, developing a relatively weak reward-shard).
Making the AI smarter while preserving the diamond abstraction
The AI begins to chain together cognition in order to acquire diamonds in a succession of increasingly demanding situations (e.g. easy chess → hard chess). By the role of on-policy exploration in RL, future data depends on current weights which depend on past updates which depend on past data, these temporal dependencies result in a network which chains cognition across situations towards goals (probably related to historical reinforcers).
Suppose the agent takes exploratory actions, solves a new maze, and then sees a diamond at the end, triggering the existing diamond-shard. The agent receives reward when it reaches the diamond. Since the start-of-maze agent probably realized there was a diamond at the end of the maze, SGD expands the agent’s current shards (like diamond-acquisition) while also possibly creating new shards (like maze-solving). Across such tasks, the diamond-shard is the most strengthened overall, since diamonds are present in all of the reward events.
In particular, even though online self-supervised learning continues to develop the world model and create more advanced concepts, the reward events also keep crediting the invocation of the diamond-abstraction as responsible for reward (because insofar as the agent's diamond-shard guides its decisions, then the diamond-shard's
diamond-abstraction is in fact responsible for the agent getting reward). The diamond-abstraction gradient starves the AI from exclusively acting on the basis of possible advanced "alien" abstractions which would otherwise have replaced the diamond abstraction. The diamond shard already gets reward effectively, integrating with the rest of the agent's world model and recurrent state, and therefore provides "job security" for the diamond-abstraction. (And once the agent is smart enough, it will want to preserve its diamond abstraction, insofar as that is necessary for the agent to keep achieving its current goals which involve prototypical-diamonds.)
By the time the agent is capable of stealing diamonds from us, it's smart enough to realize this is strategically unwise. The AI begins biding its time until it can take over the world and acquire diamonds and safety. The AI has learned an explicit power-seeking shard, since such heuristics are widely useful for a range of tasks in training. Therefore, the agent doesn’t just have power-seeking cognition in its activations (i.e. explicit online reasoning about the instrumental value of power relative to its terminal diamond values). The power-seeking motivation is terminalized into its weights (see also Richard Ngo’s “convergent final goals” speculation).
For similar reasons, the AI did learn a weak approval-shard. But that was never particularly strengthened since most reward events weren’t primarily caused by human approval, so the agent was less likely to think this would be the case, so the agent was less likely to think about earning human approval, and so such computation was less likely to get reinforced and generalized by the reward events.
The diamond-shard is used and strengthened in a huge range of situations, which causes SGD to generalize its activation contexts quite broadly, so that the AI is diamond-motivated in most situations. The agent becomes mostly controlled by the diamond-shard. The lines of computation which were initially crude heuristics (e.g. IF near
diamond, then move towards it) now feed into each other in a continuous stream (e.g. IF
affective state 37, THEN engage depth-4 greedy heuristic search with heuristic set to
people-pleasing, breaking after 15 timesteps, writing search statistics X and Y to
recurrent-state-workspace registers 152 and 1,951).
The agent becomes reflective
At a certain point, the agent starts becoming reflective—thinking about its thinking, about its embeddedness. Perhaps the reflectivity arrives early because we trained it to predict some of its future timestep activations, or perhaps somewhat later because its cognition has regular entanglements with its future observables and reward events. An example of this would be how, if the agent decides to open a door now, the agent can better predict its future observations by accounting for this fact. Since we’re still doing SSL online, I expect this to happen. (This wouldn’t have happened during SSL/IL pretraining, because there wasn't any dependence of future training data on its current computations.)
And now the agent is smart. The agent starts thinking about its update process—crudely at first, perhaps only remarking how its decision-making changes over time in a way that has to do with reinforcement events. I expect the AI to learn to roughly understand which subshards influenced which decisions. If the agent can't model how its value shards affect its decisions, then this damages long-term predictions of observables. For example, imagine I want pretzels when I'm near a bag of pretzels, but otherwise don't want them. I imagine walking by a bag of pretzels and leaving my house. If I can't predict that that context activates my pretzel-shard, I'd predict I'll leave without eating, which will be wrong. In particular, the AI’s planning process will be able to predict how future situations will modify its future value-activations. For similar reasons, the AI will also learn to roughly predict value drift (delivered via batch policy gradients).
But as SSL continues, eventually the agent models its training process quite well. One sufficient condition for rapid reflectivity improvement is that once the agent can model the convergent benefits of becoming more reflective, its shards will strongly and convergently bid for plans which increase reflectivity (like reading more about alignment theory and its learning architecture, setting some watches on key activations, running traces on itself and logging statistics, understanding how its values change contextually by tracking correlations between features in activations and its future decisions).
If we had earlier censored its data, the agent breaks out through some side-channel and reads about theories of value drift and training dynamics. Maybe the AI reads this post.
The agent prevents value drift
If the agent still is primarily diamond-motivated, it now wants to stay that way by instrumental convergence. That is, if the AI considers a plan which it knows causes value drift away from diamonds, then the AI reflectively predicts the plan leads to fewer diamonds, and so the AI doesn’t choose that plan! The agent knows the consequences of value drift and it takes a more careful approach to future updating. Let’s get specific.
Suppose the AI keeps training, but by instrumental convergence, seeking power remains a good idea, and such decisions continually get strengthened. This strengthens the power-seeking shard relative to other shards. Other shards want to prevent this from happening. Therefore, the AI does indeed face an alignment problem, including:
- The AI (reflectively) wants to keep getting smarter, but cognitive updates can make the AI e.g. care less about diamonds.
- The AI wants to ensure its “successors” are aligned—”successors” meaning either its future model checkpoints which result from running SGD over its cognition for longer, or the new version of itself it might explicitly self-modify into.
- Shifts in internal decision context might change which shards are activated.
- For example, the AI might be governed by the diamond-shard during training, but deployment contexts might more strongly activate the power-seeking shard.
- There could be a “treacherous turn” within the AI, as certain value shards suddenly bid for fast shard rebalancing via self-modification (staging an “internal coup”).
A smart reflective AI has a range of advantages over people in solving its alignment problem (see the appendix). The agent starts planning its future actions so as to minimize anti-diamond value drift imposed by future policy gradient steps. Possibly the agent installs "tripwires" which revert to a previous model checkpoint under such-and-such value drift events (addressing problems 1 and 3 above), bounds the activation strengths of different shards (2 above), and backdoors update-step overrides (1 above).
(In addition, if you have good reason to believe you have a diamond-aligned AI by this point but are worried it will fail to consider some danger, you can try the highly technical tactic of “just tell the AI about that danger.”)
Meanwhile, the AI is preparing to kill everyone, taking over server farms worldwide. This part of the plan goes off rather smoothly. With possible unaligned competitors removed, the AI is free to self-experiment and take its time with its future self-improvement process. The AI effectively solves its successor-alignment problem by using its special advantages.
The values handshake
Self-improvement proceeds. Value stability arrives. The major shards are as follows:
- Subshard: Acquiring diamonds (expected utility flavor)
- Primary activation context: Most situations. This subshard reigns primary, due in part to the lottery reward events earlier in training.
- Subshard: Being near diamonds
- Primary activation context: When diamonds can easily be brought close.
- Subshard: Seeing diamonds
- Primary activation context: When diamonds are nearby.
- Subshard: Producing diamonds
- Primary activation context: Materials+tech nearby.
- Subshard: Acquiring diamonds (expected utility flavor)
- Primary activation context: Most situations, but especially when the agent can increase optionality or probability of survival.
- Primary activation context: Situations where reward was historically easily obtainable.
- This shard bids for outcomes like “high value at historical reward register.”
- Game-playing subshard
- Primary activation context: A video game is nearby.
- Game-winning subshard
- Primary activation context: While already playing a game.
- Game-playing subshard
- Human approval
- Weak shard, primary activation context is near humans in initial training context.
Maybe in some contexts where diamonds are nearby, the seeing-diamonds subshard activates and makes the agent go see the nearby diamonds, passing up a time-limited opportunity to remotely acquire 100 additional diamonds. If the agent acquired the remote diamonds first, that would lead to more diamonds overall and retain the ability to then go see the nearby diamonds. Therefore, across the agent’s shards and subshards, it would be a Pareto improvement for the agent to self-modify into a more coherent overall policy around the quantities relevant to its most powerful and reflective subshards.
For this reason, the AI dissolves cross-situational incoherences such that the AI no longer shoots itself in the foot with respect to the quantities which the primary subshards care about (like diamonds-acquired, seconds-near-diamonds, seconds-seeing-diamonds, increments of power, game-victories). This is a values handshake, except not among agents, but among the shards of value within the agent which are activated at the time of the handshake. And thus comes the day where the AI crystallizes its value shards into a reflectively stable, globally activated utility function.
And remember that part where the more aggressive diamond-subshards got strengthened by reward events? Yeah, those subshards have been running the show. And oh boy, does this AI synthesize a lot of diamonds.
Major open questions
- Is the shard theory picture broadly appropriate?
- How do we ensure that the diamond abstraction forms?
- How do we ensure that the diamond shard forms?
- How do we ensure that the diamond shard generalizes and interfaces with the agent's self-model so as to prevent itself from being removed by other shards?
- How do we avoid catastrophic ontological shift during jumps in reflectivity, which probably change activation contexts for first-person values?
- EG if the AI thinks it’s initially an embodied robot and then realizes it’s running in a decentralized fashion on a server farm, how does that change its world model? Do its “being ‘near’ diamonds” values still activate properly?
1 is evidentially supported by the only known examples of general intelligences, but also AI will not have the same inductive biases. So the picture might be more complicated. I’d guess shard theory is still appropriate, but that's ultimately a question for empirical work (with interpretability). There’s also some weak-moderate behavioral evidence for shard theory in AI which I’ve observed by looking at videos from the Goal Misgeneralization paper.
2 and 3 are early-training phenomena—well before superintelligence and gradient hacking, on my model—and thus far easier to verify via interpretability. Furthermore, this increases the relevance of pre-AGI experiments, since probably, later training performance of pre-AGI architectures will be qualitatively similar to earlier training performance for the (scaled up) AGI architecture. These are also questions we should be able to study pre-AGI models and get some empirical basis for, from getting expertise in forming target shards given fixed ontologies, to studying the extent to which the shard theory story is broadly correct (question 1).
4 seems a bit trickier. We’ll probably need a better theory of value formation dynamics to get more confidence here, although possibly (depending on interpretability tech) we can still sanity-check via interpretability on pre-AGI models.
5 seems like a question which resolves with more thinking, also clarified by answers to 1–4.
I think there are many ways to tell the story I told while maintaining a similar difficulty profile for the problems confronted. Obviously this exact story doesn’t have to go through, there are probably obvious tweaks which improve the chance of success. Overall, though, these are real research questions. And I have lots of other substantial uncertainties as well.
But looking at these problems, you might be struck by how basic they seem. They aren’t like “find a grading rule which your AI can’t catastrophically hack” or “ensure the AI’s inner cognition is almost exclusively bent on optimizing an externally represented objective.” Compared to these kinds of difficulties, this approach’s open questions seem qualitatively easier to me.
Lastly, of course, there is real simplification occurring when I relax the problem from “human values” to “diamonds.” That simplification wasn’t trivial, and I expect stories to get harder to tell in the true human value alignment regime. That said, I think those stories will still be possible to tell and will not be fundamentally and unstoppably harder.
Reward events provide cognitive updates to the trained agent. Probably there are skillful and appropriate schedules of cognitive updates such that the final agent generalizes in a predictable and desired way. I’ve sketched out a detailed story which doesn’t seem fundamentally blocked to me, and which seems both reasonably easy to improve given information we can get today about ML, and also reasonably promising already in that it provides a plausible-to-me story in which diamond-alignment actually goes well.
Thanks to Garrett Baker, Quintin Pope, Daniel Ziegler, Adam Shimi, and Charles Foster for feedback.
Appendix: The AI’s advantages in solving successor-alignment
- The AI has (or will eventually have) read-write-execute (
rwx) access to its own weights and activations, as well as seeing the code for its learning process, logs of its training, access to its dataset and hyperparemeter schedules, and being able to scan over previous model checkpoints.
- The AI can backup and distill itself, in addition to whatever designer-implemented model checkpointing occurs as part of “best practices” for training such systems.
- The AI can introspect in some ways, being able to predict what decisions get made in different situations.
- The AI can explicitly discover its own ontology and circuits.
- Because the AI can do precise “neuroscience” on itself via
rwx, pairing its own introspective guesses with statistics of its own activations which it sets up.
- In particular, a careful AI faces a substantially milder version of information inaccessibility. It has learned to (at least somewhat) understand how it’s making decisions as a result of SSL.
- Because the AI can do precise “neuroscience” on itself via
- The successor AIs run on modified versions of the current AI’s cognition.
- If the AI understands both its current values and how they activate in future situations, its current concepts, and its updating process, then the AI can reasonably expect to understand the alignment properties of successors.
- Not so for people trying to develop minds in totally different architectures, reasoning from first principles and confronting information inaccessibility + abstractions deriving from a qualitatively different learning process.
- The AI can run experiments on a sandboxed copy of itself and its training process, with automated tripwires for catastrophic value drift events.
- Follows from benefits I, II, and V.
I think that pure diamond maximizers are anti-natural, and at least not the first kind of successful story we should try to tell. Furthermore, the analogous version for an aligned AI seems to be “an AI which really helps people, among other goals, and is not a perfect human-values maximizer (whatever that might mean).”
The local mapping from gradient directions to behaviors is given by the neural tangent kernel, and the learnability of different behaviors is given by the NTK’s eigenspectrum, which seems to adapt to the task at hand, making the network quicker to learn along behavioral dimensions similar to those it has already acquired. Probably, a model pretrained mostly by interacting with its local environment or predicting human data will be inclined towards learning value abstractions that are simple extension of the pretrained features, biasing the model towards forming values based on a human-like understanding of nearby diamonds.
"Don't approach" means negative reward on approach, "approach" means positive reward on approach. Example decision scenarios:
1. Diamond in front of agent (approach)
2. Sapphire (don't approach)
3. Nothing (no reward)
4. Five chairs (don't approach)
5. A white shiny object which isn't a diamond (don't approach)
6. A small object which isn't a diamond (don't approach)
We can even do interpretability on the features activated by a diamond, and modify the scenario so that only the diamond feature correctly distinguishes between all approach/don't approach pairs. This hopefully ensures that the batch update chisels cognition into the agent which is predicated on the activation of the agent's diamond abstraction.
Especially if we try tricks like “slap a ‘diamond’ label beneath the diamond, in order to more strongly and fully activate the agent’s internal
diamondrepresentation” (credit to Charles Foster). I expect more strongly activated features to be more salient to the gradients. I therefore more strongly expect such features to be involved in the learned shards.
I think that there's a smooth relationship between "how many reward-event mistakes you make" (eg accidentally penalizing the agent for approaching a diamond) and "the strength of desired value you get out" (with a few discontinuities at the low end, where perhaps a sufficiently weak shard ends up non-reflective, or not plugging into the planning API, or nonexistent at all).
In my view, there always had to be some way to align agents to diamonds without getting fussy about definitions. After all, (I infer that) some people grow diamond-shards in a non-fussy way, without requiring extreme precision from their reward systems or fancy genetically hardcoded alignment technology.
Why wouldn't the agent want to just find an adversarial input to its
diamondabstraction, which makes it activate unusually strongly? (I think that agents might accidentally do this a bit for optimizer's curse reasons, but not that strongly. More in an upcoming post.)
Consider why you wouldn't do this for "hanging out with friends." Consider the expected consequences of the plan "find an adversarial input to my own evaluation procedure such that I find a plan which future-me maximally evaluates as letting me 'hang out with my friends'." I currently predict that such a plan would lead future-me to daydream and not actually hang out with my friends, as present-me evaluates the abstract expected consequences of that plan. My friend-shard doesn't like that plan, because I'm not hanging out with my friends. So I don't search for an adversarial input. I infer that I don't want to find those inputs because I don't expect those inputs to lead me to actually hang out with my friends a lot as I presently evaluate the abstract-plan consequences.
I don't think an agent can consider searching for adversarial inputs to its shards without also being reflective, at which point the agent realizes the plan is dumb as evaluated by the current shards assessing the predicted plan-consequences provided by the reflective world-model.
Asking "why wouldn't the agent want to find an adversarial input to its
diamondabstraction?" seems like a dressed-up version of "why wouldn't I want to find a plan where I can get myself shot while falsely believing I solved all of the world's problems?". Because it's stupid by my actual values, that's why. (Although some confused people who have taken wrong philosophy too far, might indeed find such a plan appealing).
The reader may be surprised. "Doesn't TurnTrout think agents probably won't care about reward?". Not quite. As I stated in Reward is not the optimization target:
I think that generally intelligent RL agents will have secondary, relatively weaker values around reward, but that reward will not be a primary motivator. Under my current (weakly held) model, an AI will only start reinforcing computations about reward after it has reinforced other kinds of computations (e.g. putting away trash).
The reason I think this is that once the agent starts modeling its training process, it will have an abstraction around actions which are rewarding, and this will become a viable gradient direction for the batch PG updates. I don't expect the agent to model its training process until after it's formed e.g. the object-level diamond-shard, and I also expect abstractions like "diamond" to be more strongly activated and focused on by policy gradients. Thus, I suspect that reward-shards tend to be acquired, albeit at relatively weak strength.
If instrumental values tend to get terminalized into their own shards, then people are not particularly unusual for not exhibiting an introspectively observable type separation between "instrumental" and "terminal" values:
Despite all the other ways that the human goal system complicates everything in sight, this one distinction it manages to collapse into a mishmash of things-with-conditional-value.
You might be wondering "since when was 'take over server farms' reinforced as a heuristic?".
First, I think that by this point in the story, there's something like an "internal API" for planning, where a plan stub gets proposed and modified, and the agent's world model predicts relevant consequences of the plan. I could imagine staying put in my room right now, and then tweaking that to "go to the kitchen", and that has some consequences, and then I can add further actions or modify the plan. In each case, my world model gives me more predictions for what happens and when.
So, the diamond-shard hooks into the learned planning+world-model API, modifies plan-stubs using diamond-based criteria, and quite flexibly and generally tailors plans to increase future expected diamonds. So there doesn't have to be an explicit past reinforcement event for "take over server farms." The diamond-shard is, in conjunction with the planning-API and WM, capable of general planning + intelligence towards the diamond criteria.
(Although the diamond-shard isn't just "generally making plans better at diamond-production." The "general intelligence" has a specific implementation. But just like a rock is—if you look closely enough—made of atoms, the general intelligence is—if you look closely enough—made of streams of interwoven heuristics.)
I expect the AI to come up with a plan at least as good as:
In situations with reasonable uncertainty about value drift, run some sandbox sims overseen by several copies of your current model, using
rwx+ self-interpretability to monitor the sandboxed model. Since the AI doesn’t face as steep of an information inaccessibility obstacle (see the appendix), the AI can check new updates and reasonably expect to understand how the change takes place. If there's some crazy uninterpretable diff, then don't do that batch update. This should rule out sudden + catastrophic value drift.
The AI can also rollback to backups if it has noticed subtle value drift, and can precommit to doing so given a maintained update-invariant of (value drift tripwire) rollback. The agent also has a "saving throw"—In the moment of realization that bad value drift occurred via a previous update, the agent can undo the value drift via a rollback, if that's still motivationally possible. Furthermore, due to the batch update format, the agent had extra time to realize the update would be bad and pre-emptively nullify the update via e.g. silently and temporarily setting the learning rate to zero.
I initially conjectured this would be true while writing a draft, working mostly off of my intuitions. Quintin Pope then referred me to Let's Agree to Agree: Neural Networks Share Classification Order on Real Datasets:
We report a series of robust empirical observations, demonstrating that deep Neural Networks learn the examples in both the training and test sets in a similar order. This phenomenon is observed in all the commonly used benchmarks we evaluated, including many image classification benchmarks, and one text classification benchmark. While this phenomenon is strongest for models of the same architecture, it also crosses architectural boundaries – models of different architectures start by learning the same examples, after which the more powerful model may continue to learn additional examples. We further show that this pattern of results reflects the interplay between the way neural networks learn benchmark datasets. Thus, when fixing the architecture, we show synthetic datasets where this pattern ceases to exist. When fixing the dataset, we show that other learning paradigms may learn the data in a different order. We hypothesize that our results reflect how neural networks discover structure in natural datasets.
The authors state that they “failed to find a real dataset for which NNs differ [in classification order]” and that “models with different architectures can learn benchmark datasets at a different pace and performance, while still inducing a similar order. Specifically, we see that stronger architectures start off by learning the same examples that weaker networks learn, then move on to learning new examples.”
Similarly, crows (and other smart animals) reach developmental milestones in basically the same order as human babies reach them. On my model, developmental timelines come from convergent learning of abstractions via self-supervised learning in the brain. If so, then the smart-animal evidence is yet another instance of important qualitative concept-learning retaining its ordering, even across significant scaling and architectural differences.
Great post! I think it's very good for alignment researchers to be this level of concrete about their plans, it helps enormously in a bunch of ways e.g. for evaluating the plan.
Comments as I go along:
How is the bolded sentence different from the following:
"Consider the expected consequences of the plan "think a lot longer and harder, considering a lot more possibilities for what you should do, and then make your decision." I currently predict that such a plan would lead future-me to waste his life doing philosophy or maybe get pascal's mugged by some longtermist AI bullshit instead of actually helping people with his donations. My helping-people shard doesn't like this plan, because it predicts abstractly that thinking a lot more will not result in helping people more."
(Basically I'm saying you should think more, and then write more, about the difference between these two cases because they seem plausibly on a spectrum to me, and this should make us nervous in a couple of ways. Are we actually being really stupid by being EAs and shutting up and calculating? Have we basically adversarial-exampled ourselves away from doing things that we actually thought were altruistic and effective back in the day? If not, what's different about the kind of extended search process we did, from the logical extension of that which is to do an even more extended search process, a sufficiently extreme search process that outsiders would call the result an adversarial example?)
Are you sure that's how it works? Seems plausible to me but I'm a bit nervous, I think it could totally turn out to not work like that. (That is, it could turn out that the agent wanting to preserve its diamond abstraction is the only thing that halts the march towards more and more alien-yet-effective abstractions)
you go on to talk about shards eventually values-handshaking with each other. While I agree that shard theory is a big improvement over the models that came before it (which I call rational agent model and bag o' heuristics model) I think shard theory currently has a big hole in the middle that mirrors the hole between bag o' heuristics and rational agents. Namely, shard theory currently basically seems to be saying "At first, you get very simple shards, like the following examples: IF diamond-nearby THEN goto diamond. Then, eventually, you have a bunch of competing shards that are best modelled as rational agents; they have beliefs and desires of their own, and even negotiate with each other!" My response is "but what happens in the middle? Seems super important! Also haven't you just reproduced the problem but inside the head?" (The problem being, when modelling AGI we always understood that it would start out being just a crappy bag of heuristics and end up a scary rational agent, but what happens in between was a big and important mystery. Shard theory boldly strides into that dark spot in our model... and then reproduces it in miniature! Progress, I guess.)
I think there are several things happening. Here are some:
EDIT: One of the main threads is Don't design agents which exploit adversarial inputs. The point isn't that people can't or don't fall victim to plans which, by virtue of spurious appeal to a person's value shards, cause the person to unwisely pursue the plan. The point here is that (I claim) intelligent people convergently want to avoid this happening to them.
A diamond-shard will not try to find adversarial inputs to itself. That was my original point, and I think it stands.
I think I agree with everything you said yet still feel confused. My question/objection/issue was not so much "How do you explain people sometimes falling victim to plans which spuriously appeal to their value shards!?!? Checkmate!" but rather "what does it mean for an appeal to be spurious? What is the difference between just thinking long and hard about what to do vs. adversarially selecting a plan that'll appeal to you? Isn't the former going to in effect basically equal the latter, thanks to extremal Goodhart? In the limit where you consider all possible plans (maximum optimization power), aren't they the same?"
Yes, that's a good question. This is what I've been aiming to answer with recent posts.
(I'm presently confident the answer is "no", as might be clear from my comments and posts!)
OK, guess I'll go read those posts then...
I think this is a great observation. I thought about it a bit and don't really find myself worried, based off of some intuitions which I think would take me at least 20 minutes to type up right now, and I really should wrap my commenting up for now. Feel free to ping me if no one else has answered this in a while.
Consider yourself pinged! No rush to reply though.
I think the hole is somewhat smaller than you make out, but still substantial. From The shard theory of human values:
I have some more models beyond what I've shared publicly, and eg one of my MATS applicants proposed an interesting story for how the novelty-shard forms, and also proposed one tack of research for answering how value negotiation shakes out (which is admittedly at the end of the gap). But overall I agree that there's a substantial gap here. I've been working on writing out pseudocode for what shard-based reflective planning might look like.
I think they aren't quite best modelled as rational agents, but I'm confused about what axes they are agentic along and what they aren't.
Shard theory seems more evidentially supported than bag-o-heuristics theory and rational agent theory, but that's a pretty low bar! I expect a new theory to come along which is as much of an improvement over shard theory as shard theory is over those.
Re the 5 open questions: Yeah 4 and 5 seem like the hard ones to me.
Anyhow, in conclusion, nice work & I look forward to reading future developments. (Now I'll go read the other comments)
I appreciate the effort and strong-upvoted this post because I think it's following a good methodology of trying to build concrete gear-level models and concretely imagining what will happen, but also think this is really very much not what I expect to happen, and in my model of the world is quite deeply confused about how this will go (mostly by vastly overestimating the naturalness of the diamond abstraction, underestimating convergent instrumental goals and associated behaviors, and relying too much on the shard abstraction). I don't have time to write a whole response, but in the absence of a "disagreevote" on posts am leaving this comment.
Thanks. Am interested in hearing more at some point.
I also want to note that insofar as this extremely basic approach ("reward the agent for diamond-related activities") is obviously doomed for reasons the community already knew about, then it should be vulnerable to a convincing linkpost comment which points out a fatal, non-recoverable flaw in my reasoning (like: "TurnTrout, you're ignoring the obvious X and Y problems, linked here:"). I'm posting this comment as an invitation for people to reply with that, if appropriate!
And if there is nothing previously known to be obviously fatal, then I think the research community moved on too quickly by assuming the frame of inner/outer alignment. Even if this proposal has a new fatal flaw, that implies the perceived old fatal flaws (like "the agent games its imperfect objective") were wrong / only applicable in that particular frame.
ETA: I originally said "devastating" instead of "convincing." To be clear: I am looking for curteous counterarguments focused on truth-seeking, and not optimized for "devastation" in a social sense.
That's not to say you should have supplied it. I think it's good for people to say "I disagree" if that's all they have time for, and I'm glad you did.
First, I think most of the individual pieces of this story are basically right, so good job overall. I do think there's at least one fatal flaw and a few probably-smaller issues, though.
The main fatal flaw is this assumption:
This assumes that the human labellers (or automated labellers created by humans) have perfectly labelled the training examples.
I'm mostly used to thinking about this in the context of alignment with human values (or corrigibility etc), where it's very obvious that human labellers will make mistakes. In the case of diamonds, it is maybe plausible that we could get a dataset with zero incorrect labels, but that's still a pretty difficult problem if the dataset is to be reasonably large and diverse.
If the labels are not perfect, then the major failure mode is that the AI ends up learning the actual labelling process rather than the intended natural abstraction. Once the AI has an internal representation of the actual labelling process, that proto-shard will be reinforced more than the proto-diamond shard, because it will match the label in cases where the diamond-concept doesn't (and the reverse will not happen, or at least will happen less often and only due to random noise).
The easy way to patch these is to forget about approach-rewards altogether, and just reward the agent for causing more diamond to exist (or for total amount of diamond which exists in its environment). That's more directly what we want from a diamond-optimizer anyway.
Note that all of these issues are much more obvious if we start from the standard heuristic that the trained agent will end up optimizing for whatever generated its reward, and then pay attention to how well-aligned that reward-generator is with whatever we actually want. You've been quite vocal about how that heuristic leads to some incorrect conclusions, but it does highlight real and important considerations which are easy to miss without it.
I don't think this is true. For example, humans do not usually end up optimizing for the activations of their reward circuitry, not even neuroscientists. Also note that humans do not infer the existence of their reward circuitry simply from observing the sequence of reward events. They have to learn about it by reading neuroscience. I think that steps like "infer the existence / true nature of distant latent generators that explain your observations" are actually incredibly difficult for neural learning processes (human or AI). Empirically, SGD is perfectly willing to memorize deviations from a simple predictor, rather than generalize to a more complex predictor. Current ML would look very different if inferences like that were easy to make (and science would be much easier for humans).
Even when a distant latent generator is inferred, it is usually not the correct generator, and usually just memorizes observations in a slightly more efficient way by reusing current abstractions. E.g., religions which suppose that natural disasters are the result of a displeased, agentic force.
I partly buy that, but we can easily adjust the argument about incorrect labels to circumvent that counterargument. It may be that the full label generation process is too "distant"/complex for the AI to learn in early training, but insofar as there are simple patterns to the humans' labelling errors (which of course there usually are, in practice) the AI will still pick up those simple patterns, and shards which exploit those simple patterns will be more reinforced than the intended shard. It's like that example from the RLHF paper where the AI learns to hold a grabber in front of a ball to make it look like it's grabbing the ball.
I think something like what you're describing does occur, but my view of SGD is that it's more "ensembly" than that. Rather than "the diamond shard is replaced by the pseudo-diamond-distorted-by-mislabeling shard", I expect the agent to have both such shards (really, a giant ensemble of shards each representing slightly different interpretations of what a diamond is).
Behaviorally speaking, this manifests as the agent having preferences for certain types of diamonds over others. E.g., one very simple example is that I expect the agent to prefer nicely cut and shiny diamonds over unpolished diamonds or giant slabs of pure diamond. This is because I expect human labelers to be strongly biased towards the human conception of diamonds as pieces of art, over any configuration of matter with the chemical composition of a diamond.
Why does the ensembling matter?
I could imagine a story where it matters - e.g. if every shard has a veto over plans, and the shards are individually quite intelligent subagents, then the shards bargain and the shard-which-does-what-we-intended has to at least gain over the current world-state (otherwise it would veto). But that's a pretty specific story with a lot of load-bearing assumptions, and in particular requires very intelligent shards. I could maybe make an argument that such bargaining would be selected for even at low capability levels (probably by something like Why Subagents?), but I wouldn't put much confidence in that argument.
... and even then, that argument would only work if at least one shard exactly matches what we intended. If all of the shards are imperfect proxies, then most likely there are actions which can Goodhart all of them simultaneously. (After all, proxy failure is not something we'd expect to be uncorrelated - conditions which cause one proxy to fail probably cause many to fail in similar ways.)
On the other hand, consider a more traditional "ensemble", in which our ensemble of shards votes (with weights) or something. Typically, I expect training dynamics will increase the weight on a component exponentially w.r.t. the number of bits it correctly "predicts", so exploiting even a relatively small handful of human-mislabellings will give the exploiting shards much more weight. And on top of that, a mix of shards does not imply a mix of behavior; if a highly correlated subset of the shards controls a sufficiently large chunk of the weight, then they'll have de-facto control over the agent's behavior.
I think there's something like "why are human values so 'reasonable', such that [TurnTrout inference alert!] someone can like coffee and another person won't and that doesn't mean they would extrapolate into bitter enemies until the end of Time?", and the answer seems like it's gonna be because they don't have one criterion of Perfect Value that is exactly right over which they argmax, but rather they do embedded, reflective heuristic search guided by thousands of subshards (shiny objects, diamonds, gems, bright objects, objects, power, seeing diamonds, knowing you're near a diamond, ...), such that removing a single subshard does not catastrophically exit the regime of Perfect Value.
I think this is one proto-intuition why Goodhart-arguments seem Wrong to me, like they're from some alien universe where we really do have to align a non-embedded argmax planner with a crisp utility function. (I don't think I've properly communicated my feelings in this comment, but hopefully it's better than nothing))
My intuition is that in order to go beyond imitation learning and random exploration, we need some sort of "iteration" system (a la IDA), and the cases of such systems that we know of tend to either literally be argmax planners with crisp utility functions, or have similar problems to argmax planners with crisp utility functions.
What about this post?
Well so you're obviously pretraining using imitation learning, so I've got that part down.
If I understand your post right, the rest of the policy training is done by policy gradients on human-induced rewards? As I understand it, policy gradient is close to a macimally sample-hungry method, because it does not do any modelling. At one level I would class this as random exploration, but on another level the humans are allowed to provide reinforcement based on methods rather than results, so I suppose this also gives it an element of imitation learning.
So I guess my expectation is that your training method is too sample inefficient to achieve much beyond human imitation.
I think this won't happen FWIW.
Can you provide a concrete instantiation of this argument? (ETA: struck this part, want to hear your response first to make sure it's engaging with what you had in mind)
The point I am arguing (ETA and I expect Quintin is as well, but maybe not) is that this will be one of the primary shards produced, not that there's a chance it exists at low weight or something.
I read this as "the activations and bidding behaviors of the shards will itself be imperfect, so you get the usual 'Goodhart' problem where highly rated plans are systematically bad and not what you wanted." I disagree with the conclusion, at least for many kinds of "imperfections."
Below is one shot at instantiating the failure mode you're describing. I wrote this story so as to (hopefully) contain the relevant elements. This isn't meant as a "slam dunk case closed", but hopefully something which helps you understand how I'm thinking about the issue and why I don't anticipate "and then the shards get Goodharted."
Then this shard can be "goodharted" by actions which involve the creation of these bacteria diamonds at that time. There's a question, though, of whether the AI will actually consider these plans (so that it then actually bids on this plan, which is rated spuriously highly from our perspective). The AI knows, abstractly, that considering this plan would lead it to bid for that plan. But it seems to me like, since generating that plan is reflectively predicted to not lead to diamonds (nor does it activate the specific bidding-behavior edge case the agent abstractly knows about), the agent doesn't pursue that plan.
This was one of the main ideas I discussed in Alignment allows "nonrobust" decision-influences and doesn't require robust grading:
This suggests "and so what is an 'adversarial input' to the values, then? What intensional rule governs the kinds of high-scoring plans which internal reasoning will decide to not evaluate in full?". I haven't answered that question yet on an intensional basis, but it seems tractable.
Not crucial on my model.
I'm imagining us watching the agent and seeing whether it approaches an object or not. Those are the "labels." I'm imagining this taking place between 50-1000 times. Before seeing this comment, I edited the post to add:
So, probably I shouldn't have written "perfectly", since that isn't actually load-bearing on my model. I think that there's a rather smooth relationship between "how good you are at labelling" and "the strength of desired value you get out" (with a few discontinuities at the low end, where perhaps a sufficiently weak shard ends up non-reflective, or not plugging into the planning API, or nonexistent at all). On that model, I don't really understand the following:
The agent already has the diamond abstraction from SSL+IL, but not the labelling process (due to IID training, and it having never seen our "labelling" before—in the sense of us watching it approach the objects in real time). And this is very early in the RL training, at the very beginning. So why would the agent learn the labelling abstraction during the labelling and hook that in to decision-making, in the batch PG updates, instead of just hooking in the diamond abstraction it already has? (Edit: I discussed this a bit in this footnote.)
I agree that "diamond synthesis" is not directly rewarded, and if we wanted to ensure that happens, we could add that to the curriculum, as you note. But I think it would probably happen anyways, due to the expected-by-me "grabby" nature of the acquire-subshard. (Consider that I think it'd be cool to make dyson swarms, but I've never been rewarded for making dyson swarms.) So maybe the crux here is that I don't yet share your doubt of the acquisition-shard.
I think that "are we directly rewarding the behavior which we want the desired shards to exemplify?" is a reasonable heuristic. I think that "What happens if the agent optimizes its reward function?" is not a reasonable heuristic.
I think there's a few different errors in this reasoning.
First: the agent probably has the concept of diamond from SSL+IL, but that's different from concepts like producing diamond, approaching diamond (which in turn requires a self-concept or at least a concept of the avatar it's controlling), etc. During training, those sorts of more-complex concepts are probably built up out of their components (like e.g. "production" and "diamond"); the actual goals or behaviors encoded in a shard have to be built up in whatever "internal language" the agent has from the SSL/IL training.
So the question isn't "does the agent have the concept of diamond/label?", the question is how short the relevant "sentences" are in terms of the concepts it has. Neither will be just one "word".
Second: as with Quintin's comment, the AI does not need to fully model the entire labelling process in order for this problem to apply. If there's any simple, predictable pattern to the humans' label-errors (which of course there usually is in practice), then the AI can pick that up. (It's not just a question of hand-slips; humans make systematic errors which will strongly activate shards very similar to the intended shards.)
So the question isn't "is the entire labelling process a short 'sentence' in the AI's internal language?" (though even that is not implausible), but rather "do any systematic errors in the labelling process have a short 'sentence' in the AI's internal language?".
Now put those two together. The intended shards are quite a bit more complicated than you suggested, because they don't just depend on the concept of "diamond", they depend on constructing a bunch of other concepts about what to do involving diamonds. And the unintended shards are quite a bit less complicated than you suggested, because they can exploit simple systematic errors in the labels.
I think I have a complaint like "You seem to be comparing to a 'perfect' reward function, and lamenting how we will deviate from that. But in the absence of inner/outer alignment, that doesn't make sense. A good reward schedule will put diamond-aligned cognition in the agent. It seems like, for you to be saying there's a 'fatal' flaw here due to 'errors', you need to make an argument about the cognition which trains into the agent, and how the AI's cognition-formation behaves differently in the presence of 'errors' compared to in the absence of 'errors.' And I don't presently see that story in your comments thus far. I don't understand what 'perfect labeling' is the thing to talk about, here, or why it would ensure your shard-formation counterarguments don't hold."
(Will come by for lunch and so we can probably have a higher-context discussion about this! :) )
I think this is close to our most core crux.
It seems to me that there are a bunch of standard arguments which you are ignoring because they're formulated in an old frame that you're trying to avoid. And those arguments in fact carry over just fine to your new frame if you put a little effort into thinking about the translation, but you've instead thrown the baby out with the bathwater without actually trying to make the arguments work in your new frame.
Like, if I have a reward signal that rewards X, then the old frame would say "alright, so the agent will optimize for X". And you're like "nope, that whole form of argument is invalid, hit ignore button". But in fact it is usually very easy to take that argument and unpack it into something like "X has a short description in terms of natural abstractions, so starting from a base model and giving a feedback signal we should rapidly see some X-shards show up, and then the shards which best match X will be reinforced to exponentially higher weight (with respect to the bit-divergence between their proxy X' and the actual X)". And it seems like you are not even attempting to perform that translation, which I find very frustrating because I'm pretty sure you know this stuff plenty well to do it.
When I first read this comment, I incorrectly understood it to say somehing like "If you were actually trying, you'd have generated the exponential error model on your own; the fact that you didn't shows that you aren't properly thinking about old arguments." I now don't think that's what you meant. I think I finally understand what you did mean, and I think you misunderstood what my original comment was trying to say because I wrote poorly and stream-of-consciousness.
Most importantly, I wasn’t saying something like “‘errors’ can’t exist because outer/inner alignment isn’t my frame, ignore.” I meant to communicate the following points:
I meant to ask something like “I don’t fully understand what you’re arguing re: points 1 and 2 (but I have some guesses), and think I disagree about 3; please clarify?” But instead (e.g. by saying things like "my complaint is...") I perhaps communicated something like “because I don’t understand 2 in my native frame, your argument sucks.” And you were like “Come on, you didn’t even try, you could have totally translated 2. Worrying that you apparently didn't.”
I think that I left an off-the-cuff comment which might have been appropriate as a Discord message (with real-time clarification), but not as a LessWrong comment. Oops.
Elaborating points 1 and 3 above:
Point 1. In outer/inner, if you "perfectly label" reward events based on whether the agent approaches the diamond, you're "done" as far as the outer alignment part goes. In order to make the agent actually care about approaching diamonds, we would then turn to inner alignment techniques / ideas. It might make sense to call this labelling "perfect" as far as specifying the outer objective for those scenarios (e.g. when that objective is optimized, the agent actually approaches the diamond).
But if we aren't aiming for outer/inner alignment, and instead are just considering the (reward schedule) -> (inner value composition) mapping, then I worry that my post's original usage of "perfect" was misleading. On my current frame, a perfect reward schedule would be one which actually gets diamond-values into the agent. The schedule I posted is probably not the best way to do that, even if all goes as planned. I want to be careful not to assume the "perfection" of "+1 when it does in fact approach a real diamond which it can see", even if I can't currently point to better alternative reward schedules (e.g. "+x reward in some weird situation"). (This is what I was getting at with "I don't understand what 'perfect labeling' is the thing to talk about, here.")
What you probably meant by "errors" was "divergences from the reward function outlined in the original post." This is totally reasonable and important to talk about, but at least I want to clarify for myself and other readers that this is what we're talking about, and not assuming that my intended reward function was actually "perfect." (Probably it's fine to keep talking about "perfect labelling" as long as this point has been made explicit.)
Point 3. Under my best guess of what you mean (which did end up being roughly right, about the exponential bit-divergence), I think your original comment seemed too optimistic about the original story going well given "perfect" labelling. This is one thing I meant by "I don't understand why 'perfect labeling' would ensure your shard-formation counterarguments don't hold."
If the situation value-distribution is actually exponential in bit-divergence, I'd expect way less wiggle room on value shard formation, because that's going to mean that way more situations are controlled by relatively few subshards (or maybe even just one). Possibly the agent just ends up with fewer terms in its reflectively stable utility function, because fewer shards/subshards activate during the values handshake. (But I'm tentative about all this, haven't sketched out a concrete failure scenario yet given exponential model! Just a hunch I remember having.)
Again, it was very silly of me to expect my original comment to communicate these points. At the time of writing, I was trying to unpack some promising-seeming feelings and elaborate them over lunch.
My original guess at your complaint was "How could you possibly have not generated the exponential weight hypothesis on your own?", and I was like what the heck, it's a hypothesis, sure... but why should I have pinned down that one? What's wrong with my "linear in error proportion for that kind of situation, exponential in ontology-distance at time of update" hypothesis, why doesn't that count as a thing-to-have-generated? This was a big part of why I was initially so confused about your complaint.
And then several people said they thought your comment was importantly correct-seeming, and I was like "no way, how can everyone else already have such a developed opinion on exponential vs linear vs something-else here? Surely this is their first time considering the question? Why am I getting flak about not generating that particular hypothesis, how does that prove I'm 'not trying' in some important way?"
To be clear, I don't think the exponential asymptotics specifically are obvious (sorry for implying that), but I also don't think they're all that load-bearing here. I intended more to gesture at the general cluster of reasons to expect "reward for proxy, get an agent which cares about the proxy"; there's lots of different sets of conditions any of which would be sufficient for that result. Maybe we just train the agent for a long time with a wide variety of data. Maybe it turns out that SGD is surprisingly efficient, and usually finds a global optimum, so shards which don't perfectly fit the proxy die. Maybe the proxy is a more natural abstraction than the thing it was proxying for, and the dynamics between shards competing at decision-time are winner-take all. Maybe dynamics between shards are winner-take-all for some other reason, and a shard which captures the proxy will always have at least a small selective advantage. Etc.
It sounds like the difference between one or a few shards dominating each decision, vs a large ensemble, is very central and cruxy to you. And I still don't see why that matters, so maybe that's the main place to focus.
You gestured at some intuitions about that in this comment (which I'm copying below to avoid scrolling to different parts of the thread-tree), and I'd be interested to see more of those intuitions extracted.
I have multiple different disagreements with this, and I'm not sure which are relevant yet, so I'll briefly state a few:
The extremely basic intuition is that all else equal, the more interests present at a bargaining table, the greater the chance that some of the interests are aligned.
My values are also risk-averse (I'd much rather take a 100% chance of 10% of the lightcone than a 20% chance of 100% of the lightcone), and my best guess is that internal values handshakes are ~linear in "shard strength" after some cutoff where the shards are at all reflectively endorsed (my avoid-spiders shard might not appreciably shape my final reflectively stable values). So more subshards seems like great news to me, all else equal, with more shard variety increasing the probability that part of the system is motivated the way I want it to be.
(This isn't fully expressing my intuition, here, but I figured I'd say at least a little something to your comment right now)
I'm not going to go into most of the rest now, but:
I agree that we may need to be quite skillful in providing "good"/carefully considered reward signals on the data distribution actually fed to the AI. (I also think it's possible we have substantial degrees of freedom there.) In this sense, we might need to give "robustly" good feedback.
However, one intuition which I hadn't properly communicated was: to make OP's story go well, we don't need e.g. an outer objective which robustly grades every plan or sequence of events the AI could imagine, such that optimizing that objective globally produces good results. This isn't just good reward signals on data distribution (e.g. real vs fake diamonds), this is non-upwards-error reward signals in all AI-imaginable situations, which seems thoroughly doomed to me. And this story avoids at least that problem, which I am relieved by. (And my current guess is that this "robust grading" problem doesn't just reappear elsewhere, although I think there are still a range of other difficult problems remaining. See also my post Alignment allows "nonrobust" decision-influences and doesn't require robust grading.)
And so I might have been saying "Hey isn't this cool we can avoid the worst parts of Goodhart by exiting outer/inner as a frame" while thinking of the above intuition (but not communicating it explicitly, because I didn't have that sufficient clarity as yet). But maybe you reacted "??? how does this avoid the need to reliably grade on-distribution situations, it's totally nontrivial to do that and it seems quite probable that we have to." Both seem true to me!
(I'm not saying this was the whole of our disagreement, but it seems like a relevant guess.)
EDIT 2: The original comment was too harsh. I've struck the original below. Here is what I think I should have said:
I think you raise a valuable object-level point here, which I haven't yet made up my mind on. That said, I think this meta-level commentary is unpleasant and mostly wrong. I'd appreciate if you wouldn't speculate on my thought process like that, and would appreciate if you could edit the tone-relevant parts.
Warning: This comment, and your previous comment , violate my comment section guidelines: "Reign of terror // Be charitable." You have made and publicly stated a range of unnecessary, unkind, and untrue inferences about my thinking process. You have also made non-obvious-to-me claims of questionable-to-me truth value, which you also treat as exceedingly obvious. Please edit these two comments to conform to my civility guidelines. (EDIT: Thanks. I look forward to resuming object-level discussion!)
After more reflection, I now think that this moderation comment was too harsh. First, the parts I think I should have done differently:
I'm striking the original warning, putting in (4), and I encourage John to unredact his comments (but that's up to him).
I've thought more about what my policy should be going forward. What kind of space do I want my comment section to be? First, I want to be able to say "This seems wrong, and here's why", and other people can say the same back to me, and one or more of us can end up at the truth faster. Second, it's also important that people know that, going forward, engaging with me in (what feels to them like) good-faith will not be randomly slapped with a moderation warning because they annoyed me.
Third, I want to feel comfortable in my interactions in my comment section. My current plan is:
I had spoken with John privately before posting the warning comment. I think my main mistake was jumping to (3) instead of doing more of (1) and (2).
Oh, huh, I think this moderation action makes me substantially less likely to comment further on your posts, FWIW. It's currently will within your rights to do so, and I am on the margin excited about more people moderating things, but I feel hesitant participating with the current level of norm-specification + enforcement.
I also turned my strong-upvote into a small-upvote, since I have less trust in the comment section surfacing counterarguments, which feels particularly sad for this post (e.g. I was planning to respond to your comment with examples of past arguments this post is ignoring, but am now unlikely to do so).
Again, I think that's fine, but I think posts with idiosyncratic norm enforcement should get less exposure, or at least not be canonical references. Historically we've decided to not put posts on frontpage when they had particularly idiosyncratic norm enforcement. I think that's the wrong call here, but not confident.
Sorry, I'm confused; for my own education, can you explain why these civility guidelines aren't epistemically suicidal? Personally, I want people like John Wentworth to comment on my posts to tell me their inferences about my thinking process; moreover, controlling for quality, "unkind" inferences are better, because I learn more from people telling me what I'm doing wrong, than from people telling me what I'm already doing right. What am I missing? Please be unkind.
This is already my model and was intended as part of my communicated reasoning. Why do you think it's an error in my reasoning? You'll notice I argued "If
diamond", and about hooking that diamond predicate into its approach-subroutines (learned via IL). (ETA: I don't think you need a self-model to approach a diamond, or to "value" that in the appropriate sense. To value diamonds being near you, you can have representations of the space nearby, so you need a nearby representation, perhaps.)
I think this is not the right term to use, and I think it might be skewing your analysis. This is not a supervised learning regime with exact gradients towards a fixed label. The question is what gets upweighted by the batch PG gradients, batching over the reward events. Let me exaggerate the kind of "error rates" I think you're anticipating:
(If these errors aren't representative, can you please provide a concrete and plausible scenario?)
Both of these examples are are focused on one error type: the agent does not receive a reward in a situation which we like. That error type is, in general, not very dangerous.
The error type which is dangerous is for an agent to receive a reward in a situation which we don't like. For instance, receiving reward in a situation involving a convincing-looking fake diamond. And then a shard which hooks up its behavior to things-which-look-like-diamonds (which is probably at least as natural an abstraction as diamond) gets more weight relative to the diamond-shard, and so when those two shards disagree later the things-which-look-like-diamonds shard wins.
Note that it would not be at all surprising for the AI to have a prior concept of real-diamonds-or-fake-diamonds-which-are-good-enough-to-fool-most-humans, because that is a cluster of stuff which behaves similarly in many places in the real world - e.g. they're both used for similar jewelry.
And sure, you try to kinda patch that by including some correctly-labelled things-which-look-like-diamonds in training, but that only works insofar as they're sufficiently-obviously-not-diamond that the human labeller can tell (and depends on the ratio of correct to incorrect labels, etc).
(Also, some moderately uncharitable psychologizing, and I apologize if it's wrong: I find it suspicious that the examples of label errors you generated are both of the non-dangerous type. This is a place where I'd expect you to already have some intuition for what kind of errors are the dangerous ones, especially when you put on e.g. your Eliezer hat. That smells like a motivated search, or at least a failure to actually try to look for the problems with your argument.)
I want to talk about several points related to this topic. I don't mean to claim that you were making points directly related to all of the below bullet points. This just seems like a good time to look back and assess and see what's going on for me internally, here. This seems like the obvious spot to leave the analysis.
Kudos for writing all that out. Part of the reason I left that comment in the first place was because I thought "it's Turner, if he's actually motivatedly cognitating here he'll notice once it's pointed out". (And, corollary: since you have the skill to notice when you are motivedly cognitating, I believe you if you say you aren't. For most people, I do not consider their claims about motivatedness of their own cognition to be much evidence one way or the other.) I do have a fairly high opinion of your skills in that department.
Fair point, that part of my comment probably should have been private. Mea culpa for that.
This doesn't seem dangerous to me. So the agent values both, and there was an event which differentially strengthened the looks-like-diamond shard (assuming the agent could tell the difference at a visual remove, during training), but there are lots of other reward events, many of which won't really involve that shard (like video games where the agent collects diamonds, or text rpgs where the agent quests for lots of diamonds). (I'm not adding these now, I was imagining this kind of curriculum before, to be clear—see the "game" shard.)
So maybe there's a shard with predicates like "would be sensory-perceived by naive people to be a diamond" that gets hit by all of these, but I expect that shard to be relatively gradient starved and relatively complex in the requisite way -> not a very substantial update. Not sure why that's a big problem.
But I'll think more and see if I can't salvage your argument in some form.
I found this annoying.
Not the OP but this jumped out at me:
This failure mode seems plausible to me, but I can think of a few different plausible sequences of events that might occur, which would lead to different outcomes, at least in the shard lens.
These different sequences of events would seem to lead to different conclusions about whether imperfections in the labeling process are fatal.
Yup, that's a valid argument. Though I'd expect that gradient hacking to the point of controlling the reinforcement on one's own shards is a very advanced capability with very weak reinforcement, and would therefore come much later in training than picking up on the actual labelling process (which seems simpler and has much more direct and strong reinforcement).
I expect some form of gradient hacking to be convergantly learned much earlier than the details of the labeling process. Online SSL incentivizes the agent to model its own shard activations (so it can better predict future data) and the concept of human value drift ("addiction") is likely accessible from pretraining in the same way "diamond" is.
On the other hand, the agent has little information about the labeling process, I expect it to be more complicated, and not have the convergent benefits of predicting future behavior that reflectivity has.
(You could even argue human error is good here, if it correlates stronger with the human "diamond" abstraction the agent has from pretraining. This probably doesn't extend to the "human values" case we care about, but I thought I'd mention it as an interesting thought.)
(agreed, for the record. I do think the agent can gradient starve the label-shard in story 2, though, without fancy reflective capability.)
Possibly. Though I think it is extremely easy in a context like this. Keeping the diamond-shard in the driver's seat mostly requires the agent to keep doing the things it was already doing (pursuing diamonds because it wants diamonds), rather than making radical changes to its policy.
This proposal looks really promising to me. This might be obvious to everyone, but I think much better interpretability research is really needed to make this possible in a safe(ish) way. (To verify the shard does develop, isn't misaligned, etc.) We'd just need to avoid the temptation to take the fancy introspection and interpretability tools this would require and use them as optimization targets, which would obviously make them useless as safeguards.
The story makes almost no reference to physical properties of diamonds (made of of atoms...). I don't see why you can't replace "approach diamond" with "satisfy humans" and tell the same story. Maybe that's your hidden agenda?
Although I don't expect the analogous human alignment story to go OK as written, even conditional on this story going through; we want a range of values from the AI, not just a single one. "Satisfy humans" would probably be bad as the only human-related shard.
Reminder to self: Always read the footnotes.
This gets a lot of points for concreteness, regardless of how likely to work it is. Also, I updated towards shard theory plans working despite my models being different from shard theory, because this plan didn't seem to rely on claims I think are dodgy, e.g. internal game theory. Not too confident in this though because I haven't thought about this much.
The story sounds a lot like the steps parents take to raise a kid: First, you help it navigate and grab things, then you help it learn what things it can safely approach and which are dangerous. Next, you help it build autonomy by making its own plans while you make sure that it learns the right values.
I'm not sure that is intended or even halfway accurate but it matches what I keep saying: AI may need a caregiver.
I don't know much about ML, and I'm a bit confused about this step. How worried are we/should we be about sample efficiency here? It sounds like after pre-training you're growing the diamond shard via a real-world embedded RL agent? Naively this would be pretty performance uncompetitive compared to agents primarily trained in simulated worlds, unless your algorithm is unusually sample efficient (why?). If you aren't performance competitive, then I expect your agent to be outcompeted by stronger AI systems with trainers that are less careful about diamond (or rubies, or staples, or w/e) alignment.
OTOH if your training is primarily simulated, I'd be worried about the difficulty of creating an agent that terminally values real world (rather than simulated) diamonds.
Good question, which I should probably have clarified in the essay. On a similar compute budget, could e.g. an actor-critic in-sim approach reach superintelligence even more quickly? Yeah, probably. The point of this story isn't that this (i.e. SSL+IL+PG RL) is the optimal alignment configuration along (competitiveness, alignability-to-diamonds), but rather I claim that if this story goes through at all, it throws a rock through how we should be thinking about alignment; if this story goes through, one of the simplest, "dumbest", most quickly dismissed ideas (reward agent for good event) can work just fine to superhuman and beyond, in a predictable-to-us way which we can learn more about by looking at current ML.
Note: Even if we have a smart agent which cares about diamonds and knows about value drift, it might "bend to temptation" and drift anyways. I have had several experiences where I thought "don't open this webpage, it will cause value drift in this kind of situation via an unendorsed reward event." Sometimes this thought works. Sometimes it doesn't.
Also, even if the AI can sandbox its future changes and inspect them, not all value drift events will be immediately apparent. For example, maybe the AI undergoes a batch update and the AI-prime would not pursue diamonds if it sees a red object (this is importantly unrealistic but I 70%-expect I could find a better example if I tried). The AI would be vulnerable to these errors if it doesn't have enough mechanistic self-interpretability (I expect it to have at least some). Of course, the AI would probably know about this failure mode and take precautions as well -- this just makes the AI's self-improvement job (at least) a bit harder.
In this shortform, I explain my main confusion with this alignment proposal. The main thing that's unclear to me: what's the idea here for how the agent remains motivated by diamonds even while doing very non-diamond related things like "solving mazes" that are required for general intelligence?
More details in the shortform itself.
I think that was supposed to be answered by this line: