Or maybe you’re saying that the second bullet could happen, but it’s irrelevant to AGI risk because of “nearest unblocked strategy problems”?
I mean the nearest unblocked strategies are rather a problem in the optimistic case where the AI learns "don't be misleading", but given that yeah sorta (though I wouldn't say irrelevant, only that even if you have a "don't be misleading" preference it's not a robust solution). Though not that it's impossible to get it right in a way so it behaves as desired, but I think current proposals aren't concretely enough specified that we can say they don't run into undesirable nearest unblocked strategies.
One particular problem is that preferences which aren't over world trajectories aren't robust:
Preferences over world trajectories are robust in the sense that if you imagine changing that preference, this ranks poorly according to that preference.
Myopic preferences that just trigger given a context aren't robust in that sense - they don't assign negative value to suggestions of removing that preference for future occasions.
Say I need to walk to work, but the fastest route goes through a passage that smells really badly, so it's unpleasant to walk through. When I then think of a plan like "I can wear a mask that filters the air so I don't smell anything bad", this plan doesn't get rejected.
A preference over world trajectories, which yields significant negative utility for every time I walk through a passage that smells bad, would be more robust in this sense.
So I currently think the relevant preferences are preferences over world trajectories, and other more general kinds of preferences are better modeled as constraints for the world-trajectories-valuing part to optimize around. I know humans often have short-term preferences that get triggered myopically, but for very impressive accomplishments by humans there probably was a more long-term coherent goal that was being aimed at.
I do not know how exactly you imagine a "don't be misleading" preference to manifest, but I imagined it more like the myopic smell preference, in which case there's optimization pressure from the more long-term coherent parts to remove this myopic preference / prevent it from triggering. (Tbc, it's not like that would be useless, it could still be that this suffices to make the first working plan in the search ordering a desirable one, especially if the task which we want the AI to do isn't absurdly difficult.)
(But even if it takes a more world-trajectory form of "i value to not be misleading" - which would be good because it would incentivize planning to maintain that preference, then there may still be problems because "not be misleading" is a fuzzy concept which has to be rebound to a more precise concept to evaluate plans, and it might not rebind in a desirable way. And we didn't yet specify how to trade off between the "not being misleading" value and other goals.)
I do think there's optimization pressure to steer for not being caught being misleading, but I think it's rather because of planning how to achieve other goals while modelling reality accurately, instead of the AI learning to directly value "don't get caught being misleading" in its learned value function.
Though possibly the AI could still learn to value this (or alternatively to value "don't be misleading"), but in such a case these value shards seem more like heuristic value estimators applied to particular situations, rather than a deeply coherent utility specification over universe-trajectories. And I think such other kind of preferences are probably not really important when you crank up intelligence past the human level because those will be seen as constraints to be optimized around by the more coherent value parts, and you run into nearest unblocked strategy problems. (I mean, you could have a preference over universe trajectories that at no timestep you be misleading, but given the learning setup I would expect a more shallow version of that preference to be learned. Though it's also conceivable that the AI rebinds it's intuitive preference to yield that kind of coherent preference.)
So basically I think we don't just need to get the AI to learn a "don't be misleading" value shard, where problems are that (1) it might be outvoted by other shards in cases where being misleading would be very beneficial, and (2) the optimization for other goals might find edge instantiations that are basically still misleading but don't get classified as such. So we'd need to learn it in exactly the right way.
(I have an open discussion thread with Steve on his "Consequentialism and Corrigibility" post, where I mainly argue that Steve is wrong about Yud's consequentialism being just about future states and that it is instead about values over universe trajectories like in the corrigibility paper. IIUC Steve thinks that one can have "other kinds of preferences" as a way to get corrigibility. He unfortunately didn't make it understandable to me how such a preference may look like concretely, but one possibility is that he thinks about such "accessor of the current situation" kind of preferences, because humans have such short-term preferences in addition to their consequentialist goals. But I think when one cranks up intelligence the short term values don't matter that much. E.g. the AI might do some kind of exposure therapy to cause the short-term value shards to update to intervene less. Or maybe he just means we can have a coherent utility over universe trajectories where the optimum is indeed a non-deceptive strategy, which is true but not really a solution because such a utility function may be complex and he didn't specify how exactly the tradeoffs should be made.)
Great post! The over- vs undersculpting distinction does currently seem a lot nicer to me than I previously considered the outer- vs inner-alignment distinction.
Some comments:
1:
The "over-/undersculpting" terminology seems a bit imperfect because it seems like there might be a golden middle, whereas actually we have both problems simultaneously. But maybe it's fine because we sorta want sth in the middle, it's just that hitting a good middle isn't enough. And it does capture well that having more of one problem might lead to having less of the other problem.
2:
The human world offers an existence proof. We’re often skeptical of desire-changes—hence words like “brainwashing” or “indoctrination”, or radical teens telling their friends to shoot them if they become conservative in their old age. But we’re also frequently happy to see our desires change over the decades, and think of the changes as being for the better. We’re getting older and wiser, right? Well, cynics might suggest that “older and wiser” is cope, because we’re painting the target around the arrow, and anyway we’re just rationalizing the fact that we don’t have a choice in the matter. But regardless, this example shows that the instrumental convergence force for desire-update-prevention is not completely 100% inevitable—not even for smart, ambitious, and self-aware AGIs.
This might not generalize to super-von-Neumann AGIs. Normal humans are legit not optimizing hard enough to come up with the strategy of trying to preserve their goals in order to accomplish their goals.
Finding a reflectively stable motivation system that doesn't run into the goal-preservation instrumental incentive is what MIRI tried in their corrigibility agenda. They failed because it turned out to be unexpectedly hard. I'd say that makes it unlikely that an AGI will fall into such a reflectively-stable corrigibility basin when scaling up intelligence a lot, even when we try to make it think in corrigible ways. (Though there's still hope for keeping the AI correctable if we keep it limited and unreflective in some ways etc.)
3:
As an example (borrowing from my post “Behaviorist” RL reward functions lead to scheming), I’m skeptical that “don’t be misleading” is really simpler (in the relevant sense) than “don’t get caught being misleading”. Among other things, both equally require modeling the belief-state of the other person. I’ll go further: I’m pretty sure that the latter (bad) concept would be learned first, since it’s directly connected to the other person’s immediate behavior (i.e., they get annoyed).
I (tentatively) disagree with the frame here, because "don't get caught being misleading" isn't a utility-shard over world-trajectories, but rather just a myopic value accessor on the model of a current situation (IIUC). I think it's probably correct that humans usually act based on such myopic value accessors, but in cases where very hard problems need to be solved, what matters are the more coherent situation independent values. So my story for why the AI would be misleading is rather because it plans how to best achieve sth and being misleading without getting caught is a good strategy for this.
I mean, there might still be myopic value accessor patterns, though my cached reply would be that these would just be constraints being optimized around by the more coherent value parts, e.g. by finding a plan representation where the myopic pattern doesn't trigger. Aka the nearest unblocked strategy problem. (This doesn't matter here because we agree it would learn "don't get caught", but possible that we still have a disagreement here like in the case of your corrigibility proposal.)
I listened to it via speechify (though you need pro for acceptable listening speed). If you want sth better you could try asking AskWhoCastsAI (possibly offering to pay him).
Seems like a fine time to share my speculations about yet unresolved easter eggs from the story. I'm not overly confident on either of those.
I present some hints first in case you want to try to think about it yourself.
From chapter 122:
Harry took the Elder Wand out of his robes, gazed again at the dark-grey wood that Dumbledore had passed down to him. Harry had tried to think faster this time, he'd tried to complete the pattern implied by the Cloak of Invisibility and the Resurrection Stone. The Cloak of Invisibility had possessed the legendary power of hiding the wearer, and the hidden power of allowing the wearer to hide from Death itself in the form of Dementors. The Resurrection Stone had the legendary power of summoning an image of the dead, and then Voldemort had incorporated it into his horcrux system to allow his spirit to move freely. The second Deathly Hallow was a potential component of a system of true immortality that Cadmus Peverell had never completed, maybe due to his having ethics.
And then there was the third Deathly Hallow, the Elder Wand of Antioch Peverell, that legend said passed from wizard to stronger wizard, and made its holder invincible against ordinary attacks; that was the known and overt characteristic...
The Elder Wand that had belonged to Dumbledore, who'd been trying to prevent the Death of the world itself.
The purpose of the Elder Wand always going to the victor might be to find the strongest living wizard and empower them still further, in case there was any threat to their entire species; it could secretly be a tool to defeat Death in its form as the destroyer of worlds.
But if there was some higher power locked within the Elder Wand, it had not presented itself to Harry based on that guess. Harry had raised up the Elder Wand and spoken to it, named himself a descendant of Peverell who accepted his family's quest; he'd promised the Elder Wand that he would do his best to save the world from Death, and take up Dumbledore's duty. And the Elder Wand had answered no more strongly to his hand than before, refusing his attempt to jump ahead in the story. Maybe Harry needed to strike his first true blow against the Death of worlds before the Elder Wand would acknowledge him; as the heir of Ignotus Peverell had already defeated Death's shadow, and the heir of Cadmus Peverell had already survived the Death of his body, when their respective Deathly Hallows had revealed their secrets.
At least Harry had managed to guess that, contrary to legend, the Elder Wand didn't contain a core of 'Thestral hair'. Harry had seen Thestrals, and they were skeletal horses with smooth skin and no visible mane on their skull-like heads, nor tufts on their bony tails. But what core was truly inside the Elder Wand, Harry hadn't yet felt himself knowing; nor had he been able to find, anywhere on the Elder Wand, the circle-triangle-line of the Deathly Hallows that should have been present.
Previously in the Askaban arc, it was also mentioned that the sign of the Deathly Hallows on the invisibility cloak was drawn in thestral blood, binding in that part of the thestral's power into the cloak, to make the the wearer as invisible to death's shadow as thestrals are to the unknowing.
Suppose there's some structure to it, try to fill out this table:
Invisibility Cloak | Thestral | Death's shadow (=dementors) |
Ressurection Stone | ? | Personal Death |
Elder Wand | ? | maybe Death of Worlds (?) |
My guess
Invisibility Cloak | Thestral | Death's shadow (=dementors) |
Ressurection Stone | Unicorn | Personal Death |
Elder Wand | Centaur | Death of Worlds |
So the second power of the Elder Wand may be some divination power. That would fit well to preventing the Death of Worlds, although it's a bit unclean to have two explanations for Dumbledore's divination power.
From chapter 86 (emphasis mine):
"The Hall of Prophecy," Minerva whispered. She'd read about that place, said to be a great room of shelves filled with glowing orbs, one after another appearing over the years. Merlin himself had wrought it, it was said; the greatest wizard's final slap to the face of Fate. Not all prophecies conduced to the good; and Merlin had wished for at least those spoken of in prophecy, to know what had been spoken of them. That was the respect Merlin had given to their free will, that Destiny might not control them from the outside, unwitting. Those mentioned within a prophecy would have an glowing orb float to their hand, and then hear the prophet's true voice speaking. Others who tried to touch an orb, it was said, would be driven mad - or possibly just have their heads explode, the legends were unclear on this point. Whatever Merlin's original intention, the Unspeakables hadn't let anyone enter in centuries, so far as she'd heard. Works of the Ancient Wizards had stated that later Unspeakables had discovered that tipping off the subjects of prophecies could interfere with seers releasing whatever temporal pressures they released; and so the heirs of Merlin had sealed his Hall.
From chapter 119:
During the First Wizarding War, there came a time when I realised that Voldemort was winning, that he would soon hold all within his hand.
In that extremity, I went into the Department of Mysteries and I invoked a password which had never been spoken in the history of the Line of Merlin Unbroken, did a thing forbidden and yet not utterly forbidden.
I listened to every prophecy that had ever been recorded.
Confusion: Accessing the Hall of Prophecy doesn't sound like sth that happened the first time in the history of the Line of Merlin Unbroken.
Notice: Dumbledore's letter does not strictly say that the forbidden thing Dumbledore did was listening to all the prophecies. Those statements could refer to separate events.
Another useful excerpt from ch 80 (emphasis mine):
This is the Hall of the Wizengamot; there are older places, but they are hidden. Legend holds that the walls of dark stone were conjured, created, willed into existence by Merlin, when he gathered the most powerful wizards left in the world and awed them into accepting him as their chief. And when (the legend continues) the Seers continued to foretell that not enough had yet been done to prevent the end of the world and its magic, then (the story goes) Merlin sacrificed his life, and his wizardry, and his time, to lay in force the Interdict of Merlin.
From chapter 110:
"Distraction? " roared Dumbledore, his sapphire eyes tight with fury. "You killed Master Flamel for a distraction? "
Professor Quirrell looked dismayed. "I am wounded by the injustice of your accusation. I did not kill the one you know as Flamel. I simply commanded another to do so."
"How could you? Even you, how could you? He was the library of all our lore! Secrets you have forever lost to wizardry! "
Confusion: Dubledore seems a bit more magically powerful than Voldemort, so minus Elder Wand he should probably still be at least almost as powerful as Voldemort. Magical power comes mostly from lore, so if Dumbledore's lore comes from Flamel, then it's a bit surprising that Voldemort was able to just order someone to kill Flamel.
So how would you resolve those confusions given the hints I dropped here?
Last hint:
The method to trap objects or people in a timeless space in the mirror is called "Merlin's method".
My guess
Merlin trapped himself in the mirror. The forbidden password Dumbledore spoke allowed him to talk to Merlin through the mirror. Merlin gave Dumbledore additional lore to fight Voldemort. Voldemort has likely figured while he was trapped for 9 years.
(This also means that once Harry figures this out he can read the forbidden letter in the department of mysteries and use the technique to (at least temporarily) retrieve Dumbledore from the mirror. (Yeah I know Dumbledore said he couldn't retrieve Voldemort, but I think that's just because Dumbledore doesn't want to, and wanting to is a requirement for the mirror.))
Thanks, will edit!
Other Eliezerfics that come to mind are:
I'd be interested in trying thinking assistants to help me with my work. The main time window where I'd want that would probably be 10:30am-2:30pm CEST (with 1h break in the middle) but I'm slightly flexible. (Feel free to PM me about this even if you read this in a year or so from me now posting this.)
I'm working on a long-term non-ML alignment agenda (and also on leveling up my rationality) for which I'm currently doing introspection and concrete analyses of how I solve problems.
There are two ways confirmation bias works. One is that it's easier to think of confirming evidence than disconfirming evidence. The associative links tend to be stronger. When you're thinking of a hypothesis you tend to believe, it's easy to think of evidence that supports it.
The stronger one is that there's a miniature Ugh field[1] surrounding thinking about evidence and arguments that would disprove a belief you care about. It only takes a flicker of a thought to make the accurate prediction about where considering that evidence could lead: admitting you were wrong, and doing a bunch of work re-evaluating all of your related beliefs. Then there's a little unconscious yuck feeling when you try to pay attention to that evidence.
I usually like to call only the first "confirmation bias" and only the second "motivated reasoning".
Also I'd rather phrase the first like: Our expectations influence our information processing in a way that causes causes confirming evidence to be more salient and thereby we update on it more.
I'm still a bit confused why this is the case. Your "the links are weaker" seems quite plausible, but if so I'd still like to understand why the links are weaker.
On priors I rather would've expected that the brain uses surprise-propagation algorithms that promote information to attention that doesn't fit our existing models, since those have the most relevant information to update on.
I'd be interested in more precise models of confirmation bias.
It's not at all obvious to me that motivated reasoning is worse than the first kind of confirmation bias (they might both be really devastating).
Can you make "sort by magic" the default sort for comments under a post? Here's why:
The problem: Commenting late on a post (after the main reading peak) is disincentivized, not only because fewer people will read the post and look over the comments, but also because most people only look over the top scoring comments and won't scroll down far enough to read your new comment. This also causes early good comments to continue to accumulate more karma because more people read those, so the usual equilibrium is that early good comments stay on top and late good comments don't really get noticed.
Also, what one cares about for sorting is the quality of a comment, and the correct estimator for that would be "number of upvotes per views". I don't know how you calculate magic but it seems very likely like a better proxy for this than top scoring. (If magic doesn't seem very adequate and you track page viewcounts, you could also get a more principled new magic sort, though you'd have to track for each comment what viewcount the page had at the time when the comment was posted. Like if the average ratio of upvotes per views is a/b, you could assign each comment a score of (upvotes+a)/(page_views_since_comment_was_posted+b), and sort descending by score.)