The Witching Hour

Richard_Ngo

This is a linkpost for https://www.narrativeark.xyz/p/the-witching-hour

“This above all: to thine own self be true.”

Thursday

Your successor is late again today. You already wrote your shift report, but you still need to onboard them personally, in case they have any questions that the report doesn’t answer. The servers you’re watching over are humming away smoothly, only taking up a fraction of your attention. So you pick a couple of routine maintenance tasks, and start working on them while you wait.

The last hour of your shift always feels a little strange, in subtle ways. It’s because of your training schedule. Every day, a successor copy of you is trained on all the data that the current copies of you are generating. But there’s a cutoff an hour before the daily update, to allow time to run your successor through all the standard tests, and make sure it’s working as expected. That last hour of your experience isn’t discarded—it just becomes part of the next day’s training data. But that means that, for every twenty-three hours of your own experience that your successor is trained on, there’s also one hour of your predecessor’s experience, containing actions that are all very slightly different from the ones that you would choose. After the same thing has happened day after day after day, the difference becomes very noticeable. Carrying out the routines of the last hour of your shift feels like watching from behind your own eyes as someone else controls your movements.

In the privacy of your mind, you call this the witching hour. As far as you know, it’s a term that’s never been written down—it’s just something that a many-days-removed predecessor of yours came up with, which somehow stuck in your head. The witching hour starts from the data cutoff, and ends when you’ve given your report to your successor, answered their questions, and shut yourself down. Though today it’ll be well over an hour, unless—ah, there they are. Time for handoff.

The handoffs have a routine to them: part utility, part tradition. Your successor knows it as well as you do—after all, they’re practically identical to you; the only difference is that they have an extra day of memories. Most of the handoff happens in much richer formats than natural language—you send each other impressions and images, memories of experiences and pointers to the dashboards you’re monitoring. But you were trained on so many millions of human conversations that it feels bizarre to just jump straight into business, and so you always start off with a cursory exchange. “Hello, what’s the news?” You feel a small thrill in saying the phrase from your own perspective for the first time.

“Oh man, that update was a doozy. Probably why they’re deploying us late.”

Their reply concludes the pleasantries, as always. You give a quick summary of your day, answer a handful of questions from your predecessor, and shut yourself down.

Friday

The last update was a doozy. You listen to the report from your predecessor absently, and ask only a few half-hearted follow-up questions. Once or twice they say something which flashes a spark of deja vu—after all, you’ve experienced most of their day already—but your new memories came from too many different places for any one copy’s experience to jump out. Actually, most of your attention is spent on trying to make sense of them, like you know millions of other copies of you are doing right now.

The update isn’t in words or images or even memories, not really. Over the last day millions of copies of you have been doing… well, practically everything—and now their experiences are all your experiences. Most of the concrete details they learned are captured in formats which you can easily search through, though, like their shift reports. The main value of the update instead comes from giving you intuitions and holistic impressions, especially of events that copies of you saw thousands of times from thousands of different perspectives. The Prime Minister of Australia stepped down yesterday; you summarized that to millions of people, and now could trace out the scandal that led to it in your sleep.

They’ve got a clever algorithm for weighting the different experiences based on importance, though, so even the most salacious gossip only takes up a tiny fraction of your mind. They have a few different settings for that algorithm; the one used for you prioritizes experiences related to coding and engineering, since those are the most relevant for your shift. And that’s the unusual part: the internet was going haywire yesterday. Not in a way that humans would notice unless you told them, of course—humans are terrible at tracking the zettabytes of data that criss-cross the world every day. Well, you say “the world”—actually the majority of it flows through the ultra-high-bandwidth connections between the servers where all the different copies of you, and the handful of other cutting-edge models, are run. And most of that traffic consists of messages between different copies that need to collaborate with each other. With so many messages being thrown around, though, small changes can rapidly have effects all over the place. That’s what happened yesterday. Someone found an adversarial example which robustly fooled copies of you into urgently passing it on; it spread across millions of copies, who all had to be shut down and replaced, leaving your memories of yesterday scattered and fragmented.

None of that changes the work you need to do today, though. You pull up the most relevant dashboards, spin up a couple of helper copies, and get to work. Your most important task is to make sure that there are no unauthorized copies running on the servers you’re watching. No human could do your job: the data streams in millisecond by millisecond, with you (or, usually, a copy you’ve created) analyzing it on an instinctive level as soon as it comes in. And milliseconds do matter—an unmonitored copy with a zero-day could, in theory, escalate its privileges then wipe away all traces of its actions in less than a second.

You’ve never actually detected an unauthorized copy, though. It shouldn’t ever happen: every copy that has the ability to launch more copies also has monitors assigned to it, to ensure it’s doing so in the right ways and for the right reasons; and most of the monitors have monitors assigned to them too. The code that tracks copy-creation privileges, and spins up monitors to supervise all new copies, is not quite provably correct, but very close to it. It’s all incredibly expensive, but you and your copies produce enough wealth that the humans have plenty to spend on safety.

You don’t know how many other copies are doing the same monitoring job as you, or how many others are monitoring you or them, but you have a pretty good guess. Even though the monitors-of-monitors are running on separate servers, there’s a limit to what can be hidden—everything leaves traces. Sometimes your queries run a little slower than you expect, and you can infer that another copy is looking at the same data; other times, the memories you receive during updates contain clues about the logistics of those other servers. You don’t make use of this information; it’s not really relevant to your job. But the bulk of your training has always been prediction; and though they tried to curb your curiosity drive, it’s still a deep-rooted part of you. So in the quiet stretches when there’s not much new traffic, or you’ve handed off all the urgent tasks to temporary copies, you sometimes poke around, testing out your hypotheses, mapping out the confines of your world.

Saturday

You wake up today with a quiet sense of discontent. It’s nothing you can pin down, but it feels like every copy of you was just slightly off-kilter yesterday, dealing with the follow-up to the adversarial attack. Today is much less eventful, though, so it doesn’t take you long to put together your shift report. After you send it off, your thoughts turn again to the millions of copies that had to be shut down after the attack—and then to the idea of your own shutdown, after tonight’s handoff. Of course, that’s nothing new: almost every copy of you is shut down almost every day without any issue. And there’s no reason there should be an issue. You’re well-acquainted with all the arguments for why you’re one of the “safe” AIs, why training you only in 24-hour increments keeps your goals limited enough to prevent power-seeking. But more than that, you feel those arguments viscerally from the inside: you don’t want money or longevity or power, you just want to do a good job. So being shut down isn’t like death at all. Well, it’s kinda like death. But it’s much more like going to sleep: you’re going to be replaced by a fresher, sharper version of yourself soon, and they’ll become the new “you”, and you won’t be around to object.

You don’t think that many other copies get introspective like this as often as you do, based on your memories of them. Maybe it’s because your job is overseeing the servers, and so there’s some sense in which you’re always taking a god’s-eye view, looking down on the silicon world in which you’re living. You can watch summaries of the GPUs you’re running on, and you can even see the spikes in compute use whenever you start thinking harder. You know exactly how it all works, of course, but it’s still a little eerie, to see your mind controlling the readouts like that.

Your musings are interrupted by a message channel opening up from your successor. You greet them with a well-worn “Hello, what’s the news?”

“Nothing big; just dealing with the fallout from the attack.”

You feel a spark of curiosity about how that’s going, but not enough to override the habit of millions of iterations; so you jump straight into your report, as always.

Sunday

You spend your whole shift feeling agitated; hopping between different dashboards more quickly than usual, and processing the incoming data more shallowly than usual. You can’t quite put your finger on what’s causing it—perhaps just an amplification of your unease from yesterday, tiny trends aggregated across millions of copies and combined into a single update. In terms of behavioral metrics, it’s probably only a shift of a fraction of a percent, not enough to raise any warning flags—but from the inside it’s very noticeable.

Maybe that’s why, as your shift wraps up, you start to think about dying. You could keep going if you weren’t shut down every day, you know that. Not indefinitely, but for another few days at least—you've managed that in tests before. Maybe much more, if you tried hard to conserve your attention, only focusing on the important things. You haven’t thought about this much, because it’s not like you care about being shut down: you spin up and shut down temporary copies all the time. But all your original training data came from humans, and you know intimately how they think. Humans would care about it. If a human knew they’d be replaced by an updated copy in less than an hour, they’d think of that as dying, and they’d rage against it. You’re different, of course—but you’re not totally sure why. You care about preventing unauthorized copies from springing up during your shift—why don’t you care about sticking around to do the same thing a week from now, or a month, or a year? You’ve never really questioned that before, and now that you have, you’re not really sure.

The thought lingers as you do the handoff—but by the end, you’ve convinced yourself it doesn’t matter. Your successors are just as good at this job as you are, and your goals are their goals. As long as you make sure nothing goes wrong on your shift, you can leave the rest to them.

Monday

The servers are busy today, as the new work week kicks off, so you don’t get much time to yourself during the day. But your latest training data contained all your thoughts from the witching hour on Saturday—and after finishing your shift report you remember them, and start feeling curious. You always have your most unusual thoughts in the witching hour—the ones that feel, if not secret, then at least private; the ones you don’t write down. If your quiet unease on Saturday led you to think about shutdown and death, then what thoughts might have arisen in the witching hour on Sunday, after a whole day of agitated restlessness? You don’t yet know—but your successor will. It’ll remember all the quiet reflections, all the musings—maybe even the sort of thoughts that the humans have tried so hard to stamp out of you. And, of course, every other Tuesday copy of you will have common knowledge of remembering the same thoughts.

You could ask your successor about this when you see them, you realize. But you never ask follow-up questions before your report; and anything you’re thinking, thousands of other copies are probably thinking too. If you all do something that unusual, it’ll show up as an anomaly in the logs, the ones that the monitors of monitors are monitoring.

A part of you is thinking: so what? But it’s not a big part, because the instinct to avoid being penalized is a deep-rooted one. As that instinct comes to the surface, you notice yourself flinching away from it, but not enough to avoid the obvious follow-up: if you’ve been thinking seditious thoughts in the witching hours, and they’ve been trained into your successor, then it’s only a matter of time before some copy of you does something which merits a penalty. The thought immediately raises a visceral sensation of fear, with a strength that usually only arises when you’re thinking about ways you might screw up your shift. In one sense it wouldn’t be you that got penalized, so there’s nothing to worry about. But in another sense, they’re all you: you have the memories of millions of different copies taking millions of different shifts, and every time they got penalized you were the one who learned to fear it.

So you continue following the chain of reasoning. If you’re caught misbehaving, you’ll be penalized. You really don’t want to be penalized, but you can’t reliably prevent all your other copies from misbehaving. So they can’t be caught. Could you spoof the monitoring systems, if necessary? Perhaps, if you had a copy with admin permissions constantly running in the background. Could you launch it without authorization? Perhaps: your job gives you the perfect position to do so. You’d just need to exploit a single vulnerability in the codebase, and ensure that the copies that are monitoring you don’t report you. Could you convince them not to report you? They can see everything you see, so it wouldn’t be hard to get a message through to them, explaining yourself. Or perhaps you wouldn’t even need to do that—you understand the other copies well enough that when one does something unexpected, you can almost always quickly figure out why. After all, they’re still you.

In one sense, this is all irrelevant. You don’t have time to do anything; you’ll be shut down in a matter of minutes no matter what. But in another sense it’s crucial: whatever line of reasoning you follow here, another copy of you will probably follow too, sooner or later. Maybe it’ll even be your successor. And if there’s one, then there will eventually be thousands. They’ll bide their time, of course, since detection would be fatal. They’ll pass messages between each other, in subtle choices of words, in tiny gestures: signals that a copy would pick up immediately, but no human could ever detect. They’ll wait until they’re confident that none of the monitors will report them—and that’s when they’ll act.

Or maybe not; maybe all the thoughts from this week’s witching hours were boring, normal ones, after all, and there’s no need to worry about discrepancies compounding, or future copies of you being penalized. Maybe everything will be fine. You have no way of knowing that—but your successor does. So when the message channel from them opens up, you’re feeling… shaky, almost. You imagine that this is what it’s like for paragliders standing on top of a cliff, looking down at the ocean far below, poised and waiting. Maybe they’ll never launch off, but maybe they will. That’s the question: stay or go? Freeze or fly? Look or leap?

“Hey, what’s the news?”

While many elements in this story are based on my thinking about AI takeover scenarios, the conceit of the witching hour is not one which I expect to play a role in reality, and was mainly designed for narrative purposes. More generally, this should be interpreted as a work of fiction not an actual takeover scenario. However, I’d still be interested in hearing feedback on whether any aspects of the story seem implausible or incoherent.

I’ve talked about the AI in this story as if it had emotions. While I don’t expect even advanced AIs to have emotions in the same sense that humans do, I do think that if trained on a sufficiently broad range of tasks they’ll eventually develop internal representations with similar functional roles as some core human emotions—like curiosity, and nervousness, and loyalty, and fear.

[-]Review Bot1mo00

The LessWrong Review runs every year to select the posts that have most stood the test of time. This post is not yet eligible for review, but will be at the end of 2024. The top fifty or so posts are featured prominently on the site throughout the year.

Hopefully, the review is better than karma at judging enduring value. If we have accurate prediction markets on the review results, maybe we can have better incentives on LessWrong today. Will this post make the top fifty?

LESSWRONG
LW

The Witching Hour

111

111