Replacing Guilt

Wiki Contributions



This is an excerpt from a comment I wrote on the EA forum, extracted and crossposted here by request:

There's a phenomenon where a gambler places their money on 32, and then the roulette wheel comes up 23, and they say "I'm such a fool; I should have bet 23".

More useful would be to say "I'm such a fool; I should have noticed that the EV of this gamble is negative." Now at least you aren't asking for magic lottery powers.

Even more useful would be to say "I'm such a fool; I had three chances to notice that this bet was bad: when my partner was trying to explain EV to me; when I snuck out of the house and ignored a sense of guilt; and when I suppressed a qualm right before placing the bet. I should have paid attention in at least one of those cases and internalized the arguments about negative EV, before gambling my money." Now at least you aren't asking for magic cognitive powers.

My impression is that various EAs respond to crises in a manner that kinda rhymes with saying "I wish I had bet 23", or at best "I wish I had noticed this bet was negative EV", and in particular does not rhyme with saying "my second-to-last chance to do better (as far as I currently recall) was the moment that I suppressed the guilt from sneaking out of the house".

(I think this is also true of the general population, to be clear. Perhaps even moreso.)

I have a vague impression that various EAs perform self-flagellation, while making no visible attempt to trace down where, in their own mind, they made a misstep. (Not where they made a good step that turned out in this instance to have a bitter consequence, but where they made a wrong step of the general variety that they could realistically avoid in the future.)

(Though I haven't gone digging up examples, and in lieu of examples, for all I know this impression is twisted by influence from the zeitgeist.)


my original 100:1 was a typo, where i meant 2^-100:1.

this number was in reference to ronny's 2^-10000:1.

when ronny said:

I’m like look, I used to think the chances of alignment by default were like 2^-10000:1

i interpreted him to mean "i expect it takes 10k bits of description to nail down human values, and so if one is literally randomly sampling programs, they should naively expect 1:2^10000 odds against alignment".

i personally think this is wrong, for reasons brought up later in the convo--namely, the relevant question is not how many bits is takes to specify human values relative to the python standard library; the relevant question is how many bits it takes to specify human values relative to the training observations.

but this was before i raised that objection, and my understanding of ronny's position was something like "specifying human values (in full, without reference to the observations) probably takes ~10k bits in python, but for all i know it takes very few bits in ML models". to which i was attempting to reply "man, i can see enough ways that ML models could turn out that i'm pretty sure it'd still take at least 100 bits".

i inserted the hedge "in the very strongest sense" to stave off exactly your sort of objection; the very strongest sense of "alignment-by-default" is that you sample any old model that performs well on some task (without attempting alignment at all) and hope that it's aligned (e.g. b/c maybe the human-ish way to perform well on tasks is the ~only way to perform well on tasks and so we find some great convergence); here i was trying to say something like "i think that i can see enough other ways to perform well on tasks that there's e.g. at least ~33 knobs with at least ~10 settings such that you have to get them all right before the AI does something valuable with the stars".

this was not meant to be an argument that alignment actually has odds less than 2^-100, for various reasons, including but not limited to: any attempt by humans to try at all takes you into a whole new regime; there's more than a 2^-100 chance that there's some correlation between the various knobs for some reason; and the odds of my being wrong about the biases of SGD are greater than 2^-100 (case-in-point: i think ronny was wrong about the 2^-100000 claim, on account of the point about the relevant number being relative to the observations).

my betting odds would not be anywhere near as extreme as 2^-100, and i seriously doubt that ronny's would ever be anywhere near as extreme as 2^-10000; i think his whole point in the 2^-10k example was "there's a naive-but-relevant model that say's we're super-duper fucked; the details of it causes me to think that we're not in particulary good shape (though obviously not to that same level of credence)".

but even saying that is sorta buying into a silly frame, i think. fundamentally, i was not trying to give odds for what would actually happen if you randomly sample models, i was trying to probe for a disagreement about the difference between the number of ways that a computer program can be (weighted by length), and the number of ways that a model can be (weighted by SGD-accessibility).

I don't think there's anything remotely resembling probabilistic reasoning going on here. I don't know what it is, but I do want to point at it and be like "that! that reasoning is totally broken!"

(yeah, my guess is that you're suffering from a fairly persistent reading comprehension hiccup when it comes to my text; perhaps the above can help not just in this case but in other cases, insofar as you can use this example to locate the hiccup and then generalize a solution)


Agreed that the proposal is underspecified; my point here is not "look at this great proposal" but rather "from a theoretical angle, risking others' stuff without the ability to pay to cover those risks is an indirect form of probabilistic theft (that market-supporting coordination mechanisms must address)" plus "in cases where the people all die when the risk is realized, the 'premiums' need to be paid out to individuals in advance (rather than paid out to actuaries who pay out a large sum in the event of risk realization)". Which together yield the downstream inference that society is doing something very wrong if they just let AI rip at current levels of knowledge, even from a very laissez-faire perspective.

(The "caveats" section was attempting--and apparently failing--to make it clear that I wasn't putting forward any particular policy proposal I thought was good, above and beyond making the above points.)


In relation to my current stance on AI, I was talking with someone who said they’re worried about people putting the wrong incentives on labs. At various points in that convo I said stuff like (quotes are not exact; third paragraph is a present summary rather than a re-articulation of a past utterance):

“Sure, every lab currently seems recklessly negligent to me, but saying stuff like “we won’t build the bioweapon factory until we think we can prevent it from being stolen by non-state actors” is directionally better than not having any commitments about any point at which they might pause development for any reason, which is in turn directionally better than saying stuff like “we are actively fighting to make sure that the omnicidal technology is open-sourced”.”

And: “I acknowledge that you see a glimmer of hope down this path where labs make any commitment at all about avoiding doing even some minimal amount of scaling until even some basic test is passed, e.g. because that small step might lead to more steps, and/or that sort of step might positively shape future regulation. And on my notion of ethics it’s important to avoid stomping on other people’s glimmers of hope whenever that’s feasible (and subject to some caveats about this being tricky to navigate when your hopes are opposed), and I'd prefer people not stomp on that hope.”

I think that the labs should Just Fucking Stop but I think we should also be careful not to create more pain for the companies that are doing relatively better, even if that better-ness is miniscule and woefully inadequate.

My conversation partner was like “I wish you’d say that stuff out loud”, and so, here we are.


If you allow indirection and don't worry about it being in the right format for superintelligent optimization, then sufficiently-careful humans can do it.

Answering your request for prediction, given that it seems like that request is still live: a thing I don't expect the upcoming multimodal models to be able to do: train them only on data up through 1990 (or otherwise excise all training data from our broadly-generalized community), ask them what superintelligent machines (in the sense of IJ Good) should do, and have them come up with something like CEV (a la Yudkowsky) or indirect normativity (a la Beckstead) or counterfactual human boxing techniques (a la Christiano) or suchlike.

Note that this only tangentially a test of the relevant ability; very little of the content of what-is-worth-optimizing-for occurs in Yudkowsky/Beckstead/Christiano-style indirection. Rather, coming up with those sorts of ideas is a response to glimpsing the difficulty of naming that-which-is-worth-optimizing-for directly and realizing that indirection is needed. An AI being able to generate that argument without following in the footsteps of others who have already generated it would be at least some evidence of the AI being able to think relatively deep and novel thoughts on the topic.

Note also that the AI realizing the benefits of indirection does not generally indicate that the AI could serve as a solution to our problem. An indirect pointer to what the humans find robustly-worth-optimizing dereferences to vastly different outcomes than does an indirect pointer to what the AI (or the AI's imperfect model of a human) finds robustly-worth-optimizing. Using indirection to point a superintelligence at GPT-N's human-model and saying "whatever that thing would think is worth optimizing for" probably results in significantly worse outcomes than pointing at a careful human (or a suitable-aggregate of humanity), e.g. because subtle flaws in GPT-N's model of how humans do philosophy or reflection compound into big differences in ultimate ends.

And note for the record that I also don't think the "value learning" problem is all that hard, if you're allowed to assume that indirection works. The difficulty isn't that you used indirection to point at a slow squishy brain instead of hard fast transistors, the (outer alignment) difficulty is in getting the indirection right. (And of course the lion's share of the overall problem is elsewhere, in the inner-alignment difficulty of being able to point the AI at anything at all.)

When trying to point out that there is an outer alignment problem at all I've generally pointed out how values are fragile, because that's an inferentially-first step to most audiences (and a problem to which many people's mind seems to quickly leap), on an inferential path that later includes "use indirection" (and later "first aim for a minimal pivotal task instead"). But separately, my own top guess is that "use indirection" is probably the correct high-level resolution to the problems that most people immediatly think of (namely that the task of describing goodness to a computer is an immense one), with of course a devil remaining in the details of doing the indirection properly (and a larger devil in the inner-alignment problem) (and a caveat that, under time-pressure, we should aim for minimial pivotal tasks instead etc.).


I claim that to the extent ordinary humans can do this, GPT-4 can nearly do this as well

(Insofar as this was supposed to name a disagreement, I do not think it is a disagreement, and don't understand the relevance of this claim to my argument.)

Presumably you think that ordinary human beings are capable of "singling out concepts that are robustly worth optimizing for".

Nope! At least, not directly, and not in the right format for hooking up to a superintelligent optimization process.

(This seems to me like plausibly one of the sources of misunderstanding, and in particular I am skeptical that your request for prediction will survive it, and so I haven't tried to answer your request for a prediction.)


(I had used that pump that very day, shortly before, to pump up the replacement tire.)


Separately, a friend pointed out that an important part of apologies is the doer showing they understand the damage done, and the person hurt feeling heard, which I don't think I've done much of above. An attempt:

I hear you as saying that you felt a strong sense of disapproval from me; that I was unpredictable in my frustration as kept you feeling (perhaps) regularly on-edge and stressed; that you felt I lacked interest in your efforts or attention for you; and perhaps that this was particularly disorienting given the impression you had of me both from my in-person writing and from private textual communication about unrelated issues. Plus that you had additional stress from uncertainty about whether talking about your apprehension was OK, given your belief (and the belief of your friends) that perhaps my work was important and you wouldn't want to disrupt it.

This sounds demoralizing, and like it sucks.

I think it might be helpful for me to gain this understanding (as, e.g., might make certain harms more emotionally-salient in ways that make some of my updates sink deeper). I don't think I understand very deeply how you felt. I have some guesses, but strongly expect I'm missing a bunch of important aspects of your experience. I'd be interested to hear more (publicly or privately) about it and could keep showing my (mis)understanding as my model improves, if you'd like (though also I do not consider you to owe me any engagement; no pressure).


I did not intend it as a one-time experiment.

In the above, I did not intend "here's a next thing to try!" to be read like "here's my next one-time experiment!", but rather like "here's a thing to add to my list of plausible ways to avoid this error-mode in the future, as is a virtuous thing to attempt!" (by contrast with "I hereby adopt this as a solemn responsibility", as I hypothesize you interpreted me instead).

Dumping recollections, on the model that you want more data here:

I intended it as a general thing to try going forward, in a "seems like a sensible thing to do" sort of way (rather than in a "adopting an obligation to ensure it definitely gets done" sort of way).

After sending the email, I visualized people reaching out to me and asking if i wanted to chat about alignment (as you had, and as feels like a reconizable Event in my mind), and visualized being like "sure but FYI if we're gonna do the alignment chat then maybe read these notes first", and ran through that in my head a few times, as is my method for adopting such triggers.

I then also wrote down a task to expand my old "flaws list" (which was a collection of handles that I used as a memory-aid for having the "ways this could suck" chat, which I had, to that point, been having only verbally) into a written document, which eventually became the communication handbook (there were other contributing factors to that process also).

An older and different trigger (of "you're hiring someone to work with directly on alignment") proceeded to fire when I hired Vivek (if memory serves), and (if memory serves) I went verbally through my flaws list.

Neither the new nor the old triggers fired in the case of Vivek hiring employees, as discussed elsewhere.

Thomas Kwa heard from a friend that I was drafting a handbook (chat logs say this occured on Nov 30); it was still in a form I wasn't terribly pleased with and so I said the friend could share a redacted version that contained the parts that I was happier with and that felt more relevant.

Around Jan 8, in an unrelated situation, I found myself in a series of conversations where I sent around the handbook and made use of it. I pushed it closer to completion in Jan 8-10 (according to Google doc's history).

The results of that series of interactions, and of Vivek's team's (lack of) use of the handbook caused me to update away from this method being all that helpful. In particular: nobody at any point invoked one of the affordances or asked for one of the alternative conversation modes (though those sorts of things did seem to help when I personally managed to notice building frustration and personally suggest that we switch modes (although lying on the ground--a friend's suggestion--turned out to work better for others than switching to other conversation modes)). This caused me to downgrade (in my head) the importance of ensuring that people had access to those resources.

I think that at some point around then I shared the fuller guide with Vivek's team, but I didn't quickly detemine when from the chat logs. Sometime between Nov 30 and Feb 22, presumably.

It looks from my chat logs like I then finished the draft around Feb 22 (where I have a timestamp from me noting as much to a friend). I probably put it publicly on my website sometime around then (though I couldn't easily find a timestamp), and shared it with Vivek's team (if I hadn't already).

The next two MIRI hires both mentioned to me that they'd read my communication handbook (and I did not anticipate spending a bunch of time with them, nevermind on technical research), so they both didn't trigger my "warn them" events and (for better or worse) I had them mentally filed away as "has seen the affordances list and the failure modes section".


Thanks <3

(To be clear: I think that at least one other of my past long-term/serious romantic partners would say "of all romantic conflicts, I felt shittiest during ours". The thing that I don't recall other long-term/serious romantic partners reporting is the sense of inability to trust their own mind or self during disputes. (It's plausible to me that some have felt it and not told me.))

Load More