In relation to my current stance on AI, I was talking with someone who said they’re worried about people putting the wrong incentives on labs. At various points in that convo I said stuff like (quotes are not exact; third paragraph is a present summary rather than a re-articulation of a past utterance):
“Sure, every lab currently seems recklessly negligent to me, but saying stuff like “we won’t build the bioweapon factory until we think we can prevent it from being stolen by non-state actors” is directionally better than not having any commitments about any point at which they might pause development for any reason, which is in turn directionally better than saying stuff like “we are actively fighting to make sure that the omnicidal technology is open-sourced”.”
And: “I acknowledge that you see a glimmer of hope down this path where labs make any commitment at all about avoiding doing even some minimal amount of scaling until even some basic test is passed, e.g. because that small step might lead to more steps, and/or that sort of step might positively shape future regulation. And on my notion of ethics it’s important to avoid stomping on other people’s glimmers of hope whenever that’s feasible (and subject to some caveats about this being tricky to navigate when your hopes are opposed), and I'd prefer people not stomp on that hope.”
I think that the labs should Just Fucking Stop but I think we should also be careful not to create more pain for the companies that are doing relatively better, even if that better-ness is miniscule and woefully inadequate.
My conversation partner was like “I wish you’d say that stuff out loud”, and so, here we are.
If you allow indirection and don't worry about it being in the right format for superintelligent optimization, then sufficiently-careful humans can do it.
Answering your request for prediction, given that it seems like that request is still live: a thing I don't expect the upcoming multimodal models to be able to do: train them only on data up through 1990 (or otherwise excise all training data from our broadly-generalized community), ask them what superintelligent machines (in the sense of IJ Good) should do, and have them come up with something like CEV (a la Yudkowsky) or indirect normativity (a la Beckstead) or counterfactual human boxing techniques (a la Christiano) or suchlike.
Note that this only tangentially a test of the relevant ability; very little of the content of what-is-worth-optimizing-for occurs in Yudkowsky/Beckstead/Christiano-style indirection. Rather, coming up with those sorts of ideas is a response to glimpsing the difficulty of naming that-which-is-worth-optimizing-for directly and realizing that indirection is needed. An AI being able to generate that argument without following in the footsteps of others who have already generated it would be at least some evidence of the AI being able to think relatively deep and novel thoughts on the topic.
Note also that the AI realizing the benefits of indirection does not generally indicate that the AI could serve as a solution to our problem. An indirect pointer to what the humans find robustly-worth-optimizing dereferences to vastly different outcomes than does an indirect pointer to what the AI (or the AI's imperfect model of a human) finds robustly-worth-optimizing. Using indirection to point a superintelligence at GPT-N's human-model and saying "whatever that thing would think is worth optimizing for" probably results in significantly worse outcomes than pointing at a careful human (or a suitable-aggregate of humanity), e.g. because subtle flaws in GPT-N's model of how humans do philosophy or reflection compound into big differences in ultimate ends.
And note for the record that I also don't think the "value learning" problem is all that hard, if you're allowed to assume that indirection works. The difficulty isn't that you used indirection to point at a slow squishy brain instead of hard fast transistors, the (outer alignment) difficulty is in getting the indirection right. (And of course the lion's share of the overall problem is elsewhere, in the inner-alignment difficulty of being able to point the AI at anything at all.)
When trying to point out that there is an outer alignment problem at all I've generally pointed out how values are fragile, because that's an inferentially-first step to most audiences (and a problem to which many people's mind seems to quickly leap), on an inferential path that later includes "use indirection" (and later "first aim for a minimal pivotal task instead"). But separately, my own top guess is that "use indirection" is probably the correct high-level resolution to the problems that most people immediatly think of (namely that the task of describing goodness to a computer is an immense one), with of course a devil remaining in the details of doing the indirection properly (and a larger devil in the inner-alignment problem) (and a caveat that, under time-pressure, we should aim for minimial pivotal tasks instead etc.).
I claim that to the extent ordinary humans can do this, GPT-4 can nearly do this as well
(Insofar as this was supposed to name a disagreement, I do not think it is a disagreement, and don't understand the relevance of this claim to my argument.)
Presumably you think that ordinary human beings are capable of "singling out concepts that are robustly worth optimizing for".
Nope! At least, not directly, and not in the right format for hooking up to a superintelligent optimization process.
(This seems to me like plausibly one of the sources of misunderstanding, and in particular I am skeptical that your request for prediction will survive it, and so I haven't tried to answer your request for a prediction.)
Separately, a friend pointed out that an important part of apologies is the doer showing they understand the damage done, and the person hurt feeling heard, which I don't think I've done much of above. An attempt:
I hear you as saying that you felt a strong sense of disapproval from me; that I was unpredictable in my frustration as kept you feeling (perhaps) regularly on-edge and stressed; that you felt I lacked interest in your efforts or attention for you; and perhaps that this was particularly disorienting given the impression you had of me both from my in-person writing and from private textual communication about unrelated issues. Plus that you had additional stress from uncertainty about whether talking about your apprehension was OK, given your belief (and the belief of your friends) that perhaps my work was important and you wouldn't want to disrupt it.
This sounds demoralizing, and like it sucks.
I think it might be helpful for me to gain this understanding (as, e.g., might make certain harms more emotionally-salient in ways that make some of my updates sink deeper). I don't think I understand very deeply how you felt. I have some guesses, but strongly expect I'm missing a bunch of important aspects of your experience. I'd be interested to hear more (publicly or privately) about it and could keep showing my (mis)understanding as my model improves, if you'd like (though also I do not consider you to owe me any engagement; no pressure).
I did not intend it as a one-time experiment.
In the above, I did not intend "here's a next thing to try!" to be read like "here's my next one-time experiment!", but rather like "here's a thing to add to my list of plausible ways to avoid this error-mode in the future, as is a virtuous thing to attempt!" (by contrast with "I hereby adopt this as a solemn responsibility", as I hypothesize you interpreted me instead).
Dumping recollections, on the model that you want more data here:
I intended it as a general thing to try going forward, in a "seems like a sensible thing to do" sort of way (rather than in a "adopting an obligation to ensure it definitely gets done" sort of way).
After sending the email, I visualized people reaching out to me and asking if i wanted to chat about alignment (as you had, and as feels like a reconizable Event in my mind), and visualized being like "sure but FYI if we're gonna do the alignment chat then maybe read these notes first", and ran through that in my head a few times, as is my method for adopting such triggers.
I then also wrote down a task to expand my old "flaws list" (which was a collection of handles that I used as a memory-aid for having the "ways this could suck" chat, which I had, to that point, been having only verbally) into a written document, which eventually became the communication handbook (there were other contributing factors to that process also).
An older and different trigger (of "you're hiring someone to work with directly on alignment") proceeded to fire when I hired Vivek (if memory serves), and (if memory serves) I went verbally through my flaws list.
Neither the new nor the old triggers fired in the case of Vivek hiring employees, as discussed elsewhere.
Thomas Kwa heard from a friend that I was drafting a handbook (chat logs say this occured on Nov 30); it was still in a form I wasn't terribly pleased with and so I said the friend could share a redacted version that contained the parts that I was happier with and that felt more relevant.
Around Jan 8, in an unrelated situation, I found myself in a series of conversations where I sent around the handbook and made use of it. I pushed it closer to completion in Jan 8-10 (according to Google doc's history).
The results of that series of interactions, and of Vivek's team's (lack of) use of the handbook caused me to update away from this method being all that helpful. In particular: nobody at any point invoked one of the affordances or asked for one of the alternative conversation modes (though those sorts of things did seem to help when I personally managed to notice building frustration and personally suggest that we switch modes (although lying on the ground--a friend's suggestion--turned out to work better for others than switching to other conversation modes)). This caused me to downgrade (in my head) the importance of ensuring that people had access to those resources.
I think that at some point around then I shared the fuller guide with Vivek's team, but I didn't quickly detemine when from the chat logs. Sometime between Nov 30 and Feb 22, presumably.
It looks from my chat logs like I then finished the draft around Feb 22 (where I have a timestamp from me noting as much to a friend). I probably put it publicly on my website sometime around then (though I couldn't easily find a timestamp), and shared it with Vivek's team (if I hadn't already).
The next two MIRI hires both mentioned to me that they'd read my communication handbook (and I did not anticipate spending a bunch of time with them, nevermind on technical research), so they both didn't trigger my "warn them" events and (for better or worse) I had them mentally filed away as "has seen the affordances list and the failure modes section".
Thanks <3
(To be clear: I think that at least one other of my past long-term/serious romantic partners would say "of all romantic conflicts, I felt shittiest during ours". The thing that I don't recall other long-term/serious romantic partners reporting is the sense of inability to trust their own mind or self during disputes. (It's plausible to me that some have felt it and not told me.))
Insofar as you're querying the near future: I'm not currently attempting work collaborations with any new folk, and so the matter is somewhat up in the air. (I recently asked Malo to consider a MIRI-policy of ensuring all new employees who might interact with me get some sort of list of warnings / disclaimers / affordances / notes.)
Insofar as you're querying the recent past: There aren't many recent cases to draw from. This comment has some words about how things went with Vivek's hires. The other recent hires that I recall both (a) weren't hired to do research with me, and (b) mentioned that they'd read my communication handbook (as includes the affordance-list and the failure-modes section, which I consider to be the critcial pieces of warning), which I considered sufficient. (But then I did have communication difficulties with one of them (of the "despair" variety), which updated me somewhat.)
Insofar as you're querying about even light or tangential working relationships (like people asking my take on a whiteboard when I'm walking past), currently I don't issue any warnings in those cases, and am not convinced that they'd be warranted.
To be clear: I'm not currently personally sold on the hypothesis that I owe people a bunch of warnings. I think of them as more of a sensible thing to do; it'd be lovely if everyone was building explicit models of their conversational failure-modes and proactively sharing them, and I'm a be-the-change-you-wanna-see-in-the-world sort of guy.
(Perhaps by the end of this whole conversation I will be sold on that hypothesis! I've updated in that direction over the past couple days.)
(To state the obvious: I endorse MIRI institutionally acting according to others' conclusions on that matter rather than on mine, hence asking Malo to consider it independently.)
Do I have your permission to quote the relevant portion of your email to me?
Yep! I've also just reproduced it here, for convenience:
(One obvious takeaway here is that I should give my list of warnings-about-working-with-me to anyone who asks to discuss their alignment ideas with me, rather than just researchers I'm starting a collaboration with. Obvious in hindsight; sorry for not doing that in your case.)
Agreed that the proposal is underspecified; my point here is not "look at this great proposal" but rather "from a theoretical angle, risking others' stuff without the ability to pay to cover those risks is an indirect form of probabilistic theft (that market-supporting coordination mechanisms must address)" plus "in cases where the people all die when the risk is realized, the 'premiums' need to be paid out to individuals in advance (rather than paid out to actuaries who pay out a large sum in the event of risk realization)". Which together yield the downstream inference that society is doing something very wrong if they just let AI rip at current levels of knowledge, even from a very laissez-faire perspective.
(The "caveats" section was attempting--and apparently failing--to make it clear that I wasn't putting forward any particular policy proposal I thought was good, above and beyond making the above points.)