I'm recognizing a lot of terms you're using but there seems to be a supposition of my model that's very different from my actual model, to such an extent that I actually can't decode it. My best guess is that the productive thing is to zoom out and clarify my Actual Position in more detail, instead of arguing single points (which will then lead you to make other assumptions that don't quite square with my actual model, which is the big failure mode I'm trying to avoid here). To the extent that your aim is to better understand my model (which is very nearby, but not synonymous with, the models of other MIRI staff), this looks like the best path forward to me. Hopefully along the way we locate some cruxes (and I'd like it if you also helped out in guessing the root of our disagreement/misunderstanding).
At a high level, I don't think it's particularly useful to talk about 'alignment' with respect to smaller and more specialized systems, since it introduces the potential for conflating between qualitatively distinct cases.
For any system with a very small number of outputs (e.g. [chess piece] to [board position] small, and plausibly several OOMs larger than that in absolute number), it is trivially easy to verify the safety of the range of available outputs, since you can simply generate them all in advance, check whether they're liable to kill all humans, and move on. A key reason that I think alignment is hard is that the range of outputs of properly general systems is so large that all possible outputs in all possible deployment settings can't possibly be hand-verified. There are reasons to think you can verify groups or categories of outputs bundled together, but so far verifying the safety of the space of all possible outputs of powerful systems is not doable (my impression is that RAT and LAT were gestures in this direction, and that some of the ELK stuff is inspired by related concerns, but I'm no authority!).
I think this explanation constitutes me answering in the affirmative to (the spirit of) your first question. Let me know if that seems right.
For the theorem-prover case, the details of the implementation really matter quite a lot, and so the question is underspecified from my perspective. (My guess is the vast majority of theorem-provers are basically safe, even under large optimization pressure, but I haven't looked into it and invite someone who's thought more about that case to chime in; there are definitely approaches I could imagine for creating a theorem prover that might risk dangerous generalization, I'm just not sure how central those approaches are in practice.)
And the only reason AIs will be goal-oriented to a dangerous extent, is because people will intentionally make them like that, despite obvious risks?
This is not my position, and is not the position of anyone at MIRI who I've spoken with on the topic.
When I was reading "The Problem", I was sure that goal-oriented AI was seen as inevitable for some reason deeper than "Goal-oriented behavior is economically useful".
This is a correct reading and I don't understand what in my initial reply gave you some other impression. My best guess is that you've conflated between generality and goal-pursuing. Chess systems are safe because they're not very general (i.e., they have a very small action space), not because they aren't pursuing goals.
I'd still like to argue that "goal-oriented" is not a simple concept,
Agreed.
and it's not trivial to produce a goal-oriented agent even if you try
In conversations I've seen about this, we usually talk about how 'coherent' an agent is, as a way of describing how robustly it pursues its objective, whatever that objective may be. If what you mean is something like "contemporary general systems do not pursue their goals especially robustly, and it may be hard to make improvement on this axis," I agree.
not all useful agents are goal-oriented.
I think I disagree here, but I don't know how to frame that disagreement without more detail. Feel free to offer it if you feel there's more to talk about, and I'll do my best to continue engaging. I want to acknowledge that I don't really understand what you mean by goal-oriented, and how it might differ from my conception, which I'm hesitant to elaborate on for now, in the spirit of avoiding further confusion.
-
My best guess is that you thought the chess example was attempting to illustrate more things than it actually is. The chess example, as I recall, is a response to two common objections:
These are just existence proofs; if the AI can perform with superhuman competence at a game with n variables (like chess), then it seems plausible that AIs could eventually, in principle, perform with superhuman competence at a game with 2n, 100n, or 1e80 variables.
And I do personally believe, that EY and many others believe, that with enough optimization, even a chess bot should become dangerous. Not sure if there is any evidence for that belief.
I work at MIRI, worked on The Problem, and have never heard anyone express this belief.[1] Brendan is correct about the intention of that passage.
There is no way to make a training environment as complex as the real world.
It's unclear that this is needed; e.g., the AI2027 story where you train coders that help you train scientists that help you build ASI.
Still, virtual environments for RL are a huge market right now; people are, indeed, currently trying a more modest version of this thing you claim is impossible. Of course, these aren't literally 'as complex as the real world', but it's not clear the fidelity you'd need to reach particular capability thresholds. Iirc this is the importance of work on, e.g., multi-level world models and Markov blankets: better understanding what fidelity you need in what portions of your conception of the world in order to meet a given end.
If someone were to chime in and say they believe this, my guess is that they'd get there by abusing the category 'chess bot'; e.g., ChatGPT is kind of a chess bot in that it's a bot that can play chess, even though it's the product of a very different training regime than one would ever sensibly use to create a chess bot on purpose.
I’m not trying to silence anything. I have really just requested ~1 hour of effort (and named it as that previously).
You’re hyperbolizing my gestures and making selective calls for rigor.
Meta: I hope to follow a policy of mostly ignoring you in the future, in this thread and elsewhere. I suggest allocating your energy elsewhere.
Thank you for this very kind comment! I would like to talk in more detail about what was going on for me here, because while your assumptions are kindly framed, they're not quite accurate, and I think understanding a bit more about how I'm thinking about this might help.
The issue is not that I can't easily think of things that look relevant/useful to me on this topic; the issue is that the language you're using to describe the phenomenon is so different from the language used to describe it in the past that I would be staking the credibility of my caution entirely on whether you were equipped to recognize nearby ideas in an unfamiliar form — a form against which you already have some (justified!) bias. That's why it would be so much work! I can't know in advance if the Buddhist or Freudian or IFS or DBT or CBT or MHC framing of this kind of thing would immediately jump out to you as clearly relevant, or would help demonstrate the danger/power in the idea, much less equip you with the tools to talk about it in a manner that was sensitive enough by my lights.
So recommending asking ChatGPT wasn't just lazily pointing at the lowest hanging fruit; the Conceptual-Rounding-Error-Generator would be extremely helpful in offering you a pretty quick survey of relevant materials by squinting at your language and offering a heap of nearby and not-so-nearby analogs. You could then pick the thing that you thought was most relevant or exciting, read a bit about it, and then look into cautions related to that idea (or infer them yourself), then generalize back to your own flavor of this type of thinking.
It's simply not instructive or useful for me to try to cram your thought into my frame and then insist you think about it This Specific Way. Instead, noticing that all (or most) past-plausibly-related-thoughts (and, in particular, the thoughts that you consider nearest to your own) come with risks and disclaimers would naturally inspire you to take the next step and do the careful, sensitive thing in rendering the idea.
This is a hard dynamic to gesture at, and I did try to get it across earlier, but the specific questions I was being asked (and felt obligated to reply to) felt like attempts at taking short cuts that misunderstood the situation as something much simpler (e.g. 'William could just tell me what to look at but he's being lazy and not doing it' or 'William actually doesn't have anything in mind and is just being mean for no reason').
Hence my response of behaving unreasonably / embarrassing myself as a method of rendering a more costly signal. I did try to keep this from being outright discouraging, and hoped continuing to respond would generate some signal toward 'I'm invested in this going well and not just bidding to shut you down outright.'
I think you should think more about this idea, and get more comfortable with shittier parts of connecting your ideas to broader conversations.
You’re generalizing to the point of absurdity, WAY outside the scope of the object-level point being discussed. Also ‘is good form’ is VERY far short of ‘obligated’.
Someone requested input on their idea and I recommended some reading because the idea is pretty stakes-y / hard to do well, and now you’re holding me liable for your maliciously broad read of a subthread and accusing me of attempting to ‘wield power over others’? Are you serious? What are the levers of my power here? What threat has been issued?
I’m going out on a limb to send a somewhat costly signal that this idea, especially, is worth taking seriously and treating with care, and you’re just providing further cost for my trouble.
I didn't mean to imply the full 'optimal amount of fraud is non-zero' frame. I do mean an amount above that, and typed 'non-zero' hastily.
I wouldn't support a policy of "let's just roll back our shrinkage enforcement and let people steal more food". I would instead support giving poorer people something like expanded EBT food stamp credits.
I support what gets the people fed. Supporting an option that isn't really on the table (because serious proposals for doing welfare well aren't being actively debated and implemented) doesn't do anyone any good. I am not talking about a perfect world; I am looking at the world we are in and locating a tiny intervention/frame shift that might marginally improve things.
I am not experiencing suffering or claiming to experience suffering; I am illustrating that the labor requested of me is >>> expensive for me to perform than the labor I am requesting instead, and asking for some good faith. I find this a psychologically invasive and offensive suggestion on your part.
I mean, yes? If you want someone to do something that they wouldn't otherwise do, you need to persuade them. How could it be otherwise?
In cases where convincing is >>> costly to complying to the request it's good form to comply (indeed, defending this has already been more expensive for me than checking for pre-existing work would have been for the OP!).
"Dramatically corrosive to societal trust" feels like a wild overstatement. Fair evasion is and has been a norm (at least in the US) for approximately as long as public transit has existed, and it doesn't seem like it's meaningfully accelerated (except in periods of lax enforcement, which my guess is would reach an equilibrium somewhere) or meaningfully (let alone dramatically!) damaged social trust.
I buy parts of Kelsey's argument, but I think there's a bait and switch (at least between her opening example and the fare evasion example) where we start out talking about how people who can't afford food are incentivized to lie and then end up talking about how poor people shouldn't be allowed to use public transit. I claim the maxim should be sensitive to the context and would like to entertain the notion that services with near-zero marginal cost should tolerate some non-zero amount of fare evasion.
Fair. I was simply wondering whether or not you had something to back up your claim that this topic has been covered "quite extensively".
The thing that backs it up is you looking literally at all. Anything that I suggest may not hit on the particular parts of the (underspecified) idea that are most salient to you and can therefore easily be dismissed out of hand. This results in a huge asymmetry of effort between me locating/recommending/defending something I think is relevant and you spending a single hour looking in the direction I pointed and exploring things that seem most relevant to you.
I would like to be clear that I do not intend to claim that Newcomblike suffering is fake in any way. Suffering is a subjective experience. It is equally real whether it comes from physical pain, emotional pain, or an initially false belief that quickly becomes true. Hopefully posting it in a place like Lesswrong will keep it mostly away from the eyes of those who will fail to see this point.
I am indifferent to the content of what you intend to claim! This is a difficult to topic to broach in a manner that doesn't license people to do horrible things to themselves and others. The point I'm making isn't that you are going to intentionally doing something bad; it is that I know this minefield well and would like to make you aware that it is, in fact, a minefield!
The LessWrong audience is not sanctified as the especially psychologically robust few. Ideas do bad things to people, and more acutely so here than in most places (e.g. Ziz, Roko). If you're going to write a guide to a known minefield, maybe learn a thing or two about it before writing the guide.
I again ask though, how would a literature review help at all?
You are talking about something closely related to things a bunch of other people have talked about before you. Maybe one of them had something worthwhile to say, and maybe it's especially important to investigate that when someone is putting their time into warning you that this topic is dangerous. Like, I obviously expected a fight when posting my initial comment, and I'm getting one, and I'm putting a huge amount of time into just saying over and over again "Please oh my god do not just pull something out of your ass on this topic and encourage others to read it, that could do a lot of damage, please even look in the direction of people who have previously approached this idea with some amount of seriousness." And somehow you're still just demanding that I justify this to you? I am here to warn you! Should I stand on my head? Should I do a little dance? Should I Venmo you $200?
Like, what lever could I possibly pull to get you to heed the idea that some ideas, especially ideas around topics like suffering and hyperstition, can have consequences for those exposed to them, and these can be subtle or difficult to point at, and you should genuinely just put any effort at all into investigating the topic rather than holding my feet to the fire to guess at which features are most salient to you and then orient an argument about the dangers in a manner that is to your liking?
I'm not sure how to feel about this general attitude towards posting. I think with most things I would rather err on the side of posting something bad; I think a lot of great stuff goes unwritten because people's standards on themselves are too high.
Doesn't apply when there are real dangers associated with a lazy treatment of a topic. Otherwise I just agree.
Beyond this, I think it's the readers' responsibility to avoid content that will harm them or others.
They will not know! It is your responsibility to frame the material in a way that surfaces its utility while minimizing its potential for harm. This is not a neutral topic that can be presented in a flat, neutral, natural, obvious way. It is charged, it is going to be charged, which sides are shown will be a choice of the author, and so far it looks like you're content to lackadaisically blunder into that choice and blame others for tripping over landmines you set out of ignorance.
Again, I am a giant blinking red sign outside the suffering cave telling you 'please read the brochure before entering the suffering cave to avoid doing harm to others,' and you are making it my responsibility to convince you to read the brochure. From my perspective, you are a madman with hostages and a loaded gun! From your perspective, ignorant of the underspecified risks, I am wildly over-reacting. But you don't know that you have a gun, and I am expensively penning a comment liable to receive multiple [Too Combative?] reacts because it is the most costly signal I know how to send along this channel. Please, dear god, actually look into it before publishing this post, and just try to see why these are ideas someone might think it's worth being careful with!
The Problem is intended for a general audience (e.g., not LW users). I assure you people make precisely these objections, very often.