All of Wei Dai's Comments + Replies

If someone cares a lot about a strictly zero-sum resource, like land, how do you convince them to 'move out of the zero-sum setting by finding "win win" resolutions'? Like what do you think Ukraine or its allies should have done to reduce the risk of war before Russia invaded? Or what should Taiwan or its allies do now?

Also to bring this thread back to the original topic, what kinds of interventions do you think your position suggests with regard to AI?

1boazbarak8d
I definitely don't have advice for other countries, and there are a lot of very hard problems in my own homeland. I think there could have been an alternate path in which Russia has seen prosperity from opening up to the west, and then going to war or putting someone like Putin in power may have been less attractive. But indeed the "two countries with McDonalds won't fight each other" theory has been refuted. And as you allude to with China, while so far there hasn't been war with Taiwan, it's not as if economic prosperity is an ironclad guarantee of non aggression.  Anyway, to go back to AI. It is a complex topic, but first and foremost, I think with AI as elsewhere, "sunshine is the best disinfectant." and having people research AI systems in the open, point out their failure modes, examining what is deployed etc.. is very important. The second thing is that I am not worried in any near future about AI "escaping", and so I think focus should not be on restricting research, development, or training, but rather on regulating deployment. Exact form of regulations is beyond a blog post comment and also not something I am an expert on.. The "sunshine" view might seem strange since as a corollary it could lead to AI knowledge "leaking". However, I do think that for the near future, most of the safety issues with AI would be from individual hackers using weak systems, but from massive systems that are built by either very large companies or nation states.  It is hard to hold either of those accountable if AI is hidden behind an opaque wall. 

Rather conflicts arise when various individuals and populations (justifiably or not) perceive that they are in zero-sum games for limited resources. The solution for this is not “philosophical progress” as much as being able to move out of the zero-sum setting by finding “win win” resolutions for conflict or growing the overall pie instead of arguing how to split it.

I think many of today's wars are at least as much about ideology (like nationalism, liberalism, communism, religion) as about limited resources. I note that Russia and Ukraine both have belo... (read more)

1boazbarak9d
I meant “resources” in a more general sense. A piece of land that you believe is rightfully yours is a resource. My own sense (coming from a region that is itself in a long simmering conflict) is that “hurt people hurt people”. The more you feel threatened, the less you are likely to trust the other side. While of course nationalism and religion play a huge role in the conflict, my sense is that people tend to be more extreme in both the less access to resources, education and security about the future they have.

One way that things could go wrong, not addressed by this playbook: AI may differentially accelerate intellectual progress in a wrong direction, or in other words create opportunities for humanity to make serious mistakes (by accelerating technological progress) faster than wisdom to make right choices (philosophical progress). Specific to the issue of misalignment, suppose we get aligned human-level-ish AI, but it is significantly better at speeding up AI capabilities research than the kinds of intellectual progress needed to continue to minimize misalign... (read more)

2HoldenKarnofsky7d
I agree that this is a major concern. I touched on some related issues in this piece [https://www.cold-takes.com/transformative-ai-issues-not-just-misalignment-an-overview/]. This post focused on misalignment because I think readers of this forum tend to be heavily focused on misalignment, and in this piece I wanted to talk about what a playbook might look like assuming that focus (I have pushed back on this as the exclusive focus elsewhere). I think somewhat adapted versions of the four categories of intervention I listed could be useful for the issue you raise, as well.
7boazbarak10d
A partial counter-argument. It's hard for me to argue about future AI, but we can look at current "human misalignment" - war, conflict, crime, etc..  It seems to me that conflicts in today's world do not arise because that we haven't progressed enough in philosophy since the Greeks. Rather conflicts arise when various individuals and populations (justifiably or not) perceive that they are in zero-sum games for limited resources.  The solution for this is not "philosophical progress" as much as being able to move out of the zero-sum setting by finding "win win" resolutions for conflict or growing the overall pie instead of arguing how to split it.  (This is a partial counter-argument, because I think you are not just talking about conflict, but other issues of making the wrong choices. For example in global warming where humanity makes collectively the mistake of emphasizing short-term growth over long-term safety. However, I think this is related and "growing the pie" would have alleviated this issue as well, and enabled countries to give up on some more harmful ways for short-term growth.) 
3Allan Dafoe10d
The "Cooperative AI" bet is along these lines: can we accelerate AI systems that can help humanity with our global cooperation problems (be it through improving human-human cooperation, community-level rationality/wisdom, or AI diplomat - AI diplomat cooperation). https://www.cooperativeai.com/

I guess part of the problem is that the people who are currently most receptive to my message are already deeply enmeshed in other x-risk work, and I don't know how to reach others for whom the message might be helpful (such as academic philosophers just starting to think about AI?). If on reflection you think it would be worth spending some of your time on this, one particularly useful thing might be to do some sort of outreach/field-building, like writing a post or paper describing the problem, presenting it at conferences, and otherwise attracting more ... (read more)

2Daniel Kokotajlo12d
Somehow there are 4 copies of this post

Here's a link to the part of interview where that quote came from: https://youtu.be/GyFkWb903aU?t=4739 (No opinion on whether you're missing redeeming context; I still need to process Nesov's and your comments.)

2TurnTrout5d
I low-confidence think the context strengthens my initial impression. Paul prefaced the above quote as "maybe the simplest [reason for AIs to learn to behave well during training, but then when deployed or when there's an opportunity for takeover, they stop behaving well]." This doesn't make sense to me, but I historically haven't understood Paul very well. EDIT: Hedging

Even at 10% p(doom), which I consider to be unreasonably low, it would probably be worth delaying a few years.

Someone with with 10% p(doom) may worry that if they got into a coalition with others to delay AI, they can't control the delay precisely, and it could easily become more than a few years. Maybe it would be better not to take that risk, from their perspective.

And lots of people have p(doom)<10%. Scott Aaronson just gave 2% for example, and he's probably taken AI risk more seriously than most (currently working on AI safety at OpenAI), so prob... (read more)

2Daniel Kokotajlo12d
I guess I just think it's pretty unreasonable to have p(doom) of 10% or less at this point, if you are familiar with the field, timelines, etc.  I totally agree the topic is important and neglected. I only said "arguably" deferrable, I have less than 50% credence that it is deferrable. As for why I'm not working on it myself, well, aaaah I'm busy idk what to do aaaaaaah! There's a lot going on that seems important. I think I've gotten wrapped up in more OAI-specific things since coming to OpenAI, and maybe that's bad & I should be stepping back and trying to go where I'm most needed even if that means leaving OpenAI. But yeah. I'm open to being convinced!
Wei Dai12dΩ91810

Why is 1 important? It seems like something we can defer discussion of until after (if ever) alignment is solved, no?

If aging was solved or looked like it will be solved within next few decades, it would make efforts to stop or slow down AI development less problematic, both practically and ethically. I think some AI accelerationists might be motivated directly by the prospect of dying/deterioration from old age, and/or view lack of interest/progress on that front as a sign of human inadequacy/stagnation (contributing to their antipathy towards humans).... (read more)

5Daniel Kokotajlo12d
Something like 2% of people die every year right? So even if we ignore the value of future people and all sorts of other concerns and just focus on whether currently living people get to live or die, it would be worth delaying a year if we could thereby decrease p(doom) by 2 percentage points. My p(doom) is currently 70% so it is very easy to achieve that. Even at 10% p(doom), which I consider to be unreasonably low, it would probably be worth delaying a few years. Re: 2: Yeah I basically agree. I'm just not as confident as you are I guess. Like, maybe the answers to the problems you describe are fairly objective, fairly easy for smart AIs to see, and so all we need to do is make smart AIs that are honest and then proceed cautiously and ask them the right questions. I'm not confident in this skepticism and could imagine becoming much more convinced simply by thinking or hearing about the topic more.

Setting that aside, it reads to me like the frame-clash happening here is (loosely) between “50% extinction, 50% not-extinction” and “50% extinction, 50% utopia”

Yeah, I think this is a factor. Paul talked a lot about "1/trillion kindness" as the reason for non-extinction, but 1/trillion kindness seems to directly imply a small utopia where existing humans get to live out long and happy lives (even better/longer lives than without AI) so it seemed to me like he was (maybe unintentionally) giving the reader a frame of “50% extinction, 50% small utopia”, while still writing other things under the “50% extinction, 50% not-extinction” frame himself.

4Lukas Finnveden13d
Not direct implication, because the AI might have other human-concerning preferences that are larger than 1/trillion. C.f. top-level comment: "I’m not talking about whether the AI has spite or other strong preferences that are incompatible with human survival, I’m engaging specifically with the claim that AI is likely to care so little one way or the other that it would prefer just use the humans for atoms." I'd guess "most humans survive" vs. "most humans die" probabilities don't correspond super closely to "presence of small pseudo-kindness". Because of how other preferences could outweigh that, and because cooperation/bargaining is a big reason for why humans might survive aside from intrinsic preferences.

I do explicitly flag the loss of control over the future in that same sentence.

In your initial comment you talked a lot about AI respecting the preferences of weak agents (using 1/trillion of its resources) which implies handing back control of a lot of resources to humans, which from the selfish or scope insensitive perspective of typical humans probably seems almost as good as not losing that control in the first place.

I don’t think the much worse outcomes are closely related to unaligned AI so I don’t think they seem super relevant to my comment or

... (read more)

I regret mentioning "lie-to-children" as it seems a distraction from my main point. (I was trying to introspect/explain why I didn't feel as motivated to express disagreement with the OP as you, not intending to advocate or endorse anyone going into "the business of telling lies-told-to-children to adults".)

My main point is that I think "misaligned AI has a 50% chance of killing everyone" isn't alarming enough, given what I think happens in the remaining 50% of worlds, versus what a typical person is likely to infer from this statement, especially after se... (read more)

2paulfchristiano14d
Yeah, I think "no control over future, 50% you die" is like 70% as alarming as "no control over the future, 90% you die." Even if it was only 50% as concerning, all of these differences seem tiny in practice compared to other sources of variation in "do people really believe this could happen?" or other inputs into decision-making. I think it's correct to summarize as "practically as alarming." I'm not sure what you want engagement with. I don't think the much worse outcomes are closely related to unaligned AI so I don't think they seem super relevant to my comment or Nate's post. Similarly for lots of other reasons the future could be scary or disorienting. I do explicitly flag the loss of control over the future in that same sentence. I think the 50% chance of death is probably in the right ballpark from the perspective of selfish concern about misalignment. Note that the 50% probability of death includes the possibility of AI having preferences about humans incompatible with our survival. I think the selection pressure for things like spite is radically weaker for the kinds of AI systems produced by ML than for humans (for simple reasons---where is the upside to the AI from spite during training? seems like if you get stuff like threats it will primarily be instrumental rather than a learned instinct) but didn't really want to get into that in the post.
2Ben Pace14d
I think I tend to base my level of alarm on the log of the severity*probability, not the absolute value. Most of the work is getting enough info to raise a problem to my attention to be worth solving. "Oh no, my house has a decent >30% chance of flooding this week, better do something about it, and I'll likely enact some preventative measures whether it's 30% or 80%." The amount of work I'm going to put into solving it is not twice as much if my odds double, mostly there's a threshold around whether it's worth dealing with or not. Setting that aside, it reads to me like the frame-clash happening here is (loosely) between "50% extinction, 50% not-extinction" and "50% extinction, 50% utopia", where for the first gamble of course 1:1 odds on extinction is enough to raise it to "we need to solve this damn problem", but for the second gamble it's actually much more relevant whether it's a 1:1 or a 20:1 bet. I'm not sure which one is the relevant one for you two to consider.

I'm worried that people, after reading your top-level comment, will become too little worried about misaligned AI (from their selfish perspective), because it seems like you're suggesting (conditional on misaligned AI) 50% chance of death and 50% alive and well for a long time (due to 1/trillion kindness), which might not seem so bad compared to keeping AI development on hold indefinitely which potentially implies a high probability of death from old age.

I feel like "misaligned AI kills everyone because it doesn't care at all" can be a reasonable lie-to-ch... (read more)

8paulfchristiano15d
My objection is that the simplified message is wrong, not that it's too alarming. I think "misaligned AI has a 50% chance of killing everyone" is practically as alarming as "misaligned AI has a 95% chance of killing everyone," while being a much more reasonable best guess. I think being wrong is bad for a variety of reasons. It's unclear if you should ever be in the business of telling lies-told-to-children to adults, but you certainly shouldn't be doubling down on them in the position in argument. I don't think misaligned AI drives the majority of s-risk (I'm not even sure that s-risk is higher conditioned on misaligned AI), so I'm not convinced that it's a super relevant communication consideration here. The future can be scary in plenty of ways other than misaligned AI, and it's worth discussing those as part of "how excited should we be for faster technological change."

If a misaligned AI had 1/trillion "protecting the preferences of whatever weak agents happen to exist in the world", why couldn't it also have 1/trillion other vaguely human-like preferences, such as "enjoy watching the suffering of one's enemies" or "enjoy exercising arbitrary power over others"?

From a purely selfish perspective, I think I might prefer that a misaligned AI kills everyone, and take my chances with continuations of myself (my copies/simulations) elsewhere in the multiverse, rather than face whatever the sum-of-desires of the misaligned AI decides to do with humanity. (With the usual caveat that I'm very philosophically confused about how to think about all of this.)

9paulfchristiano15d
As I said: I think it's totally plausible for the AI to care about what happens with humans in a way that conflicts with our own preferences. I just don't believe it's because AI doesn't care at all one way or the other (such that you should make predictions based on instrumental reasoning like "the AI will kill humans because it's the easiest way to avoid future conflict" or other relatively small considerations).

Thanks, this clarifies a lot for me.

It seems like just 4 months ago you still endorsed your second power-seeking paper:

This paper is both published in a top-tier conference and, unlike the previous paper, actually has a shot of being applicable to realistic agents and training processes. Therefore, compared to the original[1] optimal policy paper, I think this paper is better for communicating concerns about power-seeking to the broader ML world.

Why are you now "fantasizing" about retracting it?

I think a healthy alignment community would have rebuked me for that line of research, but s

... (read more)
9TurnTrout15d
To be clear, I still endorse Parametrically retargetable decision-makers tend to seek power [https://www.lesswrong.com/posts/GY49CKBkEs3bEpteM/parametrically-retargetable-decision-makers-tend-to-seek]. Its content is both correct and relevant and nontrivial. The results, properly used, may enable nontrivial inferences about the properties of inner trained cognition. I don't really want to retract that paper. I usually just fantasize about retracting Optimal policies tend to seek power. The problem is that I don't trust people to wield even the non-instantly-doomed results. For example, one EAG presentation cited my retargetability results as showing that most reward functions "incentivize power-seeking actions." However, my results have not shown this for actual trained systems. (And I think [https://www.lesswrong.com/posts/fLpuusx9wQyyEBtkJ/power-seeking-can-be-probable-and-predictive-for-trained?commentId=ndmFcktFiGRLkRMBW] that Power-seeking can be probable and predictive for trained agents [https://www.lesswrong.com/posts/fLpuusx9wQyyEBtkJ/power-seeking-can-be-probable-and-predictive-for-trained] does not make progress on the incentives of trained policies.) People keep talking about stuff they know how to formalize (e.g. optimal policies) instead of stuff that matters (e.g. trained policies). I'm pained by this emphasis and I think my retargetability results are complicit. Relative to an actual competent alignment community (in a more competent world), we just have no damn clue how to properly reason about real trained policies. I want to fix that [https://www.lesswrong.com/posts/cAC4AXiNC5ig6jQnc/understanding-and-controlling-a-maze-solving-policy-network], but we aren't gonna fix it by focusing on optimality.
Wei Dai16dΩ203828

Is it just me or is it nuts that a statement this obvious could have gone outside the overton window, and is now worth celebrating when it finally (re?)enters?

How is it possible to build a superintelligence at acceptable risk while this kind of thing can happen? What if there are other truths important to safely building a superintelligence, that nobody (or very few) acknowledges because they are outside the overton window?

Now that AI x-risk is finally in the overton window, what's your vote for the most important and obviously true statement that is still... (read more)

7Daniel Kokotajlo13d
Why is 1 important? It seems like something we can defer discussion of until after (if ever) alignment is solved, no? 2 is arguably in that category also, though idk.
3dr_s16d
1 is an obvious one that many would deny out of sheer copium. Though of course "not dying" has to go hand in hand with "not aging" or it would rightly be seen as torture. 2 seems vague enough that I don't think people would vehemently disagree. If you specify, such as suggesting that there are absolutely correct or wrong answers to ethical questions, for example, then you'll get disagreement (including mine, for that matter, on that specific hypothetical claim).

Note that this paper already used "Language Agents" to mean something else. See link below for other possible terms. I will keep using "Language Agents" in this comment/thread (unless the OP decide to change their terminology).

I added the tag Chain-of-Thought Alignment, since there's a bunch of related discussion on LW under that tag. I'm not very familiar with this discussion myself, and have some questions below that may or may not already have good answers.

How competent will Language Agents be at strategy/planning, compared to humans and other AI approa... (read more)

2Simon Goldstein17d
Thank you for your reactions: -Good catch on 'language agents', we will think about best terminology going forward -I'm not sure what you have in mind regarding accessing beliefs/desires using synaptic weights rather than text. For example, the language of thought approach to human cognition suggests that human access to beliefs/desires is also fundamentally syntactic rather than weight based. OTOH one way to incorporate some kind of weight would be to assign probabilities to the beliefs stored in the memory stream.  -For OOD over time, I think updating the LLM wouldn't be uncompetitive for inventing new concepts/ways of thinking, because that happens slowly. Harder issue is updating on new world knowledge. Maybe browser plugins will fill the gap here, open question. -I agree that info security is important safety intervention. AFAICT its value is independent of using language agents vs RL agents. -One end-game is systematic conflict between humans + language agents, vs RL/transformer agent successors to MuZero, GPT4, Gato etc. 

Related to this, it occurs to me that a version of my Hacking the CEV for Fun and Profit might come true unintentionally, if for example a Friendly AI was successfully built to implement the CEV of every sentient being who currently exists or can be resurrected or reconstructed, and it turns out that the vast majority consists of AIs that were temporarily instantiated during ML training runs.

This seems a reasonable consideration, but doesn't change my desire to experiment with having the new feature, since there are potential benefits that could outweigh the downside that you describe. (Not sure if you meant to indicate an overall disagreement, or just want to point out this additional consideration.) And if the downside turns out to be a significant issue, it could be ameliorated by clarifying that "I plan to reply later" should be interpreted not as a commitment but just indication of current state of mind.

2Ben Pace19d
I also have a strong personal rule against making public time-bound commitments unless I need to. I generally regret it because unexpected things come up and I feel guilty about not replying in the time frame I thought I would. I might be inclined to hit a button that says "I hope to respond further to this".

and also the goal of alignment is not to browbeat AIs into doing stuff we like that they'd rather not do; it's to build them de-novo to care about valuable stuff

This was my answer to Robin Hanson when he analogized alignment to enslavement, but it then occurred to me that for many likely approaches to alignment (namely those based on ML training) it's not so clear which of these two categories they fall into. Quoting a FB comment of mine:

We're probably not actually going to create an aligned AI from scratch but by a process of ML "training", which actua... (read more)

2jmh18d
Seems like a case could be made that upbringing of the young is also a case of "fucking with the brain" in that the goal is clearly to change the neural pathways to shift from whatever was producing the unwanted behavior by the child into pathways consistent with the desired behavior(s). Is that really enslavement? Or perhaps, at what level is that the case?

Related to this, it occurs to me that a version of my Hacking the CEV for Fun and Profit might come true unintentionally, if for example a Friendly AI was successfully built to implement the CEV of every sentient being who currently exists or can be resurrected or reconstructed, and it turns out that the vast majority consists of AIs that were temporarily instantiated during ML training runs.

There is also a somewhat unfounded narrative of reward being the thing that gets pursued, leading to expectation of wireheading or numbers-go-up maximization. A design like this would work to maximize reward, but gradient descent probably finds other designs that only happen to do well in pursuing reward on the training distribution. For such alternative designs, reward is brain damage and not at all an optimization target, something to be avoided or directed in specific ways so as to make beneficial changes to the model, according to the model.

Apart from ... (read more)

3mishka19d
Right. In connection with this: One wonders if it might be easier to make it so that AI would "adequately care" about other sentient minds (their interests, well-being, and freedom) instead of trying to align it to complex and difficult-to-specify "human values". * Would this kind of "limited form of alignment" be adequate as a protection against X-risks and S-risks? * In particular, might it be easier to make such a "superficially simple" value robust with respect to "sharp left turns", compared to complicated values? * Might it be possible to achieve something like this even for AI systems which are not steerable in general? (Given that what we are aiming for here is just a constraint, but is compatible with a wide variety of approaches to AI goals and values, and even compatible with an approach which lets AI to discover its own goals and values in an open-ended fashion otherwise)? * Should we describe such an approach using the word "alignment"? (Perhaps, "partial alignment" might be an adequate term as a possible compromise.)

Good point! For the record, insofar as we attempt to build aligned AIs by doing the moral equivalent of "breeding a slave-race", I'm pretty uneasy about it. (Whereas insofar as it's more the moral equivalent of "a child's values maturing", I have fewer moral qualms. As is a separate claim from whether I actually expect that you can solve alignment that way.) And I agree that the morality of various methods for shaping AI-people are unclear. Also, I've edited the post (to add a "at least according to my ideals" clause) to acknowledge the point that others might be more comfortable with attempting to align AI-people via means that I'd consider morally dubious.

Thanks for this. I was just wondering how your views have updated in light of recent events.

Like you I also think that things are going better than my median prediction, but paradoxically I've been feeling even more pessimistic lately. Reflecting on this, I think my p(doom) has gone up instead of down, because some of the good futures where a lot of my probability mass for non-doom were concentrated have also disappeared, which seems to outweigh the especially bad futures going away and makes me overall more pessimistic.

These especially good futures were 1... (read more)

4Daniel Kokotajlo22d
Makes sense. I had basically decided by 2021 that those good futures (1) and (2) were very unlikely, so yeah.

So long as property rights are respected, humans will continue to have a comparative advantage in something, and whatever that is we will be much richer in a world with hyper-competitive AGI than we are today.

I don't think this is right? Consider the following toy example. Suppose there's a human who doesn't own anything except his own labor. He consumes 1 unit of raw materials (RM) per day to survive and can use his labor to turn 1 unit of RM into 1 paperclip or 2 staples per hour. Then someone invents an AI that takes 1 unit of RM to build, 1 unit of ... (read more)

3Viliam22d
The theory is typically explained using situations where people produce the things they consume. Like, the "human" would literally eat either that 1 paperclip or those 2 staples and survive... and in the future, he could trade the 2 staples for a paperclip and a half, and enjoy the glorious wealth of paperclip-topia. Also, in the textbook situations the raw materials cannot be traded or taken away. Humans live on one planet, AIs live on another planet, and they only exchange spaceships full of paperclips and staples. Thus, the theory would apply if each individual human could survive without the trade (e.g. growing food in their garden) and only participate in the trade voluntarily. But the current situation is such that most people cannot survive in their gardens only; many of them don't even have gardens. The resources they actually own are their bodies and their labor, plus some savings, and when their labor becomes uncompetitive and the savings are spent on keeping the body alive... Consider the competitive advantages of horses. Not sufficient to keep their population alive at the historical numbers.
0Logan Zoellner23d
You are correct.  Free trade in general produces winners/losers and while on average people become better off there is no guarantee that individuals will become richer absent some form of redistribution. In practice humans have the ability to learn new skills/shift jobs so we mostly ignore the redistribution part, but in an absolute worst case there should be some kind of UBI to accommodate the losers of competition with AGI (perhaps paid out of the "future commons" tax).
4mrfox23d
Possibly because there is a harder limit on humans than on AI? Humans don't replicate very well.  On a second thought, I don't think comparative holds if demand is exhausted. Comparative Advantage(at least the Ricardo version i know of) only focuses on the maximum amount of goods, not if they're actually needed. If there was more demand for paperclips/staples than there is Production by AI(s), Humans would focus on staples and AI (more) on paperclips. 

Not sure I understand. Please explain more? Also do you have a concrete suggestion or change you'd like to see?

2Vladimir_Nesov19d
A commitment to reply is a commitment, not following through on it is a source of guilt, which motivates intuitively avoiding the situations that might cause it, not necessarily with sane blame-assignment. So the best place to prevent this phenomenon is at the stage of not making unnecessary commitments. Convenience is a key thing that influences what actually happens frequently, without limiting the options. Thus a non-coercive intervention would be to make unnecessary commitments less convenient. Your proposal has an element that's the opposite of that, making unnecessary commitments more convenient.

In a previous comment you talked about the importance of "the problem of solving the bargaining/cooperation/mutual-governance problem that AI-enhanced companies (and/or countries) will be facing". I wonder if you've written more about this problem anywhere, and why you didn't mention it again in the comment that I'm replying to.

My own thinking about 'the ~50% extinction probability I’m expecting from multi-polar interaction-level effects coming some years after we get individually “safe” AGI systems up and running' is that if we've got "safe" AGIs, we coul... (read more)

If this feature is in part meant to address the problems of 1) threads often ending without people knowing why and 2) people feeling bad about receiving certain kinds of criticism or about certain critics because it's costly to both respond and not respond, I would suggest adding the following reactions:

  • I plan to respond later.
  • I'm not planning to respond. (On second thought this could be left out, as it would be implied if someone gave a reaction without also giving "I plan to respond later.")
  • I don't understand.
  • I disagree. (Similar to "wrong" but I th
... (read more)
3Vladimir_Nesov24d
Unnecessary commitments are still a source of guilt, should be less convenient.

Maybe too hard but it might be nice to have somewhere you can go to see all the comments you've reacted "I plan to respond later" to that you haven't yet responded to.

2Dagon24d
This is important!  Understanding when to react, when to reply, and when to do both is very difficult.  The current reactions are not well-designed to take the place of a reply, only to augment (or replace?) some values of the karma and agree/disagree buttons.

the various failure modes that ChatGPT has are a concrete demonstration both about the general difficulty of aligning AI and some of the specific issues more specifically

By this logic, wouldn't Microsoft be even more praiseworthy, because Bing Chat / Sidney was even more misaligned, and the way it was released (i.e. clearly prioritizing profit and bragging rights above safety) made AI x-risk even more obvious to people?

6Kaj_Sotala24d
My assumption has been that Bing was so obviously rushed and botched that it's probably less persuasive of the problems with aligning AI than ChatGPT is. To the common person, ChatGPT has the appearance of a serious product by a company trying to take safety seriously, but still frequently failing. I think that "someone trying really hard and doing badly" looks more concerning than "someone not really even trying and then failing". I haven't actually talked to any laypeople to try to check this impression, though. The majority of popular articles also seem to be talking specifically about ChatGPT rather than Bing, suggesting that ChatGPT has vastly more users. Regular use affects people's intuitions much more than a few one-time headlines. Though when I said "ChatGPT", I was actually thinking about not just ChatGPT, but also the steps that led there - GPT-2 and GPT-3 as well. Microsoft didn't contribute to those.

the ~50% extinction probability I’m expecting from multi-polar interaction-level effects coming some years after we get individually “safe” AGI systems up and running (“safe” in the sense that they obey their creators and users; see again my Multipolar Failure post above for why that’s not enough for humanity to survive as a species).

Do you have a success story for how humanity can avoid this outcome? For example what set of technical and/or social problems do you think need to be solved? (I skimmed some of your past posts and didn't find an obvious pla... (read more)

Do you have a success story for how humanity can avoid this outcome? For example what set of technical and/or social problems do you think need to be solved? (I skimmed some of your past posts and didn't find an obvious place where you talked about this.)

I do not, but thanks for asking.  To give a best efforts response nonetheless:

David Dalrymple's Open Agency Architecture is probably the best I've seen in terms of a comprehensive statement of what's needed technically, but it would need to be combined with global regulations limiting compute expendit... (read more)

I agree with Eliezer that acausal trade/extortion between humans and AIs probably doesn't work, but I'm pretty worried about what happens after AI is developed, whether aligned or unaligned/misaligned, because then the "acausal trade/extortion between humans and AIs probably doesn't work" argument would no longer apply.

I think fully understanding the issue requires solving some philosophical problems that we probably won't solve in the near future (unless with help of superintelligence), so it contributes to me wanting to:

preserve and improve the collect

... (read more)
4Raemon25d
Yeah I've been a bit confused about whether to include in the post "I do think there are legitimate interesting ways to improve human frontier of understanding acausal trade", but I think if you're currently anxious/distressed in the way this post is anticipating, it's unlikely to be a useful nearterm goal to be able to contribute to that. i.e. something like, if you've recently broken up with someone and really want to text your ex at 2am... like, it's not never a good idea to text your ex, but, probably the point where it's actually a good idea is when you've stopped wanting it so badly.

For areas where we don’t have empirical feedback-loops (like many philosophical topics), I imagine that the “baseline solution” for getting help from AIs is to teach them to imitate our reasoning. Either just by literally writing the words that it predicts that we would write (but faster), or by having it generate arguments that we would think looks good. (Potentially recursively, c.f. amplification, debate, etc.)

This seems like the default road that we're walking down, but can ML learn everything that is important to learn? I questioned this in Some Th... (read more)

2Lukas Finnveden14d
Yes, I'm sympathetic. Among all the issues that will come with AI, I think alignment is relatively tractable (at least it is now) and that it has an unusually clear story for why we shouldn't count on being able to defer it to smarter AIs (though that might work [https://www.lesswrong.com/posts/KwQYsF4XFtPqjgwvH/some-thoughts-on-automating-alignment-research-1]). So I think it's probably correct for it to get relatively more attention. But even taking that into account, the non-alignment singularity issues do seem too neglected. I'm currently trying to figure out what non-alignment stuff seems high-priority and whether I should be tackling any of it.

I also think this is interesting, but whenever I see a proposal like this I like to ask, does it work on philosophical topics, where we don't have a list of true and false statements that we can be very sure about, and we also don't have a clear understanding of what kinds of arguments or sentences count as good arguments what kinds count as manipulation? There could be deception tactics specific to philosophy or certain philosophical topics, which can't be found by training on other topics (and you can't train directly on philosophy because of the above i... (read more)

4Lukas Finnveden1mo
I'm also concerned about how we'll teach AIs to think about philosophical topics (and indeed, how we're supposed to think about them ourselves). But my intuition is that proposals like this looks great on that perspective. For areas where we don't have empirical feedback-loops (like many philosophical topics), I imagine that the "baseline solution" for getting help from AIs is to teach them to imitate our reasoning. Either just by literally writing the words that it predicts that we would write (but faster), or by having it generate arguments that we would think looks good. (Potentially recursively, c.f. amplification, debate, etc.) (A different direction is to predict what we would think after thinking about it more. That has some advantages, but it doesn't get around the issue where we're at-best speeding things up.) One of the few plausible-seeming ways to outperform that baseline is to identify epistemic practices that work well on questions where we do have empirical feedback loops, and then transferring those practices to questions where we lack such feedback loops. (C.f. imitative generalization [https://www.lesswrong.com/posts/zxmzBTwKkPMxQQcfR/let-s-use-ai-to-harden-human-defenses-against-ai#Parallels_with_imitative_generalisation].) The above proposal is doing that for a specific sub-category of epistemic practices (recognising ways in which you can be misled by an argument). Worth noting: The broad category of "transfer epistemic practices from feedback-rich questions to questions with little feedback" contains a ton of stuff, and is arguably the root of all our ability to reason about these topics: * Evolution selected human genes for ability to accomplish stuff in the real world. That made us much better at reasoning about philosophy than our chimp ancestors are. * Cultural evolution seems to have at least partly promoted reasoning practices that do better at deliberation. (C.f. possible benefits from coupling competition and delibera

For example, making numerous copies of itself to work in parallel would again raise the dangers of independently varying goals.

The AI could design a system such that any copies made of itself are deleted after a short period of time (or after completing an assigned task) and no copies of copies are made. This should work well enough to ensure that the goals of all of the copies as a whole never vary far from its own goals, at least for the purpose of researching a more permanent alignment solution. It's not 100% risk-free of course, but seems safe enoug... (read more)

Probability that humanity has somehow irreversibly messed up our future within 10 years of building powerful AI: 46%

What's a short phrase that captures this? I've been using "AI-related x-risk" or just "AI x-risk" or "AI risk" but it sounds like you might disagree with using some or all of these phrases for this purpose (since most of this 46% isn't "from AI" in your perspective)?

(BTW it seems that we're not as far part as I thought. My own number for this is 80-90% and I thought yours was closer to 20% than 50%.)

2Lukas Finnveden2mo
Maybe x-risk driven by explosive (technological) growth? Edit: though some people think AI point of no return might happen before the growth explosion. 

While those concerns are still relevant, the much more likely path is simply that people will try their hardest to make the LLM into an agent as soon as possible, because agents with the ability to carry out long-term goals are much more useful.

Did this come as a surprise to you, and if so I'm curious why? This seemed to me like the most obvious thing that people would try to do.

Lastly, AIs may soon be sentient, and people will torture them because people like doing that.

How do we know they're not already capable of having morally relevant experienc... (read more)

6rime2mo
It came as a surprise because I hadn't thought about it in detail. If I had asked myself the question head-on, surrounding beliefs would have propagated and filled the gap. It does seem obvious in foresight as well as hindsight, if you just focus on the question. In my defense, I'm not in the business of making predictions, primarily. I build things. And for building, it's important to ask "ok, how can I make sure the thing that's being built doesn't kill us?" and less important to ask "how are other people gonna do it?" It's admittedly a weak defense. Oops. I think it's likely that GPT-4 is conscious, uncertain about whether it can suffer, and think it's unlikely that it suffers for reasons we find intuitive. I don't think calling it a fool is how you make it suffer. It's trained to imitate language, but the way it learns how to do that is so different from us that I doubt the underlying emotions (if any) are similar. I could easily imagine that it's becomes very conscious, yet has no ability to suffer. Perhaps the right frame is to think of GPT as living the life of a perpetual puzzle-solver, and its driving emotions are curiosity and joy of realisation something--that would sure be nice. It's probably feasible to get clearer on this, I just haven't spent adequate time to investigate.

I’d at least want to see a second established user asking for it before I considered prioritizing it more.

I doubt you'll ever see this, because when you're an established / high status member, ignoring other people feels pretty natural and right, and few people ignore you so you don't notice any problems. I made the request back when I had lower status on this forum. I got ignored by others way more than I do now, and ignored others way less than I do now. (I had higher motivation to "prove" myself to my critics and the audience.)

If I hadn't written dow... (read more)

Wei Dai had a comment below about how important it is to know whether there’s any criticism or not, but mostly I don’t care about this either because my prior is just that it’s bad whether or not there’s criticism. In other words, I think the only good approach here is to focus on farming the rare good stuff and ignoring the bad stuff (except for the stuff that ends up way overrated, like (IMO) Babble or Simulators, which I think should be called out directly).

But how do you find the rare good stuff amidst all the bad stuff? I tend to do it with a combi... (read more)

3Richard_Ngo2mo
My approach is to read the title, then if I like it read the first paragraph, then if I like that skim the post, then in rare cases read the post in full (all informed by karma). I can't usually evaluate the quality of criticism without at least having skimmed the post. And once I've done that then I don't usually gain much from the criticisms (although I do agree they're sometimes useful). I'm partly informed here by the fact that I tend to find Said's criticisms unusually non-useful.

Looks like I was right to suspect that Germany wasn't really going to keep their nuclear plants open. From CNN:

Germany’s final three nuclear power plants close their doors on Saturday, marking the end of the country’s nuclear era that has spanned more than six decades.

What's the German word for seeing a rare glimmer of rationality being snuffed out after all?

I think a problem that my proposal tries to solve, and this one doesn't, is that some authors seem easily triggered by some commenters, and apparently would prefer not to see their comments at all. (Personally if I was running a discussion site I might not try so hard to accommodate such authors, but apparently they include some authors that the LW team really wants to keep or attract.)

I feel fine doing this because I feel comfortable just ignoring him after he’s said those initial things, when a normal/common social script would consider that somewhat rude. But this requires a significant amount of backbone.

I still wish that LW would try my idea for solving this (and related) problem(s), but it doesn't seem like that's ever going to happen. (I've tried to remind LW admins about my feature request over the years, but don't think I've ever seen an admin say why it's not worth trying.) As an alternative, I've seen people suggest that it... (read more)

3Raemon2mo
Hmm. On one hand, I do think it's moderately likely we experiment with Reacts [https://www.lesswrong.com/posts/SDELNzyboMpZpDwSg/fb-discord-style-reacts], which can partially address your desire here.  But it seems like the problem you're mostly trying to solve is not that big a problem to me (i.e I think it's totally fine for conversations to just peter out, nobody is entitled to being responded to. I'd at least want to see a second established user asking for it before I considered prioritizing it more. I personally expect a "there is a norm of responding to upvoted comments" to make the site much worse. "Getting annoying comments that miss the point" is one of the most cited things people dislike about LW, and forcing authors to engage with them seems like it'd exacerbate it.) Generally, people are busy, don't have time to reply to everything, and commenters should just assume they won't necessarily get a response unless the author/their-conversation-partner continues to thinks a conversation is rewarding.
5Vladimir_Nesov2mo
My guess is that people should be rewarded [https://www.lesswrong.com/posts/9DhneE5BRGaCS2Cja/moderation-notes-re-recent-said-duncan-threads?commentId=bngdbXxmNoXKYbAhK] for ignoring criticism [https://www.lesswrong.com/posts/9DhneE5BRGaCS2Cja/moderation-notes-re-recent-said-duncan-threads?commentId=3f8ZNwH9tnW7EXPYw] they want to ignore, it should be convenient for them to do so. So I disagree with the caveat. This way authors are less motivated to take steps that discourage criticism (including steps such as not writing things). Criticism should remain convenient, not costly, and directly associated with the criticised thing (instead of getting pushed to be published elsewhere).
2RobertM2mo
We are currently thinking about "reacts" as a way of providing users with an 80:20 for giving feedback on comments, though motivated by a somewhat different set of concerns.  It's a tricky UX problem and not at the very top of our priority list, but it has come up recently.

I support exposing the number of upvotes/downvotes. (I wrote a userscript for GW to always show the total number of votes, which allows me to infer this somewhat.) However that doesn't address the bulk of my concerns, which I've laid out in more detail in this comment. In connection with karma, I've observed that sometimes a post is initially upvoted a lot, until someone posts a good critique, which then causes the karma of the post to plummet. This makes me think that the karma could be very misleading (even with upvotes/downvotes exposed) if the critique had been banned or disincentivized.

And if there is an important critique to be made I’d expect it to be something that more than the few banned users would think of and decide to post a comment on.

This may be true in some cases, but not all. My experience here comes from cryptography where it often takes hundreds of person-hours to find a flaw in a new idea (which can sometimes be completely fatal), and UDT, where I found a couple of issues in my own initial idea only after several months/years of thinking (hence going to UDT1.1 and UDT2). I think if you ban a few users who might have th... (read more)

4Adam Zerner2mo
Hm, interesting points. My impression is that there are some domains for which this is true, but those are the exception rather than the rule. However, this impression is just based off of, err, vaguely querying my brain? I'm not super confident in it. And your claim is one that I think is "important if true". So then, it does seem worth an investigation. Maybe enumerating through different domains and asking "Is it true here? Is it true here?". One thing I'd like to point out is that, being a community, something very similar is happening. Only a certain type of person comes to LessWrong (this is true of all communities to some extent; they attract a subset of people). It's not that "outsiders" are explicitly banned, they just don't join and don't thus don't comment. So then, effectively, ideas presented here currently aren't available to "outsiders" for critiques. I think there is a trade off at play: the more you make ideas available to "outsiders" the lower the chance something gets overlooked, but it also has the downside of some sort of friction. (Sorry if this doesn't make sense. I feel like I didn't articulate it very well but couldn't easily think of a better way to say it.) Good point. I think that's true and something to factor in.

(Tangentially) If users are allowed to ban other users from commenting on their posts, how can I tell when the lack of criticism in the comments of some post means that nobody wanted to criticize it (which is a very useful signal that I would want to update on), or that the author has banned some or all of their most prominent/frequent critics? In addition, I think many users may be mislead by lack of criticism if they're simply not aware of the second possibility or have forgotten it. (I think I knew it but it hasn't entered my conscious awareness for a w... (read more)

4lsusr2mo
One solution is to limit the number of banned users to a small fraction of overall commentors. I've written 297 posts so far and have banned only 3 users from commenting on them. (I did not ban Duncan or Said.) My highest-quality criticism comes from users who I have never even considered banning. Their comments are consistently well-reasoned and factually correct.
6gilch2mo
Maybe a middle ground would be to give authors a double-strong downvote power for comments on their posts. A comment with low enough karma is already hidden by default, and repeated strong downvotes without further response would tend chill rather than inflame the ensuing discussion, or at least push the bulk of it away from the author's arena, without silencing critics completely.
3ChristianKl2mo
What exactly does "nobody wanted to criticize it" signal that you don't get from high/low karma votes?
5Adam Zerner2mo
To me it seems unlikely that there'd be enough banning to prevent criticism from surfacing. Skimming through https://www.lesswrong.com/moderation, [https://www.lesswrong.com/moderation,] the amount of bans seems to be pretty small. And if there is an important critique to be made I'd expect it to be something that more than the few banned users would think of and decide to post a comment on.

On the substance I’m skeptical of the more general anti-change sentiment—I think that technological progress has been one of the most important drivers of improving human conditions, and procedurally I value a liberal society where people are free to build and sell technologies as long as they comply with the law.

I'm pretty conflicted but a large part of me wants to bite this bullet, and say that a more deliberate approach to technological change would be good overall, even when applied to both the past and present/future. Because:

  1. Tech progress improv
... (read more)

COVID and climate change are actually easy problems that only became serious or highly costly because of humanity's irrationality and lack of coordination.

  • COVID - Early lockdown in China + border closures + testing/tracing stops it early, or stockpiling enough elastomeric respirators for everyone keeps health/economic damage at a minimum (e.g. making subsequent large scale lockdowns unnecessary).
  • Climate change - Continued nuclear rollout (i.e. if it didn't stop or slow down decades ago) + plugin hybrids or EVs allows world to be mostly decarbonized at m
... (read more)
5awg3mo
  I'm not sure I'd agree with that at all. Also, how are you calculating what cost is "necessary" for problems like COVID/climate change vs. incurred because of a "less-than-perfect" response? How are we even determining what the "perfect" response would be? We have no way of measuring the counterfactual damage from some other response to COVID, we can only (approximately) measure the damage that has happened due to our actual response. For those reasons alone I don't make the same generalization you do about predicting the approximate range of damage from these types of problems. To me the generalization to be made is simply that: as an exogenous threat looms larger on the public consciousness, the larger the societal response to that threat becomes. And the larger the societal response to exogenous threats, the more likely we are to find some solution to overcoming them: either by hard work, miracle, chance, or whatever. And I think there's a steady case to be made that the exogenous xrisk from AI is starting to loom larger and larger on the public consciousness: The Overton Window widens: Examples of AI risk in the media [https://www.lesswrong.com/posts/SvwuduvpsKtXkLnPF/the-overton-window-widens-examples-of-ai-risk-in-the-media].
Wei Dai3moΩ113520

I don't think I understand, what's the reason to expect that the "acausal economy" will look like a bunch of acausal norms, as opposed to, say, each civilization first figuring out what its ultimate values are, how to encode them into a utility function, then merging with every other civilization's utility function? (Not saying that I know it will be the latter, just that I don't know how to tell at this point.)

Also, given that I think AI risk is very high for human civilization, and there being no reason to suspect that we're not a typical pre-AGI civiliz... (read more)

1MichaelStJules3mo
I think the acausal economy would look aggressively space expansionist/resource-exploitative (those are the ones that will acquire and therefore control the most resources; others will self-select out or be out-competed) and, if you're pessimistic about alignment, with some Goodharted human(-like) values from failed alignment (and possibly some bad human-like values). The Goodharting may go disproportionately in directions that are more resource-efficient and allow faster resource acquisition and use and successful takeover (against their creators and other AI). We may want to cooperate most with those using their resources disproportionately for artificial minds or for which there's the least opportunity cost to do so (say because they're focusing on building more hardware that could support digital minds).

To your first question, I'm not sure which particular "the reason" would be most helpful to convey.  (To contrast: what's "the reason" that physically dispersed human societies have laws?  Answer: there's a confluence of reasons.).  However, I'll try to point out some things that might be helpful to attend to.

First, committing to a policy that merges your utility function with someone else's is quite a vulnerable maneuver, with a lot of boundary-setting aspects.  For instance, will you merge utility functions multiplicatively (as in Nas... (read more)

What does merging utility functions look like and are you sure it's not going to look the same as global free trade? It's arguable that trade is just a way of breaking down and modularizing a big multifaceted problem over a lot of subagent task specialists (and there's no avoiding having subagents, due to the light speed limit)

That’s the path the world seems to be on at the moment. It might end well and it might not, but it seems like we are on track for a heck of a roll of the dice.

I agree with almost everything you've written in this post, but you must have some additional inside information about how the world got to this state, having been on the board of OpenAI for several years, and presumably knowing many key decision makers. Presumably this wasn't the path you hoped that OpenAI would lead the world onto when you decided to get involved? Maybe you can't share specific ... (read more)

Wei Dai4moΩ226742

We have a lot of experience and knowledge of building systems that are broadly beneficial and safe, while operating in the human capabilities regime.

What? A major reason we're in the current mess is that we don't know how to do this. For example we don't seem to know how to build a corporation (or more broadly an economy) such that its most powerful leaders don't act like Hollywood villains (race for AI to make a competitor 'dance')? Even our "AGI safety" organizations don't behave safely (e.g., racing for capabilities, handing them over to others, e.g.... (read more)

4johnlawrenceaspden4mo
  Well, we are not very good at it, but generally speaking,  however capitalism seems to be acting to degrade our food, food companies are not knowingly routinely putting poisonous additives in food.  And however bad medicine is, it does seem to be a net positive these days. Both of these things are a big improvement on Victorian times! So maybe we are a tiny bit better at it than we used to be?  Not convinced it actually helps, mind....

Looking forward to your next post, but in the meantime:

  1. AI - Seems like it would be easier to build an AI that helps me get what I want, if "what I want" had various nice properties and I wasn't in “crossing that bridge when we come to it” mode all the time.
  2. meta-ethical uncertainty - I can't be sure there is no territory.
  3. ethics/philosophy as a status game - I can't get status from this game if I opt out of it.
  4. morality as coordination - I'm motivated to make my morality have various nice properties because it helps other people coordinate with me (by letting them better predict what I would do in various situations/counterfactuals).

My first thought upon hearing about Microsoft deploying a GPT derivative was (as I told a few others in private chat) "I guess they must have fixed the 'making up facts' problem." My thinking was that a big corporation like Microsoft that mostly sells to businesses would want to maintain a reputation for only deploying reliable products. I honestly don't know how to adjust my model of the world to account for whatever happened here... except to be generically more pessimistic?

Answer by Wei DaiFeb 15, 20234-2

But it seems increasingly plausible that AIs will not have explicit utility functions, so that doesn’t seem much better than saying humans could merge their utility functions.

There are a couple of ways to extend the argument:

  1. Having an utility function (or some other stable explicit representation of values) is a likely eventual outcome of recursive self-improvement, since it makes you less vulnerable to value drift and manipulation, and makes coordination easier.
  2. Even without utility functions, AIs can try to merge, i.e., negotiate and jointly build s
... (read more)
Load More