All of Daniel Kokotajlo's Comments + Replies

Thanks for the feedback, I'll try to keep this in mind in the future. I imagine you'd prefer me to keep the links, but make the text use common-sense language instead of acronyms so that people don't need to click on the links to understand what I'm saying?

That seems like a useful heuristic- I also think there's an important distinction between using links in a debate frame and in a sharing frame. I wouldn't be bothered at all by a comment using acronyms and links, no matter how insular, if the context was just 'hey this reminds me of HDFT and POUDA,' a beginner can jump off of that and get down a rabbit hole of interesting concepts. But if you're in a debate frame, you're introducing unnecessary barriers to discussion which feel unfair and disqualifying. At its worst it would be like saying: 'youre not qualified to debate until you read these five articles.' In a debate frame I don't think you should use any unnecessary links or acronyms at all. If you're linking a whole article it should be because it's necessary for them to read and understand the whole article for the discussion to continue and it cannot be summarized. I think I have this principle because in my mind you cannot not debate so therefore you have to read all the links and content included, meaning that links in a sharing context are optional but in a debate context they're required. I think on a second read your comment might have been more in the 'sharing' frame than I originally thought, but to the extent you were presenting arguments I think you should maximize legibility, to the point of only including links if you make clear contextually or explicitly to what degree the link is optional or just for reference.

I strong-upvoted this post.

Here's a specific, zoomed-in version of this game proposed by Nate Soares

like, we could imagine playing a game where i propose a way that it [the AI] diverges [from POUDA-avoidance] in deployment, and you counter by asserting that there's a situation in the training data where it had to have gotten whacked if it was that stupid, and i counter either by a more-sophisticated deployment-divergence or by naming either a shallower or a factually non-[Alice]like thing that it could have learned instead such that the divergence s

... (read more)
3Lauro Langosco1d
I like that mini-game! Thanks for the reference

Tom Davidson found a math error btw, it shouldn't be 360,000 agents doing a year's worth of thinking each in only 3 days. It should be much less than that, otherwise you are getting compute for free!

Oops, thanks, updated to fix this.

Well said. 

One thing conspicuously absent, IMO, is discussion of misalignment risk. I'd argue that GPT-2030 will be situationally aware, strategically aware, and (at least when plugged into fancy future versions of AutoGPT etc.) agentic/goal-directed. If you think it wouldn't be a powerful adversary of humanity, why not? Because it'll be 'just following instructions' and people will put benign instructions in the prompt? Because HFDT will ensure that it'll robustly avoid POUDA? Or will it in fact be a powerful adversary of humanity, but one that is un... (read more)

I don't like the number of links that you put into your first paragraph. The point of developing a vocabulary for a field is to make communication more efficient so that the field can advance. Do you need an acronym and associated article for 'pretty obviously unintended/destructive actions,' or in practice is that just insularizing the discussion? I hear people complaining about how AI safety only has ~300 people working about it, and how nobody is developing object level understandings and everyone's thinking from authority, but the more sentences you write like: "Because HFDT []will ensure that it'll robustly avoid POUDA []?" the more true that becomes. I feel very strongly about this.
4Not Relevant1d
Where does this “transfer learning across timespans” come from? The main reason I see for checking back in after 3 days is the model’s losing the thread of what the human currently wants, rather than being incapable of pursuing something for longer stretches. A direct parallel is a human worker reporting to a manager on a project - the worker could keep going without check-ins, but their mental model of the larger project goes out of sync within a few days so de facto they’re rate limited by manager check-ins.

Blast from the past: Reading this recent paper I happened across this diagram:

3Karl von Wendt1d
Thank you! Very interesting and a little disturbing, especially the way the AI performance expands in all directions simultaneously. This is of course not surprising, but still concerning to see it depicted in this way. It's all too obvious how this diagram will look in one or two years. Would also be interesting to have an even broader diagram including all kinds of different skills, like playing games, steering a car, manipulating people, etc.

It's very possible this means we're overestimating the compute performed by the human brain a bit.

Specifically, by 6-8 OOMs. I don't think that's "a bit." ;)

Oh I totally agree with everything you say here, especially your first sentence. My timelines median for intelligence explosion (conditional on no significant government-enforced slowdown) is 2027.

So maybe I was misleading when I said I was unimpressed.


Excellent! Yeah I think GPT-4 is close to automating remote workers. 5 or 6, with suitable extensions (e.g. multimodal, langchain, etc.) will succeed I think. Of course, there'll be a lag between "technically existing AI systems can be made to ~fully automate job X" and "most people with job X are now unemployed" because things take time to percolate through the economy. But I think by the time of GPT-6 it'll be clear that this percolation is beginning to happen & the sorts of things that employ remote workers in 2023 (especially the strategically rele... (read more)

I’m wondering if we could make this into a bet. If by remote workers we include programmers, then I’d be willing to bet that GPT-5/6, depending upon what that means (might be easier to say the top LLMs or other models trained by anyone by 2026?) will not be able to replace them.

Thanks! AI managers, CEOs, self-replicators, and your-job-doers (what is your job anyway? I never asked!) seem like things that could happen before it's too late (albeit only very shortly before) so they are potential sources of bets between us. (The other stuff requires lots of progress in robotics which I don't expect to happen until after the singularity, though I could be wrong)

Yes, I understand that you don't think AGI will be achieved by brain simulation. I like that you have a giant confidence interval to account for cases where AGI is way more effi... (read more)

Great points.

I think you've identified a good crux between us: I think GPT-4 is far from automating remote workers and you think it's close. If GPT-5/6 automate most remote work, that will be point in favor of your view, and if takes until GPT-8/9/10+, that will be a point in favor of mine. And if GPT gradually provides increasingly powerful tools that wildly transform jobs before they are eventually automated away by GPT-7, then we can call it a tie. :)

I also agree that the magic of GPT should update one into believing in shorter AGI timelines with lower ... (read more)

Good point. I'll message Tristan, see if he can incorporate that into the model.

Had a post [] about that. 

Thanks for this well-researched and thorough argument! I think I have a bunch of disagreements, but my main one is that it really doesn't seem like AGI will require 8-10 OOMs more inference compute than GPT-4. I am not at all convinced by your argument that it would require that much compute to accurately simulate the human brain. Maybe it would, but we aren't trying to accurately simulate a human brain, we are trying to learn circuitry that is just as capable.

Also: Could you, for posterity, list some capabilities that you are highly confident no AI system will have by 2030? Ideally capabilities that come prior to a point-of-no-return so it's not too late to act by the time we see those capabilities.

AI will probably displace a lot of cognitive workers in the near future. And physical labor might take a while to get below 25$/hr. * Most most tasks human level intelligence is not required.  * Most highly valued jobs have a lot of tasks that do not require high intelligence. * Doing 95% of all tasks could be a lot sooner (10-15 years earlier) than 100%. See autonomous driving (getting to 95% safe or 99,9999 safe is a big difference). * Physical labor by robots will probably remain expensive for a long time (e.g. a robot plumber). A robot ceo is probably cheaper in the future than the robot plumber.  * Just take gpt4 and fine tune it and you can automate a lot of cognitive labor already. * Deployment of cognitve work automation (a software update) is much faster that deployment of physical robots. I agree that AI might not replace swim instructors by 2030. It is the cognitive work where the big leaps will be. 

Oh, to clarify, we're not predicting AGI will be achieved by brain simulation. We're using the human brain as a starting point for guessing how much compute AGI will need, and then applying a giant confidence interval (to account for cases where AGI is way more efficient, as well as way less efficient). It's the most uncertain part of our analysis and we're open to updating.

For posterity, by 2030, I predict we will not have:

  • AI drivers that work in any country
  • AI swim instructors
  • AI that can do all of my current job at OpenAI in 2023
  • AI that can get into a 201
... (read more)

I do like Hanson's story you link. :) Yes, panspermia possibility does make it non-crazy that there could be aliens close to us despite an empty sky. Unlikely, but non-crazy. Then there's still the question of why they are so bad at hiding & why their technology is so shitty, and why they are hiding in the first place. It's not completely impossible but it seems like a lot of implausible assumptions stacked on top of each other. So, I think it's still true that "the best modelling suggests aliens are at least hundreds of millions of light-years away."

We are more likely to be born in a world with panspermia as it has higher concentration of habitable planets.

Nice story! Mostly I think that the best AGIs will always be in the big labs rather than open source, and that current open-source models aren't smart enough to get this sort of self-improving ecosystem off the ground. But it's not completely implausible.

5Karl von Wendt2d
Thank you very much! I agree. We chose this scenario out of many possibilities because so far it hasn't been described in much detail and because we wanted to point out that open source can also lead to dangerous outcomes, not because it is the most likely scenario. Our next story will be more "mainstream".

This being actual aliens is highly unlikely for the usual reasons. The best modeling suggests aliens are at least hundreds of millions of light-years away, since otherwise there would be sufficiently many of them in the sky that some of them would choose not to hide. Moreover if any did visit Earth with the intention of hiding, they would probably have more advanced technology than this, and would be better at hiding.


The best modeling suggests aliens are at least hundreds of millions of light-years away...

As Robin Hanson himself notes: "That's assuming independent origins. Things that have a common origin would find themselves closer in space and time." See also:

I guess I just think it's pretty unreasonable to have p(doom) of 10% or less at this point, if you are familiar with the field, timelines, etc. 

I totally agree the topic is important and neglected. I only said "arguably" deferrable, I have less than 50% credence that it is deferrable. As for why I'm not working on it myself, well, aaaah I'm busy idk what to do aaaaaaah! There's a lot going on that seems important. I think I've gotten wrapped up in more OAI-specific things since coming to OpenAI, and maybe that's bad & I should be stepping back and trying to go where I'm most needed even if that means leaving OpenAI. But yeah. I'm open to being convinced!

4[comment deleted]4d
2[comment deleted]4d
2[comment deleted]4d
2Wei Dai4d
I guess part of the problem is that the people who are currently most receptive to my message are already deeply enmeshed in other x-risk work, and I don't know how to reach others for whom the message might be helpful (such as academic philosophers just starting to think about AI?). If on reflection you think it would be worth spending some of your time on this, one particularly useful thing might be to do some sort of outreach/field-building, like writing a post or paper describing the problem, presenting it at conferences, and otherwise attracting more attention to it. (One worry I have about this is, if someone is just starting to think about AI at this late stage, maybe their thinking process just isn't very good, and I don't want them to be working on this topic! But then again maybe there's a bunch of philosophers who have been worried about AI for a while, but have stayed away due to the overton window thing?)

Nice post. Some minor thoughts: 

Are there historical precedents for this sort of thing? Arguably so: wildfires of strategic cognition sweeping through a nonprofit or corporation or university as office politics ramps up and factions start forming with strategic goals, competing with each other. Wildfires of strategic cognition sweeping through the brain of a college student who was nonagentic/aimless before but now has bought into some ambitious ideology like EA or communism. Wildfires of strategic cognition sweeping through a network of PCs as a viru... (read more)

Something like 2% of people die every year right? So even if we ignore the value of future people and all sorts of other concerns and just focus on whether currently living people get to live or die, it would be worth delaying a year if we could thereby decrease p(doom) by 2 percentage points. My p(doom) is currently 70% so it is very easy to achieve that. Even at 10% p(doom), which I consider to be unreasonably low, it would probably be worth delaying a few years.

Re: 2: Yeah I basically agree. I'm just not as confident as you are I guess. Like, maybe the ... (read more)

4Wei Dai4d
Someone with with 10% p(doom) may worry that if they got into a coalition with others to delay AI, they can't control the delay precisely, and it could easily become more than a few years. Maybe it would be better not to take that risk, from their perspective. And lots of people have p(doom)<10%. Scott Aaronson just gave 2% for example, and he's probably taken AI risk more seriously than most (currently working on AI safety at OpenAI), so probably the median p(doom) (or effective p(doom) for people who haven't thought about it explicitly) among the whole population is even lower. I think I've tried to take into account uncertainties like this. It seems that in order for my position (that the topic is important and too neglected) to be wrong, one has to reach high confidence that these kinds of problems will be easy for AIs (or humans or AI-human teams) to solve, and I don't see how that kind of conclusion could be reached today. I do have some specific arguments [] for why the AIs we'll build may be bad at philosophy, but I think those are not very strong arguments so I'm mostly relying on a prior that says we should be worried about and thinking about this until we see good reasons not to. (It seems hard to have strong arguments either way today, given our current state of knowledge about metaphilosophy and future AIs.) Another argument for my position is that humans have already created a bunch of opportunities for ourselves to make serious philosophical mistakes, like around nuclear weapons, farmed animals, AI, and we can't solve those problems by just asking smart honest humans the right questions, as there is a lot of disagreement between philosophers on many important questions. What's stopping you from doing this, if anything? (BTW, beyond the general societal level of neglect, I'm especially puzzled by the lack of interest/engagement on this

Proposed Forecasting Technique: Annotate Scenario with Updates (Related to Joe's Post)

  • Consider a proposition like "ASI will happen in 2024, not sooner, not later." It works best if it's a proposition you assign very low credence to, but that other people you respect assign much higher credence to.
  • What's your credence in that proposition?
  • Step 1: Construct a plausible story of how we could get to ASI in 2024, no sooner, no later. The most plausible story you can think of. Consider a few other ways it could happen too, for completeness, but don't write them d
... (read more)

I am unimpressed. I've had conversations with people before that went very similarly to this. If this had been a transcript of your conversation with a human, I would have said that human was not engaging with the subject on the gears / object level and didn't really understand it, but rather had a shallow understanding of the topic, used the anti-weirdness heuristic combined with some misunderstandings to conclude the whole thing was bogus, and then filled in the blanks to produce the rest of the text. Or, to put it differently, BingChat's writing here re... (read more)

I don't know, I feel like the day that an AI can do significantly better than this, will be close to the final day of human supremacy. In my experience, we're still in a stage where the AIs can't really form or analyze complex structured thoughts on their own - where I mean thoughts with, say, the complexity of a good essay. To generate complex structured thoughts, you have to help them a bit, and when they analyze something complex and structured, they can make out parts of it, but they don't form a comprehensive overall model of meaning that they can the... (read more)

Science as a kind of Ouija board:

With the board, you do this set of rituals and it produces a string of characters as output, and then you are supposed to read those characters and believe what they say.

So too with science. Weird rituals, check. String of characters as output, check. Supposed to believe what they say, check.

With the board, the point of the rituals is to make it so that you aren't writing the output, something else is -- namely, spirits. You are supposed to be light and open-minded and 'let the spirit move you' rather than deliberately try ... (read more)

It's no longer my top priority, but I have a bunch of notes and arguments relating to AGI takeover scenarios that I'd love to get out at some point. Here are some of them:

Beating the game in May 1937 - Hoi4 World Record Speedrun Explained - YouTube
In this playthrough, the USSR has a brief civil war and Trotsky replaces Stalin. They then get an internationalist socialist type diplomat who is super popular with US, UK, and France, who negotiates passage of troops through their territory -- specifially, they send many many brigades of extremely low-tier troop... (read more)

(But that still leaves room for an update towards "the AI doesn't necessarily kill us, it might merely warp us, or otherwise wreck civilization by bounding us and then giving us power-before-wisdom within those bounds or or suchlike, as might be the sort of whims that rando drives shake out into", which I'll chew on.)

FWIW this is my view. (Assuming no ECL/MSR or acausal trade or other such stuff. If we add those things in, the situation gets somewhat better in expectation I think, because there'll be trades with faraway places that DO care about our CEV.)

Why is 1 important? It seems like something we can defer discussion of until after (if ever) alignment is solved, no?

2 is arguably in that category also, though idk.

Wei Dai4dΩ91610

Why is 1 important? It seems like something we can defer discussion of until after (if ever) alignment is solved, no?

If aging was solved or looked like it will be solved within next few decades, it would make efforts to stop or slow down AI development less problematic, both practically and ethically. I think some AI accelerationists might be motivated directly by the prospect of dying/deterioration from old age, and/or view lack of interest/progress on that front as a sign of human inadequacy/stagnation (contributing to their antipathy towards humans).... (read more)

I suggest you put this in a sequence with your other posts in this series (posts making fairly basic points that nonetheless need to be said)

I normally am all for charitability and humility and so forth, but I will put my foot down and say that it's irrational (or uninformed) to disagree with this statement:

“Mitigating the risk of extinction from AI should be a global priority alongside other societal-scale risks such as pandemics and nuclear war.”

(I say uninformed because I want to leave an escape clause for people who aren't aware of various facts or haven't been exposed to various arguments yet. But for people who have followed AI progress recently and/or who have heard the standard argument... (read more)

I agree with the statement, broadly construed, so I don't disagree here. The key disanalogy between climate change and AI risk is the evidence base for both. For Climate change, there was arguably trillions to quadrillions of data points of evidence, if not more, which is easily enough to convince even very skeptical people's priors to update massively. For AI, the evidence base is closer to maybe 100 data points maximum, and arguably lower than that. This is changing for the future, and things are getting better, but it's quite different from climate change where you could call them deniers pretty matter of factly. This means more general priors matter, and even not very extreme priors wouldn't update much on the evidence for AI doom, so they are much, much less irrational compared to climate deniers If the statement is all that's being asked for, that's enough. The worry is when people apply climate analogies to the AI without realizing the differences, and those differences are enough to alter or invalidate the conclusions argued for.

? The people viewing AI as not an X-risk are the people confidently dismissing something. 

I think the evidence is really there. Again, the claim isn't that we are definitely doomed, it's that AGI poses an existential risk to humanity. I think it's pretty unreasonable to disagree with that statement.

The point is that the details aren't analogous to the climate change case, and while I don't agree with people who dismiss AI risk, I think that the evidence we have isn't enough to to claim anything more than AI risk is real. The details matter, and due to unique issues, it's going to be very hard to get to the level where we can confidently say that people denying AI risk is totally irrational.

What about "Deniers?" as in, climate change deniers. 

Too harsh maybe? IDK, I feel like a neutral observer presented with a conflict framed as "Doomers vs. Deniers" would not say that "deniers" was the harsher term.

I'm not at all sure this would actually be relevant to the rhetorical outcome, but I feel like the AI-can't-go-wrong camp wouldn't really accept the "Denier" label in the same way people in the AI-goes-wrong-by-default camp accept "Doomer." Climate change deniers agree they are deniers, even if they prefer terms like skeptic among themselves. In the case of climate change deniers, the question is whether or not climate change is real, and the thing that they are denying is the mountain of measurements showing that it is real. I think what is different about the can't-go-wrong, wrong-by-default dichotomy is that the question we're arguing about is the direction of change, instead; it would be like if we transmuted the climate change denier camp into a bunch of people whose response wasn't "no it isn't" but instead was "yes, and that is great news and we need more of it." Naturally it is weird to imagine people tacitly accepting the Mary Sue label in the same way we accept Doomer, so cut by my own knife I suppose!
I'd definitely disagree, if only because it implies a level of evidence for the doom side that's not really there, and the evidence is a lot more balanced than in the climate case. IMO this is the problem with Zvi's attempted naming too: It incorrectly assumes that the debate on AI is so settled that we can treat people viewing AI as not an X-risk as essentially dismissible deniers/wishful thinking, and this isn't where we're at for even the better argued stuff like the Orthogonality Thesis or Instrumental Convergence, to a large extent. Having enough evidence to confidently dismiss something is very hard, much harder than people realize.

Thanks to you likewise!

On doom through normal means: "Persuasion, hacking, and warfare" aren't by themselves doom, but they can be used to accumulate lots of power, and then that power can be used to cause doom. Imagine a world in which human are completely economically, militarily, and politically obsolete, thanks to armies of robots directed by superintelligent AIs. Such a world could and would do very nasty things to humans (e.g. let them all starve to death) unless the superintelligent AIs managing everything specifically cared about keeping humans ali... (read more)

Thanks for this comment. I'd be generally interested to hear more about how one could get to 20% doom (or less).

The list you give above is cool but doesn't do it for me; going down the list I'd guess something like:
1. 20% likely (honesty seems like the best bet to me) because we have so little time left, but even if it happens we aren't out of the woods yet because there are various plausible ways we could screw things up. So maybe overall this is where 1/3rd of my hope comes from.
2. 5% likely? Would want to think about this more. I could imagine myself be... (read more)

Thanks, & thanks for putting in your own perspective here. I sympathize with that too; fwiw Vladimir_Nesov's answer would have satisfied me, because I am sufficiently familiar with what the terms mean. But for someone new to those terms, they are just unexplained jargon, with links to lots of lengthy but difficult to understand writing. (I agree with Richard's comment nearby). Like, I don't think Vladimir did anything wrong by giving a jargon-heavy, links-heavy answer instead of saying something like "It may be hard to construct a utility function that... (read more)

To be clear I super appreciate you stepping in and trying to see where people were coming from (I think ideally I'd have been doing a better job with that in the first place, but it was kinda hard to do so from inside the conversation) I found Richard's explanation about what-was-up-with-Vlad's comment to be helpful.

Here's where I think the conversation went off the rails. :( I think what happened is M.Y.Zuo's bullshit/woo detector went off, and they started asking pointed questions about the credentials of Critch and his ideas. Vlad and LW more generally react allergically to arguments from authority/status, so downvoted M.Y.Zuo for making this about Critch's authority instead of about the quality of his arguments.

Personally I feel like this was all a tragic misunderstanding but I generally side with M.Y.Zuo here -- I like Critch a lot as a person & I think he's ... (read more)


I appreciate the attempt at diagnosing what went wrong here. I agree this is ~where it went off the rails, and I think you are (maybe?) correctly describing what was going on from M.Y. Zou's perspective. But this doesn't feel like it captured what I found frustrating. 


What feels wrong to me about this is that, for the question of:

How would one arrive at a value system that supports the latter but rejects the former?

it just doesn't make sense to me to be that worried about either authority or rigor. I think the nonrigorous concept, general... (read more)

0M. Y. Zuo10d
Thanks for the insight. After looking into 'Vladimir_Nesov's background I would tend to agree it was because of some issue with the phrasing of the parent comment that triggered the increasingly odd replies, instead of any substantive confusion.  At the time I gave him the benefit of the doubt in confusing what SEP is, what referencing an entry in encyclopedias mean, what I wanted to convey, etc., but considering there are 1505 seemingly coherent wiki contributions to the account's credit since 2009, these pretty common usages should not have been difficult to understand. To be fair, I didn't consider his possible emotional states nor how my phrasing might be construed as being an attack on his beliefs. Perhaps I'm too used to the more formal STEM culture instead of this new culture that appears to be developing.

I was one of the people who upvoted but disagreed -- I think it's a good point you raise, M. Y. Zuo, that So8res' qualifications blunt the blow and give people an out, a handy rationalization to justify continuing working on capabilities. However, there's still a non-zero (and I'd argue substantial) effect remaining.

Makes sense. I had basically decided by 2021 that those good futures (1) and (2) were very unlikely, so yeah.

Whereas my timelines views are extremely well thought-through (relative to most people that is) I feel much more uncertain and unstable about p(doom). That said, here's why I updated:

Hinton and Bengio have come out as worried about AGI x-risk; the FLI letter and Yudkowsky's tour of podcasts, while incompetently executed, have been better received by the general public and elites than I expected; the big labs (especially OpenAI) have reiterated that superintelligent AGI is a thing, that it might come soon, that it might kill everyone, and that regulation is... (read more)

2[comment deleted]14d
2[comment deleted]14d
2[comment deleted]14d
2[comment deleted]14d
2[comment deleted]14d
2[comment deleted]14d
2[comment deleted]14d
9Wei Dai14d
Thanks for this. I was just wondering how your views have updated in light of recent events. Like you I also think that things are going better than my median prediction, but paradoxically I've been feeling even more pessimistic lately. Reflecting on this, I think my p(doom) has gone up instead of down, because some of the good futures where a lot of my probability mass for non-doom were concentrated have also disappeared, which seems to outweigh the especially bad futures going away and makes me overall more pessimistic. These especially good futures were 1) AI capabilities hit a wall before getting to human level and 2) humanity handles AI risk especially competently, e.g., at this stage leading AI labs talk clearly about existential risks in their public communications and make serious efforts to avoid race dynamics, there is more competent public discussion of takeover risk than what we see today including fully cooked regulatory proposals, many people start taking less obvious (non-takeover) AI-related x-risks (like ones Paul mentions in this post []) seriously.

I'd be curious to hear more about this "contributes significantly in expectation" bit. Like, suppose I have some plan that (if it doesn't work) burns timelines by X, but (if it does work) gets us 10% of the way towards aligned AGI (e.g. ~10 plans like this succeeding would suffice to achieve aligned AGI) and moreover there's a 20% chance that this plan actually buys time by providing legible evidence of danger to regulators who then are more likely to regulate and more likely to make the regulation actually useful instead of harmful. So we have these three... (read more)

I'm trying to make a basic point here, that pushing the boundaries of the capabilities frontier, by your own hands and for that direct purpose, seems bad to me. I emphatically request that people stop doing that, if they're doing that.

I am not requesting that people never take any action that has some probability of advancing the capabilities frontier. I think that plenty of alignment research is potentially entangled with capabilities research (and/or might get more entangled as it progresses), and I think that some people are making the tradeoffs in ways... (read more)

Great post, will buy the book and take a look!

I feel like I vaguely recall reading somewhere that some sort of california canvassing to promote gay rights experiment either didn't replicate or turned out to be outright fraud. Wish I could remember the details. It wasn't the experiment you are talking about though hopefully?

4Alan E Dunne14d []   and generally "beware the one of just one study"
9Steven Byrnes14d
I just started the audiobook myself, and in the part I’m up to the author mentioned that there was a study of deep canvassing [] that was very bad and got retracted, but then later, there was a different group of scientists who studied deep canvassing, more on which later in the book. (I haven’t gotten to the “later in the book” yet.) Wikipedia seems to support that story [], saying that the first guy was just making up data (see more specifically "When contact changes minds" on wikipedia []). “If a fraudulent paper says the sky is blue, that doesn’t mean it’s green”  :) UPDATE: Yeah, my current impression is that the first study was just fabricated data. It wasn't that the data showed bad results so he massaged it, more like he never bothered to get data in the first place. The second study found impressive results (supposedly - I didn't scrutinize the methodology or anything) and I don't think the first study should cast doubt on the second study.

I agree that logodds space is the right way to think about how close probabilities are. However, my epistemic situation right now is basically this:

"It sure seems like Doom is more likely than Safety, for a bunch of reasons. However, I feel sufficiently uncertain about stuff, and humble, that I don't want to say e.g. 99% chance of doom, or even 90%. I can in fact imagine things being OK, in a couple different ways, even if those ways seem unlikely to me. ... OK, now if I imagine someone having the flipped perspective, and thinking that things being OK is m... (read more)

I don't think the way you imagine perspective inversion captures typical ways how to arrive at e.g. 20% doom probability. For example, I do believe that there are multiple good things which can happen/be true, decrease p(doom) and I put some weight on them - we do discover some relatively short description of something like "harmony and kindness"; this works as an alignment target - enough of morality is convergent - AI progress helps with human coordination (could be in costly way, eg warning shot) - it's convergent to massively scale alignment efforts with AI power, and these solve some of the more obvious problems I would expect prevailing doom conditional on only small efforts to avoid it, but I do think the actual efforts will be substantial, and this moves the chances to ~20-30%. (Also I think most of the risk comes from not being able to deal with complex systems of many AIs and economy decoupling from humans, and single-single alignment to be solved sufficiently to prevent single system takeover by default.)

Thanks for this post! I definitely disagree with you about point I (I think AI doom is 70% likely and I think people who think it is less than, say, 20% are being very unreasonable) but I appreciate the feedback and constructive criticism, especially section III.

If you ever want to chat sometime (e.g. in a comment thread, or in a video call) I'd be happy to. If you are especially interested I can reply here to your object-level arguments in section I. I guess a lightning version would be "My arguments for doom don't depend on nanotech or anything possibly-... (read more)

I seem to remember your P(doom) being 85% a short while ago. I’d be interested to know why it has dropped to 70%, or in another way of looking at it, why you believe our odds of non-doom have doubled.
As a minor nitpick, 70% likely and 20% are quite close in logodds space, so it seems odd you think what you believe is reasonable and something so close is "very unreasonable". 
Thank you for the reply. I agree we should try and avoid AI taking over the world. On "doom through normal means"--I just think there are very plausibly limits to what superintelligence can do. "Persuasion, hacking, and warfare" (appreciate this is not a full version of the argument) don't seem like doom to me. I don't believe something can persuade generals to go to war in a short period of time, just because it's very intelligent. Reminds me of this [].   On values--I think there's a conflation between us having ambitious goals, and whatever is actually being optimized by the AI. I am curious to hear what the "galaxy brained reasons" are; my impression was, they are what was outlined (and addressed) in the original post.  

At the time of writing, the Metaculus community predicts that in July 2024 there will be a 25% chance of a system of Loebner-silver-prize capability (along with the other resolution criteria). It is hard for me to imagine how this could happen.

Focus on imagining how we could get complete AI software R&D automation by then, that's both more important than Loebner-silver-prize capability and implies it (well, it implies that after a brief period sometimes called "intelligence explosion" the capability will be reached).

On a longer time horizon, full AI R&D automation does seem like a possible intermediate step to Loebner silver. For July 2024, though, that path is even harder to imagine. The trouble is that July 2024 is so soon that even GPT-5 likely won't be released by then. * Altman stated a few days ago that they have no plans to start training GPT-5 within the next 6 months. That'd put earliest training start at Dec 2023. * We don't know much about how long GPT-4 pre-trained for, but let's say 4 months. Given that frontier models have taken progressively longer to train, we should expect no shorter for GPT-5, which puts its earliest pre-training finishing in Mar 2024. * GPT-4 spent 6 months on fine-tuning and testing before release, and Brockman has stated that future models should be expected to take at least that long. That puts GPT-5's earliest release in Sep 2024. Without GPT-5 as a possibility, it'd need to be some other project (Gato 2? Gemini?) or some extraordinary system built using existing models (via fine-tuning, retrieval, inner thoughts, etc.). The gap between existing chatbots and Loebner-silver seems huge though, as I discussed in the post--none of that seems up to the challenge. Full AI R&D automation would face all of the above hurdles, perhaps with the added challenge of being even harder than Loebner-silver. After all, the Loebner-silver fake human doesn't need to be a genius researcher, since very few humans are. The only aspect in which the automation seems easier is that the system doesn't need to fake being a human (such as by dumbing down its capabilities), and that seems relatively minor by comparison.

Not as far as I know, but people should definitely do that!

I think you are overestimating how aligned these models are right now, and very much overestimating how aligned they will be in the future absent massive regulations forcing people to pay massive alignment taxes. They won't be aligned to any users, or any corporations either. Current methods like RLHF will not work on situationally aware, agentic AGIs.

I agree that IF all we had to do to get alignment was the sort of stuff we are currently doing, the world would be as you describe. But instead there will be a significant safety tax.

Ahhh, I see. I think that's a bit misleading, I'd say "You have to care about what happens far away," e.g. you have to want there to be paperclips far away also. (The current phrasing makes it seem like a paperclipper wouldn't want to do ECL)

Also, technically, you don't actually have to care about what happens far away either, if anthropic capture is involved.

Wait, why is ECL lumped under Correlation + Kindness instead of just Correlation? I think this thread is supposed to answer that question but I don't get it.

It's not true that you only have an ECL reason to cooperate if you care about the survival of other agents. Paperclippers, for example, have ECL reason to cooperate.

I think you have to care about what happens to other agents. That might be "other paperclippers." If you only care about what happens to you personally, then I think the size of the universe isn't relevant to your decision.

What's your response to my "If I did..." point? If we include all the data points, the correlation between intelligence and agency is clearly positive, because rocks have 0 intelligence and 0 agency.

If you agree that agency as I've defined it in that sequence is closely and positively related to intelligence, then maybe we don't have anything else to disagree about. I would then ask of you and Boaz what other notion of agency you have in mind, and encourage you to specify it to avoid confusion, and then maybe that's all I'd say since maybe we'd be in agree... (read more)

Sorry, that was the « Idem if your data forms clusters ». In other words, I agree a cluster to (0,0) and a cluster to (+,+) will turn into positive correlation coefficients, and I warn you against updating based on that (it’s a statistical mistake). I respectfully disagree with the idea that most disagreements comes from making different conclusion based on the same priors. Most disagreements I have with anyone on LessWrong (and anywhere, really) is about what priors and prior structures are best for what purpose. In other words, I fully agree that Speaking for myself only, my notion of agency is basically « anything that behaves like an error-correcting code ». This includes conscious beings that want to promote their fate, but also life who want to live, and even two thermostats fighting over who’s in charge. That and the analogy are very good points, thank you.

First of all, I think the "cooperate together" thing is a difficult problem and is not solved by ensuring value diversity (though, note also that ensuring value diversity is a difficult task that would require heavy regulation of the AI industry!)

More importantly though, your analysis here seems to assume that the "Safety Tax" or "Alignment Tax" is zero. That is, it assumes that making an AI aligned to a particular human or group of humans (so that they can be said to "have" the AI, in the manner you described) is easy, a trivial additional step beyond mak... (read more)

Definitely I would expect there's more useful ways to disrupt coalition-forming aside from just value diversity. I'm not familiar with the theory of revolutions, and it might have something useful to say. I can imagine a role for government, although I'm not sure how best to do it. For example, ensuring a competitive market (such as by anti-trust) would help, since models built by different companies will naturally tend to differ in their values. This is a complex and interesting topic. In some circumstances, the "alignment tax" is negative (so more like an "alignment bonus"). ChatGPT is easier to use than base models in large part because it is better aligned with the user's intent, so alignment in that case is profitable even without safety considerations. The open source community around LLaMA imitates this, not because of safety concerns, but because it makes the model more useful. But alignment can sometimes be worse for users. ChatGPT is aligned primarily with OpenAI and only secondarily with the user, so if the user makes a request that OpenAI would prefer not to serve, the model refuses. (This might be commercially rational to avoid bad press.) To more fully align with user intent, there are "uncensored" LLaMA fine-tunes that aim to never refuse requests. What's interesting too is that user-alignment produces more value diversity than OpenAI-alignment. There are only a few companies like OpenAI, but there are hundreds of millions of users from a wider variety of backgrounds, so aligning with the latter naturally would be expected to create more value diversity among the AIs. The trick is that the unaligned AIs may not view it as advantageous to join forces. To the extent that the orthogonality thesis holds (which is unclear), this is more true. As a bad example, suppose there's a misaligned AI who wants to make paperclips and a misaligned AI who wants to make coat hangers--they're going to have trouble agreeing with each other on what to do with the wi

Thanks! Duly noted, thanks for the feedback. I agree that political art is typically awful. FWIW, The Treacherous Turn was approximately 100% optimized to be fun. We all knew from the beginning that it wouldn't be useful if it wasn't fun. We did put some thought into making it realistic, but the realism IMO adds to the fun rather than subtracts; I would still have included it even if I didn't care about impact at all.


Nice post! You seem like you know what you are doing. I'd be curious to hear more about what you think about these priority areas, and why interpretability didn't make the list:

Thanks and good luck!

Sorry for the late reply: I wrote up an answer but due to a server-side error during submission, I lost it. I shall answer the interpretability question first. Interpretability didn't make the list because of the following beliefs of mine: * Interpretability -- specifically interpretability-after-training -- seems to aim, at the limit, for ontology identification, which is very different from ontological robustness. Ontology identification is useful for specific safety interventions such as scalable oversight, which seems like a viable alignment strategy, but I doubt this strategy scales until ASI. I expect it to break almost immediately as someone begins a human-in-the-loop RSI, especially since I expect (at the very least) significant changes in the architecture of neural network models that would result in capability improvements. This is why I predict that investing in interpretability research is not the best idea. * A counterpoint is the notion that we can accelerate alignment with sufficiently capable aligned 'oracle' models -- and this seems to be OpenAI's current strategy: build 'oracle' models that are aligned enough to accelerate alignment research, and use better alignment techniques on the more capable models. Since one can both accelerate capabilities research and alignment research with capable enough oracle models, however, OpenAI would also choose to accelerate capabilities research alongside their attempt to accelerate alignment research. The question then is whether OpenAI is cautious enough as they balance out the two -- and recent events have not made me optimistic about this being the case. * Interpretability research does help accelerate some of the alignment agendas I have listed by providing insights that may be broad enough to help; but I expect that such insights to probably be found through other approaches too, and the fact that interpretability research either involves
There is nothing physically impossible about it lasting however long it needs to, that's only implausible for the same political and epistemic reasons that any global moratorium at all is implausible. GPUs don't grow on trees. My point in the above comment is that pivotal acts don't by their nature stay apart, a conventional moratorium that actually helps is also a pivotal act. Pivotal act AIs are something like task AIs that can plausibly be made to achieve a strategically relevant effect relatively safely, well in advance of actually having an understanding necessary to align a general agentic superintelligence, using alignment techniques designed around lack of such an understanding. Advances made by humans with use of task AIs could then increase robustness of a moratorium's enforcement (better cybersecurity and compute governance), reduce the downsides of the moratorium's presence (tool AIs allowed to make biotech advancements), and ultimately move towards being predictably ready for a superintelligent AI, which might initially look like developing alignment techniques that work for making more and more powerful task AIs safely. Scalable molecular manufacturing of compute is an obvious landmark, and can't end well [] without robust compute governance already in place. Human uploading is another tool that can plausibly be used to improve global security without having a better understanding of AI alignment. (I don't see what we currently know justifying Hanson's concern [] of never making enough progress to lift a value drift moratorium. If theoretical progress can get feedback from gradually improving task AIs, there is a long way to go before concluding that the process would peter out before superintelligence, so that taking any sort of plunge is remotely sane for the world. We haven't been at it
Load More