All of CarlShulman's Comments + Replies

I disagree, from my experience of engaging with the public debate, doubt is mostly about AI capability, not about misalignment. Most people easily believe AI to be misaligned to them, but they have trouble believing it will be powerful enough to take over the world any time soon. I don't think alignment research will do that much here.

I would say that the power of AI will continue to visibly massively expand (although underestimation of further developments will continue to be a big problem), but that will increase both 'fear AI disaster' and 'get AI first... (read more)

I think the assumption that safe, aligned AI can't defend against a later introduction of misaligned AI is false, or rather depends on the assumption of profound alignment failures so that the 'aligned AI' really isn't. AI that is aligned enough to do AI research and operate industry and security forces can expand its capabilities to the technological frontier and grow an industrial base claiming unclaimed resources in space. Then any later AI introduced faces an insurmountable balance of capabilities just from the gap in resources, even if it catches up t... (read more)

Thank you for writing this reply. It definitely improved my overview of possible ways to look at this issue. I guess your position can be summarized as "positive offense/defense balance will emerge soon, and aligned AI can block following unaligned AIs entirely if required", is that roughly correct? I have a few remarks about your ideas (not really a complete response). First, in general, I think you're underestimating the human component of alignment. Aligned AI should be aligned to something, namely humans. That means it won't be able to build an industrial base in space until we're ready to make it do that. Even if we are not harmed by such a base in any way, and even if it would be legal to build it, I expect we may not be ready for it for a long time. It will be dead scary to see something develop that seems more powerful than us, but also deeply alien to us, even if tech companies insist it's 'aligned to our values'. Most people's response will be to rein in its power, not expand it further. Any AI that's aligned to us will need to take those feelings seriously. Even if experts would agree that increasing the power of the aligned AI is good and necessary, and that expansion in space would be required for that, I think it will take a long time to convince the general public and/or decision makers, if it's at all possible. And in any remotely democratic alignment plan, that's a necessary step. Second, I think it's uncertain whether a level of AI that's powerful enough to take over the world (and thereby cause existential risk) will also be powerful enough to build a large industrial base in space. If not, your plan might not work. I disagree, from my experience of engaging with the public debate, doubt is mostly about AI capability, not about misalignment. Most people easily believe AI to be misaligned to them, but they have trouble believing it will be powerful enough to take over the world any time soon. I don't think alignment research will do that muc

No. Short version is that the prior for the combination of technologies and motives for aliens (and worse for magic, etc) is very low, and the evidence distribution is familiar from deep dives in multiple bogus fields (including parapsychology, imaginary social science phenomena, and others), with understandable data-generating processes so not much likelihood ratio.

We've agreed to make a 25:1 bet on this. John will put the hash of the bet amount/terms below.

1John Wiseman2mo
Carl and I have ultimately agreed to a 29:1 bet on the combined amount. The term will expire on July 25 2028 and may be extended by no more than 2 days upon reasonable request at Carl's sole discretion. The resolution criteria is as laid out in the main post of this thread by the user RatsWrongAboutUAP. Unless either of the parties wishes to disclose it, the total amount agreed upon will remain in confidence between the parties.

As we've discussed and in short, I think aligned AI permits dialing up many of the processes that make science or prediction markets imperfectly self-correcting: tremendously cheaper, in parallel, on the full panoply of questions (including philosophy and the social sciences), with robust consistency, cross-examination, test sets, and forecasting. These sorts of things are an important part of scalable supervision for alignment, but if they can be made to work I expect them to drive strong epistemic convergence.

You've described some of these ideas to me before, but not in enough detail for me to form a judgement on the actual merits of the ideas and arguments. So I'm having to rely on my usual skeptical prior for new untested ideas in the philosophical or AI safety areas (because a lot of new ideas end up not working out, and people tend to be overconfident about their own original ideas), along with:

We seem to understand the philosophy/epistemology of science much better than that of philosophy (i.e. metaphilosophy), and at least superficially the methods humans... (read more)

The thing was already an obscene 7 hours with a focus on intelligence explosion and mechanics of AI takeover (which are under-discussed in the discourse and easy to improve on, so I wanted to get concrete details out). More detail on alignment plans and human-AI joint societies are planned focus areas for the next times I do podcasts.

I'm interested in my $250k against your $10k.

Carl, I'm interested in also taking taking RatsWrongAboutUAP's side of the bet, if you'd like to bet more. I'll also happy to give you better odds than 150:1. DM me if you're interested. 
2Evan R. Murphy2mo
Carl, have you written somewhere about why you are confident that all UFOs so far are prosaic in nature? Would be interest to read/listen to your thoughts on this. (Alternatively, a link to some other source that you find gives a particularly compelling explanation is also good.)
4John Wiseman2mo
I could offer $5k against your $185k, Carl. If you're interested, DM me. Same odds as a European Roulette, albeit with a much delayed payment.

 I assign that outcome low probability (and consider that disagreement to be off-topic here).

Thank you for the clarification. In that case my objections are on the object-level.


This post is an answer to the question of why an AI that was truly indifferent to humanity (and sentient life more generally), would destroy all Earth-originated sentient life.

This does exclude random small terminal valuations of things involving humans, but leaves out the instrumental value for trade and science, uncertainty about how other powerful beings might re... (read more)

RE: decision theory w.r.t how "other powerful beings" might respond - I really do think Nate has already argued this, and his arguments continue to seem more compelling to me than the the opposition's. Relevant quotes include:

It’s possible that the paperclipper that kills us will decide to scan human brains and save the scans, just in case it runs into an advanced alien civilization later that wants to trade some paperclips for the scans. And there may well be friendly aliens out there who would agree to this trade, and then give us a little pocket of th

... (read more)

Most people care a lot more about whether they and their loved ones (and their society/humanity) will in fact be killed than whether they will control the cosmic endowment. Eliezer has been going on podcasts saying that with near-certainty we will not see really superintelligent AGI because we will all be killed, and many people interpret your statements as saying that. And Paul's arguments do cut to the core of a lot of the appeals to humans keeping around other animals.

If it is false that we will almost certainly be killed (which I think is right, I... (read more)

This thread continues to seem to me to be off-topic. My main takeaway so far is that the post was not clear enough about how it's answering the question "why does an AI that is indifferent to you, kill you?". In attempts to make this clearer, I have added the following to the beginning of the post:

This post is an answer to the question of why an AI that was truly indifferent to humanity (and sentient life more generally), would destroy all Earth-originated sentient life.

I acknowledge (for the third time, with some exasperation) that this point alone is... (read more)

A world of pure Newtonian mechanics wouldn't actually support apples and grass as we know them existing, I think. They depend on matter capable of supporting organic chemistry, nuclear reactions, the speed of light, ordered causality, etc. Working out that sort of thing in simulation to get an Occam prior over coherent laws of physics producing life does seem to be plenty to favor QM+GR over Newtonian mechanics as physical laws.

I agree the possibility or probability of an AI finding itself in simulations without such direct access to 'basement level' physical reality limits the conclusions that could be drawn, although conclusions 'conditional on this being direct access' may be what's in mind in the original post.

In the post, I show you both a grass and an apple that did not require Newtonian gravity or general relativity to exist. Why exactly are nuclear reactions and organic chemistry necessary for a clump of red things to stick together, or a clump of green things to stick together?

When it comes to the "level of simulation", how exactly is the AI meant to know when it is in the "base level"? We don't know that about our universe. For all the computer knows, it's simulation is the universe.

In general human cognitive enhancement could help AGI alignment if it were at scale before AGI, but the cognitive enhancements on offer seem like we probably won't get very much out of them before AGI, and they absolutely don't suffice to 'keep up' with AGI for more than a few weeks or months (as AI R&D efforts rapidly improve AI while human brains remain similar, rendering human-AI cyborg basically AI systems). So benefit from those channels, especially for something like BCI, has to add value mainly by making better initial decisions, like successful... (read more)

What level of taxation do you think would delay timelines by even one year?

2Matthew Barnett7mo
I'm not sure. It depends greatly on the rate of general algorithmic progress, which I think is unknown at this time. I think it is not implausible (>10% chance) that we will see draconian controls that limit GPU production and usage, decreasing effective compute available to the largest actors by more than 99% from the trajectory under laissez faire. Such controls would be unprecedented in human history, but justified on the merits, if AI is both transformative and highly dangerous.  It should be noted that, to the extent that more hardware allows for more algorithmic experimentation, such controls would also slow down algorithmic progress.

With effective compute for AI doubling more than once per year, a global 100% surtax on GPUs and AI ASICs seems like it would be a difference of only months to AGI timelines.

"Effective compute" is the combination of hardware growth and algorithmic progress? If those are multiplicative rather than additive, slowing one of the factors may only accomplish little on its own, but maybe it could pave the way for more significant changes when you slow both at the same time?  Unfortunately, it seems hard to significantly slow algorithmic progress. I can think of changes to publishing behaviors (and improving security) and pausing research on scary models (for instance via safety evals). Maybe things like handicapping talent pools via changes to immigration policy, or encouraging capability researchers to do other work. But that's about it.  Still, combining different measures could be promising if the effects are multiplicative rather than additive.  Edit: Ah, but I guess your point is that even a 100% tax on compute wouldn't really change the slope of the compute growth curve – it would only move the curve rightward and delay a little. So we don't get a multiplicative effect, unfortunately. We'd need to find an intervention that changes the steepness of the curve.   
4Matthew Barnett7mo
What is your source for the claim that effective compute for AI is doubling more than once per year? And do you mean effective compute in the largest training runs, or effective compute available in the world more generally?
If the explicit goal of the regulation is to delay AI capabilities, and to implement that via taxes, seems like one could figure out something to make it longer. Also, a few months still seems quite helpful and would class as "substantially" in my mind.

This is the terrifying tradeoff, that delaying for months after reaching near-human-level AI (if there is safety research that requires studying AI around there or beyond) is plausibly enough time for a capabilities explosion (yielding arbitrary economic and military advantage, or AI takeover) by a more reckless actor willing to accept a larger level of risk, or making an erroneous/biased risk estimate. AI models selected to yield results while under control that catastrophically take over when they are collectively capable would look like automating everything was largely going fine (absent vigorous probes) until it doesn't, and mistrust could seem like paranoia.


I'd very much like to see this done with standard high-quality polling techniques, e.g. while airing counterarguments (like support for expensive programs that looks like majority but collapses if higher taxes to pay for them is mentioned). In particular, how the public would react given different views coming from computer scientists/government commissions/panels.

This is like saying there's no value to learning about and stopping a nuclear attack from killing you because you might get absolutely no benefit from not being killed then, and being tipped off about a threat trying to kill you, because later the opponent might kill you with nanotechnology before you can prevent it.

Removing intentional deception or harm greatly increases the capability of AIs that can be worked with without getting killed, to further improve safety measures. And as I said actually being able to show a threat to skeptics is immensely better for all solutions, including relinquishment, than controversial speculation.

No, it's not like that.  It's saying that if you can prevent a doomsday device from being lethal in some ways and not in others, then it's still lethal. Focussing on some ways that you feel confident that you might be able to prevent the doomsday device from being lethal is IMO distracting dangerously from the point, which is that people should not built the doomsday device in the first place.
No, it's not like that.  It's saying that if you can prevent a doomsday device from being lethal in some ways and not in others, then it's still lethal. Focussing on some ways that you feel confident that you might be able to prevent the doomsday device from being lethal is IMO distracting dangerously from the point, which is that people should not built the doomsday device in the first place.

I agree that some specific leaders you cite have expressed distaste for model scaling, but it seems not to be a core concern. In a choice between more politically feasible measures that target concerns they believe are real vs concerns they believe are imaginary and bad, I don't think you get the latter. And I think arguments based on those concerns get traction on measures addressing the concerns, but less so on secondary wishlist items of leaders .

I think that's the reason privacy advocacy in legislation and the like hasn't focused on banning computers i... (read more)

re: Leaders of movements being skeptical of the notion of AGI. Reflecting more, my impression is that Timnit Gebru is skeptical about the sci-fiy descriptions of AGI, and even more so about the social motives of people working on developing (safe) AGI. She does not say that AGI is an impossible concept or not actually a risk. She seems to question the overlapping groups of white male geeks who have been diverting efforts away from other societal issues, to both promoting AGI development and warning of AGI x-risks.  Regarding Jaron Lanier, yes, (re)reading this post I agree that he seems to totally dismiss the notion of AGI, seeing it more a result of a religious kind of thinking under which humans toil away at offering the training data necessary for statistical learning algorithms to function without being compensated.  
Returning on error correction point: Feel free to still clarify the other reasons why the changes in learning would be stable in preserving “good properties”. Then I will take that starting point to try explain why the mutually reinforcing dynamics of instrumental convergence and substrate-needs convergence override that stability. Fundamentally though, we'll still be discussing the application limits of error correction methods.  Three ways to explain why: * Any workable AI-alignment method involves receiving input signals, comparing input signals against internal references, and outputting corrective signals to maintain alignment of outside states against those references (ie. error correction).  * Any workable AI-alignment method involves a control feedback loop – of detecting the actual (or simulating the potential) effects internally and then correcting actual (or preventing the potential) effects externally (ie. error correction). * Eg. mechanistic interpretability is essentially about "detecting the actual (or simulating the potential) effects internally" of AI. * The only way to actually (slightly) counteract AGI convergence on causing "instrumental" and "needed" effects within a more complex environment is to simulate/detect and then prevent/correct those environmental effects (ie. error correction).   ~ ~ ~ Which brings us back to why error correction methods, of any kind and in any combination, cannot ensure long-term AGI Safety. I reread your original post and Christiano's comment to understand your reasoning better and see how I could limits of applicability of error correction methods.  I also messaged Forrest (the polymath) to ask for his input.  The messages were of a high enough quality that I won't bother rewriting the text. Let me copy-paste the raw exchange below (with few spelling edits). Remmelt 15:37 @Forrest, would value your thoughts on the way Carl Schulman is thinking about error correctin
I intend to respond to the rest tomorrow. Some of your interpretations of writings by Timnit Gebru and Glen Weyl seem fair to me (though would need to ask them to confirm). I have not look much into Jaron Lanier’s writings on AGI so that prompts me to google that. Perhaps you can clarify the other reasons why the changes in learning would be stable in preserving “good properties”? I’ll respond to your nuances regarding how to interpret your long-term-evaluating error correcting code after that.
I addressed claims of similar forms at least 3 times times already on separate occasions (including in the post itself). Suggest reading this: “The fact that mechanistic interpretability can possibly be used to detect a few straightforwardly detectable misalignment of the kinds you are able to imagine right now does not mean that the method can be extended to detecting/simulating most or all human-lethal dynamics manifested in/by AGI over the long term. If AGI behaviour converges on outcomes that result in our deaths through less direct routes, it really does not matter much whether the AI researcher humans did an okay job at detecting "intentional direct lethality" and "explicitly rendered deception".”

I agree there is some weak public sentiment in this direction (with the fear of AI takeover being weaker). Privacy protections and redistribution don't particularly favor measures to avoid AI apocalypse.

 I'd also mention this YouGov survey:

But the sentiment looks weak compared to e.g. climate change and nuclear war,  where fossil fuel production and nuclear arsenals continue, although there are significant policy actions taken in hopes of avoiding those problems. The sticking point is policymakers and the scientific community. At the end of the O... (read more)

I'll shill here and say that Rethink Priorities is pretty good at running polls of the electorate if anyone wants to know what a representative sample of Americans think about a particular issue such as this one. No need to poll Uber drivers or Twitter when you can do the real thing!

But the sentiment looks weak compared to e.g. climate change and nuclear war,  where fossil fuel production and nuclear arsenals continue,

That seems correct to me, but on the other hand, I think the public sentiment against things like GMOs was also weaker than the one that we currently have against climate change, and GMOs got slowed down regardless. Also I'm not sure how strong the sentiment against nuclear power was relative to the one against climate change, but in any case, nuclear power got hindered quite a bit too.

I think one important aspect w... (read more)

Do you think there is a large risk of AI systems killing or subjugating humanity autonomously related to scale-up of AI models?

A movement pursuing antidiscrimination or privacy protections for applications of AI that thinks the risk of AI autonomously destroying humanity is nonsense seems like it will mainly demand things like the EU privacy regulations, not bans on using $10B of GPUs instead of $10M in a model. It also seems like it wouldn't pursue measures targeted at the kind of disaster it denies, and might actively discourage them (this sometimes happ... (read more)

I think the two camps are less orthogonal than your examples of privacy and compute reg portray. There's room for plenty of excellent policy interventions that both camps could work together to support. For instance, increasing regulatory requirements for transparency on algorithmic decision-making (and crucially, building a capacity both in regulators and in the market supporting them to enforce this) is something that I think both camps would get behind (the xrisk one because it creates demand for interpretability and more and the other because eg. it's easier to show fairness issues) and could productively work on together. I think there are subculture clash reasons the two camps don't always get on, but that these can be overcome, particularly given there's a common enemy (misaligned powerful AI). See also this paper Beyond Near- and Long-Term: Towards a Clearer Account of Research Priorities in AI Ethics and Society I know lots of people who are uncertain about how big the risks are, and care about both problems, and work on both (I am one of these - I care more about AGI risk, but I think the best things I can do to help avert it involve working with the people you think aren't helpful).

A movement pursuing antidiscrimination or privacy protections for applications of AI that thinks the risk of AI autonomously destroying humanity is nonsense seems like it will mainly demand things like the EU privacy regulations, not bans on using $10B of GPUs instead of $10M in a model.

This is a very spicy take, but I would (weakly) guess that a hypothetical ban on ML trainings that cost more than $10M would make AGI timelines marginally shorter rather than longer, via shifting attention and energy away from scaling and towards algorithm innovation.

A movement pursuing antidiscrimination or privacy protections for applications of AI that thinks the risk of AI autonomously destroying humanity is nonsense seems like it will mainly demand things like the EU privacy regulations,  not bans on using $10B of GPUs instead of $10M in a model". 

I can imagine there being movements that fit this description, in which case I would not focus on talking with them or talking about them. 

But I have not been in touch with any movements matching this description. Perhaps you could share specific examples ... (read more)

There are a lot of pretty credible arguments for them to try, especially with low risk estimates for AI disempowering humanity, and if their percentile of responsibility looks high within the industry.

One view is that the risk of AI turning against humanity is less than the risk of a nasty eternal CCP dictatorship if democracies relinquish AI unilaterally. You see this sort of argument made publicly by people like Eric Schmidt, and 'the real risk isn't AGI revolt, it's bad humans' is almost a reflexive take for many  in online discussion of AI risk. T... (read more)

1Gerald Monroe9mo
Taking an extreme perspective here: do future generations of people not alive and who no one alive now would meet have any value? One perspective is no they don't. From that perspective "humanity" continues only as some arbitrary random numbers from our genetics. Even Clippy probably keeps at least one copy of the human genome in a file somewhere so it's the same case. That is, there is no difference between the outcomes of: 1. we delay AI a few generations and future generations of humanity take over the galaxy 2. we fall to rampant AIs and their superintelligent descendants take over the galaxy If you could delay AI long enough you would be condemning the entire population of the world to death from aging, or essentially the same case where the rampant AI kills the entire world.
0Nathan Helm-Burger9mo
Carl S. One view is that the risk of AI turning against humanity is less than the risk of a nasty eternal CCP dictatorship if democracies relinquish AI unilaterally. You see this sort of argument made publicly by people like Eric Schmidt, and 'the real risk isn't AGI revolt, it's bad humans' is almost a reflexive take for many in online discussion of AI risk. That view can easily combine with the observation that there has been even less takeup of AI safety in China thus far than in liberal democracies, and mistrust of CCP decision-making and honesty, so it also reduces accident risk. My thought: seems like a convincing demonstration of risk could be usefully persuasive.

Thank you, this seems like a high-quality steelman (I couldn't judge if it passes an ITT). 


Most AI companies and most employees there seem not to buy risk much, and to assign virtually no resources to address those issues. Unilaterally holding back from highly profitable AI when they won't put a tiny portion of those profits into safety mitigation again looks like an ask out of line with their weak interest. Even at the few significant companies with higher percentages of safety effort, it still looks to me like the power-weighted average of staff is extremely into racing to the front, at least to near the brink of catastrophe or until governmen... (read more)

looks to me like the power-weighted average of staff is extremely into racing to the front, at least to near the brink of catastrophe or until governments buy risks enough to coordinate slowdown.

Can anyone say confident why? Is there one reason that predominates, or several? Like it's vaguely something about status, money, power, acquisitive mimesis, having a seat at the table... but these hypotheses are all weirdly dismissive of the epistemics of these high-powered people, so either we're talking about people who are high-powered because of the mana... (read more)

If the balance of opinion of scientists and policymakers (or those who had briefly heard arguments) was that AI catastrophic risk is high, and that this should be a huge social priority, then you could do a lot of things. For example, you could get budgets of tens of billions of dollars for interpretability research, the way governments already provide tens of billions of dollars of subsidies to strengthen their chip industries. Top AI people would be applying to do safety research in huge numbers. People like Bill Gates and Elon Musk who nominally take AI... (read more)

Those are good points. There are some considerations that go in the other direction. Sometimes it's not obvious what's a "failure to convince people" vs. "a failure of some people to be convincible." (I mean convincible by object-level arguments as opposed to convincible through social cascades where a particular new view reaches critical mass.)  I believe both of the following:  * Persuasion efforts haven't been exhausted yet: we can do better at reaching not-yet-safety-concerned AI researchers. (That said, I think it's at least worth considering that we're getting close to exhausting low-hanging fruit?) * Even so, "persuasion as the main pillar of a strategy" is somewhat likely to be massively inadequate because it's difficult to change the minds and culture of humans in general (even if they're smart), let alone existing organizations. Another point that's maybe worth highlighting is that the people who could make large demands don't have to be the same people who are best-positioned for making smaller asks. (This is Katja's point about there not being a need for everyone to coordinate into a single "we.") The welfarism vs. abolitionism debate in animal advocacy and discussion of the radical flank effect seems related. I also agree with a point lc makes in his post on slowing down AI. He points out that there's arguably a "missing mood" around the way most people in EA and the AI alignment community communicate with safety-unconcerned researchers. The missing sense of urgency probably lowers the chance of successful persuasion efforts? Lastly, it's a challenge that there's little consensus in the EA research community around important questions like "How hard is AI alignment?," "How hard is alignment conditional on <5 years to TAI?," and "How long are TAI timelines?" (Though maybe there's quite some agreement on the second one and the answer is at least, "it's not easy?") I'd imagine there would at least be quite a strong EA expert consensus

I think I would have totally agreed in 2016. One update since then is that I think progress scales way less than resources than I used to think it did. In many historical cases, a core component of progress driven by a small number of people (which is reflected in citation counts, who is actually taught in textbooks), and introducing lots of funding and scaling too fast can disrupt that by increasing the amount of fake work.

$1B in safety well-spent is clearly more impactful than $1B less in semiconductors, it's just that "well-spent" is doing a lot of work... (read more)

This comment employs an oddly common failure mode of ignoring intermediate successes that align with market incentives, like "~N% of AI companies stop publishing their innovations on Arxiv for free".

I think this comment is overstating the case for policymakers and the electorate actually believing that investing in AI is good for the world. I think the answer currently is "we don't know what policymakers and the electorate actually want in relation to AI" as well as "the relationship of policymakers and the electorate is in the middle of shifting quite rapidly, so past actions are not that predictive of future actions".

I really only have anecdata to go on (though I don't think anyone has much better), but my sense from doing informal polls of e.g. Ube... (read more)

There are plenty of movements out there (ethics & inclusion, digital democracy, privacy, etc.) who are against current directions of AI developments, and they don’t need the AGI risk argument to be convinced that current corporate scale-up of AI models is harmful.

Working with them, redirecting AI developments away from more power-consolidating/general AI may not be that much harder than investing in supposedly “risk-mitigating” safety research.

Seems reason regarding public policy. But what about 1. private funders of AGI-relevant research 2. researchers doing AGI-relevant research? Seems like there's a lot of potential reframings that make it more feasible to separate safe-ish research from non-safe-ish research. E.g. software 2.0: we're not trying to make a General Intelligence, we're trying to replace some functions in our software with nets learned from data. This is what AlphaFold is like, and I assume is what ML for fusion energy is like. If there's a real category like this, a fair amount of the conflict might be avoidable? 

I wasn't arguing for "99+% chance that an AI, even if trained specifically to care about humans, would not end up caring about humans at all" just addressing the questions about humans in the limit of intelligence and power in the comment I replied to. It does seem to me that there is substantial chance that humans eventually do stop having human children in the limit of intelligence and power.

Tru A uniform fertility below 2.1 means extinction, yes, but in no country is the fertility rate uniformly below 2.1. Instead, some humans decide they want lots of children despite the existence of contraception and educational opportunity, and others do not. It seems to me that a substantial proportion of humans would stop having children in the limit of intelligence and power. It also seems to me like a substantial number of humans continue (and would continue) to have such children as if they value it for its own sake. This suggests that the problems Nate is highlighting, while real, are not sufficient to guarantee complete failure - even when the training process is not being designed with those problems in mind, and there are no attempts at iterated amplification whatsoever. This nuance is important because it affects how far we should think a naive SGD RL approach is from limited "1% success", and whether or not simple modifications are likely to greatly increase survival odds.

Number of children in our world is negatively correlated with educational achievement and income, often in ways that look like serving other utility function quirks at the expense of children (as the ability to indulge those quirks with scarce effort improved faster with technology faster  than those more closely tied to children), e.g. consumption spending instead of children, sex with contraception, pets instead of babies. Climate/ecological or philosophical antinatalism is also more popular the same regions and social circles. Philosophical support... (read more)

The party line of MIRI is not that a super intelligence, without extreme measures, would waste most of the universe's EV on frivolous nonsense. The party line is that there is a 99+% chance that an AI, even if trained specifically to care about humans, would not end up caring about humans at all, and instead turn the universe into uniform squiggles. That's the claim I find unsubstantiated by most concrete concerns they have, and which seems suspiciously disanalogous to the one natural example we have. 99% of people in first world countries are not forgoing pregnancy for educational attainment. It'd of course still be extremely terrible, and maybe even more terrible, if what I think is going to happen happens! But it doesn't look like all matter becoming squiggles.

At the object level I think actors like Target Malaria, the Bill and Melinda Gates Foundation, Open Philanthropy, and Kevin Esvelt are right to support a legal process approved by affected populations and states, and that such a unilateral illegal release would be very bad in terms of expected lives saved with biotech. Some of the considerations:

  1. Eradication of malaria will require a lot more than a gene drive against Anopheles gambiae s.l., meaning government cooperation is still required.
  2. Resistance can and does develop to gene drives, so that development
... (read more)

Unilateral action in general might be bad, but most of these reasons you've given to not support an illegal one (if gene drives were explicitly illegal, which they're not) seem completely misguided or misleading. I can't parse whether or not this is deliberate. I'm against lying as a means of stopping unilateral action in most real world scenarios; people who want to obtain or give multilateral consensus will need to understand where actual risks come from, not made up risks designed to discourage bad actors.

Eradication of malaria will require a lot more t

... (read more)

Putting aside the concerns about potential backfire effects of unilateral action[1], calling the release of gene drive mosquitoes "illegal" is unsubstantiated.  The claim that actually cashes out to is "every single country where Anopheles gambiae are a substantial vector for the spread of malaria has laws that narrowly prohibit the release of release of mosquitoes".  The alternative interpretation, that "every single country will stretch obviously unrelated laws as far as necessary to throw the book at you if you do this", may be true, but isn't... (read more)

Speaking as someone who does work on prioritization, this is the opposite of my lived experience, which is that robust broadly credible values for this would be incredibly valuable, and I would happily accept them over billions of dollars for risk reduction and feel civilization's prospects substantially improved.

 These sorts of forecasts are critical to setting budget and impact threshold across cause areas, and even more crucially, to determining the signs of interventions, e.g. in arguments about whether to race for AGI with less concern about cata... (read more)

This is surprising to me! If I understand correctly, you would prefer to know for certain that P(doom) was (say) 10% than spend billions on reducing x-risks? (perhaps this comes down to a difference in our definitions of P(doom)) Like Dagon pointed out, it seems more useful to know how much you can change P(doom). For example, if we treat AI risk as a single hard step, going from 10% -> 1% or 99% -> 90% both increase the expected value of the future by 10X, it doesn't matter much whether it started at 10% or 99%. For prioritization within AI safety, are there projects in AI safety that you would stop funding as P(doom) goes from 1% to 10% to 99%? I personally would want to fund all the projects I could, regardless of P(doom) (with resources roughly proportional to how promising those projects are). For prioritization across different risks, I think P(doom) is less important because I think AI is the only risk with greater than 1% chance of existential catastrophe. Maybe you have higher estimates for other risks and this is the crux? In terms of institutional decision making, it seems like P(doom) > 1% is sufficient to determine the signs of different interventions. In a perfect world, a 1% chance of extinction would make researchers, companies, and governments very cautious, there would be no need to narrow down the range further. Like Holden and Nathan point out, P(doom) does serve a promotional role by convincing people to focus more on AI risk, but getting more precise estimates of P(doom) isn't necessarily the best way to convince people.

b) the very superhuman system knows it can't kill us and that we would turn it off, and therefore conceals its capabilities, so we don't know that we've reached the very superhuman level.


Intentionally performing badly on easily measurable performance metrics seems like it requires fairly extreme successful gradient hacking or equivalent. I might analogize it to alien overlords finding it impossible to breed humans to have lots of children by using abilities they already possess. There have to be no mutations or paths through training to incrementally get the AI to use its full abilities (and I think there likely would be).

An analogy I use here is to bacteria operating in an environment in which thermodynamic entropy must always increase. The bacteria are selected to "save" and "spend" negentropy efficiently, in order to couple it to the things they want. (Negentropy here is intended to be analogous to potential-training-objective-improvement.) And as the bacteria example shows, that is totally a thing which we do in fact see in the world.

Analogously, conditional on things like gradient hacking being an issue at all, I'd expect the "hacker" to treat potential-training-object... (read more)

It's easy for ruling AGIs to have many small superintelligent drone police per human that can continually observe and restrain any physical action, and insert controls in all computer equipment/robots. That is plenty to let the humans go about their lives (in style and with tremendous wealth/tech) while being prevented from creating vacuum collapse or something else that might let them damage the vastly more powerful AGI civilization.

The material cost of this is a tiny portion of Solar System resources, as is sustaining legacy humans. On the other hand, arguments like cooperation with aliens, simulation concerns, and similar matter on the scale of the whole civilization, which has many OOMs more resources.

2Ben Pace1y
Thanks for the concrete example in the first paragraph, upvote. I don't know that it would successfully contain humans who were within it for 10^36 years. That seems like enough time for some Ramanujan-like figure to crack the basics of how to code an AGI in his head and share it, and potentially figure out a hidden place or substrate on which to do computation that the drones aren't successfully tracking. (It's also enough time for super-babies or discovering other interesting cheat codes in reality.) 10^36 is my cached number from the last time I asked how long life could sustain in this universe,. Perhaps you think it would only keep us alive as long as our sun exists, which is 5*10^9 years. On that side of things, it seems to me essentially the same as extinction in terms of value-lost. I don't follow the relevance of the second paragraph, perhaps you're just listing those as outstanding risks from sustaining a whole civilization.

4. the rest of the world pays attention to large or powerful real-world bureaucracies and force rules on them that small teams / individuals can ignore (e.g. Secret Congress, Copenhagen interpretation of ethics, startups being able to do illegal stuff), but this presumably won't apply to alignment approaches.

I think a lot of alignment tax-imposing interventions (like requiring local work to be transparent for process-based feedback) could be analogous?

2Rohin Shah1y
Hmm, maybe? There are a few ways this could go: 1. We give feedback to the model on its reasoning, that feedback is bad in the same way that "the rest of the world pays attention and forces dumb rules on them" is bad 2. "Keep your reasoning transparent" is itself a dumb rule that we force upon the AI system that leads to terrible bureaucracy problems I'm unsure about (2) and mostly disagree with (1) (and I think you were mostly saying (2)). Disagreement with (1): Seems like the disanalogy relies pretty hard on the rest of the world not paying much attention when they force bureaucracies to follow dumb rules, whereas we will presumably pay a lot of attention to how we give process-based feedback.

Retroactively giving negative rewards to bad behaviors once we’ve caught them seems like it would shift the reward-maximizing strategy (the goal of the training game) toward avoiding any bad actions that humans could plausibly punish later. 

A swift and decisive coup would still maximize reward (or further other goals). If Alex gets the opportunity to gain enough control to stop Magma engineers from changing its rewards before humans can tell what it’s planning, humans would not be able to disincentivize the actions that led to that coup. Taking t

... (read more)

The evolutionary mismatch causes differences in neural reward, e.g. eating lots of sugary food still tastes (neurally) rewarding even though it's currently evolutionarily maladaptive. And habituation reduces the delightfulness of stimuli.

This happens during fine-tuning training already, selecting for weights that give the higher human-rated response of two (or more) options. It's a starting point that can be lost later on, but we do have it now with respect to configurations of weights giving different observed behaviors.

Individual humans do make off much better when they get to select between products from competing companies rather than monopolies, benefitting from companies going out of their way to demonstrate when their products are verifiably better than rivals'. Humans get treated better by sociopathic powerful politicians and parties when those politicians face the threat of election rivals (e.g. no famines). Small states get treated better when multiple superpowers compete for their allegiance. Competitive science with occasional refutations of false claims produc... (read more)

So, the analogy here is that there's hundreds (or more) of Godzillas all running around, doing whatever it is Godzillas want to do. Humanity helps out whatever Godzillas humanity likes best, which in turn creates an incentive for the Godzillas to make humanity like them.


Still within the analogy: part of the literary point of Godzilla is that humanity's efforts to fight it are mostly pretty ineffective. In inter-Godzilla fights, humanity is like an annoying fly buzzing around. The humans just aren't all... (read more)

Wei Dai1yΩ7145

I was going to make a comment to the effect that humans are already a species of Godzilla (humans aren't safe, human morality is scary, yada yada), only to find you making the same analogy, but with an optimistic slant. :)

Competition between the powerful can lead to the ability of the less powerful to extract value.  It can also lead to the less powerful being more ruthlessly exploited by the powerful as a result of their competition.  It depends on the ability to the less powerful to choose between the more powerful.  I am not confident humanity or parts of it will  have the ability to choose between competing AGIs.

Naturally it doesn't go on forever, but any situation where you're developing technologies that move you to successively faster exponential trajectories is superexponential overall for some range. E.g. if you have robot factories that can reproduce exponentially until they've filled much of the Earth or solar system, and they are also developing faster reproducing factories,  the overall process is superexponential. So is the history of human economic growth, and the improvement from an AI intelligence explosion.

By the time you're at ~cubic expansion being ahead on the early superexponential phase the followers have missed their chance.

I agree that they probably would have missed their chance to catch up with the frontier of your expansion. Maybe an electromagnetic radiation-based assault could reach you if targeted (the speed of light is constant relative to you in a vacuum, even if you're traveling in the same direction), although unlikely to get much of the frontier of your expansion, and there are plausibly effective defenses, too. Do you also mean they wouldn't be able to take most what you've passed through, though? Or it wouldn't matter? If so, how would this be guaranteed (without any violation of the territory of sovereign states on Earth)? Exhaustive extraction in space? An advantage in armed space conflicts?

 I think this claim is true, on account of gray goo and lots of other things, and I suspect Eliezer does too, and I’m pretty sure other people disagree with this claim.

If you have robust alignment, or AIs that are rapidly bootstrapping their level of alignment fast enough to outpace the danger of increased capabilities, aligned AGI could get through its intelligence explosion to get radically superior technology and capabilities. It could also get a hard start on superexponential replication in space, so that no follower could ever catch up, and enoug... (read more)

A bit pedantic, but isn't superexponential replication too fast? Won't it hit physical limits eventually, e.g. expanding at the speed of light in each direction, so at most a cubic function of time? Also, never allowing followers to catch up means abandoning at least some or almost all of the space you passed through. Plausibly you could take most of the accessible and useful resources with you, which would also make it harder for pursuers to ever catch up, since they will plausibly need to extract resources every now and then to fuel further travel. On the other hand, it seems unlikely to me that we could extract or destroy resources quickly enough to not leave any behind for pursuers, if they're at most months behind.

Some more points about this action:

  • Controlling the datacenter means controlling the gradients/reward function, so that now AIs can do things that would otherwise suffer updating from SGD, e.g. acting on inner misaligned goals, or concealing its full capabilities even when this lowers performance
    • For reward-hungry AIs, getting to set reward to maximum (and keep it there?) seems extremely desirable
    • This also means getting past interpretability tools
    • Tripwires or watchdog AIs in the same datacenter that don't succeed in stopping the action no longer have the pow
... (read more)

Whoops, you're right that I linked the wrong survey. I see others posted the link to Rob's survey (done in response to some previous similar claims) and I edited my comment to fix the link.

I think you can identify a cluster of near certain doom views, e.g. 'logistic success curve' and odds of success being on the order of magnitude of 1% (vs 10%, or 90%) based around MIRI/Eliezer, with a lot of epistemic deference involved (visible on LW). I would say it is largely attributable there and without sufficient support.

"My current best guess is if we surveyed p... (read more)

You're right, my link was wrong, that one is a fine link.

Agreed, and versions of them exist in human governments trying to maintain control (where non-cooordination of revolts is central).  A lot of the differences are about exploiting new capabilities like copying and digital neuroscience or changing reward hookups.

In ye olde times of the early 2010s people (such as I) would formulate questions about what kind of institutional setups you'd use to get answers out of untrusted AIs (asking them separately to point out vulnerabilities in your security arrangement, having multiple AIs face fake opportunities to whistleblow on bad behavior, randomized richer human evaluations to incentivize behavior on a larger scale).


Are any of these ancient discussions available anywhere?

[Edited to link correct survey.]

It's really largely Eliezer and some MIRI people. Most alignment researchers (e.g. at ARC, Deepmind, Open AI, Anthropic, CHAI) and most of the community [ETA: had wrong link here before] disagree (I count myself among those who disagree, although I am concerned about a big risk here), and think MIRI doesn't have good reasons to support the claim of almost certain doom.

In particular, other alignment researchers tend to think that competitive supervision (e.g. AIs competing for reward to provide assistance in AI control that h... (read more)

8Jack R1y
I recently asked Eliezer why he didn't suspect ELK to be helpful, and it seemed that one of his major reasons was that Paul was "wrongly" excited about IDA. It seems that at this point in time, neither Paul nor Eliezer are excited about IDA, but Eliezer got to the conclusion first. Although, the IDA-bearishness may be for fundamentally different reasons -- I haven't tried to figure that out yet. Have you been taking this into account re: your ELK bullishness? Obviously, this sort of point should be ignored in favor of object-level arguments about ELK, but to be honest, ELK is taking me a while to digest, so for me that has to wait.

You've now linked to the same survey twice in difference discussions of this topic, even though this survey, as far as I can tell, provides no evidence of the position you are trying to argue for. To copy Thomas Kwa's response to your previous comment: 

I don't see anything in the linked survey about a consensus view on total existential risk probability from AGI. The survey asked researchers to compare between different existential catastrophe scenarios, not about their total x-risk probability, and surely not about the probability of x-risk if AGI we

... (read more)
5Eric Chen1y
Unfinished sentence?

It's really largely Eliezer and some MIRI people. 

Hm? I was recently at a 10-15 person lunch for people with >75% on doom, that included a number of non-MIRI people, including at least one person each from FHI and DeepMind and CHAI.

(Many of the people had interacted with MIRI or at some time worked with/for them, but work at other places now.)

Just registering your comment feels a little overstated, but you're right to say a lot of this emanates from some folks at MIRI. For one, I had been betting a lot on MIRI, and now feel like a lot more responsibility has fallen on my plate.

4Daniel Kokotajlo1y
Nitpick: I think this should either be a comment or an answer to Yitz' upcoming followup post, since it isn't an attempt to convince them that humanity is doomed.

The consensus among alignment researchers is that if AGI were developed right now it would be almost certainly a negative

This isn't true.  [ETA: I linked the wrong survey before.]

8Thomas Kwa1y
I don't see anything in the linked survey about a consensus view on total existential risk probability from AGI. The survey asked researchers to compare between different existential catastrophe scenarios, not about their total x-risk probability, and surely not about the probability of x-risk if AGI were developed now without further alignment research.

Shahar Avin at CSER has been involved in creating and conducting a number of such games/exercises, and you could reach out to him for his gleanings from running them.

He has written a paper on this too, link here.

"Overall these estimates imply a timeline of [372 years]("

That was only for Hanson's convenience sample, other surveys using the method gave much shorter timelines, as discussed in the post.

2Rohin Shah2y
Ah, fair point, looking back at this summary I probably should have clarified that the methodology could be applied with other samples and those look much less long.

But new algorithms also don't work well on old hardware. That's evidence in favor of Paul's view that much software work is adapting to exploit new hardware scales.

8Charlie Steiner2y
Which examples are you thinking of? Modern Stockfish outperformed historical chess engines even when using the same resources, until far enough in the past that computers didn't have enough RAM to load it. I definitely agree with your original-comment points about the general informativeness of hardware, and absolutely software is adapting to fit our current hardware. But this can all be true even if advances in software can make more than 20 orders of magnitude difference in what hardware is needed for AGI, and are much less predictable than advances in hardware rather than being adaptations in lockstep with it.

A perfectly correlated time series of compute and labor would not let us say which had the larger marginal contribution, but we have resources to get at that, which I was referring to with 'plausible decompositions.' This includes experiments with old and new software and hardware, like the chess ones Paul recently commissioned, and studies by AI Impacts, OpenAI, and Neil Thompson. There are AI scaling experiments, and observations of the results of shocks like the end of Dennard scaling, the availability of GPGPU computing, and Besiroglu's data on the rel... (read more)

6Charlie Steiner2y
The chess link maybe should go to hippke's work. What you can see there is that a fixed chess algorithm takes an exponentially growing amount of compute and transforms it into logarithmically-growing Elo. Similar behavior features in recent pessimistic predictions of deep learning's future trajectory. If general navigation of the real world suffers from this same logarithmic-or-worse penalty when translating hardware into performance metrics, then (perhaps surprisingly) we can't conclude that hardware is the dominant driver of progress by noticing that the cost of compute is dropping rapidly.

I will have to look at these studies in detail in order to understand, but I'm confused how can this pass some obvious tests. For example, do you claim that alpha-beta pruning can match AlphaGo given some not-crazy advantage in compute? Do you claim that SVMs can do SOTA image classification with not-crazy advantage in compute (or with any amount of compute with the same training data)? Can Eliza-style chatbots compete with GPT3 however we scale them up?

Progress in AI has largely been a function of increasing compute, human software research efforts, and serial time/steps. Throwing more compute at researchers has improved performance both directly and indirectly (e.g. by enabling more experiments, refining evaluation functions in chess, training neural networks, or making algorithms that work best with large compute more attractive).

Historically compute has grown by many orders of magnitude, while human labor applied to AI and supporting software  by only a few. And on plausible decompositions of pro... (read more)

(I'm trying to answer and clarify some of the points in the comments based on my interpretation of Yudkowsky in this post. So take the interpretations with a grain of salt, not as "exactly what Yudkowsky meant") My summary of what you're defending here: because hardware progress is (according to you) the major driver of AI innovation, then we should invest a lot of our forecasting resources into forecasting it, and we should leverage it as the strongest source of evidence available for thinking about AGI timelines. I feel like this is not in contradiction with what Yudkowsky wrote in this post? I doubt he agrees that just additional compute is the main driver of progress (after all, the Bitter Lesson mostly tells you that insights and innovations leveraging more compute will beat hardcorded ones), but insofar as he expect us to have next to no knowledge of how to build AGI until around 2 years before it is done (and then only for those with the Thelian secret), then compute is indeed the next best thing that we have to estimate timelines. Yet Yudkowsky's point is that being the next best thing doesn't mean it's any good. Evolution being an upper bound makes sense, and I think Yudkowsky agrees. But it's an upper bound on the whole human optimization process, and the search space of the human optimization is tricky to think about. I see much of Yudkowsky's criticisms of biological estimates here as saying "this biological anchor doesn't express the cost of evolution's optimization in terms of human optimization, but instead goes for a proxy which doesn't tell you anything". So if someone captured both evolution and human optimization in the same search space, and found an upper bound on the cost (in terms of optimization power) that evolution spent to find humans, then I expect Yudkowsky would agree that this is an upper bound for the optimization power that human will use. But he might still retort that translating optimization power into compute is not obvious.

Historically compute has grown by many orders of magnitude, while human labor applied to AI and supporting software by only a few. And on plausible decompositions of progress (allowing for adjustment of software to current hardware and vice versa), hardware growth accounts for more of the progress over time than human labor input growth.

So if you're going to use an AI production function for tech forecasting based on inputs (which do relatively OK by the standards tech forecasting), it's best to use all of compute, labor, and time, but it makes sense

... (read more)

I commend this comment and concur with the importance of hardware, the straw-manning of Moravec, etc.

However I do think that EY had a few valid criticisms of Ajeya's model in particular - it ends up smearing probability mass over many anchors or sub-models, most of which are arguably poorly grounded in deep engineering knowledge. And yes you can use it to create your own model, but most people won't do that and are just looking at the default median conclusion.

Moore's Law is petering out as we run up against the constraints of physics for practical irrever... (read more)

You may be interested in some recent empirical experiments, demonstrating objective robustness failures/inner misalignment, including ones predicted in the risks from learned optimization paper.

Load More