[-]Robert Miles3y2413

This is good and interesting. Various things to address, but I only have time for a couple at random.

I disagree with the idea that true things necessarily have explanations that are both convincing and short. In my experience you can give a short explanation that doesn't address everyone's reasonable objections, or a very long one that does, or something in between. If you understand some specific point about cutting edge research, you should be able to properly explain it to a lay person, but by the time you're done they won't be a lay person any more! If you restrict your explanation to "things you can cover before the person you're explaining to decides this isn't worth their time and goes away", many concepts simply cannot ever be explained to most people, because they don't really want to know.

So the core challenge is staying interesting enough for long enough to actually get across all of the required concepts. On that point, have you seen any of my videos, and do you have thoughts on them? You can search "AI Safety" on YouTube.

Similarly, do you thoughts on AISafety.info ?

[-]PoignardAzur3y52

Similarly, do you thoughts on AISafety.info ?

Quick note on AISafety.info: I just stumbled on it and it's a great initiative.

I remember pitching an idea for an AI Safety FAQ (which I'm currently working on) to a friend at MIRI and him telling me "We don't have anything like this, it's a great idea, go for it!"; my reaction at the time was "Well I'm glad for the validation and also very scared that nobody has had the idea yet", so I'm glad to have been wrong about that.

I'll keep working on my article, though, because I think the FAQ you're writing is too vast and maybe won't quite have enough punch, it won't be compelling enough for most people.

Would love to chat with you about it at some point.

[-]nicholashalden3y30

I disagree with the idea that true things necessarily have explanations that are both convincing and short.

I don't think it's necessary for something to be true (there's no short, convincing explanation of eg quantum mechanics), but I think accurate forecasts tend to have such explanations (Tetlock's work strongly argues for this).

I agree there is a balance to be struck between losing your audience and being exhaustive, just that the vast majority of material I've read is on one side of this.

On that point, have you seen any of my videos, and do you have thoughts on them? You can search "AI Safety" on YouTube.

I don't prefer video format for learning in general, but I will take a look!

Similarly, do you thoughts on AISafety.info ?

I hadn't seen this. I think it's a good resource as sort of a FAQ, but isn't zeroed in on "here is the problem we are trying to solve, and here's why you should care about it" in layman's terms. I guess the best example of what I'm looking for is Benjamin Hilton's article for 80,000 hours, which I wish were a more popular share.

[-]Daniel Kokotajlo3y189

Thanks for this post! I definitely disagree with you about point I (I think AI doom is 70% likely and I think people who think it is less than, say, 20% are being very unreasonable) but I appreciate the feedback and constructive criticism, especially section III.

If you ever want to chat sometime (e.g. in a comment thread, or in a video call) I'd be happy to. If you are especially interested I can reply here to your object-level arguments in section I. I guess a lightning version would be "My arguments for doom don't depend on nanotech or anything possibly-impossible like that, only on things that seem clearly possible like ordinary persuasion, hacking, engineering, warfare, etc. As for what values ASI agents would have, indeed, they could end up just wanting to get low loss or even delete themselves or something like that. But if we are training them to complete ambitious tasks in the real world (and especially, if we are training them to have ambitious aligned goals like promoting human flourishing and avoiding long-term bad consequences), they'll probably develop ambitious goals, and even if they don't, that only buys us a little bit of time before someone creates one that does have ambitious goals. Finally, even goals that seem very unambitious can really become ambitious goals when a superintelligence has them, for galaxy-brained reasons which I can explain if you like. As for what happens after unaligned ASI takes over the world -- agreed, it's plausible they won't kill us. But I think it's safe to say that unaligned ASI taking over the world would be very bad in expectation and we should work hard to avoid it."

[-]Jan_Kulveit3y104

As a minor nitpick, 70% likely and 20% are quite close in logodds space, so it seems odd you think what you believe is reasonable and something so close is "very unreasonable".

[-]Daniel Kokotajlo3y95

I agree that logodds space is the right way to think about how close probabilities are. However, my epistemic situation right now is basically this:

"It sure seems like Doom is more likely than Safety, for a bunch of reasons. However, I feel sufficiently uncertain about stuff, and humble, that I don't want to say e.g. 99% chance of doom, or even 90%. I can in fact imagine things being OK, in a couple different ways, even if those ways seem unlikely to me. ... OK, now if I imagine someone having the flipped perspective, and thinking that things being OK is more likely than doom, but being humble and thinking that they should assign at least 10% credence (but less than 20%) to doom... I'd be like "what are you smoking? What world are you living in, where it seems like things will be fine by default but there are a few unlikely ways things could go badly, instead of a world where it seems like things will go badly by default but there are a few unlikely ways things could go well? I mean I can see how you'd think this is you weren't aware of how short timelines to ASI are, or if you hadn't thought much about the alignment problem..."

If you think this is unreasonable, I'd be interested to hear it!

[-]Jan_Kulveit3y72

I don't think the way you imagine perspective inversion captures typical ways how to arrive at e.g. 20% doom probability. For example, I do believe that there are multiple good things which can happen/be true, decrease p(doom) and I put some weight on them
- we do discover some relatively short description of something like "harmony and kindness"; this works as an alignment target
- enough of morality is convergent
- AI progress helps with human coordination (could be in costly way, eg warning shot)
- it's convergent to massively scale alignment efforts with AI power, and these solve some of the more obvious problems

I would expect prevailing doom conditional on only small efforts to avoid it, but I do think the actual efforts will be substantial, and this moves the chances to ~20-30%. (Also I think most of the risk comes from not being able to deal with complex systems of many AIs and economy decoupling from humans, and single-single alignment to be solved sufficiently to prevent single system takeover by default.)

[-]Daniel Kokotajlo3y40

Thanks for this comment. I'd be generally interested to hear more about how one could get to 20% doom (or less).

The list you give above is cool but doesn't do it for me; going down the list I'd guess something like:
1. 20% likely (honesty seems like the best bet to me) because we have so little time left, but even if it happens we aren't out of the woods yet because there are various plausible ways we could screw things up. So maybe overall this is where 1/3rd of my hope comes from.
2. 5% likely? Would want to think about this more. I could imagine myself being very wrong here actually, I haven't thought about it enough. But it sure does sound like wishful thinking.
3. This is already happening to some extent, but the question is, will it happen enough? My overall "humans coordinate to not build the dangerous kinds of AI for several years, long enough to figure out how to end the acute risk period" is where most of my hope comes from. I guess it's the remaining 2/3rds basically. So, I guess I can say 20% likely.
4. What does this mean?

I would be much more optimistic if I thought timelines were longer.

[-]nicholashalden3y30

This seems to violate common sense. Why would you think about this in log space? 99% and 1% are identical in if(>0) space, but they have massively different implications for how you think about a risk (just like 20 and 70% do!)

[-]Jan_Kulveit3y113

It's much more natural way how to think about it (cf eg TE Janes, Probability theory, examples in Chapter IV)

In this specific case of evaluating hypothesis, the distance in the logodds space indicates the strength the evidence you would need to see to update. Close distance implies you don't that much evidence to update between the positions (note the distance between 0.7 and 0.2 is closer than 0.9 and 0.99). If you need only a small amount of evidence to update, it is easy to imagine some other observer as reasonable as you had accumulated a bit or two somewhere you haven't seen.

Because working in logspace is way more natural, it is almost certainly also what our brains do - the "common sense" is almost certainly based on logspace representations.

[-]Droopyhammock3y70

I seem to remember your P(doom) being 85% a short while ago. I’d be interested to know why it has dropped to 70%, or in another way of looking at it, why you believe our odds of non-doom have doubled.

[-]Daniel Kokotajlo3y*162

Whereas my timelines views are extremely well thought-through (relative to most people that is) I feel much more uncertain and unstable about p(doom). That said, here's why I updated:

Hinton and Bengio have come out as worried about AGI x-risk; the FLI letter and Yudkowsky's tour of podcasts, while incompetently executed, have been better received by the general public and elites than I expected; the big labs (especially OpenAI) have reiterated that superintelligent AGI is a thing, that it might come soon, that it might kill everyone, and that regulation is needed; internally, OpenAI at least has pushed more for focus on these big issues as well. Oh and there's been some cool progress in interpretability & alignment which doesn't come close to solving the problem on its own but makes me optimistic that we aren't barking up the wrong trees / completely hitting a wall. (I'm thinking about e.g. the cheese vector and activation vector stuff and the discovering latent knowledge stuff)

As for capabilities, yes it's bad that tons of people are now experimenting with AutoGPT and making their own LLM startups, and it's bad that Google DeepMind is apparently doing some AGI mega-project, but... those things were already priced in, by me at least. I fully expected the other big corporations to 'wake up' at some point and start racing hard, and the capabilities we've seen so far are pretty much exactly on trend for my What 2026 Looks Like scenario which involved AI takeover in 2027 and singularity in 2028.

Basically, I feel like we are on track to rule out one of the possible bad futures (in which the big corporations circle the wagons and say AGI is Safe there is No Evidence of Danger the AI x-risk people are Crazy Fanatics and the government buys their story long enough for it to be too late.) Now unfortunately the most likely bad future remains, in which the government does implement some regulation intended to fix the problem, but it fails to fix the problem & fails to buy us any significant amount of time before the dangerous sorts of AGI are built and deployed. (e.g. because it gets watered down by tech companies averse to abandoning profitable products and lines of research, e.g. because racing with China causes everyone to go 'well actually' when the time comes to slow down and change course)

Meanwhile one of the good futures (in which the regulation is good and succeeds in preventing people from building the bad kinds of AGI for years, buying us time in which to do more alignment, interpretability, and governance work, and for the world to generally get more awareness and focus on the problems) is looking somewhat more likely.

So I still think we are on a default path to doom but one of the plausible bad futures seems less likely and one of the plausible good futures seems more likely. So yeah.

[-]Wei Dai3y90

Thanks for this. I was just wondering how your views have updated in light of recent events.

Like you I also think that things are going better than my median prediction, but paradoxically I've been feeling even more pessimistic lately. Reflecting on this, I think my p(doom) has gone up instead of down, because some of the good futures where a lot of my probability mass for non-doom were concentrated have also disappeared, which seems to outweigh the especially bad futures going away and makes me overall more pessimistic.

These especially good futures were 1) AI capabilities hit a wall before getting to human level and 2) humanity handles AI risk especially competently, e.g., at this stage leading AI labs talk clearly about existential risks in their public communications and make serious efforts to avoid race dynamics, there is more competent public discussion of takeover risk than what we see today including fully cooked regulatory proposals, many people start taking less obvious (non-takeover) AI-related x-risks (like ones Paul mentions in this post) seriously.

[-]Daniel Kokotajlo3y42

Makes sense. I had basically decided by 2021 that those good futures (1) and (2) were very unlikely, so yeah.

[+][comment deleted]3y20

[-]nicholashalden3y40

Thank you for the reply. I agree we should try and avoid AI taking over the world.

On "doom through normal means"--I just think there are very plausibly limits to what superintelligence can do. "Persuasion, hacking, and warfare" (appreciate this is not a full version of the argument) don't seem like doom to me. I don't believe something can persuade generals to go to war in a short period of time, just because it's very intelligent. Reminds me of this.

On values--I think there's a conflation between us having ambitious goals, and whatever is actually being optimized by the AI. I am curious to hear what the "galaxy brained reasons" are; my impression was, they are what was outlined (and addressed) in the original post.

[-]metachirality3y52

I don't believe something can persuade generals to go to war in a short period of time, just because it's very intelligent.

A few things I've seen give pretty worrying lower bounds for how persuasive a superintelligence would be:

How it feels to have your mind hacked by an AI
The AI in a box boxes you (content warning: creepy blackmail-y acausal stuff)

Remember that a superintelligence will be at least several orders of magnitude more persuasive than character.ai or Stuart Armstrong.

[-]zrezzed3y30

a superintelligence will be at least several orders of magnitude more persuasive than character.ai or Stuart Armstrong.

Believing this seems central to believing high P(doom).

But, I think it's not a coherent enough concept to justify believing it. Yes, some people are far more persuasive than others. But how can you extrapolate that far beyond the distribution we obverse in humans? I do think AI will prove to better than humans at this, and likely much better.

But "much" better isn't the same as "better enough to be effectively treated as magic".

[-]Bezzi3y50

Well, even the tail of the human distribution is pretty scary. A single human with a lot of social skills can become the leader of a whole nation, or even a prophet considered literally a divine being. This has already happened several times in history, even in times where you had to be physically close to people to convince them.

[-]Daniel Kokotajlo3y40

Thanks to you likewise!

On doom through normal means: "Persuasion, hacking, and warfare" aren't by themselves doom, but they can be used to accumulate lots of power, and then that power can be used to cause doom. Imagine a world in which human are completely economically, militarily, and politically obsolete, thanks to armies of robots directed by superintelligent AIs. Such a world could and would do very nasty things to humans (e.g. let them all starve to death) unless the superintelligent AIs managing everything specifically cared about keeping humans alive and in good living conditions. Because keeping humans alive & in good living conditions would, ex hypothesi, not be instrumentally valuable to the economy, or the military, etc.

How could such a world arise? Well, if we have superintelligent AIs, they can do some hacking, persuasion, and maybe some warfare, and create that world.

How long would this process take? IDK, maybe years? Could be much less. But I wouldn't be surprised if it takes several years, even maybe five years.

I'm not conflating those things. We have ambitious goals and are trying to get our AIs to have ambitious goals -- specifically we are trying to get them to have our ambitious goals. It's not much of a stretch to imagine this going wrong, and them ending up with ambitious goals that are different from ours in various ways (even if somewhat overlapping).

[-]AnthonyC3y40

Remember that persuasion from an ASI doesn't need to look like "text-based chatting with a human." It includes all the tools of communication available. Actually-near-flawless forgeries of any and every form of digital data you could ever ask for, as a baseline, all based on the best possible inferences made from all available real data.

How many people today are regularly persuaded of truly ridiculous things by perfectly normal human-scale-intelligent scammers, cults, conspiracy theorists, marketers, politicians, relatives, preachers, and so on? The average human, even the average IQ 120-150 human, just isn't that resistant to persuasion in favor of untrue claims.

[-]Charlie Steiner3y114

Thanks! It seems like most of your exposure has been through Eliezer? Certainly impressions like "why does everyone think the chance of doom is >90%?" only make sense in that light. Have you seen presentations of AI risk arguments from other people like Rob Miles or Stuart Russell or Holden Karnofsky, and if so do you have different impressions?

[-]Seth Herd3y10

I think the relevant point here is that OPs impressions are from Yudkowsky, and that's evidence that many people's are. Certainly the majority of public reactions I see emphasize Yudkowsky's explanations, and seem to be motivated by his relatively long-winded and contemptuous style.

[-]Shmi3y116

I think it's a very useful perspective, sadly the commenters do not seem to engage with your main point, that the presentation of the topic is unpersuasive to an intelligent layperson, instead focusing on specific arguments.

[-]RobertM3y97

the presentation of the topic is unpersuasive to an intelligent layperson

There is, of course, no single presentation, but many presentations given by many people, targeting many different audiences. Could some of those presentations be improved? No doubt.

I agree that the question of how to communicate the problem effectively is difficult and largely unsolved. I disagree with some of the specific prescriptions (i.e. the call to falsely claim more-modest beliefs to make them more palatable for a certain audience), and the object-level arguments are either arguing against things that nobody^[1] thinks are core problems^[2] or are missing the point^[3].

^{^}
Approximately.
^{^}
Wireheading may or may not end up being a problem, but it's not the thing that kills us. Also, that entire section is sort of confused. Nobody thinks that an AI will deliberately change its own values to be easier to fulfill; goal stability implies the opposite.
^{^}
Specific arguments about whether superintelligence will be able to exploit bugs in human cognition or create nanotech (which... I don't see an arguments against, here, except for the contention that nothing was ever invented by a smart person sitting in an armchair, even though of course an AI will not be limited in its ability to experiment in the real world if it needs to) are irrelevant. Broadly speaking, the reason we might expect to lose control to a superintelligent AI is that achieving outcomes in real life is not a game with an optimal solution the way tic tac toe is, and the idea that something more intelligent than us will do better at achieving its goals than other agents in the system should be your default prior, not something that needs to overcome a strong burden of proof.

[-]nicholashalden3y52

It's very strange to me that there isn't a central, accessible "101" version of the argument given how much has been written.

I don't think anyone should make false claims, and this is an uncharitable mischaracterization of what I wrote. I am telling you that, from the outside view, what LW/rationalism gets attention for is the "I am sure we are all going to die", which I don't think is a claim most of its members hold, and this repels the average person because it violates common sense.

The object level responses you gave are so minimal and dismissive that I think they highlight the problem. "You're missing the point, no one thinks that anymore." Responses like this turn discussion into an inside-view only affair. Your status as a LW admin sharpens this point.

[-]RobertM3y60

Yeah, I probably should have explicitly clarified that I wasn't going to be citing my sources there. I agree that the fact that it's costly to do so is a real problem, but Robert Miles points out, some of the difficulty here is insoluble.

It's very strange to me that there isn't a central, accessible "101" version of the argument given how much has been written.

There are several, in fact; but as I mentioned above, none of them will cover all the bases for all possible audiences (and the last one isn't exactly short, either). Off the top of of my head, here are a few:

[-]lukemarks3y10

The focus of the post is not on this fact (at least not in terms of the quantity of written material). I responded to the arguments made because they comprised most of the post, and I disagreed with them.

If the primary point of the post was "The presentation of AI x-risk ideas results in them being unconvincing to laypeople", then I could find reason in responding to this, but other than this general notion, I don't see anything in this post that expressly conveys why (excluding troubles with argumentative rigor, and the best way to respond to this I can think of is by refuting said arguments).

[-]Quinn3y86

I don't have an overarching theory of the Hard Problem of Jargon, but I have some guesses about the sorts of mistakes people love to make. My overarching point is just "things are hard"

Working in finance, you find a lot of unnecessary jargon designed to keep smart laymen out of the discussion. AI risk is many times worse than buyside finance on this front.

This is a deeply rare phenomenon. I do think there are nonzero places with a peculiar mix of prestige and thinness of kayfabe that lead to this actually happening (like if you're maintaining a polite fiction of meritocracy in the face of aggressive nepotism, you might rely on cheap superiority signals to nudge people into not calling BS), or in a different way I remember at when I worked at home depot supervisors may have been protecting their $2/hr pay bump by hiding their responsibilities from their subordinates (to prevent subordinates from figuring out that they could handle actually supervising if the hierarchy was disturbed). Generalizing from these scenarios to scientific disciplines is perfectly silly! Most people, a vaster majority in sciences, are extremely excited about thinking clearly and communicating clearly to as many people as possible!

I also want to point out a distinction you may be missing in anti-finance populism. A synthetic CDO is sketchy because it is needlessly complex by it's nature, not that the communication strategy was insufficiently optimized! But you wrote about "unnecessary jargon", implying that you think implementing and reasoning about synthetic CDOs is inherently easy, and finance workers are misleading people into thinking it's hard (because of their scarcity mindset, to protect their job security, etc). Jargon is an incredibly weak way to implement anti-finance populism, a stronger form of it says that the instruments and processes themselves are overcomplicated (for shady reasons or whatever).

Moreover, emphasis on jargon complaints implies a destructive worldview. The various degrees and flavors of "there are no hard open problems, people say there are hard open problems to protect their power, me and my friends have all the answers, which were surprisingly easy to find, we'll prove it to you as soon as you give us power" dynamics I've watched over the years seem tightly related, to me.

I actually don’t think that many steps are involved, but the presentation in the articles I’ve read makes it seem as though there is.

I do get frustrated when people tell me that "clear writing" is one thing that definitely exists, because I think they're ignoring tradeoffs. "How many predictable objections should I address, is it 3? 6? does the 'clear writing' protocol tell me to roll a d6?" sort of questions get ignored. To be fair, Arbital was initially developed to be "wikipedia with difficulty levels", which would've made this easier.

TLDR

I think the way people should reason about facing down jargon is to first ask "can I help them improve?" and if you can't then you ask "have they earned my attention?". Literally everywhere in the world, in every discipline, there are separate questions for communication at the state of the art and communication with the public. People calculate which fields they want to learn in detail, because effort is scarce. Saying "it's a problem that learning your field takes effort" makes zero sense.

[-]Mazianni3y*70

Preamble

I've ruminated about this for several days. As an outsider to the field of artificial intelligence (coming from a IT technical space, with an emphasis on telecom and large call centers which are complex systems where interpretability has long held significant value for the business org) I have my own perspective on this particular (for the sake of brevity) "problem."

What triggered my desire to respond

For my part, I wrote a similarly sized article not for the purposes of posting, but to organize my thoughts. And then I let that sit. (I will not be posting that 2084 word response. Consider this my imitation of Pascal: I dedicated time to making a long response shorter.) However, this is one of the excerpts that I would like to extract from that my longer response:

The arbital pages for Orthogonality and Instrumental Convergence are horrifically long.

This stood out to me, so I went to assess:

This article (at the time I counted it) ranked at 2398 words total.
Arbital Orthogonality article ranked at 2246 words total (less than this article.)
Arbital Instrumental Convergence article ranked at 3225 words total (more than this article.)
A random arxiv article I recently read for anecdotal comparison, ranked in at 9534 words (far more than this article.)

Likewise, the authors response to Eliezer's short response stood out to me:

This raises red flags from a man who has written millions of words on the subject, and in the same breath asks why Quintin responded to a shorter-form version of his argument.

These elements provoke me to ask questions like:

Why does a request for brevity from Eliezer provoke concern?
Why does the author not apply their own evaluations on brevity to their article?
Can the authors point be made more succinctly?

These are rhetorical and are not intended to imply an answer, but it might give some sense of why I felt a need to write my own 2k words on the topic in order to organize my thoughts.

Observations

I observe that

Jargon, while potentially exclusive, can also serve as shorthand for brevity.
Presentation improvement seems to be the author's suggestion to combat confirmation bias, belief perseverance and cognitive dissonance. I think the author is talking about boundaries. In Youtube: Machine Learning Street Talk: Robert Miles - "There is a good chance this kills everyone" offers what I think is a fantastic analogy for this problem-- Someone asks an expert to provide an example of the kind of risk we're talking about, but the risk example requires numerous assumptions be made for the example to have meaning, then, because the student does not already buy into the assumptions, they straw man the example by coming up with a "solution" to that problem and ask "Why is it harder than that?"-- Robert gives a good analogy by saying this is like asking Robert what chess moves would defeat Magnus, but, in order for the answer to be meaningful, Robert requires more expertise at chess than Magnus. And when Robert comes up with a move that is not good, even a novice at chess might see a way to counter Robert's move. These are not good engagements in the domain, because they rely upon assumptions that have not been agreed to, so there can be no short hand.
p(doom) is subjective and lacks systemization/formalization. I intuit that Availability heuristics plays a role. An analogy might be that if someone hears Eliezer express something that sounds like hyperbole, then they assess their p(doom) must be lower than his. This seems as if this is the application of confirmation bias to what appears to be a failed appeal to emotion. (i.e., you seem to have appealed to my emotion, but I didn't feel the way you intended for me to feel, therefore I assume that I don't believe the way you believe, therefore I believe your beliefs must be wrong.) I would caution that critics of Eliezer have a tendency to quote his more sensational statements out of context. Like quoting him about his "kinetic strikes on data centers" comment, without quoting the full context of the argument. You can find related twitter exchange and admissions that his proposal is an extraordinary one.

There may be still other attributes that I did not enumerate (I am trying to stay below 1k words.)^[1]

Axis of compression potential

Which brings me to the idea that the following attributes are at the core of what the author is talking about:

Principal of Economy of Thought - The idea that truth can be expressed succinctly. This argument might also be related to Occam's Razor. There are multiple examples of complex systems that can be described simply, but inaccurately, and accurately but not simply. Take the human organism, or the atom. And yet, there is a (I think) valid argument for rendering complex things down to simple, if inaccurate, forms so that they can be more accessible to students of the topic. Regardless of complexity required, trying to express something in the smallest form has utility. This is a principal I play with, literally daily, at work. However, when I offer an educational analogy, I often feel compelled to qualify that "All analogies have flaws."
An improved sensitivity to boundaries in the less educated seems like a reasonable ask. While I think it is important to recognize that presentation alone may not change the mind of the student, it can still be useful to shape ones presentation to be less objectionable to the boundaries of the student. However, I think it important to remember that shaping an argument to an individuals boundaries is a more time consuming process and there is an implied impossibility of shaping every argument to the lowest common denominator. More complex arguments and conversation is required to solve the alignment problem.

Conclusion

I would like to close with, for the reasons the author uttered

I don’t see how we avoid a catastrophe here ...

I concur with this, and this alone puts my personal p(doom) at over 90%.

Do I think there is a solution? Absolutely.
Do I think we're allocating enough effort and resources to finding it? Absolutely not.
Do I think we will find the solution in time? Given the propensity towards apathy, as discussed in the bystander effect I doubt it.

Discussion (alone) is not problem solving.^[2] It is communication. And while communication is necessary in parallel with solution finding, it is not a replacement therefore.

So in conclusion, I generally support finding economic approaches to communication/education that avoid barrier issues, and I generally support promoting tailored communication approaches (which imply and require a large number of non-experts working collaboratively with experts to spread the message that risks exist with AI, and there are steps we can take to avoid risks, and that it is better to take steps before we do something irrevocable.)

But I also generally think that communication alone does not solve the problem. (Hopefully it can influence an investment in other necessary effort domains.)

I failed. This ranks in at 1240 words, including markdown. ↩︎
Discussion is a likely requirement of problem solving, but I meant "non-problem solving" discussion. I am not intentionally equivocating here. (Lots of little edits for typographical errors, and mistakes with markdown.) ↩︎

[-]Raemon3y57

It'd be helpful to have a short summary of the post on LessWrong so there's a bit more context on whether to click through.

[-]Seth Herd3y30

Thank you! Outside perspectives from someone who's bothered to spend their time looking at the arguments are really useful.

I'm disturbed that the majority of community responses seem defensive in tone. Responding to attempts at constructive criticism with defensiveness is a really bad sign for becoming Less Wrong.

I think the major argument missing from what you've read is that giving an AGI a goal that works for humanity is surprisingly really hard. Accurately expressing human goals, let alone as an RL training set, in a way that stays stable long-term one an AGI has (almost inevitably) escapes your control, is really difficult.

But that's on the object level, which isn't the point of your post. I include it as my suggestion for the biggest thing we're leaving out in brief summaries of the arguments.

I think the community at large tends to be really good at alignment logic, and pretty bad at communicating succinctly with the world at large, and we had better correct this or it might get us all killed. Thanks so much for trying to push us in that direction!

[-]AnthonyC3y*30

This was a really good post, and I think accurately reflects a lot of people's viewpoints. Thanks!

Working in finance, you find a lot of unnecessary jargon designed to keep smart laymen out of the discussion.

Most fields, especially technical fields, don't do this. They use jargon because 1) the actual meanings the jargon points to don't have short, precise, natural language equivalents, and 2) if experts did assign such short handles using normal language, the words and phrases used would still be prone to misunderstanding by non-experts because there are wide variations in non-technical usage, plus it would be harder for experts to know when their peers are speaking precisely vs. colloquially. In my own work, I will often be asked a question that I can figure out the overall answer to in 5 minutes, and I can express the answer and how I found it to my colleagues in seconds, but demonstrating it to others regularly takes over a day of effort organizing thoughts and background data and assumptions, and minutes to hours presenting and discussing it. I'm hardly the world's best explainer, but this is a core part of my job for the past 12 years and I get lots of feedback indicating I'm pretty good at it.

We can argue over the hidden complexity of wishes, but

I think this section greatly underestimates just how much hidden complexity (EY and other high-probability-of-doom-predictors say that) wishes have. It's not so much, "a longer sentence with more caveats would have been fine," but rather more like "the required complexity has never been able to be even close to achieved or precisely described in all the verbal musings and written explorations of axiology/morality/ethics/law/politics/theology/psychology that humanity has ever produced since the dawn of language." That claim may well be wrong, but it's not a small difference of opinion.

If we are talking about an opaque black box, how can you be >90% confident about what it contains?

This is a disagreement over priors, not black boxes. I am much more than 90% certain that the interior of a black hole beyond the event horizon does not consist of a habitable environment full of happy, immortal, well-cared for puppies eternally enjoying themselves. I am also much more than 90% certain that if I plop a lump of graphite in water and seal it a time capsule for 30 years, that when I open it, it won't contain diamonds and neatly-separated regions of hydrogen and oxygen gas. I'm not claiming anyone has that level of certainty of priors regarding AI x-risk, or even close. But if most possible good outcomes require complex specifications, that means there are orders of magnitude more ways for things to go wrong, than right. That's a high bar for what level of caution and control is needed to steer towards good outcomes. Maybe not high enough to get to >90%, but high enough that I'd find it hard to be convinced of <10%. And my bar for saying "sure, let's roll the dice on the entire future light cone of Earth" is way less than 10%.

[-]simon3y*30

We can argue over the hidden complexity of wishes, but it’s very obvious that there’s at least a good chance the populace would survive, so long as humans are the ones giving the AGI its goal.

Quite likely, depending on how you specify goals relating to humans^[1], though it could wind up quite dystopic due to that hidden complexity.

Here, we arrive at the second argument. AGI will understand its own code perfectly, and so be able to “wirehead” by changing whatever its goals are so that they can be maximized to an even greater extent.

I don't think wireheading is a common argument in fact? It doesn't seem like it would be a crux issue on doom probability, anyway. Self-modification on the other hand is very important since it could lead to expansion of capabilities.

It strikes me that it would simply fulfill that goal, and be content.

I think this is a valid point - it will by default carry out something related to the goals it is trained to do^[2], albeit with some mis-specification and mis-generalization or whatever, and I agree that mesa-optimizers are generally overrated. However I don't think the following works to support that point:

If we are talking about an opaque black box, how can you be >90% confident about what it contains?

I think Eliezer would turn that around on you and ask how you are so confident that the opaque black box has goals that fall into the narrow set that would work out for humans, when a vastly larger set would not?

There is another, more important, objection here. So far, we have talked about “tiling the universe” and turning human atoms into GPUs as though that’s easily attainable given enough intelligence. I highly doubt that’s actually true. Creating GPUs is a costly, time-consuming task. Intelligence is not magic. Eliezer writes that he thinks a superintelligence could “hack a human brain” and “bootstrap nanotechnology” relatively quickly. This is an absolutely enormous call and seems very unlikely. You don’t know that human brains can be hacked using VR headsets; it has never been demonstrated that it’s possible and there are common sense reasons to think it’s not. The brain is an immensely complicated, poorly-understood organ. Applying a lot of computing power to that problem is very unlikely to yield total mastery of it by shining light in someone’s eyes. Nanotechnology, which is basically just moving around atoms to create different materials, is another thing that he thinks compute is definitely able to just solve and be able to recombine atoms easily. Probably not. I cannot think of anything that was invented by a very smart person sitting in an armchair considering it. Is it possible that over years of experimentation like anyone else, an AGI could create something amazingly powerful? Yes. Is that going to happen in a short period of time (or aggressively all at once)? Very unlikely. Eliezer says he doesn’t think intelligence is magic, and understands that it can’t violate the laws of physics, but seemingly thinks that anything that humans think might potentially be possible but is way beyond our understanding or capabilities can be solved with a lot of intelligence. This does not fit my model of how useful intelligence is.

Intelligence is not magic? Tell that to the chimpanzees...

We don't really know how much room there is for software-level improvement; if it's large, self-improvement could create far super-human capabilities in existing hardware. And with great intelligence comes great capabilities:

It will be superhumanly good at persuading humans even if that doesn't lead to exactly "hack a human brain"
I think at least a substantial minority of humans might side with even an openly misaligned AI if they are convinced it will win, and through higher bandwidth and unified command the AI would be able to coordinate its supporters much better than the opponents can coordinate, and it could actively disrupt or subvert nominally opposing organizations through its agents
regarding experimentation, an AI may be able to substitute simulation. Its imagination need not be constrained by a human's meagre working memory
these are just a few examples that mere human-level intelligence can think of. A superintelligence will likely have more options that a superintelligence can think of and I haven't

Moreover, even if these things don't work that way and we get a slow takeoff, that doesn't necessarily save humanity. It just means that it will take a little longer for AI to be the dominant form of intelligence on the planet. That still sets a deadline to adequately solve alignment.

As I said before, I’m very confused about how you get to >90% chance of doom given the complexity of the systems we’re discussing.

As alluded to before, there's more ways for the AI to kill us than not to kill us.

My own doom percentage is lower than this, though not because of any disagreement with >90% doomers that we are headed to (at least dystopian if not extinction) doom if capabilities continue to advance without alignment theory also doing so. I just think the problems are soluble.

III. The way the material I’ve interacted with is presented will dissuade many, probably most, non-rationalist readers

I think that this leads to the conclusion that some 101-level version could be made, and promoted for outreach purposes rather than the more advanced stuff. But that depends on outreach actually occurring - we still need to have the more advanced discussions, and those will provide the default materials if the 101-stuff doesn't exist or isn't known.

Further, I think the whole “>90%” business is overemphasized by the community. It would be more believable if the argument were watered down into, “I don’t see how we avoid a catastrophe here, but there are a lot of unknown unknowns, so let’s say it’s 50 or 60% chance of everyone dying”. This is still a massive call, and I think more in line with what a lot of the community actually believes. The emphasis on certainty-of-doom as opposed to just sounding-the-alarm-on-possible-doom hurts the cause.

Yes, I do think that's more in line with what a lot of the community actually believes, including me. But, I'm not sure why you're saying in that case that "the community" overemphasizes >90%? Do you mean to say, for example, that certain members of the community (e.g. Eliezer) overemphasize >90%, and you think that those members are too prominent, at least from the perspective of outsiders?

I think, yes, perhaps Eliezer could be a better ambassador for the community or it would be better if someone else who would be better in that role took that role more. I don't know if this is a "community" issue though?

^{^}
I think Eliezer might be imagining that everything including goals relating to humans would ultimately be defined in relation to fundamental descriptions of the universe, because Solomonoff or something, and I would think such a definition would lead to certain doom unless unrealistically precise.
But IMO things like human values will have a large influence on AI data such that they should likely naturally abstract them ("grounding" in the input data but not necessarily in fundamental descriptions) so humans can plug in to those abstractions either directly or indirectly. I think it should be possible to safeguard against the AI redefining these abstractions under self-modification in terms that would undermine satisfying the original goals, and in any case I am skeptical that an optimal limited-compute Solomonoff approximator defines everything only in terms of fundamental descriptions at achievable levels of compute. Thus, I agree more with you than my imagining of Eliezer on this point. But maybe I am mis-imagining Eliezer.
A potentially crux-y issue that I also note is that Eliezer, I think, thinks we are stuck with what we get from the initial definition of goals in terms human values due to consequentialism (in his view) being a stable attractor. I think he is wrong on consequentialism^[3] (about the attractor part, or at least the size of its attractor basin, but the stable part is right) and that self-correcting alignment is feasible.
^{^}
However I do have concerns about agents arising from mostly tool-ish AI, such as:
- takeover of language model by agentic simulacra
- person uses language model's coding capabilities to make bootstrapped agent
- person asks oracle AI what it could do to achieve some effect in the world, and its response includes insufficiently sanitized raw output (such as rewrite of its own code) that achieves that
Note that these are downstream, not upstream, of the AI's fulfilling their intended goals. I'm somewhat less concerned of agents arising upstream or direct unintented agentification at the level of the original goals, but note that agentiness is something that people will be pushing for for capability reasons, and once a self-modifying AI is expressing agentiness at one point in time it will tend to self-modify if it can to follow that objective more consistently.
^{^}
And by consequentialism, I really do mean consequentialism (goals directed at specific world states) and not utility functions, which is often confused with consequentialism in this community. Non-consequentialist utility functions are fine in my view! Note that the VNM theorem has the form (consequentialism (+ rationality) -> utility function) and does not imply consequentialism is rational.

[-]zrezzed3y10

Moreover, even if these things don't work that way and we get a slow takeoff, that doesn't necessarily save humanity. It just means that it will take a little longer for AI to be the dominant form of intelligence on the planet. That still sets a deadline to adequately solve alignment.

If a slow takeoff is all that's possible, doesn't that open up other options for saving humanity besides solving alignment?

I imagine far more humans will agree p(doom) is high if they see AI isn't aligned and it's growing to be the dominant form of intelligence that holds power. In a slow-takeoff, people should be able to realize this is happening, and effect non-alignment based solutions (like bombing compute infrastructure).

[-]tangerine3y1-1

Intelligence is indeed not magic. None of the behaviors that you display that are more intelligent than a chimpanzee’s behaviors are things you have invented. I’m willing to bet that virtually no behavior that you have personally come up with is an improvement. (That’s not an insult, it’s simply par for the course for humans.) In other words, a human is not smarter than a chimpanzee.

The reason humans are able to display more intelligent behavior is because we’ve evolved to sustain cultural evolution, i.e., the mutation and selection of behaviors from one generation to the next. All of the smart things you do are a result of that slow accumulation of behaviors, such as language, counting, etc., that you have been able to simply imitate. So the author’s point stands that you need new information from experiments in order to do something new, including new kinds of persuasion.

[-]lukemarks3y31

I disagree with your objections.

"The first argument–paperclip maximizing–is coherent in that it treats the AGI’s goal as fixed and given by a human (Paperclip Corp, in this case). But if that’s true, alignment is trivial, because the human can just give it a more sensible goal, with some kind of “make as many paperclips as you can without decreasing any human’s existence or quality of life by their own lights”, or better yet something more complicated that gets us to a utopia before any paperclips are made"

This argument is essentially addressed by this post, and has many failure modes. For example, if you specify the superintelligence's goal as the example you gave, it's most optimal solution might be to cryopreserve the brain of every human in a secure location, and prevent any attempts an outside force could make at interfacing with them. You realize this, and so you specify something like "Make as many squiggles as possible whilst leaving humans in control of their future", and the intelligence is quite smart and quite general, so it can comprehend the notion of what you want when you say "we want control of our future", but then BayAreaAILab#928374 trains a superintelligence designed to produce squiggles without this limit and outcompetes the aligned intelligence, because humans are much less efficient than inscrutable matrices.

This is not even mentioning issues with inner alignment and mesa-optimizers. You start to address this with:

AGI-risk argument responds by saying, well, paperclip-maximizing is just a toy thought experiment for people to understand. In fact, the inscrutable matrices will be maximizing a reward function, and you have no idea what that actually is, it might be some mesa-optimizer

But I don't feel as though your referencing to Eliezer's Twitter loss drop fiasco and subsequent argument regarding GPU maximization successfully refutes claims regarding mesa-optimization. Even if GPU-maximizing mesa-optimization was intractable, what about the other potentially infinite number of possible mesa-optimizer configurations that result ?

You don’t know that human brains can be hacked using VR headsets; it has never been demonstrated that it’s possible and there are common sense reasons to think it’s not. The brain is an immensely complicated, poorly-understood organ. Applying a lot of computing power to that problem is very unlikely to yield total mastery of it by shining light in someone’s eyes

When Eliezer talks about 'brain hacking' I do not believe he means by dint of a virtual reality headset. Psychological manipulation is an incredibly powerful tool, and who else could manipulate humanity if not a superintelligence? Furthermore, if said intelligence models humans via simulating strategies, which that post argues is likely assuming large capabilities gaps between humanity and a hypothetical superintelligence.

As I said before, I’m very confused about how you get to >90% chance of doom given the complexity of the systems we’re discussing

The analogy of "forecasting the temperature of the coffee in 5 minutes" VS "forecasting that if left the coffee will get cold at some point" seems relevant here. Without making claims about the intricacies of the future state of a complex system, you can make high-reliability inferences about their future trajectories in more general terms. This is how I see AI x-risk claims. If the claim was that there was a 90% chance that a superintelligence will render humanity extinct and it will have some architecture x I would agree with you, but feel as though Eliezer's forecast is general enough to be reliable.

[-]nicholashalden3y40

Thanks for your reply. I welcome an object-level discussion, and appreciate people reading my thoughts and showing me where they think I went wrong.

The hidden complexity of wishes stuff is not persuasive to me in the context of an argument that AI will literally kill everyone. If we wish for it not to, there might be some problems with the outcome, but it won't kill everyone. In terms of Bay Area Lab 9324 doing something stupid, I think by the time thousands of labs are doing this, if we have been able to successfully wish for stuff without catastrophe being triggered, it will be relatively easy to wish for universal controls on the wishing technology.
"Infinite number of possible mesa-optimizers". This feels like just invoking an unknown unknown to me, and then asserting that we're all going to die, and feels like it's missing some steps.
You're wrong about Eliezer's assertions about hacking, he 100% does believe by dint of a VR headset. I quote: "- Hack a human brain - in the sense of getting the human to carry out any desired course of action, say - given a full neural wiring diagram of that human brain, and full A/V I/O with the human (eg high-resolution VR headset), unsupervised and unimpeded, over the course of a day: DEFINITE YES - Hack a human, given a week of video footage of the human in its natural environment; plus an hour of A/V exposure with the human, unsupervised and unimpeded: YES "
I get the analogy of all roads leading to doom, but it's just very obviously not like that, because it depends on complex systems that are very hard to understand, and AI x-risk proponents are some of the biggest advocates of that opacity.

[-]lukemarks3y41

Soft upvoted your reply, but have some objections. I will respond using the same numbering system you did such that point 1 in my reply will address point 1 of yours.

I agree with this in the context of short-term extinction (i.e. at or near the deployment of AGI), but would offer that an inability to remain competitive and loss of control is still likely to end in extinction, but in a less cinematic and instantaneous way. In accordance with this, the potential horizon for extinction-contributing outcomes is expanded massively. Although Yudkowsky is most renowned for hard takeoff, soft takeoff has a very differently shaped extinction-space and (I would assume) is a partial reason for his high doom estimate. Although I cannot know this for sure, I would imagine he has a >1% credence in soft takeoff. 'Problems with the outcome' seem highly likely to extend to extinction given time.
There are (probably) an infinite number of possible mesa-optimizers. I don't see any reason to assume an upper bound on potential mesa-optimization configurations, and yes; this is not a 'slam dunk' argument. Rather, as derived from the notion that even slightly imperfect outcomes can extend to extinction, I was suggesting that you are trying to search an infinite space for a quark that fell out of your pocket some unknown amount of time ago whilst you were exploring said space. This can be summed up as 'it is not probable that some mesa-optimizer selected by gradient descent will ensure a Good Outcome'.
This still does not mean that the only form of brain hacking is via highly immersive virtual reality. I recall the Tweet that this comment came from, and I interpreted it as a highly extreme and difficult form of brain hacking used to prove a point (the point being that if ASI could accomplish this it could easily accomplish psychological manipulation). Eliezer's breaking out of the sandbox experiments circa 2010 (I believe?) are a good example of this.
Alternatively you can claim some semi-arbitrary but lower extinction risk like 35%, but you can make the same objections to a more mild forecast like that. Why is assigning a 35% probability to an outcome more epistemologically valid than a >90% probability? Criticizing forecasts based on their magnitude seems difficult to justify in my opinion, and critiques should rely on argument only.

[-]Seth Herd3y10

I disagree with OPs objections, too, but that's explicitly not the point of this post. OP is giving us an outside take on how our communication is working, and that's extremely valuable.

Typically, when someone says you're not convincing them, "you're being dumb" is itself a dumb response. If you want to convince someone of something, making the arguments clear is mostly your responsibility.

[-]Seth Herd2y20

This is highly useful. Thank you so much for taking the time to write it!

It's not worth debating the points you raise, since the point is you explaining to us where the explanation went wrong for you. That didn't stop many people from doing it, of course :)

I agree strongly with your points about the communication style. It's not possible to address every objection in a short piece, but it is possible to put forth the basic argument in clear and simple terms. I think the type of person who's interested in AI safety typically isn't focused on communicating with laypeople. And we need to get better.

[-]Keenan Pepper3y10

Have you read https://www.lesswrong.com/posts/5wMcKNAwB6X4mp9og/that-alien-message yet?

I had some similar thoughts to yours before reading that, but it helped me make a large update in favor of superintelligence being able to make magical-seeming feats of deduction. If a large number of smart humans working together for a long time can figure something out (without performing experiments or getting frequent updates of relevant sensory information), then a true superintelligence will also be able to.

[-]kolmplex3y10

I've got some object-level thoughts on Section 1.

With a model of AGI-as-very-complicated-regression, there is an upper bound of how fulfilled it can actually be. It strikes me that it would simply fulfill that goal, and be content.

It'd still need to do risk mitigation, which would likely entail some very high-impact power seeking behavior. There are lots of ways things could go wrong even if its preferences saturate.

For example, it'd need to secure against the power grid going out, long-term disrepair, getting nuked, etc.

To argue that an AI might change its goals, you need to develop a theory of what’s driving those changes–something like, AI wants more utils–and probably need something like sentience, which is way outside the scope of these arguments.

The AI doesn't need to change or even fully understand its own goals. No matter what its goals are, high-impact power seeking behavior will be the default due to needs like risk mitigation.

But if that’s true, alignment is trivial, because the human can just give it a more sensible goal, with some kind of “make as many paperclips as you can without decreasing any human’s existence or quality of life by their own lights”, or better yet something more complicated that gets us to a utopia before any paperclips are made

Figuring out sensible goals is only part of the problem, and the other parts of the problem are sufficient for alignment to be really hard.

In addition to the inner/outer alignment stuff, there is what John Wentworth calls the pointers problem. In his words: "I need some way to say what the values-relevant pieces of my world model are "pointing to" in the real world".

In other words, all high-level goal specifications need to bottom out in talking about the physical world. That is... very hard and modern philosophy still struggles with it. Not only that, it all needs to be solved in the specific context of a particular AIs sensory suite (or something like that).

As a side note, the original version of the paperclip maximizer, as formulated by Eliezer, was partially an intuition pump about the pointers problem. The universe wasn't tiled by normal paperclips, it was tiled by some degenerate physical realization of the conceptual category we call "paperclips" e.g. maybe a tiny strand of atoms that kinda topologically maps to a paperclip.

Intelligence is not magic.

Agreed. Removing all/most constraints on expected futures is the classic sign of the worst kind of belief. Unfortunately, figuring out the constraints left after contending with superintelligence is so hard that it's easier to just give up. Which can, and does, lead to magical thinking.

There are lots of different intuitions about what intelligence can do in the limit. A typical LessWrong-style intuition is something like 10 billion broad-spectrum geniuses running at 1000000x speed. It feels like a losing game to bet against billions of Einsteins+Machiavellis+(insert highly skilled person) working for millions of years.

Additionally, LessWrong people (myself included) often implicitly think of intelligence as systemized winning, rather than IQ or whatever. I think that is a better framing, but it's not the typical definition of intelligence. Yet another disconnect.

However, this is all intuition about what intelligence could do, not what a fledgling AGI will probably be capable of. This distinction is often lost during Twitter-discourse.

In my opinion, a more generally palatable thought experiment about the capability of AGI is:

What could a million perfectly-coordinated, tireless copies of a pretty smart, broadly skilled person running at 100x speed do in a couple years?

Well... enough. Maybe the crazy sounding nanotech, brain-hacking stuff is the most likely scenario, but more mundane situations can still carry many of the arguments through.

[-]zrezzed3y10

What could a million perfectly-coordinated, tireless copies of a pretty smart, broadly skilled person running at 100x speed do in a couple years?

I this feels like the right analogy to consider.

And in considering this thought experiment, I'm not sure trying to solve alignment is the only/best way to reduce risks. This hypothetical seems open to reducing risk by 1) better understanding how to detect these actors operating at large scale 2) researching resilient plug-pulling strategies

[-]kolmplex3y10

I think both of those things are worth looking into (for the sake of covering all our bases), but by the time alarm bells go off it's already too late.

It's a bit like a computer virus. Even after Stuxnet became public knowledge, it wasn't possible to just turn it off. And unlike Stuxnet, AI-in-the-wild could easily adapt to ongoing changes.

LESSWRONG
is fundraising!
LW

LESSWRONG
is fundraising!
LW

58

Adumbrations on AGI from an outsider

58

58

TLDR

Preamble

What triggered my desire to respond

Observations

Axis of compression potential

Conclusion