Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

These are some selection effects impacting what ideas people tend to get exposed to and what they'll end up believing, in ways that make the overall epistemics worse. These have mostly occured to me about AI discourse (alignment research, governance, etc), mostly on LessWrong. (They might not be exclusive to discourse on AI risk.)

(EDIT: I've reordered the sections in this post so that less people get stuck on what was the first section and so they a better chance of reading the other two sections.)

Outside-view is overrated

In AI discourse, outside-view (basing one's opinion on other people's and on (things that seem like) precedents), as opposed to inside-view (having an actual gears-level understanding of how things work), is being quite overrated for a variety of reasons.

  • There's the issue of outside-view double-counting, as in this comic I drew. When building an outside-view, people don't particularly check whether 10 people say the same thing because they came up with independently, or because 9 of them heard it from the 1 person who came up with it, and they themselves mostly stuck to outside-view.
  • I suspect that outside-view is being over-valued because it feels safe — if you just follow what you believe to be consensus and/or an authority, then it can feel less like it's "your fault" if you're wrong. You can't really just rely on someone else's opinion on something, because they might be wrong, and to know if they're wrong you need an inside-view yourself. And there's a fundamental sense in which developing your own inside-view of AI risk is contributing to the research, whereas just reusing what exists is neutral, and {reusing what exists + amplifying it based on what has status or memetic virulence} is doing damage to the epistemic commons, due to things like outside-view double-counting.
  • There's occasionally a tendency to try to adopt the positions that are held by authority figures/organizations in order to appeal to them, to get resources/status, and/or generally to fit in. (Similarly, be wary of the opposite as well — having a wacky opinion in order to get quirkyness/interestingness points.)
  • "Precedents"-based ideas are pretty limited — there isn't much that looks similar to {us building things that are smarter than us and as-flexible-as-software}. The comparison with {humans as mesa-optimizers relative to evolution} has been taken way outside of its epistemic range.

Arguments about P(doom) are filtered for nonhazardousness

Some of the best arguments for high P(doom) / short timelines that someone could make would look like this:

It's not that hard to build an AI that kills everyone: you just need to solve [some problems] and combine the solutions. Considering how easy it is compared to what you thought, you should increase your P(doom) / shorten your timelines.

But obviously, if people had arguments of this shape, they wouldn't mention them, because they make it easier for someone to build an AI that kills everyone. This is great! Carefulness about exfohazards is better than the alternative here.

But people who strongly rely on outside-view for their P(doom) / timelines should be aware that their arguments are being filtered for nonhazardousness. Note that this plausibly applies to other topics than P(doom) / timelines.

Note that beyond not-being-mentioned, such arguments are also anthropically filtered against: in worlds where such arguments have been out there for longer, we died a lot quicker, so we're not there to observe those arguments having been made.

Confusion about the problem often leads to useless research

People enter AI risk discourse with various confusions, such as:

  • What are human values?
  • Aligned to whom?
  • What does it mean for something to be an optimizer?
  • Okay, unaligned ASI would kill everyone, but how?
  • What about multipolar scenarios?
  • What counts as AGI, and when do we achieve that?

Those questions about the problem do not particularly need fancy research to be resolved; they're either already solved or there's a good reason why thinking about them is not useful to the solution. For these examples:

These answers (or reasons-why-answering-is-not-useful) usually make sense if you're familiar with rationality and alignment, but some people are still missing a lot of the basics of rationality and alignment, and by repeatedly voicing these confusions they cause people to think that those confusions are relevant and should be researched, causing lots of wasted time.

It should also be noted that some things are correct to be confused about. If you're researching a correlation or concept-generalization which doesn't actually exist in the territory, you're bound to get pretty confused! If you notice you're confused, ask yourself whether the question is even coherent/true, and ask yourself whether figuring it out helps save the world.

New to LessWrong?

New Comment
22 comments, sorted by Click to highlight new comments since: Today at 7:00 AM

We don't need to figure out this problem, we can just implement CEV without ever having a good model of what "human values" are.

Why would you think that the CEV even exists?

Humans aren't all required to converge to the same volition, there's no particularly defensible way of resolving any real differences, and even finding any given person's individual volition may be arbitrarily path-dependent.

The vast majority of the utility you have to gain is from {getting a utopia rather than everyone-dying-forever}, rather than {making sure you get the right utopia}.

Whether something is a utopia or a dystopia is a matter of opinion. Some people's "utopias" may be worse than death from other people's point of view.

In fact I can name a lot of candidates whose utopias might be pretty damned ugly from my point of view. So many that it's entirely possible that if you used a majoritarian method to find the "CEV", the only thing that would prevent a dystopia would be that there are so many competing dystopic "utopias" that none of them would get a majority.

Expected utility maximization seems to fully cover this. More general models aren't particularly useful to saving the world.

Most actually implementable agents probably don't have coherent utility functions, and/or have utility functions that can't be computed even approximately over a complete state-of-the-world. And even if you can compute your utility over a single state-of-the-world, that doesn't imply that you can do anything remotely close to computing a course of action that will maximize it.

Humans aren't all required to converge to the same volition, there's no particularly defensible way of resolving any real differences

CEV-ing just one person is enough for the "basic challenge" of alignment as described on AGI Ruin.

and even finding any given person's individual volition may be arbitrarily path-dependent

Sure, but it doesn't need to be path-independent, it just needs to have pretty good expected value over possible paths.

Whether something is a utopia or a dystopia is a matter of opinion. Some people's "utopias" may be worse than death from other people's point of view.

That's fair (though given the current distribution of people likely to launch the AI, I'm somewhat optimistic that we won't get such a dystopia) — but the people getting confused about that question aren't asking it because they have such concerns, they're usually (in my experience) asking it because they're confused way upstream of that, and if they were aware of the circumstances they'd be more likely to focus on solving alignment than on asking "aligned to whom". I agree that the question makes sense, I just think the people asking it wouldn't endorse-under-reflection focusing on that question in particular if they were aware of the circumstances. Maybe see also this post.

Most actually implementable agents probably don't have coherent utility functions […]

I think the kind of AI likely to take over the world can be described closely enough in such a way. Certainly for the kind of aligned AI that saves the world, it seems likely to me that expected utility is sufficient to think about how it thinks about its impact on the world.

That's fair (though given the current distribution of people likely to launch the AI, I'm somewhat optimistic that we won't get such a dystopia) — but the people getting confused about that question aren't asking it because they have such concerns, they're usually (in my experience) asking it because they're confused way upstream of that

I disagree. I think they're concerned about the right thing for the right reasons, and the attempt to swap-in a different (if legitimate, and arguably more important) problem instead of addressing their concerns is where a lot of communication breaks down.

I mean, yes, there is the issue that it doesn't matter which monkey finds the radioactive banana and drags it home, because that's going to irradiate the whole tribe anyway. Many people don't get it, and this confusion is important to point out and resolve.

But once it is resolved, the "but which monkey" question returns. Yes, currently AGI is unalignable. But since we want to align it anyway, and we're proposing ways to make that happen, what's our plan for that step? Who's doing the aligning, what are they putting in the utility function, and why would that not be an eternal-dystopia hellscape which you'd rather burn down the world attempting to prevent than let happen?

They see a powerful technology on the horizon, and see people hyping it up as something world-changingly powerful. They're immediately concerned regarding how it'll be used. That there's an intermediary step missing – that we're not actually on-track to build the powerful technology, we're only on-track to create a world-ending explosion – doesn't invalidate the question of how that technology will be used if we could get back on-track to building it.

And if that concern gets repeatedly furiously dismissed in favour of "but we can't even build it, we need to do [whatever] to build it", that makes the other side feel unheard. And regardless of how effectively you argue the side of "the current banana-search strategies would only locate radioactive bananas" and "we need to prioritize avoiding radiation", they're going to stop listening in turn.

Okay yeah this is a pretty fair response actually. I think I still disagree with the core point (that AI aligned to current people-likely-to-get-AI-aligned-to-them would be extremely bad) but I definitely see where you're coming from.

Do you actually believe extinction is preferable to rolling the dice on the expected utility (according to your own values) of what happens if one of the current AI org people launches AI aligned to themself?

Even if, in worlds where we get an AI aligned to a set of values that you would like, that AI then acausally pays AI-aligned-to-the-"wrong"-values in different timelines to not run suffering? e.g. Bob's AI runs a bunch of things Alice would like in Bob's AI's timelines, in exchange for Alice's AI not running things Bob would very strongly dislike.

... I've not actually evaluated this question much, because I am more concerned about the "all the bananas we're on track to find are gonna be radioactive, it doesn't matter who finds one" thing.

Inasmuch as I am concerned about the banana's owners, my concerns lie in worlds in which we raise the awareness of AGI Ruin enough to get governments' attention to it, and to get them to somehow minimize race dynamics, but then fail to sell them on the whole "universal utopia" thing, and end up in some xenophobic/kafkaesque dystopia from which even death isn't escape. I frown at sociopolitical strategies that entirely fail to even consider that; it seems like a very EA-brand style of naivete. But it doesn't really seem like a particularly likely failure mode, at this point.

Here, I'm mostly trying to illuminate the perspectives of people who are starting from the place of "but who gets the banana?", and the places the AGI-ruin advocates often seem to fail at communicating with them. (And how the policies that would sound convincing to them might sound like.)

Do you actually believe extinction is preferable to rolling the dice on the expected utility (according to your own values) of what happens if one of the current AI org people launches AI aligned to themself?

I guess no, on balance, taking the acausal-negotiation part into account. (I'd need to actually do some toy math to figure out the definitive answer.)

That said, I'm quite concerned about certain people's notorious power-hunger tendencies. Such preferences can shake out into an utopia containing vast masses of people to lord over, which would be pretty hellish. But I guess the worst excesses of that would be something they could be acausally negotiated out of, such that life in such a world would still be preferable to non-existence.

(Yes, I'm given to understand there's been a lot of pro-humanity messaging from all current major-AI-lab leaders. But... Well, we all know how treacherous turns work, so that's all ~zero evidence. And my priors on the interiors of such people aren't optimistic. That's the other reason I haven't researched the matter: I don't expect to update in a positive direction no matter what I find.

I guess if Sam Altman, unprompted and prior to the recent surge of AI hype, wrote a whole book on ethics and meta-ethics in which he expressed strong specific detailed preferences for an utopia full of human flourishing, and acknowledged the human pitfall of value-drifting towards selfishness over the course of power-pursuit and the importance of preserving one's core values despite it, do link that.

But any signal less costly and hard-to-fake than that? Well, it would be something that I would know to fake in their place, just as a matter of course.)

I think the kind of AI likely to take over the world can be described closely enough in such a way. Certainly for the kind of aligned AI that saves the world, it seems likely to me that expected utility is sufficient to think about how it thinks about its impact on the world.

What observations are backing this belief? Have you seen approaches that share some key characteristics with expected utility maximization approaches which have worked in real-world situations, and where you expect that the characteristics that made it work in the situation you observed will transfer? If so, would you be willing to elaborate?

On the flip side, are there any observations you could make in the future that would convince you that expected utility maximization will not be a good model to describe the kind of AI likely to take over the world?

CEV-ing just one person is enough for the "basic challenge" of alignment as described on AGI Ruin.

I thought the "C" in CEV stood for "coherent" in the sense that it had been reconciled over all people (or over whatever set of preference-possessing entities you were taking into acount). Otherwise wouldn't it just be "EV"?

I think the kind of AI likely to take over the world can be described closely enough in such a way.

So are you saying that it would literally have an internal function that represented "how good" it thought every possible state of the world was, and then solve an (approximate) optimization problem directly in terms of maximizing that function? That doesn't seem to me like a problem you could solve even with a Jupiter brain and perfect software.

I thought the "C" in CEV stood for "coherent" in the sense that it had been reconciled over all people (or over whatever set of preference-possessing entities you were taking into acount). Otherwise wouldn't it just be "EV"?

I mean I guess, sure, if "CEV" means over-all-people then I just mean "EV" here.
Just "EV" is enough for the "basic challenge" of alignment as described on AGI Ruin.

So are you saying that it would literally have an internal function that represented "how good" it thought every possible state of the world was, and then solve an (approximate) optimization problem directly in terms of maximizing that function?

Or do something which has approximately that effect.

That doesn't seem to me like a problem you could solve even with a Jupiter brain and perfect software.

I disagree! I think some humans right now (notably people particulalry focused on alignment) already do something vague EUmax-shaped, and definitely an ASI capable of running on current compute would be able to do something more EUmax-shaped. Very, very far from actual "pure" EUmax of course; but way sufficient to defeat all humans, who are quite further away from pure EUmax. Maybe see also this comment of mine.

we can just implement CEV

So we have a detailed, mechanistic understanding of CEV now?

We have at least one prototype for a fully formalized implementation of CEV, yes: mine.

(I'd argue that for the correct answer to "what are values?" to be "just do CEV", we shouldn't need a specific plan for CEV; we should just need good confidence that something like CEV can be implemented.)

It's not that hard to build an AI that kills everyone: you just need to solve [some problems] and combine the solutions. Considering how easy it is compared to what you thought, you should increase your P(doom) / shorten your timelines.

It's not that hard to build an AI that saves everyone: you just need to solve [some problems] and combine the solutions. Considering how easy it is compared to what you thought, you should decrease your P(doom) / shorten your timelines.

They do a value-handshake and kill everyone together.

Any two AIs unaligned with humanity are very unlikely to be also aligned with each other, and would have no especial reason to coordinate with other unaligned AIs over humans (or uploads, or partially aligned neuromorphic AGIs, etc)

The vast majority of the utility you have to gain is from {getting a utopia rather than everyone-dying-forever}, rather than {making sure you get the right utopia}.

Whose utopia? For example: some people's utopia may be one where they create immense quantity of descendants of various forms and allocate resources to them rather than existing humans. I also suspect that Hamas's utopia is not mine. This idea that the distribution of future scenarios is bimodal around "we all die" or "perfect egalitarian utopia" is dangerously oversimplistic.

It's not that hard to build an AI that saves everyone: you just need to solve [some problems] and combine the solutions. Considering how easy it is compared to what you thought, you should decrease your P(doom) / shorten your timelines.

I'm not sure what you're saying here exactly. It seems to me like you're pointing to a symmetric argument favoring low doom, but if someone had an idea for how to do AI alignment right, why wouldn't they just talk about it? Doesn't seem symmetrical to me.

Its easy to say something is "not that hard", but ridiculous to claim that when the something is build an AI that takes over the world. The hard part is building something more intelligent/capable than humanity, not anything else conditioned on that first step.

I don't see why this would be ridiculous. To me, e.g. "Superintelligence only requires [hacky change to current public SOTA] to achieve with expected 2025 hardware, and OpenAI may or may not have realised that already" seems like a perfectly coherent way the world could be, and is plenty of reason for anyone who suspects such a thing to keep their mouth shut about gears-level models of [] that might be relevant for judging how hard and mysterious the remaining obstacles to superintelligence actually are.

If it only requires a simple hack to existing public SOTA, many others will have already thought of said hack and you won't have any additional edge. Taboo superintelligence and think through more specifically what is actually required to outcompete the rest of the world.

Progress in DL is completely smooth as it is driven mostly from hardware and enormous number of compute-dependent small innovations (yes transformers were a small innovation on top of contemporary alternatives such as memory networks, NTMs etc and quite predictable in advance ).

If it only requires a simple hack to existing public SOTA, many others will have already thought of said hack and you won't have any additional edge.

I don't recall assuming the edge to be unique? That seems like an unneeded condition for Tamsin's argument, it's enough to believe the field consensus isn't completely efficient by default and all relevant actors are sure of all currently deducable edges at all times.



Progress in DL is completely smooth.

Right, if you think it's completely smooth and thus basically not meaningfully influenced by the actions of individual researchers whatsoever, I see why you would not buy Tamsin's argument here. But then the reason you don't buy it would seem to me to be that you think meaningful new ideas in ML capability research basically don't exist, not because you think there is some symmetric argument to Tamsin's for people to stay quiet about new alignment research ideas.

If you think all AGIs will coordinate with each other, nobody needs an edge. If you think humans will build lots of AI systems, many technically unable to coordinate with each other (from mechanisms similar to firewalls/myopia/sparsity) then the world takeover requires an edge. An edge such that the (coalition of hostile AIs working together) wins the war against (humans plus their AIs).

This can get interesting if you think there might be diminishing returns in intelligence, which could mean that the (humans + their AI) faction might have a large advantage if the humans start with far more resources like they control now.

Excellent post, big upvote.

We need more discussion of the cognitive biases affecting the alignment debate. The eye needs to see its own flaws in order to make progress fast enough to survive AGI.

It's curious that I disagree with you on many particulars, but agree with you on the main important points.

Outside-view is overrated

Agreed on the main logic: it's tough to guess how many opinions are just duplicates. Safer? Maybe, but we also overvalue "having your own opinion" even if it's not informed and well-thought-out.

Arguments about P(doom) are filtered for nonhazardousness

Agreed, and it's important. But I think the cause is mostly different: most people with a low estimate are succumbing to motivated reasoning. They very much want to believe they and everything they love won't die. So any alternative feels better, and they wind up believing it.

Confusion about the problem often leads to useless research

Very much agreed. This is also true of every other scientific and engineering field I know about. People like doing research more than they like figuring out what research is important. To your specific points:

What are human values?

Doesn't matter. The only human values that matter are the ones of the people in charge of the first ASI.

Aligned to whom?

I think this matters an awful lot, because there are sadists and sociopaths in positions of power. As long as the team in charge of ASI has a positive empathy - sadism balance, we'll like their utopia just fine, at least after they get time to think it through.

What does it mean for something to be an optimizer?

Optimization doesn't matter. What matters is pursuing goals more competently than humans can.

Okay, unaligned ASI would kill everyone, but how?

Agreed that it doesn't matter. You can keep a smarter thing contained or controlled for a while, but eventually it will outsmart you and do whatever it wants.

What about multipolar scenarios?

Disagree. Value handshakes aren't practically possible. They go to war and you die as collatoral damage. More importantly, it doesn't matter which of those is right. Multipolar scenarios are worse not better.

What counts as AGI, and when do we achieve that?

What counts is the thing that can outsmart you. When we achieve it does matter, but only a little for time to plan and prepare. How we achieve that matters a lot, because technical alignment plans are specific to AGI designs.

Thanks for the post on an important topic!

Optimization doesn't matter. What matters is pursuing goals more competently than humans can.

Can you say more about what causes you to believe this?

If the AGI winds up just wanting a lot of paperclips, and also wanting a lot of stamps, it may not have a way to decide exactly how much of each to go for, or when "a lot" has been achieved. But there's a common obstacle to those goals: humans saying "hell no" and shutting it down when it starts setting up massive paperclip factories. Therefore it has a new subgoal: prevent humans from interfering with it. That probably involves taking over the world or destroying humanity.

If the goal is strictly bounded at some millions of tons of paperclips and stamps, then negotiating with humanity might make more sense. But if the vague goal is large, it implies all of the usual dangers to humanity because they're incompatible with how we want to use the earth and whether we want an ASI with strange goals making paperclips on the moon.

Note that beyond not-being-mentioned, such arguments are also anthropically filtered against: in worlds where such arguments have been out there for longer, we died a lot quicker, so we’re not there to observe those arguments having been made.

This anthropic analysis doesn't take into account past observers (see this post).

My current belief is that you do make some update upon observing existing, you just don't update as much as if we were somehow able to survive and observe unaligned AI taking over. I do agree that the no update at all because you can't see the counterfactual is wrong, but anthropics is still somewhat filtering your evidence; you should update less.

(I don't have my full reasoning for {why I came to this conclusion} fully loaded rn, but I could probably do so if needed. Also, I only skimmed your post, sorry. I have a post on updating under anthropics with actual math I'm working on, but unsure when I'll get around to finishing it.)