All of Andrew_Critch's Comments + Replies

Encultured AI, Part 1: Enabling New Benchmarks

> First, great news on founding an alignment organization on your own.

Actually I founded it with my cofounder, Nick Hay!
https://www.encultured.ai/#team

«Boundaries», Part 2: trends in EA's handling of boundaries

Thanks for this prediction; by default it's correct!  But, my mind could be changed on this.  Did you voice this as:
(a) a disappointment (e.g., "Ah, too bad Critch isn't trying to address pivotal acts here"), or
(b) as an expectation of what I should be aiming for (e.g., "Probably it's too much to write a sequence about boundaries that's also trying to address pivotal acts.")
(c) neither / just a value-neutral prediction?

2Raemon2d
Oh, I meant: I predict you will write a post comparing boundaries and pivotal acts, but it will be unconvincing to me that boundaries are a particularly key term in how to think about pivotal acts.
«Boundaries», Part 2: trends in EA's handling of boundaries

... From your perspective, which section do you think would be most interesting to do the deep dive on?  (I.e. the top of the "T"?)

(Interested to hear from others as well.)

3Martín Soto3d
I'd like to hear more about how the boundaries framework can be applied to Resistance from AI Labs to yield new insights or at least a more convenient framework. More concretely, I'm not exactly sure which boundaries you refer to here: My main issue is I for now agree with Noosphere89 [https://www.lesswrong.com/users/sharmake-farah]'s comment: the main reason is just commonsense "not willing to sacrifice profit". And this can certainly be conceptualized as "not willing to cross certain boundaries" (extralimiting the objectives of a usual business, reallocating boundaries of internal organization, etc.), but I don't see how these can shed any more light than the already commonsense considerations. To be clear, I know you discuss this in more depth in your posts on pivotal acts / processes, but I'm curious as to how explicitly applying the boundaries framework could clarify things.
2Raemon3d
Short answer: maybe Work/Life balance? Probably a reasonable combination of "meaty/relevant, without being too controversial an example". Longer answer: I'm not actually sure. While thinking about the answer, I notice part of the thing is that Expansive Thinking, Niche Finding and Work-Life balance each introduce somewhat different frames, and maybe another issue was I got sort of frame-fatigued by the time I got to Work/Life/Balance.
«Boundaries», Part 2: trends in EA's handling of boundaries

Thanks for the suggestion!  Maybe I'll make a T-shaped version someday :)

Unifying Bargaining Notions (1/2)

Thanks for signal-boosting the coco solution!  I'm actually planning workshop near CMU this fall with Adam Kalai :)

What's next for instrumental rationality?

Re; Jeffrey's decision theory, it's not multi-agent, which is a huge limitation.  Otherwise I'd agree with you.

Re: game theory, you're right that it's been around for a while, but it's more "tried and false" than "tried and true".  Basically, people in geopolitics (both the study, and the activity) know by now that Nash equilibria and even correlated equilibria are not good models of how powerful entities interact, and psychologists know they're not good models of how individuals interact.  (In fact, as early as 1967, Aumann, Harsanyi,  ... (read more)

1cubefox12d
Regarding game theory: The examples you give are about game theory not describing actual behavior very well. But I assume we want here to use game theory as a theory of (multi-agent instrumental) rationality. So in our case it has to describe how people should interact, not necessarily how they do interact. Right? Of course, if people do presumably interact rationality in certain cases, while game theory describes something else, then it is both normatively and descriptively inadequate. I'm not sure whether your examples are such cases. But there others. For example, both game theory and decision theory seem to recommend not to go voting in a democracy. In the former case because it seems to be a prisoner's dilemma, in the latter because the expected utility of voting is very low. Voting being irrational seems highly counterintuitive, especially if you haven't already been "brain washed" with those theories. They seem to miss some sort of Kantian "but if everyone did not vote" reasoning. That seems to me somewhat more excusable for decision theory, since it is not multi-agentic in the first place. But game t 'heory does indeed also seem more "tried and false" to me. Though some would bite the bullet and say voting is in fact irrational.
Hiring Programmers in Academia

Jeff, are they working to reduce existential risk?  If so, this it totally the kind of thing that BERI supports :)

AI Forecasting: One Year In

Jacob, good to see you thinking seriously about forecasting risks.  At SPARC meetings you argued that AGI was 80 years away, and maintained that view after considerable debate.  Is "80 years away" still your position, or have you updated?

Nonprofit Boards are Weird

To add to the content a bit: I haven't liked it when I've seen Boards totally unwilling to replace the CEO, e.g., due to a promise of loyalty to never replace them, or feeling socially outranked by the CEO. A board that feels that way should, in my opinion, focus on gracefully replacing itself with new members who would not feel too socially-or-morally-outranked-by-the-CEO to replace the CEO; otherwise there is basically nothing holding the CEO accountable for currently held resources (only future reputation).

Nonprofit Boards are Weird

Great post! Nice to have this out in the open; I too have observed numerous non-profit boards with the properties you describe (>5?) and agree with your recommendations for what the board should prioritize, including specialization, protecting the Board's authority and the CEO role (for future CEOs), and mainly focussing on evaluating the CEO.

2Andrew_Critch1mo
To add to the content a bit: I haven't liked it when I've seen Boards totally unwilling to replace the CEO, e.g., due to a promise of loyalty to never replace them, or feeling socially outranked by the CEO. A board that feels that way should, in my opinion, focus on gracefully replacing itself with new members who would not feel too socially-or-morally-outranked-by-the-CEO to replace the CEO; otherwise there is basically nothing holding the CEO accountable for currently held resources (only future reputation).
Safetywashing

This is why we need to say "existential safety" or "x-safety" more often.  "Long-term risk" is no longer an appropriate synonym for existential risk, though "large-scale risk" is still fitting.

7acylhalide1mo
Agreed! It could also act as a filter against people who are aware at some level of the risks, but are mentally compartmentalising them too much to be capable of saying that "yes AI could kill everyone on earth".
7ThomasWoodside1mo
I also think it works better than "alignment" especially when used among people who are uncertain to care about x-risk. I've found "alignment" can be very slippery and it can sometimes be improved without any corresponding clear reduction in x-risk.
LessWrong Has Agree/Disagree Voting On All New Comment Threads

... and, I'd go so far as to bet that the large number of agreement with your comment here is representative of a bunch of users that would feel similarly, but I'm putting this in a separate comment so accrues a separate agree/disagree score.  If lots of people disagree, I'll update :)

LessWrong Has Agree/Disagree Voting On All New Comment Threads

I agree with the sentiment here, but I think you have too little faith in some people's willingness to be disagreeable... especially on LessWrong!  Personally I'd feel fine/great about having a high karma and a low net-agreement score, because it means I'm adding a unique perspective to the community that people value.

6Andrew_Critch1mo
... and, I'd go so far as to bet that the large number of agreement with your comment here is representative of a bunch of users that would feel similarly, but I'm putting this in a separate comment so accrues a separate agree/disagree score. If lots of people disagree, I'll update :)
LessWrong Has Agree/Disagree Voting On All New Comment Threads

I love this change, for most of the same reasons as Ben.  Thanks, LessWrong team!  Some ideas for further ways to empower finer-grained epistemics at the community level:

  1. (additional metrics) I think it'd be nice to have a drop-down or hover-over to see more fine-grained statistics on a post or comment, such as:
    1. total_upvotes := the the total number of upvotes (ignoring downvotes)
    2. total_downvotes
    3. voting_activity := total_upvotes + total_downvotes
    4. voting_controversy := min(total_upvotes, total_downvotes)
    5. total_agreement
    6. total_disagreement
    7. agreement_activi
... (read more)
LessWrong Has Agree/Disagree Voting On All New Comment Threads

I mostly agree with Ben here, though I think Adam's preference could be served by having a few optional sorting options available to the user on a given page, like "Sort by most agreement" or "sort by most controversial".  Without changing the semantics of what you have now, you could even allow the user to enter a custom sorting function (air-table style), like "2*karma + 3*(agreement + disagreement)" and sort by that.   These could all be hidden under a three-dots menu dropdown to avoid clutter.

I could imagine this sort of fix mostly solving the problem for readers, but so far at least I've been most pained by this while voting. The categories "truth-tracking" and "true" don't seem cleanly distinguishable to me—nor do e.g. "this is the sort of thing I want to see on LW" and "I agree"—so now I experience type error-ish aversion and confusion each time I vote.

Slow motion videos as AI risk intuition pumps

For a person at a starting point of the form {AGI doesn't pose a risk / I don't get it}, I'd say this video+argument pushes thinking in a more robustly accurate direction than most brief-and-understandable arguments I've seen.  Another okay brief-and-understandable argument is the analogy "humans don't respect gorillas or ants very much, so why assume AI will respect humans?", but I think that argument smuggles in lots of cognitive architecture assumptions that are less robustly true across possible futures, by comparison to the speed advantage argument (which seems robustly valid across most futures, and important).

0AllAmericanBreakfast2mo
It sounds like you're advocating starting with the slow-motion camera concept, and then graduating into brainstorming AGI attack vectors and defenses until the other person becomes convinced that there's a lot of ways to launch a conclusive humanity-ending attack and no way to stop them all. My concern with the overall strategy is that the slow-motion camera argument may promote a way of thinking about these attacks and defenses that becomes unmoored from the speed at which physical processes can occur, and the accuracy with which they can be usefully predicted even by an AGI that's extremely fast and intelligent. Most people do not have sufficient appreciation for just how complex the world is, how much processing power it would take to solve NP-hard problems, or how crucial the difference is between 95% right and 100% right in many cases. If your objective is to convince people that AGI is something to take seriously as a potential threat, I think your approach would be accuracy-promoting if it moves people from "I don't get it/no way" to "that sounds concerning - worth more research!" If it moves people to forget or ignore the possibility that AGI might be severely bottlenecked by the speed of physical processes, including the physical processes of human thought and action, then I think it would be at best neutral in its effects on people's epistemics. However, I do very much support and approve of the effort to find an accuracy-promoting and well-communicated way to educate and raise discussiona about these issues. My question here is about the specific execution, not the overall goal, which I think is good.
Slow motion videos as AI risk intuition pumps

I'd say I've tried it with around 30 people?  With around 15 I showed the video, and with around 15 I didn't.  In all cases they seemed more thoughtful once I make the (humans:AI)::(plants::humans) analogy, and when I showed the video they seemed to spend considerably more time generating independent thoughts of their own about how things could go wrong.

Of course, speed isn't the only thing that matters, isn't necessary, isn't sufficient, etc. etc..  But it's a big deal in a lot of scenarios, and it helps to get people thinking about it.

AGI Ruin: A List of Lethalities

Eliezer, thanks for sharing these ideas so that more people can be on the lookout for failures.  Personally, I think something like 15% of AGI dev teams (weighted by success probability) would destroy the world more-or-less immediately, and I think it's not crazy to think the fraction is more like 90% or higher (which I judge to be your view).

FWIW, I do not agree with the following stance, because I think it exposes the world to more x-risk:

So far as I'm concerned, if you can get a powerful AGI that carries out some pivotal superhuman engineering

... (read more)
1acylhalide2mo
I don't think the inferential distance to pivotal act thinking is that high even if you manage to censor it out at a community level. If you do censor, what you will get is a lot of people doing pivotal act thinking but doing it badly, because they can't build off of each other's work. Whether this is net good or net bad I'm uncertain. Hopefully the AI lab actually executing pivotal acts does better thinking on it than the average LWer. That still doesn't mean they'll do it enough or that they wouldn't benefit from a public body of work on what good pivotal acts look like.

a considerable fraction of the remaining AI x-risk facing humanity stems from people pulling desperate (unsafe) moves with AGI to head off other AGI projects

In your post “Pivotal Act” Intentions, you wrote that you disagree with contributing to race dynamics by planning to invasively shut down AGI projects because AGI projects would, in reaction, try to maintain

the ability to implement their own pet theories on how safety/alignment should work, leading to more desperation, more risk-taking, and less safety overall.

Could you give some kind of very rough est... (read more)

“Pivotal Act” Intentions: Negative Consequences and Fallacious Arguments

John, it seems like you're continuing to make the mistake-according-to-me of analyzing the consequences of a pivotal act without regard for the consequences of the intentions leading up to the act.  The act can't come out of a vacuum, and you can't built a project compatible with the kind of invasive pivotal acts I'm complaining about without causing a lot of problems leading up to the act, including triggering a lot of fear and panic for other labs and institutions.  To summarize from the post title: pivotal act intentions directly have negative consequences fox x-safety, and people thinking about the acts alone seem to be ignoring the consequences of the intentions leading up to the act, which is a fallacy.

5johnswentworth2mo
I see the argument you're making there. I still think my point stands: the strategically relevant question is not whether unilateral pivotal act intentions will cause problems, the question is whether aiming for a unilateral pivotal act would or would not reduce the chance of human extinction much more than aiming for a multilateral pivotal act. The OP does not actually attempt to compare the two, it just lists some problems with aiming for a unilateral pivotal act. I do think that aiming for a unilateral act increases the chance of successfully executing the pivotal act by multiple orders of magnitude, even accounting for the part where other players react to the intention, and that completely swamps the other considerations.
4Ben Pace2mo
Just as a related idea, in my mind, I often do a kind of thinking that HPMOR!Harry would call “Hufflepuff Bones”, where I look for ways a problem is solvable in physical reality at all, before considering ethical and coordination and even much in the way of practical concerns.
Intergenerational trauma impeding cooperative existential safety efforts

Writing this post as if it's about AI risk specifically seems weirdly narrow.

I disagree.  Parts 2-5 wouldn't make sense to argue for a random other cause area that people go to college hoping to revolutionize.  Parts 2-5 are about how AI is changing rapidly, and going to continue changing rapidly, and those changes result in changes to discourse, such that it's more-of-a-mistake-than-for-other-areas to treat humanity as a purely static entity that either does or doesn't take AI x-risk seriously enough.

By contrast, animal welfare is another really... (read more)

I agree parts 2-5 wouldn't make sense for all the random cause areas, but they would for a decent chunk of them. CO2-driven climate change, for example, would have been an excellent fit for those sections about 10 years ago.

That said, insofar as we're mainly talking about level of discourse, I at least partially buy your argument. On the other hand, the OP makes it sound like you're arguing against pessimism about shifting institutions in general, which is a much harder problem than discourse alone (as evidenced by the climate change movement, for instance).

“Pivotal Act” Intentions: Negative Consequences and Fallacious Arguments

Eliezer, from outside the universe I might take your side of this bet.  But I don't think it's productive to give up on getting mainstream institutions to engage in cooperative efforts to reduce x-risk.

A propos, I wrote the following post in reaction to positions-like-yours-on-this-issue, but FYI it's not just you (maybe 10% you though?):
https://www.lesswrong.com/posts/5hkXeCnzojjESJ4eB 

1berglund2mo
The link doesn't work. I think you are linking to a draft version of the post or something.
"Tech company singularities", and steering them to reduce x-risk

Yep, you got it!  The definition is meant to be non-recursive and grounded in 2022-level industrial capabilities.  This definition is bit unsatisfying insofar as 2022 is a bit arbitrary, except that I don't think the definition would change much if we replaced 2022 by 2010.

I decided not to get into these details to avoid bogging down the post with definitions, but if a lot of people upvote you on this I will change the OP.

Thanks for raising this!

5michaelkeenan3mo
Looks like it's fixed on the EA Forum version but not the LW version.
"Tech company singularities", and steering them to reduce x-risk

I agree this is an important question.  From the post:

given the choice to do so — in the form of agreement among its Board and CEO — with around one year of effort following the choice. 

I.e., in the definition, the "company" is considered to have "chosen" once the Board and CEO have agreed to do it.  If the CEO and Board agree and make the choice but the company fails to do the thing — e.g., because the employees refuse to go along with the Board+CEO decision — then the company has failed to execute on its choice, despite "effort" (presumabl... (read more)

"Tech company singularities", and steering them to reduce x-risk

(I originally posted this reply to the wrong thread)

tech companies are much, much better at steering you than you are at steering them. So in the AI policy space, people mostly work on trying to explain AI risk to decisionmakers in an honest and persuasive way, not by relabelling tech companies (which can be interpreted or misinterpreted as pointing fingers).

I agree with this generally.

Slack gives you space to notice/reflect on subtle things

+1 to John and Ray for this; my experience is very similar to John's. Here's a relevant old post from me, where I was trying to gesture at the importance of a cluster of things around or similar to your #3, as distinct from #1 and #2; http://acritch.com/boredom/ (Title: Boredom as Exploratory Overhead Cost.)

Spend twice as much effort every time you attempt to solve a problem

Nice post!  I think something closer to would be a better multiplier than two.  Reason:

Instead of minimizing the upper bound of total effort (b^2d−1)/(b-1), it makes sense to also consider the lower bound, (bd−1)/(b-1), which is achieved when d is a power of b. We can treat the "expected" effort (e.g., if you have a uniform improper prior on ) as landing in the middle of these two numbers, i.e., 

 

This is minimized where which approaches b=1+√2 for d large.  If you squint at your sees... (read more)

1Jsevillamol1y
Nicely done! I think this improper prior approach makes sense. I am a bit confused on the step when you go from an improper prior to saying that the "expected" effort would land in the middle of these numbers. This is because the continuous part of the total effort spent vs doubling factor is concave, so I would expect the "expected" effort to be weighted more in favor of the lower bound. I tried coding up a simple setup where I average the graphs across a space of difficulties to approximate the "improper prior" but it is very hard to draw a conclusion from it. I think the graph suggests that the asymptotic minimum is somewhere above 2.5 but I am not sure at all. Doubling factor (x-axis) vs expected total effort spent (y-axis), averaged across 1e5 difficulty levels uniformly spaced between d=2 and d=1e6Also I guess it is unclear to me whether a flat uninformative prior is best, vs an uninformative prior over logspace of difficulties. What do you think about both of these things? Code for the graph: import numpy as np import matplotlib.pyplot as plt import math effort_spent = lambda d,b : (b**(np.ceil(math.log(d, b))+1)-1) / (b-1) ds = np.linspace(2, 1000000, 100000) hist = np.zeros(shape=(1000,)) for d in ds: bs = np.linspace(1.1, 5, 1000) hist += np.vectorize(lambda b : effort_spent(d,b))(bs) / len(ds) plt.plot(bs, hist);
Power dynamics as a blind spot or blurry spot in our collective world-modeling, especially around AI

Sorry for the slow reply!

I feel like you are lumping together things like "bargaining in a world with many AIs representing diverse stakeholders" with things like "prioritizing actions on the basis of how they affect the balance of power."

Yes, but not a crux for my point.  I think this community has a blind/blurry spot around both of those things (compared to the most influential elites shaping the future of humanity).   So, the thesis statement of the post does not hinge on this distinction, IMO.

I would prefer keep those things separate.

Yep, and... (read more)

Let's define:

  • X = thinking about the dynamics of conflict + how they affect our collective ability to achieve things we all want; prioritizing actions based on those considerations
  • Y = thinking about how actions shift the balance of power + how we should be trying to shift the balance of power; prioritizing actions based on those considerations

I'm saying:

  • I think the alignment community traditionally avoids Y but does a lot of X.
  • I think that the factors you listed (including in the parent) are mostly reasons we'd do less Y.
  • So I read you as mostly making a cas
... (read more)
What Multipolar Failure Looks Like, and Robust Agent-Agnostic Processes (RAAPs)

> Failure mode: When B-cultured entities invest in "having more influence", often the easiest way to do this will be for them to invest in or copy A'-cultured-entities/processes.  This increases the total presence of A'-like processes in the world, which have many opportunities to coordinate because of their shared (power-maximizing) values.  Moreover, the A' culture has an incentive to trick the B culture(s) into thinking A' will not take over the world, but eventually, A' wins.

> In other words, the humans and human-aligned institutions no

... (read more)
6Ben Pace1y
Sounds great! I was thinking myself about setting aside some time to write a summary of this comment section (as I see it).
What Multipolar Failure Looks Like, and Robust Agent-Agnostic Processes (RAAPs)

> My prior (and present) position is that reliability meeting a certain threshold, rather than being optimized, is a dominant factor in how soon deployment happens.

I don't think we can get to convergence on many of these discussions, so I'm happy to just leave it here for the reader to think through.

Yeah I agree we probably can't reach convergence on how alignment affects deployment time, at least not in this medium (especially since a lot of info about company policies / plans / standards are covered under NDAs), so I also think it's good to leave this... (read more)

What Multipolar Failure Looks Like, and Robust Agent-Agnostic Processes (RAAPs)

> Both [cultures A and B] are aiming to preserve human values, but within A, a subculture A' develops to favor more efficient business practices (nihilistic power-maximizing) over preserving human values.

I was asking you why you thought A'  would effectively outcompete B (sorry for being unclear). For example, why do people with intrinsic interest in power-maximization outcompete people who are interested in human flourishing but still invest their money to have more influence in the future?

 

Ah! Yes, this is really getting to the crux of thing... (read more)

Failure mode: When B-cultured entities invest in "having more influence", often the easiest way to do this will be for them to invest in or copy A'-cultured-entities/processes.  This increases the total presence of A'-like processes in the world, which have many opportunities to coordinate because of their shared (power-maximizing) values.  Moreover, the A' culture has an incentive to trick the B culture(s) into thinking A' will not take over the world, but eventually, A' wins.

I'm wondering why the easiest way is to copy A'---why was A' better at... (read more)

What Multipolar Failure Looks Like, and Robust Agent-Agnostic Processes (RAAPs)

Thanks for the pointer to grace2020whose!  I've added it to the original post now under "successes in our agent-agnostic thinking".

But I also think the AI safety community has had important contributions on this front.

For sure, that is the point of the "successes" section.  Instead of "outside the EA / rationality / x-risk meme-bubbles, lots of AI researchers think about agent-agnostic processes" I should probably have said "outside the EA / rationality / x-risk meme-bubbles, lots of AI researchers think about agent-agnostic processes, and to my ... (read more)

What Multipolar Failure Looks Like, and Robust Agent-Agnostic Processes (RAAPs)

Thanks for this synopsis of your impressions, and +1 to the two points you think we agree on.

I also read the post as also implying or suggesting some things I'd disagree with:

As for these, some of them are real positions I hold, while some are not:

  • That there is some real sense in which "cooperation itself is the problem."

I don't hold that view.  I the closest view I hold is more like: "Failing to cooperate on alignment is the problem, and solving it involves being both good at cooperation and good at alignment."

  • Relatedly, that cooperation plays a qual
... (read more)

Failing to cooperate on alignment is the problem, and solving it involves being both good at cooperation and good at alignment

Sounds like we are on broadly the same page. I would have said "Aligning ML systems is more likely if we understand more about how to align ML systems, or are better at coordinating to differentially deploy aligned systems, or are wiser or smarter or..." and then moved on to talking about how alignment research quantitatively compares to improvements in various kinds of coordination or wisdom or whatever. (My bottom line from doing ... (read more)

What Multipolar Failure Looks Like, and Robust Agent-Agnostic Processes (RAAPs)

The previous story tends to frame this more as a failure of humanity’s coordination, while this one frames it (in the title) as a failure of intent alignment. It seems like both of these aspects greatly increase the plausibility of the story, or in other words, if we eliminated or made significantly less bad either of the two failures, then the story would no longer seem very plausible.

Yes, I agree with this.

A natural next question is then which of the two failures would be best to intervene on, that is, is it more useful to work on intent alignment, or wo

... (read more)

For some reason when I express opinions of the form "Alignment isn't the most valuable thing on the margin", alignment-oriented folks (e.g., Paul here) seem to think I'm saying you shouldn't work on alignment

In fairness, writing “marginal deep-thinking researchers [should not] allocate themselves to making alignment […] cheaper/easier/better” is pretty similar to saying “one shouldn’t work on alignment.”

(I didn’t read you as saying that Paul or Rohin shouldn’t work on alignment, and indeed I’d care much less about that than about a researcher at CHAI argui... (read more)

Perhaps I should start saying "Guys, can we encourage folks to work on both issues please, so that people who care about x-risk have more ways to show up and professionally matter?", and maybe that will trigger less pushback of the form "No, alignment is the most important thing"... 

I think that probably would be true.

For some reason when I express opinions of the form "Alignment isn't the most valuable thing on the margin", alignment-oriented folks (e.g., Paul here) seem to think I'm saying you shouldn't work on alignment (which I'm not), which trigg

... (read more)
What Multipolar Failure Looks Like, and Robust Agent-Agnostic Processes (RAAPs)

I think that the biggest difference between us is that I think that working on single-single alignment is the easiest way to make headway on that issue, whereas you expect greater improvements from some categories of technical work on coordination

Yes.

(my sense is that I'm quite skeptical about most of the particular kinds of work you advocate

That is also my sense, and a major reason I suspect multi/multi delegation dynamics will remain neglected among x-risk oriented researchers for the next 3-5 years at least.

If you disagree, then I expect the main disagr

... (read more)
4paulfchristiano1y
I was asking you why you thought A' would effectively outcompete B (sorry for being unclear). For example, why do people with intrinsic interest in power-maximization outcompete people who are interested in human flourishing but still invest their money to have more influence in the future? * One obvious reason is single-single misalignment---A' is willing to deploy misaligned AI in order to get an advantage, while B isn't---but you say "their tech is aligned with them" so it sounds like you're setting this aside. But maybe you mean that A' has values that make alignment easy, while B has values that make alignment hard, and so B's disadvantage still comes from single-single misalignment even though A''s systems are aligned? * Another advantage is that A' can invest almost all of their resources, while B wants to spend some of their resources today to e.g. help presently-living humans flourish. But quantitatively that advantage doesn't seem like it can cause A' to dominate, since B can secure rapidly rising quality of life for all humans using only a small fraction of its initial endowment. * Wei Dai has suggested [https://www.alignmentforum.org/posts/Sn5NiiD5WBi4dLzaB/agi-will-drastically-increase-economies-of-scale] that groups with unified values might outcompete groups with heterogeneous values since homogeneous values allow for better coordination, and that AI may make this phenomenon more important. For example, if a research-producer and research-consumer have different values, then the producer may restrict access as part of an inefficient negotiation process and so they may be at a competitive disadvantage relative to a competing community where research is shared freely. This feels inconsistent with many of the things you are saying in your story, but I might be misunderstanding what you are saying and it could be that some argument like like Wei Dai's is the best way to translate your concer
What Multipolar Failure Looks Like, and Robust Agent-Agnostic Processes (RAAPs)

My best understanding of your position is: "Sure, but they will be trying really hard. So additional researchers working on the problem won't much change their probability of success, and you should instead work on more-neglected problems."

That is not my position if "you" in the story is "you, Paul Christiano" :)  The closest position I have to that one is : "If another Paul comes along who cares about x-risk, they'll have more positive impact by focusing on multi-agent and multi-stakeholder issues or 'ethics' with AI tech than if they focus on intent... (read more)

What Multipolar Failure Looks Like, and Robust Agent-Agnostic Processes (RAAPs)

Carl, thanks for this clear statement of your beliefs.  It sounds like you're saying (among other things) that American and Chinese cultures will not engage in a "race-to-the-bottom" in terms of how much they displace human control over the AI technologies their companies develop.  Is that right?  If so, could you give me a % confidence on that position somehow?  And if not, could you clarify?

To reciprocate: I currently assign a ≥10% chance of a race-to-the-bottom on AI control/security/safety between two or more cultures this century, ... (read more)

The US and China might well wreck the world  by knowingly taking gargantuan risks even if both had aligned AI advisors, although I think they likely wouldn't.

But what I'm saying is really hard to do is to make the scenarios in the OP (with competition among individual corporate boards and the like) occur without extreme failure of 1-to-1 alignment (for both companies and governments). Competitive pressures are the main reason why AI systems with inadequate 1-to-1 alignment would be given long enough leashes to bring catastrophe. I would cosign Vanessa... (read more)

Another (outer) alignment failure story

(I called the story an "outer" misalignment story because it focuses on the---somewhat improbable---case in which the intentions of the machines are all natural generalizations of their training objectives. I don't have a precise definition of inner or outer alignment and think they are even less well defined than intent alignment in general, but sometimes the meaning seems unambiguous and it seemed worth flagging specifically because I consider that one of the least realistic parts of this story.)

Thanks; this was somewhat helpful to my understanding, beca... (read more)

I currently can't tell if by "outer alignment failure" you're referring to the entire ecosystem of machines being outer-misaligned, or just each individual machine (and if so, which ones in particular), and I'd like to sync with your usage of the concept if possible (or at least know how to sync with it).

I'm saying each individual machine is misaligned, because each individual machine is searching over plans to find one that leads to an outcome that humans will judge as good in hindsight. The collective behavior of many machines each individually trying ma... (read more)

Another (outer) alignment failure story

Paul, thanks writing this; it's very much in line with the kind of future I'm most worried about.  

For me, it would be super helpful if you could pepper throughout the story mentions of the term "outer alignment" indicating which events-in-particular you consider outer alignment failures.  Is there any chance you could edit it to add in such mentions?  E.g., I currently can't tell if by "outer alignment failure" you're referring to the entire ecosystem of machines being outer-misaligned, or just each individual machine (and if so, which ones in particular), and I'd like to sync with your usage of the concept if possible (or at least know how to sync with it).

6paulfchristiano1y
I'd say that every single machine in the story is misaligned, so hopefully that makes it easy :) I'm basically always talking about intent alignment, as described in this post [https://ai-alignment.com/clarifying-ai-alignment-cec47cd69dd6]. (I called the story an "outer" misalignment story because it focuses on the---somewhat improbable---case in which the intentions of the machines are all natural generalizations of their training objectives. I don't have a precise definition of inner or outer alignment and think they are even less well defined than intent alignment in general, but sometimes the meaning seems unambiguous and it seemed worth flagging specifically because I consider that one of the least realistic parts of this story.)
What Multipolar Failure Looks Like, and Robust Agent-Agnostic Processes (RAAPs)

I don't understand the claim that the scenarios presented here prove the need for some new kind of technical AI alignment research.

I don't mean to say this post warrants a new kind of AI alignment research, and I don't think I said that, but perhaps I'm missing some kind of subtext I'm inadvertently sending?

I would say this post warrants research on multi-agent RL and/or AI social choice and/or fairness and/or transparency, none of which are "new kinds" of research (I promoted them heavily in my preceding post), and none of which I would call "alignment re... (read more)

From your reply to Paul, I understand your argument to be something like the following:

  1. Any solution to single-single alignment will involve a tradeoff between alignment and capability.
  2. If AIs systems are not designed to be cooperative, then in a competitive environment each system will either go out of business or slide towards the capability end of the tradeoff. This will result in catastrophe.
  3. If AI systems are designed to be cooperative, they will strike deals to stay towards the alignment end of the tradeoff.
  4. Given the technical knowledge to design c
... (read more)
What Multipolar Failure Looks Like, and Robust Agent-Agnostic Processes (RAAPs)

> The objective of each company in the production web could loosely be described as "maximizing production'' within its industry sector.

Why does any company have this goal, or even roughly this goal, if they are aligned with their shareholders?

It seems to me you are using the word "alignment" as a boolean, whereas I'm using it to refer to either a scalar ("how aligned is the system?") or a process ("the system has been aligned, i.e., has undergone a process of increasing its alignment").   I prefer the scalar/process usage, because it seems to me t... (read more)

I think I disagree with you on the tininess of the advantage conferred by ignoring human values early on during a multi-polar take-off.  I agree the long-run cost of supporting humans is tiny, but I'm trying to highlight a dynamic where fairly myopic/nihilistic power-maximizing entities end up quickly out-competing entities with other values, due to, as you say, bargaining failure on the part of the creators of the power-maximizing entities.



Right now the United States has a GDP of >$20T, US plus its NATO allies and Japan >$40T, the PRC >$14T,... (read more)

1interstice1y
Is the following scenario a good example of the sort of problem you have in mind? Say you have two advanced ML systems with values that are partially, but not entirely, aligned with humanity: their utility function is 0.9 * (human values) + 0.1 * (control of resources). These two ML systems have been trained with advanced RL, in such a fashion that, when interacting with other powerful systems, they learn to play Nash equilibria. The only Nash equilibrium of their interaction is one where they ruthlessly compete for resources, making the Earth uninhabitable in the process. So both systems are "pretty much aligned", but their joint interaction is radically unaligned. If this seems like a reasonable example, two thoughts: A) I think other people in this discussion might be envisioning 'aligned AI' as looking more like an approval-directed agent, rather than a system trained with RL on a proxy for the human utility function. Crucially, in this paradigm the system's long-term planning and bargaining are emergent consequences of what it predicts an (amplified) human would evaluate highly, they're not baked into the RL algorithm itself. This means it would only try to play a Nash equilibrium if it thinks humans would value that highly, which, in this scenario, they would not. In approval-directed AI systems, or more generally systems where strategic behavior is an emergent consequence of some other algorithm, bargaining ability should rise in tandem with general capability, making it unlikely that very powerful systems would have 'obvious' bargaining failures. B) It seems that systems that are bad at bargaining would also be worse at acquiring resources. For instance, maybe the Nash equilibrium of the above interaction of two RL agents would actually be more like 'try to coordinate a military strike against the other AI as soon as possible', leaving both systems crippled, or to a unipolar scenario(which would be OK given the systems' mostly-aligned utility functions). T
7paulfchristiano1y
Quantitatively I think that entities without instrumental resources win very, very slowly. For example, if the average savings rate is 99% and my personal savings rate is only 95%, then by the time that the economy grows 10,000x my share of the world will have fallen by about half. The levels of consumption needed to maintain human safety and current quality of life seems quite low (and the high-growth during which they have to be maintained is quite low). Also, typically taxes transfer (way more) than that much value from high-savers to low-savers. It's not clear to me what's happening with taxes in your story. I guess you are imagining low-tax jurisdictions winning out, but again the pace at which that happens is even slower and it is dwarfed by the typical rate of expropriation from war. From my end it feels like the big difference is that quantitatively I think the overhead of achieving human values is extremely low, so the dynamics you point to are too weak to do anything before the end of time (unless single-single alignment turns out to be hard). I don't know exactly what your view on this is. If you agree that the main source of overhead is single-single alignment, then I think that the biggest difference between us is that I think that working on single-single alignment is the easiest way to make headway on that issue, whereas you expect greater improvements from some categories of technical work on coordination (my sense is that I'm quite skeptical about most of the particular kinds of work you advocate). If you disagree, then I expect the main disagreement is about those other sources of overhead (e.g. you might have some other particular things in mind, or you might feel that unknown-unknowns are a larger fraction of the total risk, or something else). Could you explain the advantage you are imagining? Some candidates, none of which I think are your view: * Single-single alignment failures---e.g. it's easier to build a widget-maximizing corpora
4paulfchristiano1y
I think this is an indication of the system serving some people (e.g. capitalists, managers, high-skilled labor) better than others (e.g. the median line worker). That's a really important and common complaint with the existing economic order, but I don't really see how it indicates a Pareto improvement or is related to the central thesis of your post about firms failing to help their shareholders. (In general wage labor is supposed to benefit you by giving you money, and then the question is whether the stuff you spend money on benefits you.))

If trillion-dollar tech companies stop trying to make their systems do what they want, I will update that marginal deep-thinking researchers should allocate themselves to making alignment (the scalar!) cheaper/easier/better instead of making bargaining/cooperation/mutual-governance cheaper/easier/better.  I just don't see that happening given the structure of today's global economy and tech industry.

In your story, trillion-dollar tech companies are trying to make their systems do what they want and failing. My best understanding of your position is: "... (read more)

It seems to me you are using the word "alignment" as a boolean, whereas I'm using it to refer to either a scalar ("how aligned is the system?") or a process ("the system has been aligned, i.e., has undergone a process of increasing its alignment").   I prefer the scalar/process usage, because it seems to me that people who do alignment research (including yourself) are going to produce ways of increasing the "alignment scalar", rather than ways of guaranteeing the "perfect alignment" boolean.  (I sometimes use "misaligned" as a boolean due to it

... (read more)
What Multipolar Failure Looks Like, and Robust Agent-Agnostic Processes (RAAPs)

These management assistants, DAOs etc are not aligned to the goals of their respective, individual users/owners.

How are you inferring this?  From the fact that a negative outcome eventually obtained?  Or from particular misaligned decisions each system made?  It would be helpful if you could point to a particular single-agent decision in one of the stories that you view as evidence of that single agent being highly misaligned with its user or creator.  I can then reply with how I envision that decision being made even with high single-a... (read more)

How are you inferring this?  From the fact that a negative outcome eventually obtained?  Or from particular misaligned decisions each system made?

I also thought the story strongly suggested single-single misalignment, though it doesn't get into many of the concrete decisions made by any of the systems so it's hard to say whether particular decisions are in fact misaligned.

The objective of each company in the production web could loosely be described as "maximizing production'' within its industry sector.

Why does any company have this goal, or eve... (read more)

What Multipolar Failure Looks Like, and Robust Agent-Agnostic Processes (RAAPs)

I hadn't read it (nor almost any science fiction books/stories) but yes, you're right!  I've now added a callback to Autofac after the "facotiral DAO" story.  Thanks.

Some AI research areas and their relevance to existential safety

Good to hear!

If I read that term ["AI existential safety"] without a definition I would assume it meant "reducing the existential risk posed by AI." Hopefully you'd be OK with that reading. I'm not sure if you are trying to subtly distinguish it from Nick's definition of existential risk or if the definition you give is just intended to be somewhere in that space of what people mean when they say "existential risk" (e.g. the LW definition is like yours).

Yep, that's my intention.  If given the chance I'd also shift the meaning of "existential risk" a b... (read more)

Some AI research areas and their relevance to existential safety

The OP's conclusion seems to be that social AI alignment should be the main focus. Personally, I'm less convinced. It would be interesting to see more detailed arguments about the above parameters that support or refute this thesis.

Thanks for the feedback, Vanessa.  I've just written a follow-up post to better illustrate a class of societal-scale failure modes ("unsafe robust agent-agnostic processes") that constitutes the majority of the probability mass I currently place on human extinction precipitated by transformative AI advancements (especially ... (read more)

Some AI research areas and their relevance to existential safety

My actual thought process for believing GDPR is good is not that it "is a sample from the empirical distribution of governance demands", but that it intializes the process of governments (and thereby the public they represent) weighing in on what tech companies can and cannot design their systems to reason about, and more specifically the degree to which systems are allowed to reason about humans.  Having a regulatory structure in place for restricting access to human data is a good first step, but we'll probably also eventually want restrictions for ... (read more)

Load More