Wiki Contributions


Convince me that humanity *isn’t* doomed by AGI

My problem is that most of the scenarios I see being discussed are dependent on a long chain of assumptions being true and they often seem to ignore that many things could go wrong, invalidating the full thing: you don't need to be wrong in all those steps, one of them is just enough.

This feels a bit like it might be shifting the goalposts; it seemed like your previous comment was criticizing a specific argumentative step ("reasons not to believe in doom: [...] Orthogonality of intelligence and agency"), rather than just pointing out that there were many argumentative steps.

Anyway, addressing the point about there being many argumentative steps: I partially agree, although I'm not very convinced since there seems to be significant redundancy in arguments for AI risk (e.g., multiple fuzzy heuristics suggesting there's risk, multiple reasons to expect misalignment, multiple actors who could be careless, multiple ways misaligned AI could gain influence under multiple scenarios).

The different AGIs might find it hard/impossible to coordinate. The different AGIs might even be in conflict with one another

Maybe, although here are six reasons to think otherwise:

  • There are reasons to think they will have an easy time coordinating:
    • (1) As mentioned, a very plausible scenario is that many of these AI systems will be copies of some specific model. To the extent that the model has goals, all these copies of any single model would have the same goal. This seems like it would make coordination much easier.
    • (2) Computer programs may be able to give credible signals through open-source code, facilitating cooperation.
    • (3) Focal points of coordination may come up and facilitate coordination, as they often do with humans.
    • (4) If they are initially in conflict, this will create competitive selection pressures for well-coordinated groups (much like how coordinated human states arise from anarchy).
    • (5) They may coordinate due to decision theoretic considerations.
    • (Humans may be able to mitigate coordination earlier on, but this gets harder as their number and/or capabilities grow.)
  • (6) Regardless, they might not need to (widely) coordinate; overwhelming numbers of uncoordinated actors may be risky enough (especially if there is some local coordination, which seems likely for the above reasons).
Convince me that humanity *isn’t* doomed by AGI
  1. Orthogonality of intelligence and agency. I can envision a machine with high intelligence and zero agency, I haven't seen any convincing argument yet of why both things must necessarily go together

Hm, what do you make of the following argument? Even assuming (contestably) that intelligence and agency don't in principle need to go together, in practice they'll go together because there will appear to be strong economic or geopolitical incentives to build systems that are both highly intelligent and highly agentic (e.g., AI systems that can run teams, companies, or militaries). (And even if some AI developers are cautious enough to not build such systems, less cautious AI developers will, in the absence of strong coordination.)

Also, (2) and (3) seem like reasons why a single AI system may be unable to disempower humanity. Even if we accept that, how relevant will these points be when there is a huge number of highly capable AI systems (which may happen because of the ease and economic benefits of replicating highly capable AI systems)? Their numbers might make up for their limited knowledge and limited plans.

(Admittedly, in these scenarios, people might have significantly more time to figure things out.)

Or as Paul Christiano puts it (potentially in making a different point):

At the same time, it becomes increasingly difficult for humans to directly control what happens in a world where nearly all productive work, including management, investment, and the design of new machines, is being done by machines. We can imagine a scenario in which humans continue to make all goal-oriented decisions about the management of PepsiCo but are assisted by an increasingly elaborate network of prosthetics and assistants. But I think human management becomes increasingly implausible as the size of the world grows (imagine a minority of 7 billion humans trying to manage the equivalent of 7 trillion knowledge workers; then imagine 70 trillion), and as machines’ abilities to plan and decide outstrip humans’ by a widening margin. In this world, the AIs that are left to do their own thing outnumber and outperform those which remain under close management of humans.

We're already in AI takeoff

For people curious about additional discussion of this, Eric Drexler has a closely related comment (focusing on corporate "superhumans" rather than ideological "superhumans"):

[...] The proper (and to some extent actual) goal of firms is widely considered to be maximizing share value, but this is manifestly not the same as maximizing shareholder value — or even benefiting shareholders. [...] Equating share value with shareholder value is obviously wrong (even when considering only share value!) and is potentially lethal. This conceptual error both encourages complacency regarding the alignment of corporate behavior with human interests and undercuts efforts to improve that alignment."

Why I'm co-founding Aligned AI

So we need a way to have alignment deployed throughout the algorithmic world before anyone develops AGI. To do this, we'll start by offering alignment as a service for more limited AIs.

I'm tentatively fairly excited about some version of this, so I'll suggest some tweaks that can hopefully be helpful for your success (or for the brainstorming of anyone else who's thinking about doing something similar in the future).

We will refine and develop this deployment plan, depending on research results, commercial opportunities, feedback, and suggestions.

I suspect there'd be much better commercial/scaling opportunities for a somewhat similar org that offered a more comprehensive, high-quality package of "trustworthy AI services"--e.g., addressing bias, privacy issues, and other more mainstream concerns along with safety/alignment concerns. Then there'd be less of a need to convince companies about paying for some new service--you would mostly just need to convince them that you're the best provider of services that they're already interested in. (Cf. ethical AI consulting companies that already exist.)

(One could ask: But wouldn't the extra price be the same, whether you're offering alignment in a package or separately? Not necessarily--IP concerns and transaction costs incentivize AI companies to reduce the number of third parties they share their algorithms with.)

As an additional benefit, a more comprehensive package of "trustworthy AI services" would be directly competing for consumers with companies like the AI consulting company mentioned above. This might pressure those companies to start offering safety/alignment services--a mechanism for broadening adoption that isn't available to an org that only provides alignment services.

[From the website] We are hiring AI safety researchers, ML engineers and other staff.

Relatedly to the earlier point, given that commercial opportunities are a big potential bottleneck (in other words, given that selling limited alignment services might be as much of a communications and persuasion challenge as it is a technical challenge), my intuition would be to also put significant emphasis into hiring people who will kill it at the persuasion: people who are closely familiar with the market and regulatory incentives faced by relevant companies, people with sales and marketing experience, people with otherwise strong communications skills, etc. (in addition to the researchers and engineers).

What failure looks like

A more recent clarification from Paul Christiano, on how Part 1 might get locked in / how it relates to concerns about misaligned, power-seeking AI:

I also consider catastrophic versions of "you get what you measure" to be a subset/framing/whatever of "misaligned power-seeking." I think misaligned power-seeking is the main way the problem is locked in.

My Overview of the AI Alignment Landscape: Threat Models

I'm still pretty confused by "You get what you measure" being framed as a distinct threat model from power-seeking AI (rather than as another sub-threat model). I'll try to address two defenses of that (of framing them as distinct threat models) which I interpret this post as suggesting (in the context of my earlier comment on the overview post). Broadly, I'll be arguing that: power-seeking AI is necessary for "you get what you measure" concerns being plausible motivators for x-risk-focused people to pursue alignment research, so "you get what you measure" concerns are best thought of as a sub-threat model of power-seeking AI.

(Edit: OK, an aspect of "you get what you measure" concerns--the emphasis on something like "sufficiently strong optimization for some goal is very bad for different goals"--is a tweaked framing of power-seeking AI risk in general, rather than a subset. I think the aspect of "you get what you measure" concerns that's a subset of power-seeking AI risk is the emphasis on goal misspecification as the cause of misalignment. Either way, not a distinct threat model.)

Lock-in: Once we’ve noticed problems, how difficult will they be to fix, and how much resistance will there be? For example, despite the clear harms of CO2 emissions, fossil fuels are such an indispensable part of the economy that it’s incredibly hard to get rid of them. A similar thing could happen if AI systems become an indispensable part of the economy, which seems pretty plausible given how incredibly useful human-level AI would be. As another example, imagine how hard it would be to ban social media, if we as a society decided that this was net bad for the world.

Unless I'm missing something, this is just an argument for why AI might get locked in--not an argument for why misaligned AI might get locked in. AI becoming an indispensable part of the economy isn't a long-term problem if people remain capable of identifying and fixing problems with the AI. So we still need an additional lock-in mechanism (e.g. the initially deployed, misaligned AI being power-seeking) to have trouble. (If we're wondering how hard it will be to fix/improve non-power-seeking AI after it's been deployed, the difficulty of banning social media doesn't seem like a great analogy; a more relevant analogy would be the difficulty of fixing/improving social media after it's been deployed. Empirically, this doesn't seem that hard. For example, YouTube's recommendation algorithm started as a click-maximizer, and YouTube has already modified it to learn from human feedback (!).)

See Sam Clarke’s excellent post for more discussion of examples of lock-in.

I don't think Sam Clarke's post (which I'm also a fan of) proposes any lock-in mechanisms that (a) are plausible reasons for x-risk-focused people to pursue AI alignment work (and therefore be in the scope of this post) and (b) do not depend on AI being power-seeking. Clarke proposes five mechanisms by which Part 1 of "What Failure Looks Like" could get locked in -- addressing each of these in turn (in the context of his original post):

  • (1) short-term incentives and collective action -- arguably fails condition (a) or fails condition (b); if we don't assume AI will be power-seeking, then I see no reason why these difficulties would get much worse in hundreds of years than they are now, i.e. no reason why this on its own is a lock-in mechanism.
  • (2) regulatory capture -- the worry here is that the companies controlling AI might have and permanently act on bad values; this arguably fails condition (a), because if we're mainly worried about AI developers being bad, then focusing on intent alignment doesn't make that much sense.
  • (3) genuine ambiguity -- arguably fails condition (a) or fails condition (b); if we don't assume AI will be power-seeking, then I see no reason why these difficulties would get much worse in hundreds of years than they are now, i.e. no reason why this on its own is a lock-in mechanism.
  • (4) dependency and deskilling -- addressed above
  • (5) [AI] opposition to [humanity] taking back influence -- clearly fails condition (b)

So I think there remains no plausible alignment-relevant threat model for "You get what you measure" that doesn't fall under "power-seeking AI." (And even partial deference to Paul Christiano wouldn't imply that there is, since "What Failure Looks Like" Part 1 features power-seeking AI.)

Zvi’s Thoughts on the Survival and Flourishing Fund (SFF)

Thanks for the thoughtful reply--lots to think more about.

Zvi’s Thoughts on the Survival and Flourishing Fund (SFF)

(One view from which (political) power-seeking seems much less valuable is if we assume that, on the margin, this kind of power isn't all that useful for solving key problems. But if that were the crux, I'd have expected the original criticism to emphasize the (limited) benefits of power-seeking, rather than its costs.)

Zvi’s Thoughts on the Survival and Flourishing Fund (SFF)

In my model, one should be deeply skeptical whenever the answer to ‘what would do the most good?’ is ‘get people like me more money and/or access to power.’ One should be only somewhat less skeptical when the answer is ‘make there be more people like me’ or ‘build and fund a community of people like me.’ [...] I wish I had a better way to communicate what I find so deeply wrong here

I'd be very curious to hear more fleshed-out arguments here, if you/others think of them. My best guess about what you have in mind is that it's a combination of the following (lumping all the interventions mentioned in the quoted excerpt into "power-seeking"):

  1. People have personal incentives and tribalistic motivations to pursue power for their in-group, so we're heavily biased toward overestimating its altruistic value.
  2. Seeking power occupies resources/attention that could be spent figuring out how to solve problems, and figuring out how to solve problems is very valuable.
  3. Figuring out how to solve problems isn't just very valuable. It's necessary for things to go well, so just/mainly doing power-seeking makes it way too easy for us to get the mistaken impression that we're making progress and things are going well, while a crucial input into things going well (knowing what to do with power) remains absent.
  4. Power-seeking attracts leeches (which wastes resources and dilutes relevant fields).
  5. Power-seeking pushes people's attention away from object-level discussion and learning. (This is different from (3) in that (3) is about how power-seeking impacts a specific belief, while this point is about attention.)
  6. Power-seeking makes a culture increasingly value power for its own sake (i.e. "power corrupts"?), which is bad for the usual reasons that value drift is bad.

If that's it (is it?), then I'm more sympathetic than I was before writing out the above, but I'm still skeptical:

  • Re: 1: Speaking of object-level arguments, object-level arguments for the usefulness of power and field growth seem very compelling (and simple enough to significantly reduce room for bias).
    • The arguments I find most compelling are:
      • "It seems very plausible that AI will be at least as transformative as the agricultural revolution this century, and there are only ~50-200 people total (depending on how we count) working full-time on improving the long-term impacts of this transition. Oh man."
      • And the instrumental convergence arguments for power-seeking (which seem central enough to AI safety concerns for "power-seeking is bad because we've got to figure out AI safety" to be a really weird position).
  • 4 only seems like a problem with poorly executed power-seeking (although maybe that's hard to avoid?).
  • 2-5 and 6 seem to be horrific problems mostly just if power-seeking is the main activity of the community, rather than one of several activities. My current sense is that people tend to neither explicitly nor implicitly endorse having power-seeking be the main activity of the community as a whole (it's the main activity of some organizations, sure, but that's just specialization.*)

*Maybe then we should worry about the above problems at the organizational level? But these concerns seem mostly cultural, and there seems to be much cultural diffusion / cross-influence across organizations in the movement, so individual organizations are much less vulnerable to these problems than the movement as a whole.

Is "gears-level" just a synonym for "mechanistic"?

Yeah, my impression is that "mechanistic" is often used in social sciences to refer to a very similar idea as "gears-level." E.g. as discussed in this highly-cited overview (with emphasis added):

The idea that science aims at providing mechanistic explanations of phenomena has a long history (Bechtel 2006) [...]. In philosophy of science, mechanistic explanation has been mainly discussed in the context of biological sciences [...] whereas in the social sciences the idea has been mostly discussed by social scientists themselves (Abbott 2007, Elster 1989, Elster 2007, Gross 2009, Hedström 2005, Hedström & Swedberg 1998, Manicas 2006, Mayntz 2004, Morgan & Winship 2007, Schmidt 2006, Tilly 2001, Wikström 2006). [...] In both contexts, the development of the idea of mechanistic explanation has been partly motivated by the shortcomings of the once hegemonic covering-law account of explanation (Hempel 1965). The basic idea of mechanistic explanation is quite simple: At its core, it implies that proper explanations should detail the ‘cogs and wheels’ of the causal process through which the outcome to be explained was brought about.

Load More