LESSWRONG
LW

522
David Scott Krueger (formerly: capybaralet)
2301Ω597654671
Message
Dialogue
Subscribe

I'm more active on Twitter than LW/AF these days: https://twitter.com/DavidSKrueger

https://www.davidscottkrueger.com/
 

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
5capybaralet's Shortform
Ω
5y
Ω
50
Thoughts on Gradual Disempowerment
David Scott Krueger (formerly: capybaralet)8d20

Thanks!

> Do you think that, absent AI power-seeking, this dynamic is highly likely to lead to human disempowerment? (If so, then i disagree.)

As a sort-of answer, I would just say that I am concerned that people might knowingly and deliberately build power-seeking AIs and hand over power to them, even if we have the means to build AIs that are not power-seeking.

> I said "absent misalignemnt", and I think your story involves misalignment?

It does not.  The point of my story is: "reality can also just be unfriendly to you".  There are trade-offs, and so people optimize for selfish, short-term objectives. You could argue people already do that, but cranking up the optimization power without fixing that seems likely to be bad.

My true objection is more that I think we will see extreme safety/performance trade-offs due to technical inadequacies -- ie (roughly) the alignment tax is large (although I don't like that framing).  In that case, you have misalignment despite also having a solution to alignment: competitive pressures prevent people from adopting the solution.

Reply
Thoughts on Gradual Disempowerment
David Scott Krueger (formerly: capybaralet)24d102

(I’ve only read the parts I’m responding to)
 

My high-level view is that the convincing versions of gradual disempowerment either rely on misalignment or result [from] power concentration among humans. 

It feels like this statement should be qualified more; later it is stated that GD isn’t “similarly plausible to the risks from power-seeking AI or AI-enabled coups”, but this is holding GD to a higher bar; the relevant bar would seem to be “is plausible enough to be worth considering”. 

“Rely[ing] on misalignment” is also an extremely weak condition: I claim that current systems are not aligned, and gradual disempowerment dynamics are already at play (cf AI “arms race”).

The analysis of economic disempowerment seems to take place in a vacuum, ignoring one of the main arguments we make, which is that different forms of disempowerment can mutually reinforce each other.  The most concerning version of this, I think, is not just “we don't get UBI”, but rather that the memes that say “it's good to hand over as much power as quickly as possible to AI” win the day. 

The analysis of cultural disempowerment goes one step “worse”, arguing that “If humans remain economically empowered (in the sense of having much more money than AI), I think they will likely remain culturally empowered.”  I think we agree that a reasonable model here is one where cultural and economic are tightly coupled, but I don’t see why that means they won’t both go off the rails.  You seem to think that they are almost guaranteed to feedback on each other in a way that maintains human power, but I think it can easily go the opposite way. 

Regarding political disempowerment, you state: “It’s hard to see how those leading the state and the top AI companies could be disempowered, absent misalignment.” Personally, I find this quite easy. Insufficient elite coordination is one mechanism (discussed below). But reality can also just be unfriendly to you and force you to make choices about how you prioritize long-term vs. short-term objectives, leading people to accept deals like: “I'll be rich and powerful for the next hundred years, and then my AI will take over my domain and do as it pleases”. Furthermore, if more people take such deals, this creates pressure for others to do so as well, since you need to get power in the short-term in order to remain “solvent” in the long term, even if you aren’t myopic yourself.  I think this is already happening; the AI arms race is burning the commons every day; I don’t expect it to stop.

Regarding elite coordination, I also looked at the list under the heading “Sceptic: Why don’t the elites realise what’s happening and coordinate to stop it?” Another important reason not mentioned is that cooperating usually produces a bargaining game where there is no clearly correct way to split the proceeds of the cooperation.

Reply
The many paths to permanent disempowerment even with shutdownable AIs (MATS project summary for feedback)
David Scott Krueger (formerly: capybaralet)1mo20

Yeah I roughly agree.

EtA: I might say algorithmic trading and marketting (which are older) are alread doing this, e.g., but it's a bit subjective and uncertain.

Reply
The many paths to permanent disempowerment even with shutdownable AIs (MATS project summary for feedback)
David Scott Krueger (formerly: capybaralet)1mo40

and to date, this fact has little to do with AI.

This seems incorrect over the last couple of years.  But also incorrect historically if you broaden from AI to "information processing and person modelling technologies that help turn money into influence".

But more generally, GD can be viewed as a continuation of historical trends or not.  I think I'm more in the "continuation" camp vs. e.g. Duvenaud, who would stress that things change once humans become redundant.

Reply
4. Existing Writing on Corrigibility
David Scott Krueger (formerly: capybaralet)2moΩ120

Seems to be missing old stuff by Stuart Armstrong (?)

Reply
the void
David Scott Krueger (formerly: capybaralet)3moΩ242

Some further half-baked thoughts:


One thing that is still not clear (both in reality, and per this article) is the extent to which we should view a model as having a coherent persona/goal. 

This is a tiny bit related to the question of whether models are strictly simulators, or if some personas / optimization daemons "take on a life of their own", and e.g.:
1) bias the model towards simulating them and/or
2) influence the behavior of other personas

It seems like these things do in fact happen, and the implications are that the "simulator" viewpoint becomes less accurate over time.

Why?

  • There needs to be some prior distribution over personas.
  • Empirically, post-training seems to concentrate the prior over personas on some default persona (although it's unclear what to make of this).
  • It seems like alignment faking, exploration/gradient hacking, and implicit meta-learning type effects are likely to be sensitive to goals of whichever personas are active and lead the model to preferentially update in a way that serves the goals of these personas.
  • To the extent that different personas are represented in the prior (or conjured during post-training), the ones that more aggressively use such strategies to influence training updates would gain relatively more influence. 
Reply
the void
David Scott Krueger (formerly: capybaralet)3moΩ10162

This was an interesting article, however, taking a cynical/critical lens, it seems like "the void" is just... underspecification causing an inner alignment failure?  The post has this to say on the topic of inner alignment:

And one might notice, too, that the threat model – about inhuman, spontaneously generated, secret AI goals – predates Claude by a long shot. In 2016 there was an odd fad in the SF rationalist community about stuff kind of like this, under the name “optimization demons.” Then that discourse got sort of refurbished, and renamed to “inner alignment” vs. “outer alignment.”

This is in the context of mocking these concerns as delusional self-fulfilling prophecies.

I guess the devil is in the details, and the point of the post is more to dispute the framing and ontology of the safety community, which I found useful.  But it does seem weirdly uncharitable in how it does so.

Reply
What happens if you present 500 people with an argument that AI is risky?
David Scott Krueger (formerly: capybaralet)5mo20

In the big round (without counterarguments), arguments pushed people upward slightly more:

(more than downward -- not more than previous surveys)

Reply
Gradual Disempowerment: Systemic Existential Risks from Incremental AI Development
David Scott Krueger (formerly: capybaralet)7moΩ240

First, RE the role of "solving alignment" in this discussion, I just want to note that:
1) I disagree that alignment solves gradual disempowerment problems.
2) Even if it would that does not imply that gradual disempowerment problems aren't important (since we can't assume alignment will be solved).
3) I'm not sure what you mean by "alignment is solved"; I'm taking it to mean "AI systems can be trivially intent aligned".  Such a system may still say things like "Well, I can build you a successor that I think has only a 90% chance of being aligned, but will make you win (e.g. survive) if it is aligned.  Is that what you want?" and people can respond with "yes" -- this is the sort of thing that probably still happens IMO.
4) Alternatively, you might say we're in the "alignment basin" -- I'm not sure what that means, precisely, but I would operationalize it as something like "the AI system is playing a roughly optimal CIRL game".  It's unclear how good of performance that can yield in practice (e.g. it can't actually be optimal due to compute limitations), but I suspect it still leaves significant room for fuck-ups.
5) I'm more interested in the case where alignment is not "perfectly" "solved", and so there are simply clear and obvious opportunities to trade-off safety and performance; I think this is much more realistic to consider.
6) I expect such trade-off opportunities to persist when it comes to assurance (even if alignment is solved), since I expect high-quality assurance to be extremely costly.  And it is irresponsible (because it's subjectively risky) to trust a perfectly aligned AI system absent strong assurances.  But of course, people who are willing to YOLO it and just say "seems aligned, let's ship" will win.  This is also part of the problem...


My main response, at a high level:
Consider a simple model:

  • We have 2 human/AI teams in competition with each other, A and B.
  • A and B both start out with the humans in charge, and then decide whether the humans should stay in charge for the next week.
  • Whichever group has more power at the end of the week survives.
  • The humans in A ask their AI to make A as powerful as possible at the end of the week.
  • The humans in B ask their AI to make B as powerful as possible at the end of the week, subject to the constraint that the humans in B are sure to stay in charge.

I predict that group A survives, but the humans are no longer in power.  I think this illustrates the basic dynamic.  EtA: Do you understand what I'm getting at?  Can you explain what you think it wrong with thinking of it this way?

 

Responding to some particular points below:
 

Sure, but these things don't result in non-human entities obtaining power right? 

Yes, they do; they result in beaurocracies and automated decision-making systems obtaining power.  People were already having to implement and interact with stupid automated decision-making systems before AI came along.

Like usually these are somewhat negative sum, but mostly just involve inefficient transfer of power. I don't see why these mechanisms would on net transfer power from human control of resources to some other control of resources in the long run. To consider the most extreme case, why would these mechanisms result in humans or human appointed successors not having control of what compute is doing in the long run?

My main claim was not that these are mechanisms of human disempowerment (although I think they are), but rather that they are indicators of the overall low level of functionality of the world.  




 

Reply1
Gradual Disempowerment: Systemic Existential Risks from Incremental AI Development
David Scott Krueger (formerly: capybaralet)7moΩ9138

I think we disagree about:
1) The level of "functionality" of the current world/institutions.
2) How strong and decisive competitive pressures are and will be in determining outcomes.

I view the world today as highly dysfunctional in many ways: corruption, coordination failures, preference falsification, coercion, inequality, etc. are rampant.  This state of affairs both causes many bad outcomes and many aspects are self-reinforcing.  I don't expect AI to fix these problems; I expect it to exacerbate them.

I do believe it has the potential to fix them, however, I think the use of AI for such pro-social ends is not going to be sufficiently incentivized, especially on short time-scales (e.g. a few years), and we will instead see a race-to-the-bottom that encourages highly reckless, negligent, short-sighted, selfish decisions around AI development, deployment, and use.  The current AI arms race is a great example -- Companies and nations all view it as more important that they be the ones to develop ASI than to do it carefully or put effort into cooperation/coordination.

Given these views:
1) Asking AI for advice instead of letting it take decisions directly seems unrealistically uncompetitive.  When we can plausibly simulate human meetings in seconds it will be organizational suicide to take hours-to-weeks to let the humans make an informed and thoughtful decision. 
2) The idea that decision-makers who "think a goverance structure will yield total human disempowerment" will "do something else" also seems quite implausible.  Such decision-makers will likely struggle to retain power.  Decision-makers who prioritize their own "power" (and feel empowered even as they hand off increasing decision-making to AI) and their immediate political survival above all else will be empowered.  

Another features of the future which seems likely and can already be witnessed beginning is the gradual emergence and ascendance of pro-AI-takeover and pro-arms-race ideologies, which endorse the more competitive moves of rapidly handing off power to AI systems in insufficiently cooperative ways.

Reply111
Load More
Consequentialism
10 years ago
(+50/-38)
49Detecting High-Stakes Interactions with Activation Probes
2mo
0
25Upcoming workshop on Post-AGI Civilizational Equilibria
3mo
0
24A review of "Why Did Environmentalism Become Partisan?"
5mo
0
167Gradual Disempowerment: Systemic Existential Risks from Incremental AI Development
Ω
7mo
Ω
65
40A Sober Look at Steering Vectors for LLMs
Ω
10mo
Ω
0
19Is there any rigorous work on using anthropic uncertainty to prevent situational awareness / deception?
QΩ
1y
QΩ
7
21An ML paper on data stealing provides a construction for "gradient hacking"
1y
1
70[Link Post] "Foundational Challenges in Assuring Alignment and Safety of Large Language Models"
Ω
1y
Ω
2
25Testing for consequence-blindness in LLMs using the HI-ADS unit test.
2y
2
112"Publish or Perish" (a quick note on why you should try to make your work legible to existing academic communities)
Ω
2y
Ω
49
Load More