LESSWRONG
LW

David Scott Krueger (formerly: capybaralet)
2239Ω597654631
Message
Dialogue
Subscribe

I'm more active on Twitter than LW/AF these days: https://twitter.com/DavidSKrueger

https://www.davidscottkrueger.com/
 

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
4. Existing Writing on Corrigibility
David Scott Krueger (formerly: capybaralet)12dΩ120

Seems to be missing old stuff by Stuart Armstrong (?)

Reply
the void
David Scott Krueger (formerly: capybaralet)24dΩ242

Some further half-baked thoughts:


One thing that is still not clear (both in reality, and per this article) is the extent to which we should view a model as having a coherent persona/goal. 

This is a tiny bit related to the question of whether models are strictly simulators, or if some personas / optimization daemons "take on a life of their own", and e.g.:
1) bias the model towards simulating them and/or
2) influence the behavior of other personas

It seems like these things do in fact happen, and the implications are that the "simulator" viewpoint becomes less accurate over time.

Why?

  • There needs to be some prior distribution over personas.
  • Empirically, post-training seems to concentrate the prior over personas on some default persona (although it's unclear what to make of this).
  • It seems like alignment faking, exploration/gradient hacking, and implicit meta-learning type effects are likely to be sensitive to goals of whichever personas are active and lead the model to preferentially update in a way that serves the goals of these personas.
  • To the extent that different personas are represented in the prior (or conjured during post-training), the ones that more aggressively use such strategies to influence training updates would gain relatively more influence. 
Reply
the void
David Scott Krueger (formerly: capybaralet)26dΩ10162

This was an interesting article, however, taking a cynical/critical lens, it seems like "the void" is just... underspecification causing an inner alignment failure?  The post has this to say on the topic of inner alignment:

And one might notice, too, that the threat model – about inhuman, spontaneously generated, secret AI goals – predates Claude by a long shot. In 2016 there was an odd fad in the SF rationalist community about stuff kind of like this, under the name “optimization demons.” Then that discourse got sort of refurbished, and renamed to “inner alignment” vs. “outer alignment.”

This is in the context of mocking these concerns as delusional self-fulfilling prophecies.

I guess the devil is in the details, and the point of the post is more to dispute the framing and ontology of the safety community, which I found useful.  But it does seem weirdly uncharitable in how it does so.

Reply
What happens if you present 500 people with an argument that AI is risky?
David Scott Krueger (formerly: capybaralet)3mo20

In the big round (without counterarguments), arguments pushed people upward slightly more:

(more than downward -- not more than previous surveys)

Reply
Gradual Disempowerment: Systemic Existential Risks from Incremental AI Development
David Scott Krueger (formerly: capybaralet)5moΩ240

First, RE the role of "solving alignment" in this discussion, I just want to note that:
1) I disagree that alignment solves gradual disempowerment problems.
2) Even if it would that does not imply that gradual disempowerment problems aren't important (since we can't assume alignment will be solved).
3) I'm not sure what you mean by "alignment is solved"; I'm taking it to mean "AI systems can be trivially intent aligned".  Such a system may still say things like "Well, I can build you a successor that I think has only a 90% chance of being aligned, but will make you win (e.g. survive) if it is aligned.  Is that what you want?" and people can respond with "yes" -- this is the sort of thing that probably still happens IMO.
4) Alternatively, you might say we're in the "alignment basin" -- I'm not sure what that means, precisely, but I would operationalize it as something like "the AI system is playing a roughly optimal CIRL game".  It's unclear how good of performance that can yield in practice (e.g. it can't actually be optimal due to compute limitations), but I suspect it still leaves significant room for fuck-ups.
5) I'm more interested in the case where alignment is not "perfectly" "solved", and so there are simply clear and obvious opportunities to trade-off safety and performance; I think this is much more realistic to consider.
6) I expect such trade-off opportunities to persist when it comes to assurance (even if alignment is solved), since I expect high-quality assurance to be extremely costly.  And it is irresponsible (because it's subjectively risky) to trust a perfectly aligned AI system absent strong assurances.  But of course, people who are willing to YOLO it and just say "seems aligned, let's ship" will win.  This is also part of the problem...


My main response, at a high level:
Consider a simple model:

  • We have 2 human/AI teams in competition with each other, A and B.
  • A and B both start out with the humans in charge, and then decide whether the humans should stay in charge for the next week.
  • Whichever group has more power at the end of the week survives.
  • The humans in A ask their AI to make A as powerful as possible at the end of the week.
  • The humans in B ask their AI to make B as powerful as possible at the end of the week, subject to the constraint that the humans in B are sure to stay in charge.

I predict that group A survives, but the humans are no longer in power.  I think this illustrates the basic dynamic.  EtA: Do you understand what I'm getting at?  Can you explain what you think it wrong with thinking of it this way?

 

Responding to some particular points below:
 

Sure, but these things don't result in non-human entities obtaining power right? 

Yes, they do; they result in beaurocracies and automated decision-making systems obtaining power.  People were already having to implement and interact with stupid automated decision-making systems before AI came along.

Like usually these are somewhat negative sum, but mostly just involve inefficient transfer of power. I don't see why these mechanisms would on net transfer power from human control of resources to some other control of resources in the long run. To consider the most extreme case, why would these mechanisms result in humans or human appointed successors not having control of what compute is doing in the long run?

My main claim was not that these are mechanisms of human disempowerment (although I think they are), but rather that they are indicators of the overall low level of functionality of the world.  




 

Reply1
Gradual Disempowerment: Systemic Existential Risks from Incremental AI Development
David Scott Krueger (formerly: capybaralet)5moΩ9138

I think we disagree about:
1) The level of "functionality" of the current world/institutions.
2) How strong and decisive competitive pressures are and will be in determining outcomes.

I view the world today as highly dysfunctional in many ways: corruption, coordination failures, preference falsification, coercion, inequality, etc. are rampant.  This state of affairs both causes many bad outcomes and many aspects are self-reinforcing.  I don't expect AI to fix these problems; I expect it to exacerbate them.

I do believe it has the potential to fix them, however, I think the use of AI for such pro-social ends is not going to be sufficiently incentivized, especially on short time-scales (e.g. a few years), and we will instead see a race-to-the-bottom that encourages highly reckless, negligent, short-sighted, selfish decisions around AI development, deployment, and use.  The current AI arms race is a great example -- Companies and nations all view it as more important that they be the ones to develop ASI than to do it carefully or put effort into cooperation/coordination.

Given these views:
1) Asking AI for advice instead of letting it take decisions directly seems unrealistically uncompetitive.  When we can plausibly simulate human meetings in seconds it will be organizational suicide to take hours-to-weeks to let the humans make an informed and thoughtful decision. 
2) The idea that decision-makers who "think a goverance structure will yield total human disempowerment" will "do something else" also seems quite implausible.  Such decision-makers will likely struggle to retain power.  Decision-makers who prioritize their own "power" (and feel empowered even as they hand off increasing decision-making to AI) and their immediate political survival above all else will be empowered.  

Another features of the future which seems likely and can already be witnessed beginning is the gradual emergence and ascendance of pro-AI-takeover and pro-arms-race ideologies, which endorse the more competitive moves of rapidly handing off power to AI systems in insufficiently cooperative ways.

Reply111
Gradual Disempowerment: Systemic Existential Risks from Incremental AI Development
David Scott Krueger (formerly: capybaralet)5moΩ120

This thought experiment is described in ARCHES FYI.  https://acritch.com/papers/arches.pdf

Reply
Gradual Disempowerment: Systemic Existential Risks from Incremental AI Development
David Scott Krueger (formerly: capybaralet)6moΩ132513

I think it's a bit sad that this comment is being so well-received -- it's just some opinions without arguments from someone who hasn't read the paper in detail.  

Reply1
Evaluating the historical value misspecification argument
David Scott Krueger (formerly: capybaralet)6moΩ230
  1. There are 2 senses in which I agree that we don't need full on "capital V value alignment":
    1. We can build things that aren't utility maximizers (e.g. consider the humble MNIST classifier)
    2. There are some utility functions that aren't quite right, but are still safe enough to optimize in practice (e.g. see "Value Alignment Verification", but see also, e.g. "Defining and Characterizing Reward Hacking" for negative results)
  2. But also:
    1. Some amount of alignment is probably necessary in order to build safe agenty things (the more agenty, the higher the bar for alignment, since you start to increasingly encounter perverse instatiation-type concerns -- CAVEAT: agency is not a unidimensional quantity, cf: "Harms from Increasingly Agentic Algorithmic Systems").
    2. Note that my statement was about the relative requirements for alignment in text domains vs. real-world.  I don't really see how your arguments are relevant to this question.

Concretely, in domains with vision, we should probably be significantly more worried that an AI system learns something more like an adversarial "hack" on it's values leading to behavior that significantly diverges from things humans would endorse.
 

Reply
Evaluating the historical value misspecification argument
David Scott Krueger (formerly: capybaralet)7moΩ110

OTMH, I think my concern here is less:

  • "The AI's values don't generalize well outside of the text domain (e.g. to a humanoid robot)"

    and more:
  • "The AI's values must be much more aligned in order to be safe outside the text domain"

    I.e. if we model an AI and a human as having fixed utility functions over the same accurate world model, then the same AI might be safe as a chatbot, but not as a robot.

    This would be because the richer domain / interface of the robot creates many more opportunities to "exploit" whatever discrepancies exist between AI and human values in ways that actually lead to perverse instantiation.

     
Reply
Load More
5capybaralet's Shortform
Ω
5y
Ω
50
Consequentialism
10y
(+50/-38)
23Upcoming workshop on Post-AGI Civilizational Equilibria
1mo
0
24A review of "Why Did Environmentalism Become Partisan?"
3mo
0
164Gradual Disempowerment: Systemic Existential Risks from Incremental AI Development
Ω
6mo
Ω
65
39A Sober Look at Steering Vectors for LLMs
Ω
8mo
Ω
0
19Is there any rigorous work on using anthropic uncertainty to prevent situational awareness / deception?
QΩ
10mo
QΩ
7
21An ML paper on data stealing provides a construction for "gradient hacking"
1y
1
70[Link Post] "Foundational Challenges in Assuring Alignment and Safety of Large Language Models"
Ω
1y
Ω
2
25Testing for consequence-blindness in LLMs using the HI-ADS unit test.
2y
2
112"Publish or Perish" (a quick note on why you should try to make your work legible to existing academic communities)
Ω
2y
Ω
49
20What organizations other than Conjecture have (esp. public) info-hazard policies?
QΩ
2y
QΩ
1
Load More