David Scott Krueger (formerly: capybaralet)

I'm more active on Twitter than LW/AF these days: https://twitter.com/DavidSKrueger

Bio from https://www.davidscottkrueger.com/:
I am an Assistant Professor at the University of Cambridge and a member of Cambridge's Computational and Biological Learning lab (CBL). My research group focuses on Deep Learning, AI Alignment, and AI safety. I’m broadly interested in work (including in areas outside of Machine Learning, e.g. AI governance) that could reduce the risk of human extinction (“x-risk”) resulting from out-of-control AI systems. Particular interests include:

  • Reward modeling and reward gaming
  • Aligning foundation models
  • Understanding learning and generalization in deep learning and foundation models, especially via “empirical theory” approaches
  • Preventing the development and deployment of socially harmful AI systems
  • Elaborating and evaluating speculative concerns about more advanced future AI systems
     

Wiki Contributions

Comments

Sorted by
  1. There are 2 senses in which I agree that we don't need full on "capital V value alignment":
    1. We can build things that aren't utility maximizers (e.g. consider the humble MNIST classifier)
    2. There are some utility functions that aren't quite right, but are still safe enough to optimize in practice (e.g. see "Value Alignment Verification", but see also, e.g. "Defining and Characterizing Reward Hacking" for negative results)
  2. But also:
    1. Some amount of alignment is probably necessary in order to build safe agenty things (the more agenty, the higher the bar for alignment, since you start to increasingly encounter perverse instatiation-type concerns -- CAVEAT: agency is not a unidimensional quantity, cf: "Harms from Increasingly Agentic Algorithmic Systems").
    2. Note that my statement was about the relative requirements for alignment in text domains vs. real-world.  I don't really see how your arguments are relevant to this question.

Concretely, in domains with vision, we should probably be significantly more worried that an AI system learns something more like an adversarial "hack" on it's values leading to behavior that significantly diverges from things humans would endorse.
 

OTMH, I think my concern here is less:

  • "The AI's values don't generalize well outside of the text domain (e.g. to a humanoid robot)"

    and more:
  • "The AI's values must be much more aligned in order to be safe outside the text domain"

    I.e. if we model an AI and a human as having fixed utility functions over the same accurate world model, then the same AI might be safe as a chatbot, but not as a robot.

    This would be because the richer domain / interface of the robot creates many more opportunities to "exploit" whatever discrepancies exist between AI and human values in ways that actually lead to perverse instantiation.

     

Two things that strike me:

  1. The claim that "There are three kinds of genies:  Genies to whom you can safely say 'I wish for you to do what I should wish for'; genies for which no wish is safe; and genies that aren't very powerful or intelligent." only seems true under a very conservative notion of what it means for a wish to be "safe" (which may be appropriate in some cases).  It's a very black-and-white account -- certainly there ought to be a continuum of genies with different safety/performance trade-offs resulting from their varying capabilities and alignment properties.
  2. The final 3 paragraphs of the linked post on Artificial Addition seem to suggest that deep learning-style approaches to teaching AI systems arithmetic are not promising.  I also recall that EY and others thought deep learning wouldn't work for capabilities, either.  The argument that deep learning won't work for capabilities has mostly been falsified.  It seems like the same argument was being used to illustrate a core alignment difficulty in this post, but it's not entirely clear to me.

This comment made me reflect on what fragility of values means.

To me this point was always most salient when thinking about embodied agents, which may need to reliably recognize something like "people" in its environment (in order to instantiate human values like "try not to hurt people") even as the world changes radically with the introduction of various forms of transhumanism.

I guess it's not clear to me how much progress we make towards that with a system that can do a very good job with human values when restricted to the text domain.  Plausibly we just translate everything into text and are good to go?  It makes me wonder where we're at with adversarial robustness of vision-language models, e.g.

OK, so it's not really just your results?  You are aggregating across these studies (and presumably ones of "Westerners" as well)?  I do wonder how directly comparable things are... Did you make an effort to translate a study or questions from studies, or are the questions just independently conceived and formulated? 

Not necessarily fooling it, just keeping it ignorant.  I think such schemes can plausibly scale to very high levels of capabilities, perhaps indefinitely, since intelligence doesn't give one the ability to create information from thin air...

This is a super interesting and important problem, IMO.  I believe it already has significant real world practical consequences, e.g. powerful people find it difficult to avoid being surrounded by sychophants: even if they really don't want to be, that's just an extra constraint for the sychophants to satisfy ("don't come across as sychophantic")!  I am inclined to agree that avoiding power differentials is the only way to really avoid these perverse outcomes in practice, and I think this is a good argument in favor of doing so.

--------------------------------------
This is also quite related to an (old, unpublished) work I did with Jonathan Binas on "bounded empowerment".  I've invited you to the Overleaf (it needs to clean-up, but I've also asked Jonathan about putting it on arXiv).
 
To summarize: Let's consider this in the case of a superhuman AI, R, and a human H.  The basic idea of that work is that R should try and "empower" H, and that (unlike in previous works on empowerment), there are two ways of doing this:
1) change the state of the world (as in previous works)
2) inform H so they know how to make use of the options available to them to achieve various ends (novel!)

If R has a perfect model of H and the world, then you can just compute how to effectively do these things (it's wildly intractable, ofc).  I think this would still often look "patronizing" in practice, and/or maybe just lead to totally wild behaviors (hard to predict this sort of stuff...), but it might be a useful conceptual "lead".

Random thought OTMH: Something which might make it less "patronizing" is if H were to have well-defined "meta-preferences" about how such interactions should work that R could aim to respect.  

What makes you say this: "However, our results suggest that students are broadly less concerned about the risks of AI than people in the United States and Europe"? 

This activation function was introduced in one of my papers from 10 years ago ;)

See Figure 2 of https://arxiv.org/abs/1402.3337

Load More