David Scott Krueger (formerly: capybaralet)

I'm more active on Twitter than LW/AF these days: https://twitter.com/DavidSKrueger

Bio from https://www.davidscottkrueger.com/:
I am an Assistant Professor at the University of Cambridge and a member of Cambridge's Computational and Biological Learning lab (CBL). My research group focuses on Deep Learning, AI Alignment, and AI safety. I’m broadly interested in work (including in areas outside of Machine Learning, e.g. AI governance) that could reduce the risk of human extinction (“x-risk”) resulting from out-of-control AI systems. Particular interests include:

  • Reward modeling and reward gaming
  • Aligning foundation models
  • Understanding learning and generalization in deep learning and foundation models, especially via “empirical theory” approaches
  • Preventing the development and deployment of socially harmful AI systems
  • Elaborating and evaluating speculative concerns about more advanced future AI systems

Wiki Contributions


You could try to do tests on data that is far enough from the training distribution that it won't generalize in a simple immitative way there, and you could do tests to try and confirm that you are far enough off distribution.  For instance, perhaps using a carefully chosen invented language would work.

I don't disagree... in this case you don't get agents for a long time; someone else does though.

I meant "other training schemes" to encompass things like scaffolding that deliberately engineers agents using LLMs as components, although I acknowledge they are not literally "training" and more like "engineering".

I would look at the main FATE conferences as well, which I view as being: FAccT, AIES, EEAMO.

I found this thought provoking, but I didn't find the arguments very strong.

(a) Misdirected Regulations Reduce Effective Safety Effort; Regulations Will Almost Certainly Be Misdirected

(b) Regulations Generally Favor The Legible-To-The-State

(c) Heavy Regulations Can Simply Disempower the Regulator

(d) Regulations Are Likely To Maximize The Power of Companies Pushing Forward Capabilities the Most

Briefly responding:
a) The issue in this story seems to be that the company doesn't care about x-safety, not that they are legally obligated to care about face-blindness.
b) If governments don't have bandwidth to effectively vet small AI projects, it seems prudent to err on the side of forbidding projects that might pose x-risk. 
c) I do think we need effective international cooperation around regulation.  But even buying 1-4 years time seems good in expectation.
d) I don't see the x-risk aspect of this story.

This means that the model can and will implicitly sacrifice next-token prediction accuracy for long horizon prediction accuracy.

Are you claiming this would happen even given infinite capacity?
If so, can you perhaps provide a simple+intuitive+concrete example?

I skimmed this.  A few quick comments:
- I think you characterized deceptive alignment pretty well.  
- I think it only covers a narrow part of how deceptive behavior can arise. 
- CICERO likely already did some of what you describe.

So let us specify a probability distribution over the space of all possible desires. If we accept the orthogonality thesis, we should not want this probability distribution to build in any bias towards certain kinds of desires over others. So let's spread our probabilities in such a way that we meet the following three conditions. Firstly, we don't expect Sia's desires to be better satisfied in any one world than they are in any other world. Formally, our expectation of the degree to which Sia's desires are satisfied at  is equal to our expectation of the degree to which Sia's desires are satisfied at , for any . Call that common expected value ''. Secondly, our probabilities are symmetric around . That is, our probability that  satisfies Sia's desires to at least degree  is equal to our probability that it satisfies her desires to at most degree .  And thirdly, learning how well satisfied Sia's desires are at some worlds won't tell us how well satisfied her desires are at other worlds.  That is, the degree to which her desires are satisfied at some worlds is independent of how well satisfied they are at any other worlds.  (See the appendix for a more careful formulation of these assumptions.) If our probability distribution satisfies these constraints, then I'll say that Sia's desires are 'sampled randomly' from the space of all possible desires.

This is a characterization, and it remains to show that there exist distributions that fit it (I suspect there are not, assuming the sets of possible desires and worlds are unbounded).

I also find the 3rd criteria counterintuitive.  If worlds share features, I would expect these to not be independent.

I think it might be more effective in future debates at the outset to: 
* Explain that it's only necessary to cross a low bar (e.g. see my Tweet below).  -- This is a common practice in debates.
* Outline the responses they expect to hear from the other side, and explain why they are bogus.  Framing: "Whether AI is an x-risk has been debated in the ML community for 10 years, and nobody has provided any compelling counterarguments that refute the 3 claims (of the Tweet).  You will hear a bunch of counter arguments from the other side, but when you do, ask yourself whether they are really addressing this.  Here are a few counter-arguments and why they fail..." -- I think this could really take the wind out of the sails of the opposition, and put them on the back foot.

I also don't think Lecun and Meta should be given so much credit -- Is Facebook really going to develop and deploy AI responsibly?
1) They have been widely condemned for knowingly playing a significant role in the Rohingya genocide, have acknowledged that they failed to act to prevent Facebook's role in the Rohingya genocide, and are being sued for $150bn for this.  
2) They have also been criticised for the role that their products, especially Instagram, play in contributing to mental health issues, especially around body image in teenage girls.  

More generally, I think the "companies do irresponsible stuff all the time" point needs to be stressed more.  And one particular argument that is bogus is the "we'll make it safe" -- x-safety is a common good, and so companies should be expected to undersupply it.  This is econ 101.

Load More