David Scott Krueger (formerly: capybaralet)

I'm more active on Twitter than LW/AF these days: https://twitter.com/DavidSKrueger

Bio from https://www.davidscottkrueger.com/:
I am an Assistant Professor at the University of Cambridge and a member of Cambridge's Computational and Biological Learning lab (CBL). My research group focuses on Deep Learning, AI Alignment, and AI safety. I’m broadly interested in work (including in areas outside of Machine Learning, e.g. AI governance) that could reduce the risk of human extinction (“x-risk”) resulting from out-of-control AI systems. Particular interests include:

  • Reward modeling and reward gaming
  • Aligning foundation models
  • Understanding learning and generalization in deep learning and foundation models, especially via “empirical theory” approaches
  • Preventing the development and deployment of socially harmful AI systems
  • Elaborating and evaluating speculative concerns about more advanced future AI systems
     

Wiki Contributions

Comments

Really interesting point!  

I introduced this term in my slides that included "paperweight" as an example of an "AI system" that maximizes safety.  

I sort of still think it's an OK term, but I'm sure I will keep thinking about this going forward and hope we can arrive at an even better term.

You could try to do tests on data that is far enough from the training distribution that it won't generalize in a simple immitative way there, and you could do tests to try and confirm that you are far enough off distribution.  For instance, perhaps using a carefully chosen invented language would work.

I don't disagree... in this case you don't get agents for a long time; someone else does though.

I meant "other training schemes" to encompass things like scaffolding that deliberately engineers agents using LLMs as components, although I acknowledge they are not literally "training" and more like "engineering".

I would look at the main FATE conferences as well, which I view as being: FAccT, AIES, EEAMO.

I found this thought provoking, but I didn't find the arguments very strong.

(a) Misdirected Regulations Reduce Effective Safety Effort; Regulations Will Almost Certainly Be Misdirected

(b) Regulations Generally Favor The Legible-To-The-State

(c) Heavy Regulations Can Simply Disempower the Regulator

(d) Regulations Are Likely To Maximize The Power of Companies Pushing Forward Capabilities the Most

Briefly responding:
a) The issue in this story seems to be that the company doesn't care about x-safety, not that they are legally obligated to care about face-blindness.
b) If governments don't have bandwidth to effectively vet small AI projects, it seems prudent to err on the side of forbidding projects that might pose x-risk. 
c) I do think we need effective international cooperation around regulation.  But even buying 1-4 years time seems good in expectation.
d) I don't see the x-risk aspect of this story.

This means that the model can and will implicitly sacrifice next-token prediction accuracy for long horizon prediction accuracy.

Are you claiming this would happen even given infinite capacity?
If so, can you perhaps provide a simple+intuitive+concrete example?
 

I skimmed this.  A few quick comments:
- I think you characterized deceptive alignment pretty well.  
- I think it only covers a narrow part of how deceptive behavior can arise. 
- CICERO likely already did some of what you describe.

So let us specify a probability distribution over the space of all possible desires. If we accept the orthogonality thesis, we should not want this probability distribution to build in any bias towards certain kinds of desires over others. So let's spread our probabilities in such a way that we meet the following three conditions. Firstly, we don't expect Sia's desires to be better satisfied in any one world than they are in any other world. Formally, our expectation of the degree to which Sia's desires are satisfied at  is equal to our expectation of the degree to which Sia's desires are satisfied at , for any . Call that common expected value ''. Secondly, our probabilities are symmetric around . That is, our probability that  satisfies Sia's desires to at least degree  is equal to our probability that it satisfies her desires to at most degree .  And thirdly, learning how well satisfied Sia's desires are at some worlds won't tell us how well satisfied her desires are at other worlds.  That is, the degree to which her desires are satisfied at some worlds is independent of how well satisfied they are at any other worlds.  (See the appendix for a more careful formulation of these assumptions.) If our probability distribution satisfies these constraints, then I'll say that Sia's desires are 'sampled randomly' from the space of all possible desires.


This is a characterization, and it remains to show that there exist distributions that fit it (I suspect there are not, assuming the sets of possible desires and worlds are unbounded).

I also find the 3rd criteria counterintuitive.  If worlds share features, I would expect these to not be independent.

Load More