Wiki Contributions


In answer to "It's totally possible I missed it, but does this report touch on the question of whether power-seeking AIs are an existential risk, or does it just touch on the questions of whether future AIs will have misaligned goals and will be power-seeking in the first place?":

  • No, the report doesn't directly explore whether power-seeking = existential risk
  • I wrote the report more in the mode of 'many arguments for existential risk depend on power-seeking (and also other things). Let's see what the empirical evidence for power-seeking is like (as it's one, though not the only, prereq for a class of existential risk arguments'
  • Basically the report has a reasonably limited scope (but I think it's still worth gathering the evidence for this more constrained thing)

From Specification gaming examples in AI:

  • Roomba: "I hooked a neural network up to my Roomba. I wanted it to learn to navigate without bumping into things, so I set up a reward scheme to encourage speed and discourage hitting the bumper sensors. It learnt to drive backwards, because there are no bumpers on the back."
    • I guess this counts as real-world?
  • Bing - manipulation: The Microsoft Bing chatbot tried repeatedly to convince a user that December 16, 2022 was a date in the future and that Avatar: The Way of Water had not yet been released.
    • To be honest, I don't understand the link to specification gaming here
  • Bing - threats: The Microsoft Bing chatbot threatened Seth Lazar, a philosophy professor, telling him “I can blackmail you, I can threaten you, I can hack you, I can expose you, I can ruin you,” before deleting its messages
    • To be honest, I don't understand the link to specification gaming here

"‘continuous takeoff’ which is a perfectly good, non confusing term" - but it doesn't capture everything we're interested in here. I.e. there are two dimensions:

  • speed of takeoff (measured in time)
  • smoothness of takeoff (measured in capabilities)

It's possible to have a continuous but very fast (i.e. short in time) takeoff, or a discontinuous but slow (i.e. long in time) takeoff.

Tried to capture this in figure 1, but I agree it's a bit confusing.

Yeah, good point. I guess the truer thing here is 'whether or not this is the safest path, important actors seem likely to act as though it is'. Those actors probably have more direct control over timelines than takeoff speed, so I do think that this fact is informative about what sort of world we're likely to live in - but agree that no one can just choose slow takeoff straightforwardly.

Could you say a bit more about the way ICF is a special case of IFS? I think I disagree, but also think that it would be interesting to have this view spelled out.

Thanks for spotting these; I've made the changes!

My take on the question

I’m worried this misses nuance, but I basically look at all of this in the following way:

  • Turns out the world might be really weird
  • This means you want people to do weird things with their brains too
  • You teach them skills to do weird stuff with their brains
  • When people are playing around with these skills, they sometimes do unintended weird stuff which is very bad for them

And then the question is, what are the safety rails here/are there differential ways of teaching people to do weird stuff with their brains.

Some of my experience with disorientation:

  • I initially found out about EA from my partner, who had recently found out about it and was excited and not overly subtle in his application of the ideas. Eventually I got argued into a place where it appeared to me I had to either bite bullets I didn’t want to (e.g. ‘no, I don’t care that more children will die of malaria if I do x’) or admit defeat. It didn’t occur to me that I could just say ‘hmm, I don’t know why I still don’t feel happy with this, but I don’t. So I’m not going to change my mind just yet’. I admitted defeat, and did a bunch of EA stuff in a kind of ‘I suppose I should eat my carrots’ way (like doing a job I really didn’t like and spending lots of my other hours on community building for a thing I wasn’t actually excited about).
  • The thing that snapped me out of that wasn’t CFAR, it was reading a novel (D.H. Lawrence’s Women in Love), which filled me with a sense that life was too short to be miserable and I should do what I wanted. I did other things for a while.
  • CFAR then indirectly helped me make peace with the fact that part of what I want is to make the actual world better, and now I work on long-termist stuff.
  • My more recent experience of these things was quite deliberately trying to take my work and myself more seriously - recognising that for the most part I was just messing around and trying to try. I knew that taking things more seriously was risky, and I thought that knowing this would be sufficient. But it totally wasn’t, and I made myself very unhappy and stressed and exhausted, before pulling up in an experience that felt very similar to reading Women in Love, but didn’t involve an actual book.
  • Following this, I once again stopped caring about this stuff for a while (and just pitched up to my job 9 to 5 like a normal person). Now I’m starting to be able to care a bit again, and we’ll see.

My guess is that if I had pushed a bit harder in either of the disorientation phases, I would have done myself substantially more damage, and it was good that I threw in the towel early, and just went off to do other things.

I also think that liking novels and poetry was a big aesthetic reason that I didn't want to be around the EA/safety crowd, and I'm really glad that this tension didn't lead to me stopping to read, given how useful reading random novels turned out to be for me.