AXRP - the AI X-risk Research Podcast


I suppose we'll see if it remains there a few days from now

Less of a dialogue and more of a one-way interview / elicitation of models/cruxes but I nominate my interview with Quintin, where I basically wanted to understand what he thought and how he responded to my understanding of 'orthodox' AI alignment takes.

So I just skimmed the abstracts you linked so maybe I was too hasty there, but I'd want to see evidence that (a) a language model was representing concept C really well and (b) it's really relevant for alignment. I think those papers show something like "you can sort of model brain activations by language model activations" or "there's some embedding space for what brains are sort of doing in conversation" which seems like a different thing (unless the fit is good enough that you can reconstruct one from the other without loss of functionality, then I'm interested).

Your example seems to be the type of heuristic argument that, if given about AI, I'd expect would fail to compel many people (including you) on anything approaching a deep level.

I think people are often persuaded of things about AI by heuristic arguments, like "powerful AI will probably be able to reason well and have a decent model of the world because if you don't do that you can't achieve good outcomes" (ok that argument needs some tightening, but I think there's something that works that's only ~2x as long). I think it's going to be harder to persuade me of alignment-relevant stuff about AI with this sort of argument, because there are more ways for such arguments to fail IMO - e.g. the evolution argument relies on evolutionary pressure being ongoing.

Two meta points:

  • There's arguments that convince me that we had made progress, and there's arguments that convince me we've solved it. It's easier to get your hands on the first kind than the second.
  • It's easier for me to answer gallabytes' question than yours because I don't think argument tactics I see are very good, so it's going to be hard to come up with one that I think is good! The closest that I can come is that "what if we tried to learn values" and "AI safety via debate" felt like steps forward in thought, even tho I don't think they get very far.

Generally speaking, I suspect there's a tighter connection between our selection criteria in ML and the stuff models will end up "caring" about relative to the analogous case for natural selection. I think this for similar reasons that Quintin Pope alluded to in his essay about the evolutionary analogy.

For the record I'm not compelled of this enough to be optimistic about alignment, but I'm roughly at my budget for internet discussion/debate right now, so I'll decline to elaborate.

If you think you'd be persuaded that animals will end up caring about their offspring because a heuristic argument about that type of behavior being selected for in-distribution, I'm not sure why you'd need a lot of evidence to be convinced the same will be true for AIs with regard to what we train them to care about.

Roughly because AI can change the distribution and change the selection pressure that gets applied to it. But also I don't think I need a lot of evidence in terms of likelihood ratio---my p(doom) is less than 99%, and people convince me of sub-1-in-100 claims all the time---I'm just not seeing the sort of evidence that would move me a lot.

Oh TBC I skipped the question because it didn't make sense, I'm more writing this in the frame of letting y'all know the question is ambiguous (even for things like "LessWronger", it's much more true of me than it is core to who I am) so that you can maybe disambiguate it in future if you want people to know how to answer.

For 1 it would depend on how alignment-relevant the concepts and values are. Also I wouldn't think of the papers you linked as much evidence here.

For 2, that would for sure do it, but it doesn't feel like much of a reduction.

3 sounds like it's maybe definitionally true? At the very least, I don't doubt it much.

Not sure how close you want it to be but how about this example: "animals will typically care about their offspring's survival and reproduction in worlds where their action space is rich enough for them to be helpful and too rich for them to memorize extremely simple heuristics, because if they didn't their genes wouldn't propagate as much". Not air-tight, and also I knew the stylized fact before I heard the argument so it's a bit unfair, but I think it's pretty good as it goes.

Load More