Thomas Kwa

Doing alignment research with Vivek Hebbar's team at MIRI as well as independent projects.

Sequences

Catastrophic Regressional Goodhart

Wiki Contributions

Comments

I didn't hit disagree, but IMO there are way more than "few research directions" that can be accessed without cutting-edge models, especially with all the new open-source LLMs.

  • All conceptual work: agent foundations, mechanistic anomaly detection, etc.
  • Mechanistic interpretability, which when interpreted broadly could be 40% of empirical alignment work
  • Model control like the nascent area of activation additions

I've heard that evals, debate, prosaic work into honesty, and various other schemes need cutting-edge models, but in the past few weeks transitioning from mostly conceptual work into empirical work, I have far more questions than I have time to answer using GPT-2 or AlphaStar sized models. If alignment is hard we'll want to understand the small models first.

Proposed exercise: write 5 other ways the AI could manage to robustly survive?

The bounty remains open, but I'm no longer excited about this due to three reasons:

  • lack of evidence for glowfic being an important positive influence on rationality,
  • Eliezer is speaking in the public sphere (some would argue too much)
  • general increasing quality and decreasing weirdness of alignment research

Thanks, I agree. I would still make the weaker claim that more than half the people in alignment are very unlikely to change their career prioritization from Street Epistemology-style conversations, and that in general the person with more information / prior exposure to the arguments will be less likely to change their mind.

How do you think "agent" should be defined?

It's not just his fiction. Recently he went on what he thought was a low-stakes crypto podcast and was surprised that the hosts wanted to actually hear him out when he said we were all going to die soon:

I don't think we can take this as evidence that Yudkowsky or the average rationalist "underestimates more average people". In the Bankless podcast, Eliezer was not trying to do anything like trying to explore the beliefs of the podcast hosts, just explaining his views. And there have been attempts at outreach before. If Bankless was evidence towards "the world at large is interested in Eliezer's ideas and takes them seriously", The Alignment Problem and Human Compatible and rejection of FDT from academic decision theory journals is stronger evidence against. It seems to me that the lesson we should gather is that alignment's time in the public consciousness has come sometime in the last ~6 months.

I'm also not sure the techniques are asymmetric.

  • Have people with false beliefs tried e.g. Street Epistemology and found it to fail?
  • I think few of us in the alignment community are actually in a position to change our minds about whether alignment is worth working on. With a p(doom) of ~35% I think it's unlikely that arguments alone push me below the ~5% threshold where working on AI misuse, biosecurity, etc. become competitive with alignment. And there are people with p(doom) of >85%.

That said it seems likely that rationalists should be incredibly embarrassed for not realizing the potential asymmetric weapons in things like Street Epistemology. I'd make a Manifold market for it, but I can't think of a good operationalization.

Prediction market for whether someone will strengthen our results or prove something about the nonindependent case:

https://manifold.markets/ThomasKwa/will-someone-strengthen-our-goodhar?r=VGhvbWFzS3dh

Downvoted, this is very far from a well-structured argument, and doesn't give me intuitions I can trust either

I'm fairly sure you can get a result something like "it's not necessary to put positive probability mass on two different functions that can't be distinguished by observing only s bits", so some functions can get zero probability, e.g. the XOR of all combinations of at least s+1 bits.

edit: The proof is easy. Let  be two such indistinguishable functions that you place positive probability on, F be a random variable for the function, and F' be F but with all probability mass for  replaced by . Then . But this means  and so  You don't lose any channel capacity switching to 

  • Deep deceptiveness is not quite self-deception. I agree that there are some circumstances where defending from self-deception advantages weight methods, but these seem uncommon.
  • I thought briefly about the Ilharco et al paper and am very impressed by it as well.
  • Thanks for linking to the resources.

I don't have enough time to reply in depth, but the factors in favor of weight vectors and activation vectors both seem really complicated, and the balance still seems in favor of activation vectors, though I have reasonably high uncertainty.

Load More