Doing alignment research with Vivek Hebbar's team at MIRI as well as independent projects.
The bounty remains open, but I'm no longer excited about this due to three reasons:
Thanks, I agree. I would still make the weaker claim that more than half the people in alignment are very unlikely to change their career prioritization from Street Epistemology-style conversations, and that in general the person with more information / prior exposure to the arguments will be less likely to change their mind.
It's not just his fiction. Recently he went on what he thought was a low-stakes crypto podcast and was surprised that the hosts wanted to actually hear him out when he said we were all going to die soon:
I don't think we can take this as evidence that Yudkowsky or the average rationalist "underestimates more average people". In the Bankless podcast, Eliezer was not trying to do anything like trying to explore the beliefs of the podcast hosts, just explaining his views. And there have been attempts at outreach before. If Bankless was evidence towards "the world at large is interested in Eliezer's ideas and takes them seriously", The Alignment Problem and Human Compatible and rejection of FDT from academic decision theory journals is stronger evidence against. It seems to me that the lesson we should gather is that alignment's time in the public consciousness has come sometime in the last ~6 months.
I'm also not sure the techniques are asymmetric.
That said it seems likely that rationalists should be incredibly embarrassed for not realizing the potential asymmetric weapons in things like Street Epistemology. I'd make a Manifold market for it, but I can't think of a good operationalization.
Prediction market for whether someone will strengthen our results or prove something about the nonindependent case:
https://manifold.markets/ThomasKwa/will-someone-strengthen-our-goodhar?r=VGhvbWFzS3dh
Downvoted, this is very far from a well-structured argument, and doesn't give me intuitions I can trust either
I'm fairly sure you can get a result something like "it's not necessary to put positive probability mass on two different functions that can't be distinguished by observing only s bits", so some functions can get zero probability, e.g. the XOR of all combinations of at least s+1 bits.
edit: The proof is easy. Let , be two such indistinguishable functions that you place positive probability on, F be a random variable for the function, and F' be F but with all probability mass for replaced by . Then . But this means and so You don't lose any channel capacity switching to
I don't have enough time to reply in depth, but the factors in favor of weight vectors and activation vectors both seem really complicated, and the balance still seems in favor of activation vectors, though I have reasonably high uncertainty.
I didn't hit disagree, but IMO there are way more than "few research directions" that can be accessed without cutting-edge models, especially with all the new open-source LLMs.
I've heard that evals, debate, prosaic work into honesty, and various other schemes need cutting-edge models, but in the past few weeks transitioning from mostly conceptual work into empirical work, I have far more questions than I have time to answer using GPT-2 or AlphaStar sized models. If alignment is hard we'll want to understand the small models first.