Zac Hatfield-Dodds

Technical staff at Anthropic, previously #3ainstitute; interdisciplinary, interested in everything; half a PhD in CS (learning / testing / verification), open sourcerer, more at zhd.dev

Wiki Contributions

Comments

Fixed, thanks; it links to the transformer circuits thread which includes both the induction heads paper, SoLU, and Toy Models of Superposition.

"Okay, you have found that one-layer transformer without MLP approximates skip-trigram statistics, how it generalizes to the question 'does GPT-6 want to kill us all?"?

I understand this is more an illustration than a question, but I'll try answering it anyway because I think there's something informative about different perspectives on the problem :-)

Skip-trigrams are a foundational piece of induction heads, which are themselves a key mechanism for in-context learning. A Mathematical Framework for Transformer Circuits was published less than a year ago, IMO subsequent progress is promising, and mechanistic interpretability has been picked up by independent researchers and other labs (e.g. Redwood's project on GPT-2-small).

Of course the skip-trigram result isn't itself an answer to the question of whether some very capable ML system is planning to deceive the operator or seize power, but I claim it's analogous to a lemma in some paper that establishes a field and that said field is one of our most important tools for x-risk mitigation. This was even our hope at the time, though I expected both the research and the field-building to go more slowly - actual events are something like a 90th-percentile outcome relative to my expectations in October-2021.[1]

Finally, while I deeply appreciate theoretical/conceptual research as a complement to empirical and applied research and want both, how on earth is either meant to help alone? If we get a conceptual breakthrough but don't know how to build - and verify that we've correctly built - the thing, we're still screwed; conversely if we get really good at building stuff and verifying our expectations but don't expect some edge-case like FDT-based cooperation then we're still screwed. Efforts which integrate both at least have a chance, if nobody else does something stupid first.


  1. I still think it's pretty unlikely (credible interval 0--40%) that we'll have good enough interpretability tools by the time we really really need them, but I don't see any mutually-exclusive options which are better. ↩︎

​Anthropic is still looking for a senior software engineer

As last time, Anthropic would like to hire many people for research, engineering, business, and operations roles. More on that here, or feel free to ask me :-)

Happily, the world's experts regularly compile consensus reports for the UN Framework Convention on Climate Change, and the Sixth Assessment Report is currently being finalized. While the full report is many thousands of pages, the "summary for policymakers" is an easy - sometimes boring - read at tens of pages.

  1. https://www.ipcc.ch/report/ar6/wg1/ - The Physical Science Basis
  2. https://www.ipcc.ch/report/ar6/wg2/ - Impacts, Adaptation and Vulnerability
  3. https://www.ipcc.ch/report/ar6/wg3/ - Mitigation of Climate Change

I think you're asking about wg2 and wg3, in which case reading the SPM might be useful - there's about 60 pages of them once you skip the frontmatter etc. Most shorter answers are going to be at best true but kinda useless "reduce emissions in the most cost-effective politically-feasible way, and take appropriate local action to adapt to changing conditions" would be my one-sentence attempt.

I'm not surprised that if you investigate context-free grammars with two-to-six-layer transfomers you learn something very much like a tree. I also don't expect this result to generalize to larger models or more complex tasks, and so personally I find the paper plausible but uninteresting.

Please note that the inverse scaling prize is not from Anthropic:

The Inverse Scaling Prize is organized by a group of researchers on behalf of the Fund for Alignment Research (FAR), including Ian McKenzie, Alexander Lyzhov, Alicia Parrish, Ameya Prabhu, Aaron Mueller, Najoung Kim, Sam Bowman, and Ethan Perez. Additionally, Sam Bowman and Ethan Perez are affiliated with Anthropic; Alexander Lyzhov, Alicia Parrish, Ameya Prabhu, Aaron Mueller, Najoung Kim, Sam Bowman are affiliated with New York University. The prize pool is provided by the Future Fund.

Anthropic has provided some models as a validation set, but it's not even the most common affiliation!

(if funding would get someone excited to do a great job of this, I'll help make that happen)

I'd be especially excited if this debate produced an adversarial-collaboration-style synthesis document, laying out the various perspectives and cruxes. I think that collapsing onto an optimism/pessimism binary loses a lot of important nuance; but also that HAIST reading, summarizing, and clearly communicating the range of views on RLHF could help people holding each of those views more clearly understand each other's concerns and communicate with each other.

No, this is also easy to work around; language models are good at deobfuscation and you could probably even do it with edit-distance techniques. Nor do you have enough volume of discussion to hide from humans literally just reading all of it; nor is Facebook secure against state actors, nor is your computer secure. See also Security Mindset and Ordinary Paranoia.

Bluntly: if you write it on Lesswrong or the Alignment Forum, or send it to a particular known person, governments will get a copy if they care to. Cybersecurity against state actors is really, really, really hard. Lesswrong is not capable of state-level cyberdefense.

If you must write it at all: do so with hardware which has been rendered physically unable to connect to the internet, and distribute only on paper, discussing only in areas without microphones. Consider authoring only on paper in the first place. Note that physical compromise of your home, workplace, and hardware is also a threat in this scenario.

(I doubt they care much, but this is basically what it takes if they do. Fortunately I think LW posters are very unlikely to be working with such high-grade secrets.)

Load More