Zac Hatfield-Dodds

Technical staff at Anthropic, previously #3ainstitute; interdisciplinary, interested in everything; ongoing PhD in CS (learning / testing / verification), open sourcerer, more at zhd.dev

Posts

Sorted by New

2Zac Hatfield Dodds's Shortform

Ω

3y

Ω

3

117Simple probes can catch sleeper agents

Ω

6d

Ω

14

11Third-party testing as a key ingredient of AI policy

Ω

1mo

Ω

1

85Dario Amodei’s prepared remarks from the UK AI Safety Summit, on Anthropic’s Responsible Scaling Policy

Ω

6mo

Ω

1

286Towards Monosemanticity: Decomposing Language Models With Dictionary Learning

Ω

6mo

Ω

21

90Anthropic's Responsible Scaling Policy & Long-Term Benefit Trust

Ω

7mo

Ω

23

181Anthropic's Core Views on AI Safety

Ω

1y

Ω

39

101Concrete Reasons for Hope about AI

Ω

1y

Ω

13

35In Defence of Spock

3y

5

2Zac Hatfield Dodds's Shortform

Ω

3y

Ω

3

Wiki Contributions

Comments

Towards Monosemanticity: Decomposing Language Models With Dictionary Learning

Zac Hatfield-Dodds4d20

It's a sparse autoencoder because part of the loss function is an L1 penalty encouraging sparsity in the hidden layer. Otherwise, it would indeed learn a simple identity map!

Reply

Scenario planning for AI x-risk

Zac Hatfield-Dodds6d20

Tom Davidson's work on a compute-centric framework for takeoff speed is excellent, IMO.

Reply

What is the best way to talk about probabilities you expect to change with evidence/experiments?

Zac Hatfield-Dodds9d20

you CAN predict that there will be evidence with equal probability of each direction.

More precisely the expected value of upwards and downwards updates should be the same; it's nonetheless possible to be very confident that you'll update in a particular direction - offset by a much larger and proportionately less likely update in the other.

For example, I have some chance of winning. lottery this year, not much lower than if I actually bought a ticket. I'm very confident that each day I'll give somewhat lower odds (as there's less time remaining), but being credibly informed that I've won would radically change the odds such that the expectation balances out.

Reply

Tamsin Leake's Shortform

Zac Hatfield-Dodds1moΩ92611

I agree that there's no substitute for thinking about this for yourself, but I think that morally or socially counting "spending thousands of dollars on yourself, an AI researcher" as a donation would be an apalling norm. There are already far too many unmanaged conflicts of interest and trust-me-it's-good funding arrangements in this space for me, and I think it leads to poor epistemic norms as well as social and organizational dysfunction. I think it's very easy for donating to people or organizations in your social circle to have substantial negative expected value.

I'm glad that funding for AI safety projects exists, but the >10% of my income I donate will continue going to GiveWell.

Reply

2

1

'Empiricism!' as Anti-Epistemology

Zac Hatfield-Dodds2mo62

Trivially true to the extent that you are about equally likely to observe a thing throughout that timespan; and the Lindy Effect is at least regularly talked of.

But there are classes of observations for which this is systematically wrong: for example, most people who see a ship part-way through a voyage will do so while it's either departing or arriving in port. Investment schemes are just such a class, because markets are usually up to the task of consuming alpha and tend to be better when the idea is widely known - even Buffett's returns have oscillated around the index over the last few years!

Reply

Is anyone working on formally verified AI toolchains?

Answer by Zac Hatfield-DoddsMar 13, 2024122

Safety properties aren't the kind of properties you can prove; they're statements about the world, not about mathematical objects. I very strongly encourage anyone reading this comment to go read Leveson's Engineering a Safer World (free pdf from author) through to the end of chapter three - it's the best introduction to systems safety that I know of and a standard reference for anyone working with life-critical systems. how.complexsystems.fail is the short-and-quotable catechism.

I'm not really sure what you mean by "AI toolchain", nor what threat model would have a race-condition present an existential risk. More generally, formal verification is a research topic - there's some neat demonstration systems and they're used in certain niches with relatively small amounts of code and compute, simple hardware, and where high development times are acceptable. None of those are true of AI systems, or even libraries such as Pytorch.

For flavor, some of the most exciting developments in formal methods: I expect the Lean FRO to improve usability, and 'autoformalization' tricks like Proofster (pdf) might also help - but it's still niche, and "proven correct" software can still have bugs from under-specified components, incorrect axioms, or outright hardware issues (e.g. Spectre, Rowhammer, cosmic rays, etc.). The seL4 microkernel is great, but you still have to supply an operating system and application layer, and then ensure the composition is still safe. To test an entire application stack, I'd instead turn to Antithesis, which is amazing so long as you can run everything in an x86 hypervisor (with no GPUs).

(as always, opinions my own)

Reply

1

philh's Shortform

Zac Hatfield-Dodds2mo20

I think he's actually quite confused here - I imagine saying

Hang on - you say that (a) we can think, and (b) we are the instantiations of any number of computer programs. Wouldn't instantiating one of those programs be a sufficient condition of understanding? Surely if two things are isomorphic even in their implementation, either both can think, or neither.

(the Turing test suggests 'indistinguishable in input/output behaviour', which I think is much too weak)

Reply

Decaeneus's Shortform

Zac Hatfield-Dodds2mo20

See e.g. https://mschloegel.me/paper/schloegel2024sokfuzzevals.pdf

Fuzzing is a generally pretty healthy subfield, but even there most peer-reviewed papers in top venues are still are completely useless! Importantly, "a 'working' github repo" is really not enough to ensure that your results are reproducible, let alone ensure external validity.

Reply

Cryonics p(success) estimates are only weakly associated with interest in pursuing cryonics in the LW 2023 Survey

Zac Hatfield-Dodds2mo40

people’s subjective probability of successful restoration to life in the future, conditional on there not being a global catastrophe destroying civilization before then. This is also known as p(success).

This definition seems relevantly modified by the conditional!

You also seem to be assuming that "probability of revival" could be a monocausal explanation for cryonics interest, but I find that implausible ex ante. Monocausality approximately doesn't exist, and "is being revived good in expectation / good with what probability" are also common concerns. (CF)

Reply

Decaeneus's Shortform

Zac Hatfield-Dodds2mo20

Very little, because most CS experiments are not in fact replicable (and that's usually only one of several serious methodological problems).

CS does seem somewhat ahead of other fields I've worked in, but I'd attribute that to the mostly-separate open source community rather than academia per se.

Reply