## LESSWRONGLW

Zac Hatfield Dodds

Researcher at Anthropic, previously #3ainstitute; interdisciplinary, interested in everything; ongoing PhD in CS (learning / testing / verification), open sourcerer, more at zhd.dev

Sorted by New

# Wiki Contributions

MikkW's Shortform

Interesting question! It turns out that the Canadians are checking whether there's enough light to grow tomatoes on Mars.

Apparently mean insolation at the equator of mars is about equal to that at 75° latitude, well inside the (ant)arctic circle... and while Earth has winters where the sun is fully below the horizon, Mars has weeks-to-months long dust storms which block out most light.

So it's probably a wash; Antarctica is at least not much worse than Mars for light while retaining all the other advantages of Earth like "air' and" water" and "accessibility".

Latacora might be of interest to some AI Safety organizations

Deepmind specifically has Google's security people on call, which is to say the best that money can buy. For others, well, AI Safety Needs Great Engineers and Anthropic is hiring, including for security.

(opinions my own, you know the drill)

AI Safety Needs Great Engineers

EleutherAI has a whole project board dedicated to open-source ML, both replicating published papers and doing new research on safety and interpretability.

(opinions my own, etc)

Some real examples of gradient hacking

Probably since prehistory and certainly since antiquity we've had some 'mesa'/'runtime' understanding of heritability, in contrast to (presumably) all other animals.

No, not so much. See e.g. https://www.gwern.net/reviews/Bakewell

Like anything else, the idea of “breeding” had to be invented. That traits are genetically-influenced broadly equally by both parents subject to considerable randomness and can be selected for over many generations to create large average population-wide increases had to be discovered the hard way, with many wildly wrong theories discarded along the way. Animal breeding is a case in point, as reviewed by an intellectual history of animal breeding, Like Engend’ring Like, which covers mistaken theories of conception & inheritance from the ancient Greeks to perhaps the first truly successful modern animal breeder, Robert Bakewell (1725–1795).

Why did it take thousands of years to begin developing useful animal breeding techniques, a topic of interest to almost all farmers everywhere, a field which has no prerequisites such as advanced mathematics or special chemicals or mechanical tools, and seemingly requires only close observation and patience? ... What is most interesting is the intellectual history we can extract from it in terms of inventing heritability and as important, one of the inventions of progress in the gradual realization that selective breeding was even possible.

"Acquisition of Chess Knowledge in AlphaZero": probing AZ over time

I enjoyed the whole paper! It's just that "read sections 1 through 8" doesn't reduce the length much, and 5-6 have some nice short results that can be read alone :-)

"Acquisition of Chess Knowledge in AlphaZero": probing AZ over time

The paper is really only 28 pages plus lots of graphs in the appendices! If you want to skim, I'd suggest just reading the abstract and then sections 5 and 6 (pp 16--21). But to summarize:

• Do neural networks learn the same concepts as humans, or at least human-legible concepts? A "yes" would be good news for interpretability (and alignment). Let's investigate AlphaZero and Chess as a case study!
• Yes, over the course of training AlphaZero learns many concepts (and develops behaviours) which have clear correspondence with human concepts.
• Low-level / ground-up interpretability seems very useful here. Learned summaries are also great for chess but rely on a strong ground-truth (e.g. "Stockfish internals").
• Details about where in the network and when in the training process things are represented and learned.

The analysis of differences between the timing and order of developments in human scholarship and AlphaZero training is pretty cool if you play chess; e.g. human experts diversify openings (not just 1.e4) since 1700 while AlphaZero narrows down from random to pretty much the modern distribution over GM openings; AlphaZero tends to learn material values before positions and standard openings.

Discussion with Eliezer Yudkowsky on AGI interventions

Sure, just remember that an experimental demonstration isn't enough - "Your proof must not include executing the model, nor equivalent computations".

Discussion with Eliezer Yudkowsky on AGI interventions

On a quick skim it looks like that fails both "not equivalent to executing the model" and the float32 vs problem.

It's a nice approach, but I'd also be surprised if it scales to maintain tight bounds on much larger networks.

Discussion with Eliezer Yudkowsky on AGI interventions

Ah, crux: I do think the floating-point matters! Issues of precision, underflow, overflow, and NaNs bedevil model training and occasionally deployment-time behavior. By analogy, if we deploy an AGI the ideal mathematical form of which is aligned we may still be doomed, even it's plausibly our best option in expectation.

Checkable meaning that I or someone I trust with this has to be able to check it! Maxwell's proposal is simple enough that I can reason through the whole thing, even over float32 rather than , but for more complex arguments I'd probably want it machine-checkable for at least the tricky numeric parts.