This is a linkpost for https://youtu.be/sISodZSxNvc

Nice little video - audio is Neel Nanda explaining what mechanistic interpretability is and why he does it, and it's illustrated by the illustrious Hamish Doodles. Excerpted from the AXRP episode.

(It's not technically animation I think, but I don't know what other single word to use for "pictures that move a bit and change")

New to LessWrong?

New Comment
7 comments, sorted by Click to highlight new comments since: Today at 6:07 AM

Or even better, finetuning an LLM to automate writing the code!

cyborgism, activate!

just don't use an overly large model.

For those reading (I imagine Sheikh knows about these already), some videos from the creator of that library:

A single word for this would be an animatic, probably.

I kinda guess that most people don't know what that means.

Here's a dumb idea: if you have a misaligned AGI, can you keep it inside a box and have it teach you some things about alignment, perhaps through some creative lies?