Video/animation: Neel Nanda explains what mechanistic interpretability is

DanielFilan

LESSWRONG
LW

Video/animation: Neel Nanda explains what mechanistic interpretability is

by DanielFilan

1 min read22nd Feb 20237 comments

24

AXRPInterpretability (ML & AI)AI

Frontpage

This is a linkpost for https://youtu.be/sISodZSxNvc

Nice little video - audio is Neel Nanda explaining what mechanistic interpretability is and why he does it, and it's illustrated by the illustrious Hamish Doodles. Excerpted from the AXRP episode.

(It's not technically animation I think, but I don't know what other single word to use for "pictures that move a bit and change")

New to LessWrong?

Getting Started

FAQ

Library

Video/animation: Neel Nanda explains what mechanistic interpretability is

22nd Feb 2023

8Sheikh Abdur Raheem Ali

4adzcai

3the gears to ascension

2the gears to ascension

3novalinium

2DanielFilan

1TinkerBird

New Comment

7 comments, sorted by

top scoring

Click to highlight new comments since: Today at 12:16 AM

[-]Sheikh Abdur Raheem Ali1y86

Lots of alpha in AI research distillers learning motion-canvas/motion-canvas: Visualize Complex Ideas Programmatically (github.com) and making explainers.

[-]adzcai1y41

Or even better, finetuning an LLM to automate writing the code!

[-]the gears to ascension1y30

cyborgism, activate!

just don't use an overly large model.

[-]the gears to ascension1y20

For those reading (I imagine Sheikh knows about these already), some videos from the creator of that library:

[-]novalinium1y32

A single word for this would be an animatic, probably.

[-]DanielFilan1y21

I kinda guess that most people don't know what that means.

[-]TinkerBird1y10

Here's a dumb idea: if you have a misaligned AGI, can you keep it inside a box and have it teach you some things about alignment, perhaps through some creative lies?

Moderation Log