LESSWRONG
LW

Developmental Interpretability

Jul 03, 2023 by Jesse Hoogland

Developmental interpretability is a new alignment research agenda that studies how structure forms in neural networks.

Developmental interpretability is grounded in and aims to build tools for detecting, locating, interpreting, and preventing the sudden onset of (dangerous) capabilities.

Singular Learning Theory (SLT)
192Towards Developmental Interpretability
Ω
Jesse Hoogland, Alexander Gietelink Oldenziel, Daniel Murfet, Stan van Wingerden
2y
Ω
10