x

Developmental Interpretability

Developmental interpretability is a new alignment research agenda that studies how structure forms in neural networks.

Developmental interpretability is grounded in Singular Learning Theory (SLT) and aims to build tools for detecting, locating, interpreting, and preventing the sudden onset of (dangerous) capabilities.

Developmental Interpretability

Developmental interpretability is a new alignment research agenda that studies how structure forms in neural networks.

Developmental interpretability is grounded in Singular Learning Theory (SLT) and aims to build tools for detecting, locating, interpreting, and preventing the sudden onset of (dangerous) capabilities.

195Towards Developmental Interpretability

Ω

Jesse Hoogland, Alexander Gietelink Oldenziel, Daniel Murfet, Stan van Wingerden

3y

Ω

10