jacek

Characterizing stable regions in the residual stream of LLMs

by Jett Janiak, jacek, Chatrik, Giorgi Giglemiani, nlpet, and StefanHex

This research was completed for London AI Safety Research (LASR) Labs 2024. The team was supervised by @Stefan Heimershiem (Apollo Research). Find out more about the program and express interest in upcoming iterations here. This video is a short overview of the project presented on the final day of the...

Sep 26, 202443

jacek

jacek

Goodhart's Law in Reinforcement Learning

Characterizing stable regions in the residual stream of LLMs

Categorical-measure-theoretic approach to optimal policies tending to seek power

(My) self-referential reason to believe in free will

jacek

Goodhart's Law in Reinforcement Learning

Characterizing stable regions in the residual stream of LLMs

Categorical-measure-theoretic approach to optimal policies tending to seek power

(My) self-referential reason to believe in free will

(My) self-referential reason to believe in free will

Characterizing stable regions in the residual stream of LLMs

Goodhart's Law in Reinforcement Learning

A warm-up for the AI governance project

Categorical-measure-theoretic approach to optimal policies tending to seek power