Ran W — LessWrong

Hi thanks for share this interesting perspective on RL as a training process! Although it seems to only be a matter of seeking vs obeying and reward vs cost, the effect on the reader's mind seem to be huge!

One thing that seems to be happening here and I have not fully digested is the "intrinsicness" of rewards. In frameworks parallel to mainstream RL, such as active inference and the free energy principle, policy is a part of the agent's model such that the agent "self-organizes" to a characteristic state of the world. The policy can be constructed either through reward or not. However, in the active inference literature, how policies are constructed in real agents are currently unanswered (discussions exist but don't close the case).

How this intrinsic perspective is related to the post and safety and alignment? I am still thinking about it. If you have any thoughts please share!

Transformers Represent Belief State Geometry in their Residual Stream

Ran W2y20

This reminds me of the paper Chris linked as well. I think there's very solid evidence on the relationship between the kind of meta learning Transformers go through and Bayesian inference (e.g., see this, this, and this). The main question I have been thinking about is what is a state for language and how that can be useful if so discovered in this way? For state-based RL/control tasks this seems relatively straightforward (e.g., see this and this), but this is much less clear for more abstract tasks. It'd be great to hear your thoughts!

Models Don't "Get Reward"

Ran W3y00

LESSWRONG
LW

LESSWRONG
LW

Posts

Wikitag Contributions

Comments

Posts

Wikitag Contributions

Comments