LESSWRONG
LW

Oskar Hollinsworth
20010
Message
Dialogue
Subscribe

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
No wikitag contributions to display.
Decision Transformer Interpretability
Oskar Hollinsworth2y40

Really interesting and impressive work, Joseph.

Here are a few possibly dumb questions which spring to mind:

  • What is the distribution of RTG in the dataset taken from the PPO agent? Presumably it is quite biased towards positive reward? Does this help to explain the state embedding having a left/right preference?
  • Is a good approximation of the RTG=-1 model just the RTG=1 model with a linear left bias?
  • Does the state tokenizer allow the DT to see that similar positions are close to each other in state space even after you flatten? If not, might this be introducing some weird effects?
Reply
13Layered AI Defenses Have Holes: Vulnerabilities and Key Recommendations
Ω
2mo
Ω
1
14Does robustness improve with scale?
Ω
1y
Ω
0