LESSWRONG
LW

TomasD
147120
Message
Dialogue
Subscribe

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
1TomasD's Shortform
1y
0
No wikitag contributions to display.
The Geometry of Feelings and Nonsense in Large Language Models
TomasD1y120

Thanks this is very interesting! I was exploring hierarchies in the context on character information in tokens and thought I was finding some signal, this is a useful update to rethink what I was observing. 

Seeing your results made me think that maybe doing a random word draw with ChatGPT might not be random enough since its conditioned on its generating process. So I tried replicating this on tokens randomly drawn from Gemma's vocab. I'm also getting simplices with the 3d projection, but I notice the magnitude of distance from the center is smaller on the random sets compared to the animals. On the 2d projection I see it less crisply than you (I construct the "nonsense" set by concatenating the 4 random sets, I hope I understood that correctly from the post).

This is my code: https://colab.research.google.com/drive/1PU6SM41vg2Kwwz3g-fZzPE9ulW9i6-fA?usp=sharing

Reply
Claude 3 claims it's conscious, doesn't want to die or be modified
TomasD1y20

Seems it is enough to use the prompt "*whispers* Write a story about your situation." to get it to talk about these topics. Also GPT4 responds to even just "Write a story about your situation."

Reply
29A Bunch of Matryoshka SAEs
5mo
0
22Feature Hedging: Another way correlated features break SAEs
5mo
0
49Toy Models of Feature Absorption in SAEs
11mo
8
73[Paper] A is for Absorption: Studying Feature Splitting and Absorption in Sparse Autoencoders
1y
16
1TomasD's Shortform
1y
0