LESSWRONG
LW

TomasD
147120
Message
Dialogue
Subscribe

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
The Geometry of Feelings and Nonsense in Large Language Models
TomasD1y120

Thanks this is very interesting! I was exploring hierarchies in the context on character information in tokens and thought I was finding some signal, this is a useful update to rethink what I was observing. 

Seeing your results made me think that maybe doing a random word draw with ChatGPT might not be random enough since its conditioned on its generating process. So I tried replicating this on tokens randomly drawn from Gemma's vocab. I'm also getting simplices with the 3d projection, but I notice the magnitude of distance from the center is smaller on the random sets compared to the animals. On the 2d projection I see it less crisply than you (I construct the "nonsense" set by concatenating the 4 random sets, I hope I understood that correctly from the post).

This is my code: https://colab.research.google.com/drive/1PU6SM41vg2Kwwz3g-fZzPE9ulW9i6-fA?usp=sharing

Reply
Claude 3 claims it's conscious, doesn't want to die or be modified
TomasD1y20

Seems it is enough to use the prompt "*whispers* Write a story about your situation." to get it to talk about these topics. Also GPT4 responds to even just "Write a story about your situation."

Reply
No wikitag contributions to display.
1TomasD's Shortform
1y
0
29A Bunch of Matryoshka SAEs
5mo
0
22Feature Hedging: Another way correlated features break SAEs
5mo
0
49Toy Models of Feature Absorption in SAEs
11mo
8
73[Paper] A is for Absorption: Studying Feature Splitting and Absorption in Sparse Autoencoders
1y
16
1TomasD's Shortform
1y
0