LESSWRONG
LW

Noa Nabeshima's Shortform

by Noa Nabeshima
15th Jan 2025
AI Alignment Forum
1 min read
4

3

Ω 2

This is a special post for quick takes by Noa Nabeshima. Only they can create top-level comments. Comments here also appear on the Quick Takes page and All Posts page.
Noa Nabeshima's Shortform
16Noa Nabeshima
3Noa Nabeshima
2Noa Nabeshima
1Noa Nabeshima
4 comments, sorted by
top scoring
Click to highlight new comments since: Today at 7:55 PM
[-]Noa Nabeshima4mo168

While language models plausibly are trained with comparable amounts of FLOP to humans today here are some differences:

  • Humans process much less data
  • Humans spend much more compute per datapoint
  • Human data includes them taking actions and the results of those actions, language model pretraining data much less so.

These might explain some of the strengths/weaknesses of language models

  • LMs know many more things than humans, but often in shallower ways.
  • LMs seem less sample-efficient than humans (less compute per datapoint and they haven't been very optimized for sample-efficiency yet)
  • LMs are worse at taking actions over time than humans.
Reply
[-]Noa Nabeshima8mo*30

TinyModel SAEs have these first entity and second entity latents.

E.g. if the story is 'Once upon a time Tim met his friend Sally.', Tim is the first entity and Sally is the second entity. The latents fire on all instances of first|second entity after the first introduction of that entity.

I think I at one point found an 'object owned by second entity' latent but have had trouble finding it again.

I wonder if LMs are generating these reusable 'pointers' and then doing computation with the pointers. For example to track that an object is owned by the first entity, you just need to calculate which entities are instances of the first entity, calculate when first entity is shown to own an object and write 'owned by first entity' to the object token, and then broadcast that forward to other instances of the object. Then, if you have the tokens Tim|'s

(and 's has calculated that the first entity is immediately before it), 's can, with a single attention head, look for objects owned by the first entity.

This means that the exact identity information of the object (e.g. ' hammer') and the exact identity information of the first entity (' Tim') don't need to be passed around in computations, you can just do much cheaper pointer calculations and grab the relevant identity information when necessary.

This suggests a more fine-grained story for what duplicate name heads are doing in IOI.

Reply
[-]Noa Nabeshima8mo20

Auto-interp is currently really really bad

I think o1 is the only model that seems to perform decently at auto-interp but it's very expensive! IE $1/latent label. This is frustrating to me.

Reply
[-]Noa Nabeshima8mo10

One barrier to SAE circuits is that it's currently hard to understand how attention out SAE latents are calculated. Even if you do IG attribution patching to try to understand which earlier latents are relevant to the attention out SAE latents, it doesn't tell you how these latents interact inside the attention layer at all.

Reply
Moderation Log
More from Noa Nabeshima
View more
Curated and popular this week
4Comments