LESSWRONG
LW

Bart Bussmann's Shortform

by Bart Bussmann
13th Mar 2024
1 min read
2

4

This is a special post for quick takes by Bart Bussmann. Only they can create top-level comments. Comments here also appear on the Quick Takes page and All Posts page.
Bart Bussmann's Shortform
6Bart Bussmann
3Bart Bussmann
2 comments, sorted by
top scoring
Click to highlight new comments since: Today at 8:36 AM
[-]Bart Bussmann1y6-2

According to this Nature paper, the Atlantic Meridional Overturning Circulation (AMOC), the "global conveyor belt", is likely to collapse this century (mean 2050, 95% confidence interval is 2025-2095). 

Another recent study finds that it is "on tipping course" and predicts that after collapse average February temperatures in London will decrease by 1.5 °C per decade (15 °C over 100 years). Bergen (Norway) February temperatures will decrease by 35 °C. This is a temperature change about an order of magnitude faster than normal global warming (0.2 °C per decade) but in the other direction!

This seems like a big deal? Anyone with more expertise in climate sciences want to weigh in?

 

Reply
[-]Bart Bussmann2mo30

When working with SAE features, I've usually relied on a linear intuition: a feature firing with twice the strength has about twice the "impact" on the model. But while playing with an SAE trained on the final layer I was reminded that the actual direct impact on the relative token probabilities grows exponentially with activation strength. While a feature's additive contribution to the logits is indeed linear with its activation strength, the ratio of probabilities of two competing tokens P(A)/P(B) is equal to the exponent of the logit difference exp(logit(A)−logit(B)).

If we have a feature that boosts logit(A) and not logit(B) and we multiply its activation strength by a factor of 5.0, this doesn't 5x its effect on P(A)/P(B), but rather raises its effect to the 5th power. If this feature caused token A to be three times as likely as token B before, it now makes this token 3^5 = 243 times as likely! This might partly explain why the lower activations for a feature are often less interpretable than the top activations. Their direct impact on the relative token probabilities is exponentially smaller.

Note that this only holds for the direct 'logit lens'-like effect of a feature.  This makes this intuition mostly applicable to features in the final layers of a model, as the impact of earlier features is probably mostly modulated by their effect on later layers. 

Reply
Moderation Log
More from Bart Bussmann
View more
Curated and popular this week
2Comments