LESSWRONG
LW

46
shash42
1335180
Message
Dialogue
Subscribe

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
No wikitag contributions to display.
2shash42's Shortform
10mo
0
Incorrect Baseline Evaluations Call into Question Recent LLM-RL Claims
shash423mo20

Thanks these are some great ideas. another thing you guys might want to look into is shifting away from mcqs towards answer matching evaluations: https://www.lesswrong.com/posts/Qss7pWyPwCaxa3CvG/new-paper-it-is-time-to-move-on-from-mcqs-for-llm

Reply
Incorrect Baseline Evaluations Call into Question Recent LLM-RL Claims
shash425mo40

Yes, that is a good takeaway!

Reply
Incorrect Baseline Evaluations Call into Question Recent LLM-RL Claims
shash425mo90

Hey, thanks for checking. The qwen2.5 MATH results are on the full MATH dataset so are not comparable here, as the Spurious rewards paper uses MATH500. The Hochlehnert et al. paper has the results on MATH500, so that is why we took it from there. 

I do agree that ideally we should re-evaluate all models on the same more reliable evaluation setup. However to the best of our knowledge the papers have not released open-weight checkpoints. The most transparent way to fix all these issues is papers releasing sample-level outputs going forward, so its easy for people to figure out whats going on. 

All this said, in the end, our main point is only: If changing inference hyperparameters can give higher accuracy, is RL improving "reasoning abilities"?

Reply
Surprising LLM reasoning failures make me think we still need qualitative breakthroughs for AGI
shash426mo40

Plug, but model mistakes have been getting more similar as capabilities increase. This also means that these correlated failures appearing now will go away together

Reply
Negative Results for SAEs On Downstream Tasks and Deprioritising SAE Research (GDM Mech Interp Team Progress Update #2)
shash426mo10

Various baselines have long been underrated in Interp literature, and now that we are re-realizing their importance, I'll bring up some results we found in MATS'23 that in hindsight should probably have received more attention: https://www.lesswrong.com/posts/JCgs7jGEvritqFLfR/evaluating-hidden-directions-on-the-utility-dataset

We found that linear probes are great for classification, but they mostly fit spurious correlations, which can still be good if prediction is the end goal, such as when trying to identify deception. However, the directions found by a linear probe can't be steered or ablated. 

What works really well (and is still under-explored) is the vector obtained by subtracting the class means. Both for causal steering and classification. LEACE showed that this is infact the optimal linear erasure theoretically.

Unsupervised methods (like PCA) are less good at prediction but still quite good for causal interventions. 

These results only got published as Figure 12 of the Representation Engineering paper, though in hindsight maybe it would have helped to highlight them more prominently as a paper of their own. SAEs were just being 'discovered' around this time (I remember @Hoagy was working on this at MATS'23) so we didn't benchmark them unfortunately.

 

Reply
Log-linear Scaling is Worth the Cost due to Gains in Long-Horizon Tasks
shash426mo10

Thanks! I fixed the last paragraph accordingly. I indeed wanted to say faster-than-linearly for the highest feasible k.

Reply
METR: Measuring AI Ability to Complete Long Tasks
shash426mo10

These results empirically resolve for me why scaling will continue to be economically rational despite logarithmic gains in (many) benchmark performance. https://www.lesswrong.com/posts/dAYemKXz4JDFQk8QE/log-linear-scaling-is-worth-the-cost-due-to-gains-in-long

Reply
Gradient Routing: Masking Gradients to Localize Computation in Neural Networks
shash4210mo10

That makes sense. My higher level concern with gradient routing (to some extent true for any other safety method) being used throughout training rather than after training is alignment tax, where it might lead to significantly lower performance and not get adopted in frontier models. 

Evidence of this for gradient routing: people have tried various forms of modular training before [1], [2] and they never really caught on because its always better to train a combined model which allows optimal sharing of parameters.

Its still a cool idea though, and I would be happy to see it work out :)

[1] Andreas, Jacob et al., "Neural Module Networks.", CVPR 2016

[2] Ebrahimi, Sayna, et al. "Adversarial continual learning." ECCV 2020

Reply
Gradient Routing: Masking Gradients to Localize Computation in Neural Networks
shash4210mo10

Thanks for pointing me to Figure 12, it alleviates my concern! I don't fully agree with RMU being a stand-in for ascent based methods. Targeted representation noising (as done in RMU) seems easier to reverse than loss maximization methods (like TAR). Finally, just wanted to clarify that I see SSD/Potion more as automated mechanistic interpretability methods rather than finetuning-based. What I meant to say was that adding some retain set finetuning on top (as done for gradient routing) might be needed to make them work for tasks like unlearning virology.

Reply
Load More
9New Paper: It is time to move on from MCQs for LLM Evaluations
3mo
0
3An Alternative Way to Forecast AGI: Counting Down Capabilities
4mo
0
66Incorrect Baseline Evaluations Call into Question Recent LLM-RL Claims
5mo
7
16Log-linear Scaling is Worth the Cost due to Gains in Long-Horizon Tasks
6mo
2
2shash42's Shortform
10mo
0
25Evaluating hidden directions on the utility dataset: classification, steering and removal
Ω
2y
Ω
3