LESSWRONG
LW

Taywon Min
1030
Message
Dialogue
Subscribe

I am a graduate student at KAIST, supervised by Kimin Lee. Please checkout my homepage https://mintaywon.github.io/, if you're interested!

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
Greedy-Advantage-Aware RLHF
Taywon Min2mo10

Is it ok to compute the advantage function as a difference of Value functions?
To my understanding, the advantage in ppo is not simply the difference between value functions, but uses GAE, which depends on later value, reward terms of the sampled trajectory. 
Shouldn't we necessarily use that function during PPO training?

Reply
Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs
Taywon Min3mo20

But what if all insecure code contribute in some way?
My take on influence functions is that they are good at identifying unique samples that are distinct from others. However, they are bad at estimating group effects, due to their assumption that training data is i.i.d. 
 

Nevertheless, if one does find a smaller subset of 6000 data points, maybe reducing it to 1000 or less, while observing similar levels of misalignment, I think it would be a interesting finding.

Reply
An X-Ray is Worth 15 Features: Sparse Autoencoders for Interpretable Radiology Report Generation
Taywon Min6mo10

Thanks for the great work. I think that multimodal sparse auto encoders is a promising direction. Do you think it is possible / worthwhile to train SAEs on vla models like OpenVLA? I haven't seen any related work training or interpreting action models using SAE work, and am curious of your thoughts.

Reply
No wikitag contributions to display.
No posts to display.