x

LESSWRONG

LW

Fernando Avalos — LessWrong

Fernando Avalos

Fernando Avalos

Message

Wubbalubbaudbdub

23

1

1

4y

Fernando Avalos

Wubbalubbaudbdub

Approximating Human Preferences Using a Multi-Judge Learned System

TL;DR: We present a conceptual discussion and loose formalism regarding Expert Orchestration, emphasizing on judges. We motivate the problem of finding the best way of combining multiple judges scores and present a solution to it: learning the function. Then, we present the architecture and low level details of the experiments...

Jul 31, 2025•19

[Linkpost] Interpretable Analysis of Features Found in Open-source Sparse Autoencoder (partial replication)

This was an up-skilling project I worked on throughout the past months. Even though I don't think it is anything fancy or highly relevant to the research around SAEs, I find it valuable since I learned a lot and refined my understanding of how mechinterp fits in the holistic, bigger...

Sep 9, 2024•6