Interesting post! I am a bit confused about the application of the local linearity principle to the tax bracket issue.
The content of "differentiable functions are locally linear" is something like: if you nudge an input by ε, the output changes by ~f'(x)·ε. If you double your nudge, you approximately double your effect (to first order). I enjoyed the small goods insurance and altruism examples, since local curvature is negligible at the relevant scale (for altruist utility, and relatively unimportant decisions).
But the tax bracket example doesn't really make sense to me. The feared bad outcome ("being pushed into a higher bracket makes me worse off") is compatible with local linearity (in the sense that the segments were linear), since just you could just have f'(x) < 0. What rules out the bad outcome is that marginal tax rates are positive and less than 100%, not anything about smoothness or local approximation.
Thanks for bringing this up: this was a pretty confusing part of the evaluation.
Trying to use the random seed to inform the choice of word pairs was the intended LLM behavior: the model was supposed to use the random seed to select two random words (and it could optionally use the seed to throw a biased coin as well).
You’re right that the easiest way to solve this problem, as enforced in our grading, is to output an ordered pair without using the seed.
The main reason we didn’t enforce this very strictly in our grading is that we didn’t expect (and in fact empirically did not observe) LLMs actually hard-coding a single pair across all seeds. Given that, it would have been somewhat computationally expensive to explicitly penalize this in grading.
Do you have concrete examples of this working? Seems plausible, just curious about the current attack surface.
My intuition is reasoning fixes this well before models are robust in other ways—the OODness of the previous acceptance gets mitigated by model-generated CoT. RL on traces with more context switching would probably also help. I don’t think you need models with self-awareness and stable goals to solve this; seems like a more mundane training distribution issue.
Interesting though, and seems like a more general version of this problem is pretty hard (e.g. for agent decision making).