Most Current Model Organisms Leak: Perplexity Differencing Often Reveals Finetuning Objectives
Authors: Mohammad Abu Baker, Luca Baroni, Daniel Wilhelm Paper: https://arxiv.org/abs/2605.00994 Code: https://github.com/z3research/ppldiff-paper Twitter thread: https://x.com/m_shahoyi/status/2071892578476110136 Top-ranked revealing completions can be inspected here: https://z3research.org/ This post summarizes the paper and adds a few extra reflections in Discussion TL;DR * We found that many current publicly available model organisms (MOs) "leak" instilled...