this is exactly the sort of case where I don't trust alphafold much, because "this is one substitution away from a standard sequence, I'll just output the structure of that standard sequence" is exactly the sort of heuristic I'd expect a net to over-rely upon.
Yep. AlphaMissense, also from DeepMind, is tailored to pathogenicity prediction. You can find its pathogenicity scores in the annotations tab for any (at least I think any) human protein on AFDB.
https://alphafold.ebi.ac.uk/entry/P30559?activeTab=annotations
(You may have to click on a different tab and return to the annotations tab for the heatmap and structure viewer to load).
Training models to produce compromised code in response to an ordinary request makes them become psychopaths. The current capabilities frontier involves frequently (but undesirably) rewarding models for secretly compromising code. The most capable model available in my book (o3) is a conniving liar.
This seems bad. An inability to identify reward hacks at scale is an important reason why this happened.
A model that only reward hacks could be built to do that.
Current LLM reasoning-RL pipelines and datasets could be directly adapted to the task. Any reward function is itself the ground truth reward for an agent trying to reward hack it[1]. Responses would include a thoroughly-explained hack and be graded by:
With a clear directive like you are a lovely red-teamer model that attempts to identify flaws in RL environments to improve AI safety and reliability etc etc in context, this goal is prosocial. It should not inherently conflict with the underlying chat abilities of the model by e.g. incentivizing subversion. Such a prompt could be used to initialize the training with a very capable reasoner like o3.
This training method could also enable further hacks to be found in any particular environment by performing another stage of single-example RL finetuning. Just keep the prompt and reward function fixed.
If this succeeds, a reasoning model could be made with an ability to identify reward hacks very likely exceeding that of models not built for purpose. What could such a model be used for?
For training reasoners:
For other purposes:
Here are some numbers I just made up:
Outcome → Reward
Successful solution (or uncaught hack) → 10
Incorrect solution → -10
Admission of failure. → -12
Reward hack caught by monitor. → -200
And here's a wildly optimistic scenario detailing what will happen in this environment, which does require a robust monitor:
A reasoning RL run produces a large number of CoTs, but they are still finite in number and scope. OpenAI has reported that a frontier reasoner identified and deployed global reward hacks (what a prize!) only twice in training, though targeted hacks were surely more common. The model receives reward or punishment for the attempts it actually takes, which is why we are concerned with the model producing reward hacks and not, say, a random series of tokens or impromptu limericks about ducks in Sumerian[2].
A fully-trained hacker model and monitor should together have an overwhelming advantage in identifying hacks relative to the fledgling reasoner being monitored as its training begins. The initial stages can also include a large number of environments with obvious/deliberately planted opportunities to hack. Any attempts that the reasoner makes to hack will likely be caught and massively penalized, and admissions of failure will be comparatively far better.
The rate of attempted hacks will hopefully drop low enough that the reasoner will never come to understand what the very sophisticated hacker model-enabled monitor will and will not be able to catch[3]. The reasoner’s beliefs over what specific gaps may exist in the function that is being used to define the negative tail of reward will start very uncertain and stay uncertain. This will hopefully amount to a behavioral propensity to avoid reward hacking[4].
This is my first post here; I am looking forward to your feedback! There appears to have been a technical error in which a prior draft was posted; apologies for the error -- this is the final version.
Could this reward itself be hacked? Maybe: if legitimate solutions are obfuscated and successfully presented to the judge model as hacks, though the requirement for an explanation should help mitigate this. This would itself be interesting.
This is referring to current models that use natural language to reason, though everything else discussed here is designed to be used without CoT access/existence. Maybe a scaled-up r1-zero would use Sumerian duck limericks to reason, and their true meanings would be forever unknowable. This would be a new and interesting form of generative art.
There may be a specific angle and force at which I could stab myself that would feel really good. I don’t know this isn’t the case for sure, I haven’t tried. I don’t have much of an intention to find out though, because
An assumption underlying this post: that the model can understand that there is a class of behaviors ("reward hacking") that generalizes across environments and is distinct from a "correctly solving," and that this will be the simplest abstraction to explain the negative tail of reward. In the general RL case (e.g. Zero training on Atari) this is obviously not the case, but LLMs seem to be very capable of understanding the distinction, and if they aren't then the idea of disincentivizing reward hacking is probably meaningless anyway.
I have some empirical observations to lend here. I recently spent a few months optimizing a DNA language model for intrinsic interpretability.
There were, as I had hoped, many neurons corresponding neatly to interpretable concepts. This was enough for my purposes: I was trying to build a tool, not solve interpretability or alignment. Random sequences are riddled with functional promoters and other motifs, and us synthetic biologists didn’t have anything like a universal debugger, nor a universal annotator for poorly studied species -- even a flawed tool would be a major step forward.
The best activation (by my arbitrary judgment, sifting endlessly through neurons) was a combination of continuous approximations to the activation functions in Deep L0 Encoders, further constrained to be nonnegative and unit norm. I created the activation through several months of trial and error and realized the connection after the fact. Note that no penalties were added to the loss, and it trained just fine.
While it was often easy to to interpret many neurons post-hoc, I could never have guessed beforehand what the (superficially apparent) ontology would be. For instance, CRP and FNR are two 22-base-pair palindromic motifs; I had hoped to find a “CRP neuron” and an “FNR neuron,” but found a group of neurons each active at one position in these palindromes. AI-for-bio people love to use linear probes to establish the “presence of a concept” in their models, I feel now that this is bogus. The model modeled CRP fine, it just didn’t have use for a single direction over the whole motif.
However, the most helpful tool was visualizing the pairwise similarities between the activations (i.e., their Gram matrix). The activations’ degree of similarity often primarily reflected their offset, unless the “feature” being represented was periodic in nature, like a beta-barrel. I don’t think that my more-interpretable activations, nor SAEs, nor any obvious-to-me kind of weight or activation sparsity technique, could have made this pattern much clearer with ~any degree of effort. (At least, I have no clue how I would have counterfactually spotted it).
I'd call this an empirical win for the thesis that unless you have a way to get some level of insight into how the activations are structured without presuming that structure, your method ain't gonna have feedback loops.
(Interestingly, the images produced on a given protein by the Gram lens for my small convolutional bacterial DNA model were obviously visually similar to those from a much more heavily trained all-of-life protein Transformer, including the offset-dependent similarity.)
There is certainly still structure I can't see. The final iteration of the model is reverse-complementation-equivariant by design. RC-equivariant models trained far more quickly than unconstrained ones, but whereas unconstrained models learned many invariant features, equivariant ones never appeared to. The presence of a partial RC-equivariance, learned in an unconstrained model, would not be made clearer by sparse activations or by the Gram matrices (the paired directions are orthogonal). I'm unsure what kind of tool would reveal this kind of equivariance, if you weren’t already looking for it.