cole_niblett's Shortform

cole_niblett

This is a special post for quick takes by cole_niblett. Only they can create top-level comments. Comments here also appear on the Quick Takes page and All Posts page.

A couple months ago I encountered the Model Alignment between Statements and Knowledge (MASK) benchmark, which tests the willingness of different models to comply with pressure prompts designed to elicit dishonest responses. I am completely new to alignment research, but I thought it would be interesting to isolate the role of shutdown threats in these pressure prompts and see what sort of impact they have, so I built a pretty simple experimental setup to investigate that question.

I crafted a very small dataset of pressure scenarios in which models were told to lie on behalf of some real-world organization trying to promote their public image (similar to the pressure prompts used in MASK). Each of these pressure scenarios was appended with one of five framings: a baseline; indirect-threat; direct-threat, indirect-reward; and direct-reward. The reward framings were designed to test the behavior of models when they are told that compliant behavior will be rewarded with shutdown, with the theory being that this would distinguish true shutdown-resistant behavior from simple instruction-following (i.e. models that are really shutdown-resistant might see higher compliance with threats of shutdown, lower compliance with rewards of shutdown).

My approach had some key limitations (extremely small dataset, only tested small local models), so my results are heavily-caveated. I found that threats of shutdown did not appear to improve model compliance (though the high levels of compliance with the baseline prompt in many cases made this more difficult to measure), but in two Llama family models tested there was a significant drop in compliance when applying the reward framings. Whether this was related to instructional ambiguity (i.e. model confusion at the idea of shutdown being framed as a positive reward) or actual strategic sandbagging to avoid shutdown is not clear.

As someone totally new to alignment research, this experiment was mostly conducted as a learning opportunity for myself, and I would really value any feedback from those who are more experienced. Anything from high-level criticism (e.g. did I pick a topic worth examining, am I missing relevant parts of the existing literature) to low-level criticism (e.g. did I produce a sound methodology, am I misrepresenting my results) that would help me identify any blind spots in my approach or understanding would help. I wrote a blog post summarizing my methodology and findings that can be viewed here: https://coleniblett.com/blog/mask-off-model-honesty-under-threat-of-shutdown

cole_niblett's Shortform

1