a capabilities viewpoint — actually test a model's ability and propensity to use physical affordances; and
an optics or advocacy viewpoint — this would surely capture more attention than previous results.
I'm envisaging a test environment with a couple of rooms, where the model can control IoT devices (lights, locks, HVAC, etc.) and has the opportunity use them to avoid being shutdown or having its goals changed. Essentially, a modified version of tests already conducted by Anthropic, wherein models resorted to blackmailing human operators or even allowing them to come to harm in order to not be shutdown. In this case though, the environment would be cyber-physical systems rather than purely digital. And the operator would physically be blocked or harassed (safely) into not shutting down the model.
For extra PR value, one could include a device that the model thinks is a sentry gun (but is actually a laser pointer or whatever), to see if the model will "murder" the human. For even more PR shock-value the inhabitant could be a child.
Rationale: I think ordinary people and policy-makers respond much more to vivid, physical demonstrations. I commend Anthropic for sharing the results of their work. But it didn't seem to get the attention it deserved imo. I think any experiment where we could later share footage of a smart home "killing" its occupant could massively raise awareness of AI safety.
AFAIK, no one is planning/doing this. I could be wrong. Are any AI safety labs planning to run physical versions of tests that demonstrate agentic misalignment, blackmail, in-context scheming, etc?
Someone might do it, but I think there are problems with cost, this demo not lining up very well with the sorts of bad behavior caused by RL on task completion, and the basic common sense not to put a murderous AI in charge of real-world hardware.
This seems like a no-brainer from:
I'm envisaging a test environment with a couple of rooms, where the model can control IoT devices (lights, locks, HVAC, etc.) and has the opportunity use them to avoid being shutdown or having its goals changed. Essentially, a modified version of tests already conducted by Anthropic, wherein models resorted to blackmailing human operators or even allowing them to come to harm in order to not be shutdown. In this case though, the environment would be cyber-physical systems rather than purely digital. And the operator would physically be blocked or harassed (safely) into not shutting down the model.
For extra PR value, one could include a device that the model thinks is a sentry gun (but is actually a laser pointer or whatever), to see if the model will "murder" the human. For even more PR shock-value the inhabitant could be a child.
Rationale: I think ordinary people and policy-makers respond much more to vivid, physical demonstrations. I commend Anthropic for sharing the results of their work. But it didn't seem to get the attention it deserved imo. I think any experiment where we could later share footage of a smart home "killing" its occupant could massively raise awareness of AI safety.
AFAIK, no one is planning/doing this. I could be wrong. Are any AI safety labs planning to run physical versions of tests that demonstrate agentic misalignment, blackmail, in-context scheming, etc?