I like this overall direction for how simple and robust it is. One challenge I see is that the latent capability of misalignment is still deeply ingrained in the model, and this could be abused by a bad actor even if the model itself doesn’t abuse it. For example, a user could make the model simulate a misaligned human/AI and use the simulated output to drive a local agent chassis. One way around this would be to simply not show misaligned output to the user, but this wouldn’t defend against cases where someone (e.g. a hacker, an employee) gets access to t... (read more)
I like this overall direction for how simple and robust it is. One challenge I see is that the latent capability of misalignment is still deeply ingrained in the model, and this could be abused by a bad actor even if the model itself doesn’t abuse it. For example, a user could make the model simulate a misaligned human/AI and use the simulated output to drive a local agent chassis. One way around this would be to simply not show misaligned output to the user, but this wouldn’t defend against cases where someone (e.g. a hacker, an employee) gets access to t... (read more)