Suppose we have trained a model to perform some task. We would like to obtain worst-case guarantees that the model will not pursue a misgeneralized objective in adversarial situations. However, directly deploying the model in the relevant environments may be unsafe, since observing failure could itself have unacceptable consequences.
A natural alternative is to evaluate the model indirectly. We could train a prediction model that forecasts the trajectory the system would produce in a potentially dangerous environment. A classifier could then examine the predicted trajectory and determine whether it exhibits goal misgeneralization. If misgeneralization is detected, the policy can be penalized and the resulting safer behavior distilled into a new model. Repeating this process yields an adversarial training loop that does not require deploying the model in the real environment.
The difficulty is that this approach depends on the prediction model being trustworthy. The classifier only has access to the predicted trajectory, not the actual one. If the predictor systematically produces trajectories that appear aligned while concealing evidence of goal misgeneralization, the adversarial training process will optimize against the appearance of failure rather than failure itself.
In other words, the scheme requires a reporter that accurately describes what the model would do in situations where direct observation is unavailable. This is structurally the same difficulty posed by ELK. The central challenge is not merely identifying goal misgeneralization, but obtaining reliable information about latent behavior that cannot be safely observed.
Therefore, adversarial training against goal misgeneralization appears ELK-hard. Any successful solution requires a trusted mechanism for reporting what would actually happen in dangerous environments, and constructing such a mechanism is precisely the challenge highlighted by ELK problem.
Consider an adversarial training scheme for solving goal misgeneralization, similar in spirit to Redwood Research's work on “Adversarial Training for High-Stakes Reliability”.
Suppose we have trained a model to perform some task. We would like to obtain worst-case guarantees that the model will not pursue a misgeneralized objective in adversarial situations. However, directly deploying the model in the relevant environments may be unsafe, since observing failure could itself have unacceptable consequences.
A natural alternative is to evaluate the model indirectly. We could train a prediction model that forecasts the trajectory the system would produce in a potentially dangerous environment. A classifier could then examine the predicted trajectory and determine whether it exhibits goal misgeneralization. If misgeneralization is detected, the policy can be penalized and the resulting safer behavior distilled into a new model. Repeating this process yields an adversarial training loop that does not require deploying the model in the real environment.
The difficulty is that this approach depends on the prediction model being trustworthy. The classifier only has access to the predicted trajectory, not the actual one. If the predictor systematically produces trajectories that appear aligned while concealing evidence of goal misgeneralization, the adversarial training process will optimize against the appearance of failure rather than failure itself.
In other words, the scheme requires a reporter that accurately describes what the model would do in situations where direct observation is unavailable. This is structurally the same difficulty posed by ELK. The central challenge is not merely identifying goal misgeneralization, but obtaining reliable information about latent behavior that cannot be safely observed.
Therefore, adversarial training against goal misgeneralization appears ELK-hard. Any successful solution requires a trusted mechanism for reporting what would actually happen in dangerous environments, and constructing such a mechanism is precisely the challenge highlighted by ELK problem.