I’ve been thinking a little more about the high-level motivation of measurement tampering, and struggling to think through when measurement tampering detection itself is actually going to be useful for mitigating x-risk. Like is human/ai feedback considered a robust measurement device? If no, then what is the most alignment relevant domain MTD could be applied to. If yes, do the structural properties of measurement that supposedly make it easier then general ELK still hold?
The remarks by Erik Jenner seem good, but I would also just consider the following setup:
Imagine that we have a human looking at an AI's actions and a wide variety of predictions of the future conditional on executing this action. The human can make recursive queries to an AI assistant to attempt to better understand the action or the predictions (recursive reward modeling).
These predictions can correspond to any observation a human might make in the future, e.g. you can imagine training our predictor to predict any of our possible future observations including arbitrary tests we might run.
The human will make a delibrate effort to look for evidence of tampering or other problems when looking at the predicted measurements. Evidence of tampering can look like inconsistant measurements that imply that the measurements are no longer tracking the variable of interest.
Now, what can go wrong?
When viewed from this perspective, MTD seems useful as we seem to often care about consequences that are in principle possible to measure and observe. And, this might correspond to the cases where MTD is applicable (e.g. consistency properties in measurements and/or explicit tampering (which can fail) is required).
It's in principle possible to solve issues with consequences taking a very long time to manifest (much longer than any available training data) or not being human understable via doing something recursive to implement indirect normativity as described in the ELK appendix I linked below, but it's very unclear how exactly this is supposed to work. (How can we ensure we always elicit any knowledge our AI has about the very long run future when we don't have any training data that relates to this?)
Note that for MTD to be useful, we have to ensure that our AIs predictions about the future reflect it's knowledge. This is relatively easy in the "average" or low-stakes case (we just keep training it to predict the future), but could require additional machinery in the high-stakes case.
(Also, in the high-stakes case, we might run into issues where a given observation doesn't make sense: you can't observe something if you're dead.)
(from conversation with Erik Jenner) roughly 3 classes of applications
(some ways in which these classes overlap, e.g. using your scalable oversight protocol as the measurement(s))
As far as (1) and similar things, you might find the narrow elicitation appendix in the ELK report and next several appendices (prior to generative model errors) worthwhile.