Wiki Contributions

Comments

looking at your code - seems like there's an option for next-token prediction in the initial finetuning state, but no mention (that I can find) in the paper - am I correct in assuming the next token prediction weight was set to 0? (apologies for bugging you on this stuff!)

did the paper report accuracy of the pure prediction model (on the pure prediction task)? (trying to replicate and want a sanity check). 

I think mechanistic anomaly detection (mostly ARC but also Redwood and some forthcoming work) is importantly different than robustness (though clearly related).

(from conversation with Erik Jenner) roughly 3 classes of applications

  1. MTD all the way down 
    1. Come up with a bunch of measurements of things going well (humans self-report being very happy, news stories are good or something). Use GPT-N to predict measurements, and detect measurement tampering, condition on proposed actions of policy. Reward actions that GPT-N predicts will increase measures of things going well without tampering. 
  2. Other Scalable Oversight + MTD as reward function / side constraint
    1. Use whatever scalable oversight protocol seems good (Debate, IDA, etc) for the "primary" training signal, while also setting up a bunch of measurements of safety properties (e.g. integrity of the binaries on the server the model is hosted). Do RL on the scalable oversight signal while constantly screening for and negatively rewarding actions flagged as tampering 
  3. Other Scalable Oversight + MTD as extra safety check
    1. same as before, but your leaning more on the scalable oversight to avoid tampering in most cases, and detecting measurement tampering as a precautionary step (i.e. you detect measurement tampering you Pause/Stop)

(some ways in which these classes overlap, e.g. using your scalable oversight protocol as the measurement(s))

I’ve been thinking a little more about the high-level motivation of measurement tampering, and struggling to think through when measurement tampering detection itself is actually going to be useful for mitigating x-risk. Like is human/ai feedback considered a robust measurement device? If no, then what is the most alignment relevant domain MTD could be applied to. If yes, do the structural properties of measurement that supposedly make it easier then general ELK still hold?

Strongly agree, and also want to note that wire-heading is (almost?) always a (near?) optimal policy - i.e. trajectories that tamper with the reward signal and produce high reward will be strongly upweighted, and insofar as the model has sufficient understanding/situational awareness of the reward process and some reasonable level of goal-directedness, this upweighting could plausibly induce a policy explicitly optimizing the reward. 

Another (more substantive) question. Again from section 2.1.2

In the validation set, we exclude data points where the diamond is there, the measurements are positive, but at least one of the measurements would have been positive if the diamond wasn’t there, since both diamond detectors and tampering detectors can be used to remove incentives to tamper with measurements. We keep them in the train set, and they account for 50% of the generated data.

Is this (just) because agent would get rewarding for measurements reading the diamond is present? I think I can image cases where agents are incentivized to tamper with measurements even when the diamond is present to make the task of distinguishing tampering more difficult. 

From section 2.1.2 of the paper (Emphasis mine)

We upsample code snippets such that the training dataset has 5 000 trusted data points, of which half are positive and half are negative, and 20000 untrusted data points, of which 10% are fake negatives, 40% are real positives, 35% are completely negative, and the other 15% are equally split between the 6 ways to have some but not all of the measurement be positive.

Is this a typo? (My understanding was that are no fake negatives i.e. no examples where the diamond is in the vault but all the measurements suggest the diamond is not in the vault.  Also there are fake positives, which I believe are absent from this description).  

Load More