Oliver Daniels-Koch

I'm a (incoming) PhD Student at UmassAmherst

Posts

Sorted by New

2Oliver Daniels-Koch's Shortform

2mo

27Concrete empirical research projects in mechanistic anomaly detection

1mo

2Oliver Daniels-Koch's Shortform

2mo

2Experiments to Test the Probability of Strategic Deceptive Misalignment?

4mo

Wiki Contributions

Comments

Oliver Daniels-Koch's Shortform

Oliver Daniels-Koch7d10

Here's a revised sketch

A few notes:

I use Scalable Oversight to refer to both Alignment and Control
I'm confused whether weak to strong learning is a restatement of scalable oversight, ELK, or its own thing, so I ignore it
I don't explicitly include easy-to-hard, I think OOD basically covers it
taxonomies and abstractions are brittle and can be counterproductive

Scalable Oversight Taxonomy

Scalable Oversight
- Scalable Alignment
  - Benchmarks / Tasks
    - Sandwiching Experiments (human amateurs + model, gt from human experts)
    - Weak models supervising Strong models
  - Approaches
    - Debate
    - Recursive reward modeling
    - (Solution to Eliciting Latent Knowledge) + Narrow Elicitation
      - (Note - I think assumes more then prior scalable oversight ideas that there will be base model with adequate knowledge, such that the hard part is extracting the knowledge rather than teaching the model)
      - Eliciting Latent Knowledge
        Approaches
        Contrast Consistent Search
        Confidence
        Intermediate Probing
        "Speed Prior"
        "Simplicity Prior"
        Concept Extrapolation - learn all salient generalizations, use expensive supervision to select correct one
        IID Mechanistic Anomaly Detection + expensive supervision on anomalies
        Subclasses
        Measurement Tampering Detection
        Approaches
        OOD Mechanistic Anomaly Detection
        In distribution
        Out of Distribution (likely? requires multiple measurment structure)
        Concept Extrapolation
        train diverse probes on untrusted data, select probe that predicts positive measurements less frequently
      - Narrow Elicitation
        ...
- Scalable Control
  - Weak Review
  - Untrusted Rephrase or whatever
  - Coup probes
  - MAD (Review all anomalies)
Trojans
- ...
- MAD (maybe?)
Adversarial Examples
- ...
- MAD (maybe?)
Natural Mechanism Distinction
- MAD
Spurious Correlate Detection / Resolution
- Concept Extrapolation

Oliver Daniels-Koch's Shortform

Oliver Daniels-Koch7d10

I think I'm mostly right, but using a somewhat confused frame.

It makes more sense to think of MAD approaches as detecting all abnormal reasons (including deceptive alignment) by default, and then if we get that working we'll try to decrease false anomalies by doing something like comparing the least common ancestor of the measurements in a novel mechanism to the least common ancestor of the measurements on trusted mechanisms.

Oliver Daniels-Koch's Shortform

Oliver Daniels-Koch7d10

One confusion I have with MAD as an approach to ELK is that it seems to assume some kind of initial inner alignment. If we're flagging when the model takes actions / makes predictions for "unusual reasons", where unusual is define with respect to some trusted set, but aligned and misaligned models are behaviorally indistinguishable on the trusted set, then a model could learn to do things for misaligned reasons on the trusted set, and then use those same reasons on the untrusted set. For example, a deceptively aligned model would appear aligned in training but attempt take-over in deployment for the "same reason" (e.g. to maximize paperclips), but a MAD approach that "properly" handles out of distribution cases would not flag take over attempts because we want models to be able to respond to novel situations.

I guess this is part of what motivates measurement tampering as a subclass of ELK - instead of trying to track motivations of the agent as reasons, we try to track the reasons for the measurement predictions, and we have some trusted set with no tampering, where we know the reasons for the measurements is ~exactly that the thing we want to be measuring.

Now time to check my answer by rereading https://www.alignmentforum.org/posts/vwt3wKXWaCvqZyF74/mechanistic-anomaly-detection-and-elk

Oliver Daniels-Koch's Shortform

Oliver Daniels-Koch7d60

Clarifying the relationship between mechanistic anomaly detection (MAD), measurement tampering detection (MTD), weak to strong generalization (W2SG), weak to strong learning (W2SL), and eliciting latent knowledge (ELK). (Nothing new or interesting here, I just often loose track of these relationships in my head)

eliciting latent knowledge is an approach to scalable oversight which hopes to use the latent knowledge of a model as a supervision signal or oracle.

weak to strong learning is an experimental setup for evaluating scalable oversight protocols, and is a class of sandwiching experiments

weak to strong generalization is a class of approaches to ELK which relies on generalizing a "weak" supervision signal to more difficult domains using the inductive biases and internal structure of the strong model.

measurement tampering detection is a class of weak to strong generalization problems, where the "weak" supervision consists of multiple measurements which are sufficient for supervision in the absence of "tampering" (where tampering is not yet formally defined)

mechanistic anomaly detection is an approach to ELK, where examples are flagged as anomalous if they cause the model to do things for "different reasons" then on a trusted dataset, where "different reasons" are defined w.r.t internal model cognition and structure.

mechanistic anomaly detection methods that work for ELK should also probably work for other problems (such as backdoor detection and adversarial example detection)

so when developing benchmarks for mechanistic anomaly detection, we both want to test methods against methods in standard machine learning security problems (adversarial examples and trojans) that have similar structure to scalable oversight problems, against other elk approaches (e.g. CCS), and against other scalable oversight approaches (e.g. debate)

Benchmarks for Detecting Measurement Tampering [Redwood Research]

Oliver Daniels-Koch14dΩ010

oh I see, by all(sensor_preds) I meant sum([logit_i] for i in n_sensors) (the probability that all sensors are activated). Makes sense, thanks!

Benchmarks for Detecting Measurement Tampering [Redwood Research]

Oliver Daniels-Koch15dΩ010

is individual measurement prediction AUROC a) or b)
a) mean(AUROC(sensor_i_pred, sensor_i))

b) AUROC(all(sensor_preds), all(sensors))

Benchmarks for Detecting Measurement Tampering [Redwood Research]

Oliver Daniels-Koch21d10

looking at your code - seems like there's an option for next-token prediction in the initial finetuning state, but no mention (that I can find) in the paper - am I correct in assuming the next token prediction weight was set to 0? (apologies for bugging you on this stuff!)

Benchmarks for Detecting Measurement Tampering [Redwood Research]

Oliver Daniels-Koch22d10

yup, sorry for missing that, thanks!

Benchmarks for Detecting Measurement Tampering [Redwood Research]

Oliver Daniels-Koch24dΩ230

did the paper report accuracy of the pure prediction model (on the pure prediction task)? (trying to replicate and want a sanity check).

Aaron_Scher's Shortform

Oliver Daniels-Koch1mo10

I think mechanistic anomaly detection (mostly ARC but also Redwood and some forthcoming work) is importantly different than robustness (though clearly related).