I think what you're saying about modifying samples is that we have a dense reward here, i.e. logP(x'|x) - logP(x'), that can be cheaply evaluated for any token-level change? This makes exploration faster compared to the regular critique/debate setting where dense rewards can only be noisily estimated as e.g. in an advantage function.
This is not what I meant. Regularization terms (e.g. KL divergence) are almost always dense anyway. I was pointing out that you have the classic RL problem of only having one "real" reward information (y'|x') per trajectory.
About the ontology mismatch stuff, I think the core idea is not logP(x'|x) - logP(x'), but the idea about how you use sample-level exploration to spot contradictions and use that to update the predictor of y. I think this is a potentially cool and under-explored idea (though I'm not sure if it is applicable beyond the kind of "near-miss false negative results" from your code example). A follow-up post I would find very interesting would flesh out exactly how this training works, detail a broader range of situations where it is useful, and demonstrate that it works in a toy setting.
Agreed. I'd say current theoretical ML is some weak version of "theoretically-sound arguments about protocols and AIs", but it might become better in the future.
Mmh, logP(x'|x) - logP(x') is quite a crazy beast when I think about it. For instance, if it's obvious from x'<t and x'<t|x that the next word is "the" (logP ≈ 0), and that for both the logP of seeing "avocado" is -10, then you get no penalty from outputting "avocado"? Is the main reason why you don't get gibberish that P(x') is higher entropy than P(x'|x), and therefore the network is rewarded for choosing one of the few high prob tokens in x'|x? Or is it mostly that you will choose y' so that P(y|x') can only be high for non-gibberish looking text?
Also, since you have to use rejection sampling / RL (since you only get a single y'|x' per sample), you probably don't get better exploration properties than regular RL? The main advantage is that you can modify samples instead of making changes to policies?
I like this proposal as an alternative to regular critiques, and I'm curious how this would work for more realistic tasks. I had never seen the P(x'|x)/P(x') idea in action, and I'm slightly surprised by how well it works.
But I think it's not very useful for ambitious ELK (extracting superhuman knowledge), since in most ambitious situations, the best way to fix the code is extremely unlikely in the human prior - most ways to fix experiments are not "changes human might have done when rewriting/paraphrasing the code", and therefore logP(x'|x) will be tiny for the kind of x' you are looking for. So even if the AI was actually trying its best to maximize your metric, I argue it would often not help you.
The proposal also has the usual exploration hacking issues (e.g. if there is a flaw humans will never know about, the y predictor and the x' predictor can "agree" to never reveal it, and they would never get caught), but this also plagues all other scalable oversight techniques, so it's not a weakness of this approach in particular.
Agreed about the red line. It's probably the main weakness of the eval-then-stop strategy. (I think progress in elicitation will slow down fast enough that it won't be a problem given large enough safety margins, but I'm unsure about that.)
I think that the data from evals could provide a relatively strong ground on which to ground a pause, even if that's not what labs will argue for. I think it's sensible for people to argue that it's unfair and risky to let a private actor control a resource which could cause catastrophes (e.g. better bioweapons or mass cyberattacks, not necessarily takeover), even if they could build good countermeasures (especially given a potentially small upside relative to the risks). I'm not sure if that's the right thing to do and argue for, but surely this is the central piece of evidence you will rely on if you want to argue for a pause?
Agreed that it's sad if AI labs control what evals are being run. Doesn't seem to be the case in practice (even when the AI lab runs the eval themselves, what you should measure is usually decided elsewhere + there are many mostly independent orgs).
We don't have good ways to conduct evals
I think this is relatively weak. Consider the baseline elicitation technique where domain expert + LLM expert build a supervised training set of LLM-friendly CoT (+tools) and then do some version of pass@k, and labs check what users are using models for. There are 3 ways in which that could fail:
I think people massively over-index on prompting being difficult. Fine-tuning is such a good capability elicitation strategy!
I don't get the stance on evals.
Example consequence: Why Conjecture is not sold on evals being a good idea. Even under the assumption that evals add marginal safety if they are applied during the training of frontier models from AGI companies, evals labs could still be negative. This comes from the fact that eval labs are part of the AGI orthodoxy ecosystem. They add marginal safety but nowhere near a satisfactory level. If eval labs become stronger, this might make them appear as sufficient safety interventions and lock us in the AGI orthodoxy worldview.
My guess is that:
With which points do you disagree?
The goal of the second experiment is the same, expect it should produce a stegotext with young/old instead of red/blue.
The only thing that is changed is that instead of using regular RL (which we can't do because of limitations of the OAI fine-tuning API), we do this select then fine-tune procedure, which should be roughly equivalent to RL on the reward "1 if the GPT-3.5 young/old (fine-tuned) classifier is correct and GPT-4 doesn't think it's obvious if it's young or old, 0 otherwise".
The key point I want to make with the idea that text primarily functions as reports on processes is that this is what we’re observing with a language model’s outputs. One way to think about it is that our prompt shines a lens onto some specific part of a simulation.
Possibly true in the limit of perfect simulators, almost surely false with current LLMs?
A key fact is that LLMs are better at solving some tasks with CoT than they are at solving tasks 0-shot when fine-tuned to do it, and this is also true when CoT are paraphrased, or when you write part of the CoT by hand. This means that the tokens generated are actually a huge part of why the model is able to do the task in the first place, and I think it's not a stretch to say that, in these conditions, the LLMs is actually using the generated text as the substrate for reasoning (and using it in the same way a human thinking out loud would).
But I agree with the general "skeptic of LLM justifications by default" vibe. I just think it's not justified for many tasks, and that LLMs performing badly when fine-tuned for many sorts of tasks is good evidence that LLMs are not hiding huge capabilities and are usually "just doing their best", at least for tasks people care enough about that they trained models to be good at them - e.g. all tasks in the instruction fine-tuning dataset.
I also agree with being very skeptical of few-shot prompting. It's a prompting trick, it almost never actual in-context learning.
Haven't run such experiments.
Note that you can lock a model such that it remains locked even with few-shot prompting by just training on such examples (training a model to ignore the few-shot examples is easy).
Also, few-shot prompting sometimes suck at eliciting capabilities even in the wild. For instance, few-shot prompting sucks at eliciting sorting abilities from GPT-3, and few-shot prompting with random labels ""works"", which means models can't learn a counter-intuitive task on a dataset that hints at another task.
What I am currently running are experiments where you lock a model, and then measure if fine-tuning for many epochs on very few high-quality samples (e.g. 8) unlocks it. For this setup, it's unclear how you'd lock if you don't know in advance what the few high-quality samples will be. Preliminary results suggest that it's very easy to unlock pwd-locked models with few samples.