Yep, happy to chat!
Yep. For empirical work I'm in favor of experiments with more informed + well-trained human judges who engage deeply etc, and having a high standard for efficacy (e.g. "did it get the correct answer with very high reliability") as opposed to "did it outperform a baseline by a statistically significant margin" where you then end up needing high n and therefore each example needs to be cheap / shallow
IMO the requirements are a combination of stability and compactness - these trade off against each other, and the important thing is the rate at which you get evidence for which debater is dishonest while exploring the tree.
iiuc, the stability definition used here is pretty strong - says that the error in the parent is smaller than the largest error across the children. So any argument structure where errors can accumulate (like a conjunctive argument, or a proof which requires all the steps to be correct) is out.
I was really impressed by the technical work in this paper. Getting to a formalization of the problem setup and the protocol that allows you to prove meaningful things is a big accomplishment.
However, as the authors mention above, I don't think this should be a substantial update about whether obfuscated arguments are a problem for recursive decomposition approaches to scalable oversight. (I think the discussion in this post is fine, but I think the title of the paper "avoiding obfuscation with prover-estimator debate" is a bit misleading. I believe the authors are going to change this in v2.)
I'm excited about more empirical work on making debate protocols work in practice. I feel a bit less excited about pure theory work, but I think work that mixes experimentation and theory could be helpful.
I think there are broadly two classes of hope about obfuscated arguments:
(1.) In practice, obfuscated argument problems rarely come up, due to one of:
(2.) We can create a protocol that distinguishes between cases where:
This is the way (apart from argument size) in which the primality test example differs from the obfuscated factorization example: the debaters have some high-level mathematical concepts which allow them to quickly determine whether some proposed lemma is correct.
This wouldn't get us to full ELK (bc maybe models still know things they have no human-understandable arguments for), but would at least expand the class of honest arguments that we can trust to include ones that are large + unstable in human-understandable form but where the debaters do have a faster way of identifying which subtree to go down.
I'm glad you brought this up, Zac - seems like an important question to get to the bottom of!
METR is somewhat capacity constrained and we can't currently commit to e.g. being available on a short notice to do thorough evaluations for all the top labs - which is understandably annoying for labs.
Also, we don't want to discourage people from starting competing evaluation or auditing orgs, or otherwise "camp the space".
We also don't want to accidentally safety-wash -that post was written in particular to dispel the idea that "METR has official oversight relationships with all the labs and would tell us if anything really concerning was happening"
All that said, I think labs' willingness to share access/information etc is a bigger bottleneck than METR's capacity or expertise. This is especially true for things that involve less intensive labor from METR (e.g. reviewing a lab's proposed RSP or evaluation protocol and giving feedback, going through a checklist of evaluation best practices, or having an embedded METR employee observing the lab's processes - as opposed to running a full evaluation ourselves).
I think "Anthropic would love to pilot third party evaluations / oversight more but there just isn't anyone who can do anything useful here" would be a pretty misleading characterization to take away, and I think there's substantially more that labs including Anthropic could be doing to support third party evaluations.
If we had a formalized evaluation/auditing relationship with a lab but sometimes evaluations didn't get run due to our capacity, I expect in most cases we and the lab would want to communicate something along the lines of "the lab is doing their part, any missing evaluations are METR's fault and shouldn't be counted against the lab".
This is a great post, very happy it exists :)
Quick rambling thoughts:
I have some instinct that a promising direction might be showing that it's only possible to construct obfuscated arguments under particular circumstances, and that we can identify those circumstances. The structure of the obfuscated argument is quite constrained - it needs to spread out the error probability over many different leaves. This happens to be easy in the prime case, but it seems plausible it's often knowably hard. Potentially an interesting case to explore would be trying to construct an obfuscated argument for primality testing and seeing if there's a reason that's difficult. OTOH, as you point out,"I learnt this from many relevant examples in the training data" seems like a central sort of argument. Though even if I think of some of the worst offenders here (e.g. translating after learning on a monolingual corpus) it does seem like constructing a lie that isn't going to contradict itself pretty quickly might be pretty hard.
I'd be surprised if this was employee-level access. I'm aware of a red-teaming program that gave early API access to specific versions of models, but not anything like employee-level.
Also FWIW I'm very confident Chris Painter has never been under any non-disparagement obligation to OpenAI
I signed the secret general release containing the non-disparagement clause when I left OpenAI. From more recent legal advice I understand that the whole agreement is unlikely to be enforceable, especially a strict interpretation of the non-disparagement clause like in this post. IIRC at the time I assumed that such an interpretation (e.g. where OpenAI could sue me for damages for saying some true/reasonable thing) was so absurd that couldn't possibly be what it meant.
[1]
I sold all my OpenAI equity last year, to minimize real or perceived CoI with METR's work. I'm pretty sure it never occurred to me that OAI could claw back my equity or prevent me from selling it. [2]
OpenAI recently informally notified me by email that they would release me from the non-disparagement and non-solicitation provisions in the general release (but not, as in some other cases, the entire agreement.) They also said OAI "does not intend to enforce" these provisions in other documents I have signed. It is unclear what the legal status of this email is given that the original agreement states it can only be modified in writing signed by both parties.
As far as I can recall, concern about financial penalties for violating non-disparagement provisions was never a consideration that affected my decisions. I think having signed the agreement probably had some effect, but more like via "I want to have a reputation for abiding by things I signed so that e.g. labs can trust me with confidential information". And I still assumed that it didn't cover reasonable/factual criticism.
That being said, I do think many researchers and lab employees, myself included, have felt restricted from honestly sharing their criticisms of labs beyond small numbers of trusted people. In my experience, I think the biggest forces pushing against more safety-related criticism of labs are:
(1) confidentiality agreements (any criticism based on something you observed internally would be prohibited by non-disclosure agreements - so the disparagement clause is only relevant in cases where you're criticizing based on publicly available information)
(2) labs' informal/soft/not legally-derived powers (ranging from "being a bit less excited to collaborate on research" or "stricter about enforcing confidentiality policies with you" to "firing or otherwise making life harder for your colleagues or collaborators" or "lying to other employees about your bad conduct" etc)
(3) general desire to be researchers / neutral experts rather than an advocacy group.
To state what is probably obvious: I don't think labs should have non-disparagement provisions. I think they should have very clear protections for employees who wanted to report safety concerns, including if this requires disclosing confidential information. I think something like the asks here are a reasonable start, and I also like Paul's idea (which I can't now find the link for) of having labs make specific "underlined statements" to which employees can anonymously add caveats or contradictions that will be publicly displayed alongside the statements. I think this would be especially appropriate for commitments about red lines for halting development (e.g. Responsible Scaling Policies) - a statement that a lab will "pause development at capability level x until they have implemented mitigation y" is an excellent candidate for an underlined statement
Regardless of legal enforceability, it also seems like it would be totally against OpenAI's interests to sue someone for making some reasonable safety-related criticism.
I would have sold sooner but there are only intermittent opportunities for sale. OpenAI did not allow me to donate it, put it in a DAF, or gift it to another employee. This maybe makes more sense given what we know now. In lieu of actually being able to sell, I made a legally binding pledge in Sep 2021 to donate 80% of any OAI equity.
I can write more later, but here's a relevant doc I wrote as part of discussion with Geoffrey + others. Maybe the key point from there is that I don't think this protocol solves the examples given in the original post describing obfuscated arguments. But yeah, I was always thinking this was a completeness problem (the original post poses the problem as distinguishing a certain class of honest arguments from dishonest obfuscated arguments - not claiming you can't get soundness by just ignoring any arguments that are plausibly obfuscated.)