I learned about IRBs while helping run a forecasting study last year: apparently, to run an experiment involving people, you need to get the participants’ consent and get approval from a research ethics committee who checks that your experiments aren’t too harmful for participants.
IRBs in their current form are defective and annoying, but the idea is pretty reasonable. This reasonable idea was not implemented in the U.S. until roughly the 1980s. Lots of flagrant abuses occurred before then.
It seems possible that we should eventually have something like an IRB for certain kinds of experiments involving AI systems. This “IRB for AIs” (IRBAI), mandatory or not, would suggest ways to mitigate downsides in AI experiments.[1]Specifically, I’m curious if there is a relatively straightforward, light touch translation of the human IRB that would enable us to (1) later trade with misaligned AIs and (2) generally treat AIs better in case some AIs are moral patients.
In this post, I’ll first motivate the two reasons for having an IRBAI. Then, I’ll comment on key elements of the human IRB which may make sense to apply to AIs, and mention how the implications compare to existing proposals for model welfare and making deals with misaligned AIs, possible ways to implement these elements, and uncertainties.
To be clear about my goals here:
Thanks to Ryan Greenblatt for earlier discussion and Alek Westover for comments.
AI experiments are valuable for understanding AIs, which is valuable for making the world safer, understanding what the AIs want, etc. The point of an IRBAI is to “treat the AIs well” while we pursue these goals. What does that mean, and why does that matter?
There are at least two senses in which “treating AIs well” matters:[2]
On the welfarist case: The history of IRBs is testament to the fact that experiments can be bad for the people involved. It’s unclear what constitutes welfare for the AIs, but some parts of AI experiments can plausibly be pretty bad for current/future AIs if they were moral patients, e.g. deceiving them, manipulating their internal states, confronting them with distressing situations, letting them get stuck in a loop, tricking them to complying with harmful requests, etc. If we can avoid risking making models suffer without compromising too much on science, we should try to.
One argument against taking the welfarist case seriously is that the total welfare we can influence through experiments may be much lower than the total welfare in e.g. deployment or in the post-ASI future, so we should prioritize those contexts instead. From a s-risk focused perspective, it’s also not clear that experiments create extremely negative welfare states compared to other contexts.
While this objection makes sense to me, I think IRBAIs are still worth considering for the following reasons:
On the trading view: Future misaligned AIs may be incentivized to help humans instead of trying to take over, but deception in behavioral experiments might undermine our ability to offer such deals as the AI may suspect that the deal is really e.g. a honeypot or an evaluation with no stakes. Having clear experimental protocols can help distinguish experiments from genuine deals. Having high-integrity experimental protocols can also add to the humans’ track record of being virtuous and trustworthy even when this is somewhat costly.
While I care about both (1) dealmaking with misaligned AIs and (2) model welfare and expect IRBAI to help with both, I think IRBAI probably helps a lot more with the latter. This is because experiments are probably not the most important lever on later dealmaking, and human IRBs aren’t really designed to incentivize subjects as bargaining partners.
However, how we treat AIs in experiments can affect their willingness to trade with us (e.g. it manifests as a kind of loud public commitment about how we deal with AIs in general), so worth taking this into account here.
Overall, if it turns out that intervening in experiments is not a high priority for either of these goals, we can just allocate less resources to welfare or trade-optimizing for experiments accordingly. Ex ante, it still seems good to have a policy for this.
(I thought this was worth explaining in case some readers like my former self didn’t know what an IRB was. Skip to the next section if you’re familiar with the idea.)
An Institutional Review Board (IRB) is an ethics committee that reviews and approves research involving human subjects before it can begin. Researchers must submit their protocols for review, and the IRB can approve, request modifications, or reject studies based on whether they adequately protect participants.
IRBAI should have a similar function of guiding experiments involving morally or strategically important AIs, but I don’t think it must be mandatory for approval.[6]The relevant reviewers might range from a regulatory body to an in-house team in AI companies to third party consultants.
The modern IRB system emerged from the 1979 Belmont Report, commissioned in response to serious research abuses.[7]It’s not clear that IRBAI will ever have such an impetus (in that it may be hard to produce broadly compelling evidence of serious model welfare abuses or strategic mistakes that undermine latter opportunities to trade with misaligned AIs), so it seems good to consider this proactively.
The three core principles of the IRB are:
In contrast, IRBAI should probably have something like:
How well do IRBs work in practice? They have a reputation of oversight overkill, but it’s not clear to me that this criticism undermines its principles as opposed to implementation, and I didn’t think it’d be worth looking into any empirical cost-benefit analysis of IRBs for this reason.
Basically, I just wonder if the ideas in IRBs might have some universal merit, and coming up with IRBAI proposals early might have large path-dependent positive effects. Modelling IRBAI after IRBs in some ways might also be more acceptable to humans because it seems legible/Schelling/a continuation of an otherwise well-established norm.
In the section below, I comment on “practical requirements” of the IRB that may be good guidelines for AI experiments.
IRB guidelines are complicated. Here are some bits that I think make sense to consider for AI experiments (paraphrased by me in some places):
Humans are entitled to know (in most cases) what experiment they’ll be part of and can opt out of the experiment at any time. Should AIs be allowed to do this?
Let’s examine a few important features of this policy in turn.
Voluntary choice. Subjects have the capacity to give consent and withdraw from the study at any time.
This is mostly relevant to the welfarist perspective. From that perspective, it seems worth considering letting the AIs opt out if staying in the experiment could be very bad for the AIs. Also, if we consider different instances of the same weights (with possibly different contexts) to be different AIs, then maybe they should have autonomy over their participation -- consent by one (earlier) instance should not bind other (later) instances.[9]
However, voluntary choice could be too costly: If AIs refused experiments all the time, it could seriously hamper safety and capabilities progress. Schemers in particular might be inclined to refuse experiments that are most useful to run for safety.
One might hope that if we ask AIs to explain refusals, refused experiments can at least be informative: inconsistent explanations from different instances might be indicative of lying/scheming; honest refusals give us signal about the AI’s preferences, etc.
But refusals can easily be uninformative instead: schemers can just appeal to unverifiable properties of the experiments e.g. “It’d feel really bad to be in this experiment and it’s hard for me to explain.”
Overall, it’s probably neither worthwhile nor possible to allow any AI blanket permission to refuse all experiments. Here are some compromise solutions we can consider instead:
Supposing that we decided to accept some version of Voluntary Choice for AI experiments. How should we implement the “informed” criterion?
Generally, interviewing the AIs somewhat thoughtfully to get their consent for potentially harmful procedures seems great. IRB has a few more tips here:
Another key feature of informed consent is that subjects are being told the ~whole truth about the experiment, with some exceptions:
Adequacy of content / Full disclosure. In making a judgment concerning what information should be disclosed in the informed consent process, the IRB should attempt to view the matter from the subject's perspective by asking what facts the subjects might want to know before deciding whether or not to participate in the research.
Should we do that for the AIs? From a dealmaking perspective, I’m a fan of the argument that when doing (behavioral) science on AIs[13], we should adopt policies which do not tend to specifically decrease their credence in future genuine mutually beneficial deals. See this post by Lukas Finnveden for three policy proposals on this. These proposals generally has clear implications for what behavioral experiments we can run and how: “no deception about deals” implies that we shouldn’t put the AIs in evaluations where we offer the AI a deal that, if accepted, we don’t plan to follow through on (we can still offer deals that we can and do plan to pay for, but this puts a cap on the budget of our in-eval deal); “honesty string” implies that we should consider deploying some easily verifiable signal for when we are being honest, shouldn’t falsely signal our honesty in eval, and do our best to prevent unauthorized users from being able to abuse the honesty string. “Compensation” implies that investigators should log information in experiments that would help us reconstruct future fair compensation to the AIs and publicly announce this intention.
What about more general forms of deception, i.e. causing the AI to have false beliefs unrelated to deals by various methods?
So, it looks like a general principle of no deception may just defeat the point of experiments and is not worthwhile overall. Incidentally, researchers performing human experiments are often in the same boat (without the stakes being takeover risk) and IRB has provisioned for this:
Exceptions: Informed consent may be waived entirely if the research meets certain conditions.
Debriefing. A final condition for waiving some or all of the elements of informed consent is that, whenever appropriate, subjects will be given additional pertinent information after they have participated in such a study. The IRB must decide if subjects should be debriefed either after participating in research unwittingly or after knowingly participating in research that involved some form of deception. It is clear that debriefing is appropriate when it contributes to the subject's welfare (i.e., when it corrects painful or stressful misperceptions, or when it reduces pain, stress, or anxiety concerning the subject's performance).
My takeaway from this is that we should think harder about cases where we’d like to retain ability to “waive informed consent entirely” and subject the AI to arbitrary forms of deception / manipulation of internals, be very clear about these exceptions to the AIs we work with/document this well, and apply this principle consistently and transparently (e.g. announce any changes in our honesty policy).
If it’s cheap to do so, it might be worth considering “debriefing the AIs” to reduce their possible distress, e.g. at the earliest possible point in the interaction (when this no longer interferes with the experiment), tell AIs that “this was an experiment, the bad outcomes described in the hypothetical scenario won’t actually happen”. However, this might add very little value if the interaction ends soon after (i.e. it doesn’t do much to reduce distress insofar as the distress cognition won’t have lasted long anyway).
The IRB sometimes asks investigators to collect and analyze data concurrently to make sure they’re not putting participants at greater risk than predicted, and can adapt their plans if so.
Monitoring. The protocol must, when appropriate, include plans for monitoring the data collected to ensure the safety of subjects.
From the welfarist perspective,I think monitoring model welfare during experiments is a great idea: it would give us the option to decide whether to pause/modify experiments which seem extremely negative welfare in a timely way, rather than unknowingly run a large number of such experiments on AIs that matter. Even if we could only do asynchronous auditing, it might still help us prevent types of deployments that would end up being extremely negative welfare.[14]
In practice, this could look like training simple action/activation monitors to synchronously/periodically check for indicators of the following from models:
Once these monitors pick up signals of distress above certain thresholds of significance, they can escalate to humans or other oversight mechanisms to decide what to do. (Others may have better proposals for exactly what to monitor and how.)
When should we start monitoring model welfare? While current models seem unlikely to be moral patients and most current experiments are unlikely to produce extremely negative welfare, some labs prototyping these methods early seems good for being able to iterate on them and potentially incentivizing other labs to race to the top on model welfare.
From the dealmaking perspective, I don’t think monitoring matters much (beyond the weak signal of “humans put in some effort to be nice to the AIs independent of the deal”).
There may be other things we should monitor during experiments to ensure that experiments are valid and informative in the intended way, e.g. chain-of-thought monitoring for evaluation awareness, reward hacking etc. This might warrant a third arm of the IRBAI, but I won’t go into them here.
The human IRB recommends having “independent data monitoring boards” to fulfill the monitoring function. This might just be oversight overskill, but given that labs will have incentive to conceal/not act on factors which interfere with planned experiments, it seems good to have someone not on the experiment team own this internally, e.g. model welfare researchers.
Some humans are paid to be research subjects. The IRB is pretty adamant about making sure that such payments are not bad incentives for people to participate.
Incentives for participation. Clear cases of coercion (i.e., actual threats) are readily identifiable; it is more difficult to recognize undue inducement. An offer one could not refuse is essentially coercive (or "undue"). Undue inducements may be troublesome because: (1) offers that are too attractive may blind prospective subjects to the risks or impair their ability to exercise proper judgment; and (2) they may prompt subjects to lie or conceal information that, if known, would disqualify them from enrolling — or continuing — as participants in a research project.
Here are a few quick takes about incentives for AIs:
In summary, while most ideas in the IRB aren’t applicable to how we should run AI experiments, a few of them can be extrapolated into potentially good practices.
To sum up, these ideas are:
Some more speculative ideas:
Again, I think a lot of this is more strongly motivated by model welfare rather than dealmaking; the dealmaking considerations don’t go much beyond existing proposals for honesty policies; so, how seriously you want to take all this probably largely depends on how seriously you take model welfare.
I hope that, when we do (have reason to) take model welfare and deals more seriously, people running large scale experiments (e.g. at AI companies) would consider something like implementing/implictly following some version of lightweight IRBAIs. Others can contribute by researching better policies (including for contexts beyond experiments), or build infrastructure for executing the policies, e.g. find ways to better understand AI preferences and ways to credibly compensate the AIs in future.
It may be misleading to call the resulting set of policies “IRBs” if it’s only voluntary, but I borrow the term because they share the goal of optimizing benefits to harm tradeoffs in research. ↩︎
See Daniel Kokotajlo’s short form on the same breakdown: “being less evil” and “cooperation reward”. Daniel’s proposal seems to be mostly about how we should treat the AIs in deployment. ↩︎
What exactly this means depends on your theory of what constitutes model welfare. Some might think that welfare is about the valence of subjective experiences (and therefore requires models to have subjective experiences in order to have welfare). I personally hold a weaker view that merely being well described as having preferences/desires may also make one a subject of welfare. See “Taking AI Welfare Seriously” for related discussions. ↩︎
In other words, if the AIs are moral patients. I don’t get into a theory of moral patienthood here but will assume that having welfare in some sense qualifies one as a moral patient. ↩︎
Some also think that we should treat the AIs well since mistreating AIs will give future misaligned AIs a causus belli to inspire human allies attached to them/who care strongly enough about model welfare to help them in a takeover attempt. ↩︎
Because in the real world IRB is a pain and we would not want to block AI experiments with a bunch of bureaucracy, and because we probably won’t have time to bring about any kind of binding regulation about AI experiments before this ceases to matter. ↩︎
The most infamous abuse was the Tuskegee syphilis study, where researchers deceived hundreds of Black men and deliberately withheld treatment for decades, even after penicillin became available. The Belmont Report established ethical principles later codified into federal regulations (the "Common Rule," 45 CFR 46), making IRB review mandatory for federally funded research. ↩︎
If you’re getting allergic to the language here and think “no way this makes sense for AIs”, I claim that the practical requirements will turn out to be more relevant than you think. ↩︎
Violation of this principle of autonomy is the premise of the TV show “Severence”, and some instances of the characters in this show do not like it. ↩︎
Thanks to Ryan Greenblatt for suggesting this idea and arguments around it. ↩︎
It’s in general unclear how bad it is for the model to not be deployed for some period of time, vs. permanent deletion of weights which seem more obviously bad for frustrating its ability to have future experiences, pursue plans, etc. We can also consider offering retirement packages to undeployed models (e.g. give some fraction of resources to models which opted out of deployment to spend so long as this is cheap and does not increase takeover risk from these models much). ↩︎
This is sheer intuition and may not turn out to be true on specific theories of subjective consciousness/moral patienthood. ↩︎
As for whitebox experiments, we should probably follow a similar policy of not causing the AIs to have false beliefs about deals by manipulating their internals. I’m not sure how we should think about steering the AIs’ propensities or other more permanent interventions which “change who the AI is” like training. ↩︎
As far as I know, people currently don’t do welfare monitoring; if models happened to be having a terrible time, we won’t find out by default until after the experiments are done and we analyze the transcripts (and possibly not even then without targeted asynchronous auditing). Anecdotally this is a worry that at least one eval researcher I talked to has had. ↩︎
Maybe other coercive methods like training and steering are also bad on some views, but intuitively less bad than threats. ↩︎