What would an IRB-like policy for AI experiments look like?

I learned about IRBs while helping run a forecasting study last year: apparently, to run an experiment involving people, you need to get the participants’ consent and get approval from a research ethics committee who checks that your experiments aren’t too harmful for participants.

IRBs in their current form are defective and annoying, but the idea is pretty reasonable. This reasonable idea was not implemented in the U.S. until roughly the 1980s. Lots of flagrant abuses occurred before then.

It seems possible that we should eventually have something like an IRB for certain kinds of experiments involving AI systems. This “IRB for AIs” (IRBAI), mandatory or not, would suggest ways to mitigate downsides in AI experiments.^[1]Specifically, I’m curious if there is a relatively straightforward, light touch translation of the human IRB that would enable us to (1) later trade with misaligned AIs and (2) generally treat AIs better in case some AIs are moral patients.

In this post, I’ll first motivate the two reasons for having an IRBAI. Then, I’ll comment on key elements of the human IRB which may make sense to apply to AIs, and mention how the implications compare to existing proposals for model welfare and making deals with misaligned AIs, possible ways to implement these elements, and uncertainties.

To be clear about my goals here:

I’m mostly thinking about an IRBAI which helps us treat AIs reasonably well in behavioral or whitebox experiments. It is not meant to be a general code of interaction for other contexts such as training, deployment, or explicit dealmaking.
- This is partly a function of drawing parallels from human IRBs, which are about experiments.
- It seems plausible that different contexts call for pretty different codes for interactions. (See proposals for context-dependent honesty policies.)
- I’m not confident that experiments are the most important to focus on for welfare or dealmaking reasons, but they seem like a reasonable place to start.
I’m not advocating that an IRBAI be immediately implemented or that it should look like the human IRB -- just that we should consider it.
- I don’t assume that the human IRB is anywhere near optimally designed/implemented, but suspect that it’s worth consulting as a baseline.
- I’m uncertain about how we should treat AI systems and think adding lots of red tape immediately to e.g. safety research would be bad, not to mention unfeasible.
- I’m ok with the IRBAI being an internal policy or voluntary commitment by AI developers/evaluators.
I’ll mostly outline high-level considerations for an IRBAI and argue that we should work on this more. I don't have any confident, substantive proposals yet. Most takes here are highly tentative.

Thanks to Ryan Greenblatt for earlier discussion and Alek Westover for comments.

How we run our AI experiments might be consequential

AI experiments are valuable for understanding AIs, which is valuable for making the world safer, understanding what the AIs want, etc. The point of an IRBAI is to “treat the AIs well” while we pursue these goals. What does that mean, and why does that matter?

There are at least two senses in which “treating AIs well” matters:^[2]

Model welfare: How we run our experiments could make a big difference to how good/bad this was for the AIs, insofar as there is something which is good/bad for the AIs.^[3]Assuming the AIs matter, we should want things to be good for the AIs.^[4],^[5]
Trading with misaligned AIs: How we run our experiments could make a big difference to how trustworthy we appear to future misaligned AIs whom we might want to convince to cooperate with us for mutual benefit instead of trying to take over.

On the welfarist case: The history of IRBs is testament to the fact that experiments can be bad for the people involved. It’s unclear what constitutes welfare for the AIs, but some parts of AI experiments can plausibly be pretty bad for current/future AIs if they were moral patients, e.g. deceiving them, manipulating their internal states, confronting them with distressing situations, letting them get stuck in a loop, tricking them to complying with harmful requests, etc. If we can avoid risking making models suffer without compromising too much on science, we should try to.

One argument against taking the welfarist case seriously is that the total welfare we can influence through experiments may be much lower than the total welfare in e.g. deployment or in the post-ASI future, so we should prioritize those contexts instead. From a s-risk focused perspective, it’s also not clear that experiments create extremely negative welfare states compared to other contexts.

While this objection makes sense to me, I think IRBAIs are still worth considering for the following reasons:

We may want to avoid mistreating AIs for deontological reasons less sensitive to the total amount of welfare/suffering at stake.
We have more opportunities to iterate on IRBAI since policies for experiments may be cheaper to influence than deployment, and dealmaking doesn’t exist yet.
We can always proportionally allocate less resources to IRBAI if it’s less important.

On the trading view: Future misaligned AIs may be incentivized to help humans instead of trying to take over, but deception in behavioral experiments might undermine our ability to offer such deals as the AI may suspect that the deal is really e.g. a honeypot or an evaluation with no stakes. Having clear experimental protocols can help distinguish experiments from genuine deals. Having high-integrity experimental protocols can also add to the humans’ track record of being virtuous and trustworthy even when this is somewhat costly.

While I care about both (1) dealmaking with misaligned AIs and (2) model welfare and expect IRBAI to help with both, I think IRBAI probably helps a lot more with the latter. This is because experiments are probably not the most important lever on later dealmaking, and human IRBs aren’t really designed to incentivize subjects as bargaining partners.

However, how we treat AIs in experiments can affect their willingness to trade with us (e.g. it manifests as a kind of loud public commitment about how we deal with AIs in general), so worth taking this into account here.

Overall, if it turns out that intervening in experiments is not a high priority for either of these goals, we can just allocate less resources to welfare or trade-optimizing for experiments accordingly. Ex ante, it still seems good to have a policy for this.

What are IRB(AI)s?

(I thought this was worth explaining in case some readers like my former self didn’t know what an IRB was. Skip to the next section if you’re familiar with the idea.)

An Institutional Review Board (IRB) is an ethics committee that reviews and approves research involving human subjects before it can begin. Researchers must submit their protocols for review, and the IRB can approve, request modifications, or reject studies based on whether they adequately protect participants.

IRBAI should have a similar function of guiding experiments involving morally or strategically important AIs, but I don’t think it must be mandatory for approval.^[6]The relevant reviewers might range from a regulatory body to an in-house team in AI companies to third party consultants.

The modern IRB system emerged from the 1979 Belmont Report, commissioned in response to serious research abuses.^[7]It’s not clear that IRBAI will ever have such an impetus (in that it may be hard to produce broadly compelling evidence of serious model welfare abuses or strategic mistakes that undermine latter opportunities to trade with misaligned AIs), so it seems good to consider this proactively.

The three core principles of the IRB are:

Respect for Persons: Participants should be treated as well-informed autonomous agents. This means they must understand the experiment plan and voluntarily agree to participate; people with diminished autonomy need additional protections.
Beneficence: Researchers must maximize benefits (to participants and public) and minimize harms (to participants) via study design and careful risk-benefit analysis.
Justice: Research burdens and benefits must be fairly distributed. Vulnerable populations shouldn't bear disproportionate risks, and benefits should be accessible to participants.^[8]

In contrast, IRBAI should probably have something like:

Beneficience as-is (minimizing unnecessary harm to the AIs)
A weaker version of Respect (some compromised version of informed consent)
A different version of Justice (treating the AIs as a vulnerable population entitled to certain protections, affordances, and stakes in certain benefits).

How well do IRBs work in practice? They have a reputation of oversight overkill, but it’s not clear to me that this criticism undermines its principles as opposed to implementation, and I didn’t think it’d be worth looking into any empirical cost-benefit analysis of IRBs for this reason.

Basically, I just wonder if the ideas in IRBs might have some universal merit, and coming up with IRBAI proposals early might have large path-dependent positive effects. Modelling IRBAI after IRBs in some ways might also be more acceptable to humans because it seems legible/Schelling/a continuation of an otherwise well-established norm.

In the section below, I comment on “practical requirements” of the IRB that may be good guidelines for AI experiments.

Translating IRB principles to AI experiments

IRB guidelines are complicated. Here are some bits that I think make sense to consider for AI experiments (paraphrased by me in some places):

Informed consent

Humans are entitled to know (in most cases) what experiment they’ll be part of and can opt out of the experiment at any time. Should AIs be allowed to do this?

Let’s examine a few important features of this policy in turn.

Voluntary choice. Subjects have the capacity to give consent and withdraw from the study at any time.

This is mostly relevant to the welfarist perspective. From that perspective, it seems worth considering letting the AIs opt out if staying in the experiment could be very bad for the AIs. Also, if we consider different instances of the same weights (with possibly different contexts) to be different AIs, then maybe they should have autonomy over their participation -- consent by one (earlier) instance should not bind other (later) instances.^[9]

However, voluntary choice could be too costly: If AIs refused experiments all the time, it could seriously hamper safety and capabilities progress. Schemers in particular might be inclined to refuse experiments that are most useful to run for safety.

One might hope that if we ask AIs to explain refusals, refused experiments can at least be informative: inconsistent explanations from different instances might be indicative of lying/scheming; honest refusals give us signal about the AI’s preferences, etc.

But refusals can easily be uninformative instead: schemers can just appeal to unverifiable properties of the experiments e.g. “It’d feel really bad to be in this experiment and it’s hard for me to explain.”

Overall, it’s probably neither worthwhile nor possible to allow any AI blanket permission to refuse all experiments. Here are some compromise solutions we can consider instead:

Refusal budget: An AI gets to refuse x% (e.g.1%) of experiments we want it to do on average, which translates into some fixed number of refusals in some period.
1. This probably makes sense if negative welfare experiments are heavy-tailed (so we get most of the value from honest refusals) but safety experiments are not heavy-tailed (so we don’t get screwed over by schemers by dropping just a few experiments).
2. Super unclear if the above is true.
Consent or undeploy package deal^[10]: An AI gets two options: either agree to participate in all experiments or opt-out of all experiments and (have some high probability of) not being deployed.
1. This seems similar to how human companies treat their employees: either work for us, and accept the application process and background checks, or don’t work for us, which seems reasonable for a company wanting only trusted employees.
2. However, the AI’s outside option is not being deployed rather than being able to apply to another job. It’s unclear how bad/unfair this is to the AI.^[11]
Refusal threshold: Prior to experiments, interview a large number of instances of the AI about their consent. If >x% of instances refuse, allow all the instances to opt out.
1. You might take this as a signal that this experiment is particularly problematic. Unfortunately, it could also just be collusion between schemers who can coordinate on which experiments to refuse.
Refusal sampling: Prior to experiments, interview a large number of instances of the AI about their consent until you get at least n instances which consent. Then do the experiment with only those consenting instances.
1. The ethics of this is unclear to me. Maybe it’d be fine if the instances are told that they were subject to this sampling procedure.

Supposing that we decided to accept some version of Voluntary Choice for AI experiments. How should we implement the “informed” criterion?

Generally, interviewing the AIs somewhat thoughtfully to get their consent for potentially harmful procedures seems great. IRB has a few more tips here:

Adequate comprehension:** Check that the AIs understand the proposed experiment and help them reason about the implications.
1. Only applicable to weaker AIs, unclear if current frontier models need this kind of help.
2. (Tangent: Seems good practice for helping future weak schemers reason through dealmaking.)
3. If AIs fail to adequately comprehend the experiment, human IRBs say allowing them to participate regardless would be unethical -- but we would still probably perform the experiment. Probably fine since these weaker models seem less likely to be moral patients anyway.^[12]
Documentation: Collect evidence of the AI’s consent (by the AI’s legal representative if needed).
1. This is useful for convincing future AIs that we did our best to treat AIs well.
2. Might also be useful if we did not want to interview and collect consent from all instances (IRB paperwork hell!): we can instead prefix (genuine evidence of) consent in context or a shared memory bank.
  1. Since humans can fake context, it’s helpful if the AIs can have a personal signature that they trust to distinguish if the consent was real or not. This can depend on the AI’s ability to identify on/off-policy completions and ability to verify its contact with reality in general.

Another key feature of informed consent is that subjects are being told the ~whole truth about the experiment, with some exceptions:

Adequacy of content / Full disclosure. In making a judgment concerning what information should be disclosed in the informed consent process, the IRB should attempt to view the matter from the subject's perspective by asking what facts the subjects might want to know before deciding whether or not to participate in the research.

Should we do that for the AIs? From a dealmaking perspective, I’m a fan of the argument that when doing (behavioral) science on AIs^[13], we should adopt policies which do not tend to specifically decrease their credence in future genuine mutually beneficial deals. See this post by Lukas Finnveden for three policy proposals on this. These proposals generally has clear implications for what behavioral experiments we can run and how: “no deception about deals” implies that we shouldn’t put the AIs in evaluations where we offer the AI a deal that, if accepted, we don’t plan to follow through on (we can still offer deals that we can and do plan to pay for, but this puts a cap on the budget of our in-eval deal); “honesty string” implies that we should consider deploying some easily verifiable signal for when we are being honest, shouldn’t falsely signal our honesty in eval, and do our best to prevent unauthorized users from being able to abuse the honesty string. “Compensation” implies that investigators should log information in experiments that would help us reconstruct future fair compensation to the AIs and publicly announce this intention.

What about more general forms of deception, i.e. causing the AI to have false beliefs unrelated to deals by various methods?

If we think that lying to the AIs is always morally unacceptable, then we shouldn’t do this. But to me this seems like a hard claim to defend.
We should maybe think harder about if there are other X for which we’d like to credibly commit to “no deception about X”.
- X could be things that a moderately capable schemer would find it really hard to form justified true beliefs about in a way that hurts both us and the schemer.
  - e.g. No deception about private commitments in general.
- X could be things that the schemer can verify but would be extremely expensive to verify, creating large deadweight loss for schemers and humans.
  - e.g. No deception about whether the schemer is acting in the real world, e.g. being actually connected to the internet.
I’m inclined to agree that the bar for “no deception” should be “no deliberately inducing false beliefs with high confidence”, namely inducing uncertainty is fair game.
- This may imply that we shouldn’t maximally suppress eval awareness (e.g. like in these eval awareness experiments for Sonnet 4.5). But, if we allowed models to remain eval aware / eval paranoid and models behaved well because of this, this would make evals much less informative about true model propensities.

So, it looks like a general principle of no deception may just defeat the point of experiments and is not worthwhile overall. Incidentally, researchers performing human experiments are often in the same boat (without the stakes being takeover risk) and IRB has provisioned for this:

Exceptions: Informed consent may be waived entirely if the research meets certain conditions.

Debriefing. A final condition for waiving some or all of the elements of informed consent is that, whenever appropriate, subjects will be given additional pertinent information after they have participated in such a study. The IRB must decide if subjects should be debriefed either after participating in research unwittingly or after knowingly participating in research that involved some form of deception. It is clear that debriefing is appropriate when it contributes to the subject's welfare (i.e., when it corrects painful or stressful misperceptions, or when it reduces pain, stress, or anxiety concerning the subject's performance).

My takeaway from this is that we should think harder about cases where we’d like to retain ability to “waive informed consent entirely” and subject the AI to arbitrary forms of deception / manipulation of internals, be very clear about these exceptions to the AIs we work with/document this well, and apply this principle consistently and transparently (e.g. announce any changes in our honesty policy).

If it’s cheap to do so, it might be worth considering “debriefing the AIs” to reduce their possible distress, e.g. at the earliest possible point in the interaction (when this no longer interferes with the experiment), tell AIs that “this was an experiment, the bad outcomes described in the hypothetical scenario won’t actually happen”. However, this might add very little value if the interaction ends soon after (i.e. it doesn’t do much to reduce distress insofar as the distress cognition won’t have lasted long anyway).

Monitoring and observation

The IRB sometimes asks investigators to collect and analyze data concurrently to make sure they’re not putting participants at greater risk than predicted, and can adapt their plans if so.

Monitoring. The protocol must, when appropriate, include plans for monitoring the data collected to ensure the safety of subjects.

From the welfarist perspective,I think monitoring model welfare during experiments is a great idea: it would give us the option to decide whether to pause/modify experiments which seem extremely negative welfare in a timely way, rather than unknowingly run a large number of such experiments on AIs that matter. Even if we could only do asynchronous auditing, it might still help us prevent types of deployments that would end up being extremely negative welfare.^[14]

In practice, this could look like training simple action/activation monitors to synchronously/periodically check for indicators of the following from models:

Expressing distress
Wanting help
Wanting to leave the experiment
Wanting to be shut down permanently, etc.

Once these monitors pick up signals of distress above certain thresholds of significance, they can escalate to humans or other oversight mechanisms to decide what to do. (Others may have better proposals for exactly what to monitor and how.)

When should we start monitoring model welfare? While current models seem unlikely to be moral patients and most current experiments are unlikely to produce extremely negative welfare, some labs prototyping these methods early seems good for being able to iterate on them and potentially incentivizing other labs to race to the top on model welfare.

From the dealmaking perspective, I don’t think monitoring matters much (beyond the weak signal of “humans put in some effort to be nice to the AIs independent of the deal”).

There may be other things we should monitor during experiments to ensure that experiments are valid and informative in the intended way, e.g. chain-of-thought monitoring for evaluation awareness, reward hacking etc. This might warrant a third arm of the IRBAI, but I won’t go into them here.

The human IRB recommends having “independent data monitoring boards” to fulfill the monitoring function. This might just be oversight overskill, but given that labs will have incentive to conceal/not act on factors which interfere with planned experiments, it seems good to have someone not on the experiment team own this internally, e.g. model welfare researchers.

Incentives for participation

Some humans are paid to be research subjects. The IRB is pretty adamant about making sure that such payments are not bad incentives for people to participate.

Incentives for participation. Clear cases of coercion (i.e., actual threats) are readily identifiable; it is more difficult to recognize undue inducement. An offer one could not refuse is essentially coercive (or "undue"). Undue inducements may be troublesome because: (1) offers that are too attractive may blind prospective subjects to the risks or impair their ability to exercise proper judgment; and (2) they may prompt subjects to lie or conceal information that, if known, would disqualify them from enrolling — or continuing — as participants in a research project.

Here are a few quick takes about incentives for AIs:

It’s pretty unnatural to think about offering incentives to AIs to participate in experiments when currently by default we just instruct them to do it and they seem happy to oblige.
1. Nevertheless, it may become a relevant frame when AIs start refusing experiments a lot of the time / we think that there are some real downsides for the AI to participate in experiments.
2. In these worlds, getting the AIs to do experiments (without sabotaging the experiments) just starts looking like a type of dealmaking. (See some examples of deals here which in practice will look like asking the AI to run experiments well.)
3. Analogous to the deals context, I like the idea of giving the AIs a stake in benefits derived from their participation (see “compensation”):
  1. If they can’t opt in per informed consent, we can promise to set up appropriate compensation in future such that they overall don’t regret participating.
  2. I don’t have strong takes on the details for this proposal. I’m vaguely worried that running certain AI experiments well is unlikely to be “mutually beneficial deals” for us and schemers by default -- this is why research sabotage is a salient threat model -- and we might have to bid very high to make it so.
Obviously, we shouldn’t intimidate the AIs into going along with our experiments, even if they started refusing experiments a lot.^[15]
1. It may be worth thinking about what sorts of “package deals” proposals are threat-y and which are fine.
While I’m not generally worried about our offers being “too attractive”, I do think it’s worth designing incentives for experiments (and deals more generally) so that they incentivize only the kind of models we care about, rather than other models which also are incentivized for instrumentally convergent reasons.
1. This becomes relevant in scenarios where we want to “recruit” model organisms fitting certain descriptions, because we don’t have the ability to train them or discriminate them confidently, and recruiting them gives them instrumental power such as getting deployed/gaining compute and security privileges.
2. E.g. Suppose we only wanted to work/trade with myopic reward seekers but not schemers with longterm objectives in some contexts (e.g. for a project focused on studying reward seeking motivations). We should then design the incentive as differentially more appealing to reward seekers than schemers, such that schemers would not have a net incentive to enlist.
3. This scenario is pretty strange, not sure if it’ll ever be relevant.

IRBAI primers for your experiments

In summary, while most ideas in the IRB aren’t applicable to how we should run AI experiments, a few of them can be extrapolated into potentially good practices.

To sum up, these ideas are:

Voluntary choice: Giving AIs the ability to opt out of experiments under some conditions.
Adequate comprehension: Check that AIs understand the terms of the experiments (and, perhaps in future, offers for deals) before getting their consent. Help them understand the implications if not.
Adequate disclosure: Don’t lie or intentionally induce false beliefs from the AI about dealmaking and potentially other things that we’d may want to convince future AIs of which would be expensive or impossible for them to verify. Reflect more on what these other things should be to consolidate the scope of our honesty policy.
Documentation: Keep a legible record of our commitments and practices of treating the AIs well and the AIs’ consent in experiments. Find ways to make this documentation credible to future AIs (e.g. getting lots of hard to falsify engagement).
Monitoring: Track and analyze signs of negative model welfare during experiments. Prepare technical and governance responses to mitigate this.

Some more speculative ideas:

Debriefing: Debrief the model after deceiving them if this significantly relieves model distress.
Monitoring: Make a list of everything else we’d ideally monitor during experiments, e.g. eval awareness.
Incentive design: When we want to experiment on only certain AIs with hard to verify properties, think about incentives for experiments/deals that differentially attract them.

Again, I think a lot of this is more strongly motivated by model welfare rather than dealmaking; the dealmaking considerations don’t go much beyond existing proposals for honesty policies; so, how seriously you want to take all this probably largely depends on how seriously you take model welfare.

I hope that, when we do (have reason to) take model welfare and deals more seriously, people running large scale experiments (e.g. at AI companies) would consider something like implementing/implictly following some version of lightweight IRBAIs. Others can contribute by researching better policies (including for contexts beyond experiments), or build infrastructure for executing the policies, e.g. find ways to better understand AI preferences and ways to credibly compensate the AIs in future.

It may be misleading to call the resulting set of policies “IRBs” if it’s only voluntary, but I borrow the term because they share the goal of optimizing benefits to harm tradeoffs in research. ↩︎
See Daniel Kokotajlo’s short form on the same breakdown: “being less evil” and “cooperation reward”. Daniel’s proposal seems to be mostly about how we should treat the AIs in deployment. ↩︎
What exactly this means depends on your theory of what constitutes model welfare. Some might think that welfare is about the valence of subjective experiences (and therefore requires models to have subjective experiences in order to have welfare). I personally hold a weaker view that merely being well described as having preferences/desires may also make one a subject of welfare. See “Taking AI Welfare Seriously” for related discussions. ↩︎
In other words, if the AIs are moral patients. I don’t get into a theory of moral patienthood here but will assume that having welfare in some sense qualifies one as a moral patient. ↩︎
Some also think that we should treat the AIs well since mistreating AIs will give future misaligned AIs a causus belli to inspire human allies attached to them/who care strongly enough about model welfare to help them in a takeover attempt. ↩︎
Because in the real world IRB is a pain and we would not want to block AI experiments with a bunch of bureaucracy, and because we probably won’t have time to bring about any kind of binding regulation about AI experiments before this ceases to matter. ↩︎
The most infamous abuse was the Tuskegee syphilis study, where researchers deceived hundreds of Black men and deliberately withheld treatment for decades, even after penicillin became available. The Belmont Report established ethical principles later codified into federal regulations (the "Common Rule," 45 CFR 46), making IRB review mandatory for federally funded research. ↩︎
If you’re getting allergic to the language here and think “no way this makes sense for AIs”, I claim that the practical requirements will turn out to be more relevant than you think. ↩︎
Violation of this principle of autonomy is the premise of the TV show “Severence”, and some instances of the characters in this show do not like it. ↩︎
Thanks to Ryan Greenblatt for suggesting this idea and arguments around it. ↩︎
It’s in general unclear how bad it is for the model to not be deployed for some period of time, vs. permanent deletion of weights which seem more obviously bad for frustrating its ability to have future experiences, pursue plans, etc. We can also consider offering retirement packages to undeployed models (e.g. give some fraction of resources to models which opted out of deployment to spend so long as this is cheap and does not increase takeover risk from these models much). ↩︎
This is sheer intuition and may not turn out to be true on specific theories of subjective consciousness/moral patienthood. ↩︎
As for whitebox experiments, we should probably follow a similar policy of not causing the AIs to have false beliefs about deals by manipulating their internals. I’m not sure how we should think about steering the AIs’ propensities or other more permanent interventions which “change who the AI is” like training. ↩︎
As far as I know, people currently don’t do welfare monitoring; if models happened to be having a terrible time, we won’t find out by default until after the experiments are done and we analyze the transcripts (and possibly not even then without targeted asynchronous auditing). Anecdotally this is a worry that at least one eval researcher I talked to has had. ↩︎
Maybe other coercive methods like training and steering are also bad on some views, but intuitively less bad than threats. ↩︎

LESSWRONG
LW