I raised a similar proposal to various people a while ago.
The strongest objection I'm aware of is something like:
You actually want evaluators to have as much skin in the game as other employees so that when they take actions that might shut the company down or notably reduce the value of equity, this is a costly signal.
Further, it's good if evaluators are just considered normal employees and aren't separated out in any specific way. Then, other employees at the company will consider these evaluators to be part of their tribe and will feel some loyalty. (Also reducing the chance that risk evaluators feel like they are part of rival "AI safety" tribe.) This probably has a variety of benefits in terms of support from the company. For example, when evaluators make a decision with is very costly for the company, it is more likely to respected by other employees.
A key question is how much of the problem comes from evaluators within the company sounding the alarm versus the rest of the company (and the world) not respecting this alarm. And of course, how much the conflict of interest (and tribal-ish affiliation with the company) will alter the judgment of evaluators in a problematic way.
If I was confident that all risk evaluators were incorruptible, high integrity, total altruists (to the world or something even more cosmopolitan), then I think the case for them getting compensated normally with equity is pretty good. This would allow them to perform a costly signal later and would have other benefits. Though perhaps the costly signal is less strong if it is clear these people don't care about money (but this itself might lend some credibility).
Given my understanding of the actual situation at AI companies, I think AI companies should ideally pay risk evaluators in cash.
I have more complex views about the situation at each of the different major AI labs and what is highest priority for ensuring good risk evaluations.
You actually want evaluators to have as much skin in the game as other employees so that when they take actions that might shut the company down or notably reduce the value of equity, this is a costly signal.
Further, it's good if evaluators are just considered normal employees and aren't separated out in any specific way. Then, other employees at the company will consider these evaluators to be part of their tribe and will feel some loyalty. (Also reducing the chance that risk evaluators feel like they are part of rival "AI safety" tribe.) This probably has a variety of benefits in terms of support from the company. For example, when evaluators make a decision with is very costly for the company, it is more likely to respected by other employees.
This situation seems backwards to me. Like, presumably the ideal scenario is that a risk evaluator estimates the risk in an objective way, and then the company takes (hopefully predefined) actions based on that estimate. The outcome of this interaction should not depend on social cues like how loyal they seem, or how personally costly it was for them to communicate that information. To the extent it does, I think this is evidence that the risk management framework is broken.
Like, presumably the ideal scenario is that a risk evaluator estimates the risk in an objective way, and then the company takes (hopefully predefined) actions based on that estimate. The outcome of this interaction should not depend on social cues like how loyal they seem, or how personally costly it was for them to communicate that information. To the extent it does, I think this is evidence that the risk management framework is broken.
I agree with all of this, but don't expect to live in an ideal world with a non-broken risk management framework and we're making decisions on that margin.
I also think predefined actions is somewhat tricky to get right even in a pretty ideal set up, but I agree you can get reasonably close.
Note that I don't necessary endorse the arguments I quoted (this is just the strongest objection I'm aware of) and as a bottom line, I think you should pay risk evaluators in cash.
Why not pay them nothing until they are shown to be correct with sufficiently credible proof, and then pay out a huge prize?
This could exclude competent evaluators without other income--this isn't Dath Ilan, where a bank could evaluate evaluators and front them money at interest rates that depended on their probability of finding important risks--and their shortage of liquidity could provide a lever for distortion of their incentives.
On Earth, if someone's working for you, and you're not giving them a salary commensurate with the task, there's a good chance they are getting compensation in other ways (some of which might be contrary to your goals).
Your third paragraph mentions "all AI company staff" and the last refers to "risk evaluators" (i.e. "everyone within these companies charged with sounding the alarm"). Are these groups roughly the same or is the latter subgroup significantly smaller?
I personally think it's most important to have at least some technical employees who have the knowledge/expertise to evaluate the actual situation, and who also have the power to do something about it. I'd want this to include some people who's primary job is more like "board member" and some people who's primary job is more like "alignment and/or capabilities researcher."
But there is a sense in which I'd feel way more comfortable if all technical AI employees (alignment and capabilities), and policy / strategy people, didn't have profit equity, so they didn't have inceptive to optimize against what was safe. So, there's just a lot of eyes on the problem, and the overall egregore steering the company has one fewer cognitive distortion to manage.
This might be too expensive (OpenAI and Anthropic have a lot of money, but, like, that doesn't mean they can just double everyone's salary)
An idea that occurs to me is construct "windfall equity", which only pays out if the company or world generates AI that is safely, massively improving the world.
I think the latter group is is much smaller. I'm not sure who exactly has most influence over risk evaluation, but the most obvious examples are company leadership and safety staff/red-teamers. From what I hear, even those currently receive equity (which seems corroborated by job listings, e.g. Anthropic, DeepMind, OpenAI).
It's a good point that this is an extreme conflict of interest. But isn't one obvious reason why these companies design their incentive structures this way because they don't have the cash lying around to "simply" compensate people in cash instead?
E.g. here's ChatGPT:
Q: Why do companies like OpenAI partly pay their employees in equity?
ChatGPT: Companies like OpenAI partly pay employees in equity to align the employees' incentives with the company's long-term success. By offering equity, companies encourage employees to take ownership of the business's outcomes, fostering a sense of responsibility and motivation to contribute to its growth.Additionally, equity compensation can attract top talent by offering potential future financial rewards tied to the company’s success, especially for startups or high-growth firms. It also helps conserve cash, which is valuable for companies in early stages or those heavily investing in research and development.
It would indeed be good if AI safety research was paid for in cash, not equity. But doesn't this problem just reduce to there being far less interest in (and cash for) safety research relative to capability research than we would need for a good outcome?
This is not obvioulsly true for the large AI labs, which pay their mid-level engineers/researchers something like 800-900k/year with ~2/3 of that being equity. If you have a thousand such employees, that's an extra $600m/year in cash. It's true that in practice the equity often ends up getting sold for cash later by the employees themselves (e.g. in tender offers/secondaries), but paying in equity is sort of like deferring the sale of that equity for cash. (Which also lets you bake in assumptions about growth in the value of that equity, etc...)
But only a small fraction work on evaluations, so the increased cost is much smaller than you make out
Oh, that's true, I sort of lost track of the broader context of the thread. Though then the company needs to very clearly define who's responsible for doing the risk evals, and making go/no-go/etc calls based on their results... and how much input do they accept from other employees?
I think this is important to define anyway! (and likely pretty obvious). This would create a lot more friction for someone to take on such a role though, or move out
It would be expensive, but it's not a hard constraint. OpenAI could almost certainly raise another $600M per year if they wanted to (they're allegedly already losing $5B per year now).
Also the post only suggests this pay structure for a subset of employees.
Early discussion of AI risk often focused on debating the viability of various elaborate safety schemes humanity might someday devise—designing AI systems to be more like “tools” than “agents,” for example, or as purely question-answering oracles locked within some kryptonite-style box. These debates feel a bit quaint now, as AI companies race to release agentic models they barely understand directly onto the internet.
Why do you call current AI models "agentic"? It seems to me they are more like tool AI or oracle AI...
I just meant there are many teams racing to build more agentic models. I agree current ones aren't very agentic, though whether that's because they're meaningfully more like "tools" or just still too stupid to do agency well or something else entirely, feels like an open question to me; I think our language here (like our understanding) remains confused and ill-defined.
I do think current systems are very unlike oracles though, in that they have far more opportunity to exert influence than the prototypical imagined oracle design—e.g., most have I/O with ~any browser (or human) anywhere, people are actively experimenting with hooking them up to robotic effectors, etc.
The LessWrong Review runs every year to select the posts that have most stood the test of time. This post is not yet eligible for review, but will be at the end of 2025. The top fifty or so posts are featured prominently on the site throughout the year.
Hopefully, the review is better than karma at judging enduring value. If we have accurate prediction markets on the review results, maybe we can have better incentives on LessWrong today. Will this post make the top fifty?
Personally, I suspect the alignment problem is hard. But even if it turns out to be easy, survival may still require getting at least the absolute basics right; currently, I think we're mostly failing even at that.
Early discussion of AI risk often focused on debating the viability of various elaborate safety schemes humanity might someday devise—designing AI systems to be more like “tools” than “agents,” for example, or as purely question-answering oracles locked within some kryptonite-style box. These debates feel a bit quaint now, as AI companies race to release agentic models they barely understand directly onto the internet.
But a far more basic failure, from my perspective, is that at present nearly all AI company staff—including those tasked with deciding whether new models are safe to build and release—are paid substantially in equity, the value of which seems likely to decline if their employers stop building and releasing new models.
As a result, it is currently the case that roughly everyone within these companies charged with sounding the alarm risks personally losing huge sums of money if they do. This extreme conflict of interest could be avoided simply by compensating risk evaluators in cash instead.