cross-posted from my personal blog

I have an intuition that the gap between short term ML alignment problems and long term AGI alignment problems is smaller than usual.

Writing on the subject I usually see rarely takes this perspective. I see a few things actually get written:

  • People who work on long term AGI alignment saying short term ML alignment doesn’t help their use case
  • People who work on short term ML alignment frustrated with companies that they can’t get their work implemented
  • People who talk about the incentives of business structures and draw some parallels to long term AGI alignment, for example Immoral Mazes and Mesa Optimizers. Maybe I’m projecting, as I’m not sure how much this parallel is being drawn explicitly.


This seems to me like a strong throughline, especially for a Christiano-style amplification path. That is, we already observe hierarchical self-referential systems that encounter supervision problems for aligning their stated objectives (which may or may not be honest) with their pursued objectives (which may or may not be aligned) by using semi-aligned components (employees) and indirect supervision (middle management).

There has been significant discussion of the analogy between corporations and superintelligences and I don’t mean to rehash this. Certainly a superintelligent AGI may have many properties that no corporation of humans has, and one of the most likely and concerning of those properties is operating on a faster serialized time scale.


But I think it’s hard to deny that corporations are a form of very weak superintelligence that exists in the real world. And I think that means they are an opportunity to experiment with methods for alignment in superintelligences in the real world.


Or to phrase it differently, I don’t think “being able to run a good corporation” is a solution to the problem of AGI alignment, but I DO think that a solution to the problem of AGI alignment, probably in general but definitely in Christiano-style amplification, should contain a solution to being able to run a company that doesn’t experience value drift. I think this philosophy, of attempting to solve simplified sub-problems, is pretty well established as a good idea.

Unfortunately, I don’t really know that much about business or management. And I don’t know that much about the kind of selection pressures companies undergo, especially around initial funding. So I’m going to try to write a little bit about some toy models and hope this train of thought is helpful for someone.


A Toy Model of Businesses

A company starts with a founder. The founder has an idea which will create value, and an idea about how to sell that value and capture some of it for themself.

The founder then needs to create a company to make the idea happen. Creating a company requires a bunch of capital to get tools and employees. Often in order to acquire that capital, the founder needs to find an investor.

The founder needs to convince the investor of a few things:

The idea will create value

There will be ENOUGH value for them to split it and make better than market returns

The company can make the idea happen

The founder can make the company happen, given enough capital 

By appealing to the investor the founder is letting the company be less aligned with the idea, because the investor only cares about the return on investment. If the idea changes so that it is harder to sell the value created, this may be a big win for the founder or society at large, but will be a loss for the investor.


Assuming this deal works out, the founder needs to create the company by hiring employees to do subtasks that will help make the idea real.

These employees have a number of possible motivations, and probably a mix of all of them

To capture some of the value from selling the idea

To capture some of the capital from the investor

To benefit directly from the idea and it’s value

To support the efforts of the founder

To support the efforts of the investor

To support the efforts of other employees

To send signals to external parties (for example governments, other companies)

To create value from a different idea.


From the perspective of the founder, motivations 3 and 4 are probably aligned (since in 3 they may pay/work for lower salaries), and motivations 1 and 8 can be actively harmful (less value remains for the founder to capture).

From the perspective of the investor, motivations 4, 5, and 8 are aligned, and motivations 1 and 2 are actively harmful.

From the perspective of society at large, motivations 3 and 8 are aligned, while motivation 7 is suspect.

From the perspective of employee self-interest, motivations 1, 2, and 7 are aligned.

From the perspective of employee reputation, motivations 6 and 8 are best aligned while 1 and 2 may be less aligned.


So let’s look at some (hypothetical) real world events in the context of a toy model.

A team of engineers sees a way to increase engagement on their component in a way that differs from the plan.

The engineers see this as a way to support the core idea, (4), and they see their own ideas (8), adding value because of this. They may also be motivated by increasing their contribution, so they can increase their share of the value (1).

Once they have this idea, gathering together around it is supported by 6.

Importantly though, this only satisfies 8 IF optimizing for engagement (the imperfect proxy they are given) actually adds value to the idea (3).

Otherwise, the change might actually damage the company’s overall value proposition by overoptimizing a piece. 

This individual team often doesn’t understand the core value proposition. For something like Facebook clicks, it’s not clear that even the founder understands the core value proposition.


Supervision in this case could work. If you have a high level executive who cares (because of his investment incentives) about how much value the company overall generates, AND that executive has the ability to measure whether the change overall adds value (for example, by doing UX research and getting a holistic picture of what the value is), the executive could observe that “build a tool that finds posts that are fairly clickable” adds value but “build a tool that finds posts that are MAXIMALLY clickable” reduces value.

This is at least a much shorter, easier to verify statement than the full details of differences between possible tool implementations.

My impression is that this doesn’t happen in practice, which could be why clickbait is such a problem?

Are there real life examples of anti-goodharting based on user studies? What I have seen in practice is usually more like measuring a range of different types of clickiness, and hoping that broadening in this way reduces over-optimization.

An ethics researcher discovers that part of the software that implements the company’s idea actually destroys some value. They bring this concern to their manager.

The ethics researcher is trying to enhance 3 and 4, possibly directly in response to the previous case. In this case, they are working directly against 6, and creating a zero sum game with other employees for 1, because credit assignment would mean reducing how much of the value the people who wrote this tool are capturing. However, if the value destruction is a secret, then the investor may not care, because it won’t affect how well the company can sell itself.

In this case, the founder’s original vision requires the work of the ethics researcher, but this works directly against the investor’s interests (by reducing external measures of value) and creates a zero sum conflict among employees.

Since incentives are working against them, their manager quiets the researcher and the idea creates less value.


Can supervision fix this?

If our aligned executive comes back, he could be convinced that the implementation may destroy some value. If this identifies a case like (1) where this value destruction comes from changes to a plan that misalign the company, he could intervene and the researcher would be a valuable part of the oversight and summarization process.


But if the issue identified is part of the original plan, the executive may not know what alternative paths the team could take, whether those alternatives would actually destroy less value vs. this being a discovery about the idea fundamentally containing less value, or how much cost the alternatives would have. In particular, because the researcher is not the person who originally made the product in the first place, it’s not clear that ANYONE would have a clear picture of how much effort it would take to fix.


Imagine Facebook showing users posts that will get reactions. Reactions are a bad proxy for post quality, and optimizing for them Goodharts any value they have far too hard. But Facebook has to show posts in some way! I can’t imagine that hundreds of engineers at Facebook would be happy to say “yeah, let’s go back to reverse chronological ordering and find new jobs.” And finding a better way may just be unachievable.


It seems to me that it would be hard to summarize this situation, so that an executive supervisor could easily set incentives for other teams in response. In particular, it might be hard to distinguish between researchers who point out real gaps in the product and those who simply point at politically weak teams in hope of stealing their share of value.


If that is the common outcome, then you may see ethics researchers or other whistleblowers leaving the company on bad terms, because they are unable to contribute value.


An executive does their best to incentivize projects to create real value and stop projects under them that destroy real value in order to capture more of the remaining value

Whether this executive is following their incentives from 1 depends on the exact tradeoff between productivity and corruption they face, and potentially depends on contingent factors about the investor.


Say there are opportunities to seize X shares by destroying Y value with an opportunity cost of creating Z value spread across all shares.

Say the executive has 1% of shares.

If value creation (the opportunity cost Z) can add 1% to company value, that is worth 0.01% directly.

If the executive can seize all of another executives shares (going to 2%) by destroying 1% of company value, that is instead worth 0.98% directly.

If value creation can add 5% to company value, it is worth 0.05% directly

If the executive can expand his influence slightly (going up to 1.1% of shares) at high cost (destroying 5% of company value) this is worth 0.045% directly.


How able the investor is to notice value destruction, how much the investor (or board) wants to punish value grabbing, how easy value creation is, how easy value grabbing is, and how much influence over punishments can be provided by value grabbing, are all a bunch of knobs that could get aligned executives promoted or removed from the company. I have no idea what sorts of numbers are reasonable or what kind of stable equilibrium might result.

That is, can this kind of supervision compete in the toy model market? Unclear.

New Comment
6 comments, sorted by Click to highlight new comments since: Today at 2:38 AM

After writing this I came up with the following summary that might be much cleaner:

In the intelligence amplification/distillation framework, it seems like "hire a team of people" is a possible amplification step, in which case running a company is a special case of the framework. Therefore I'd expect to see useful analogies from advice for running a company that would apply directly to amplification safety measurements.

I also meant to mention the analogy between UX research and the Inception network architecture having middle layers that directly connect to the output, but I forgot.

Given Foom hard takeoff kind of background assumptions, once we actually have aligned AGI, most other problems rapidly become a non-issue. But suppose you threw a future paper on AI safety through a wormhole to an alternate world that didn't have computers or something, would it help them align coorporations? 

Likely not a lot. Many AI alignment techniques assume that you can write arbitrary mathematical expressions into the AI's utility function. Some assume that the AI is capable of a quantity or quality of labour that humans could not reasonably produce. Some assume that the AI's internal thoughts can be monitored. Some assume the AI can be duplicated. The level of neurotechnology required to do some of these things is so high that if you can do it, you can basically write arbitrary code in neurons. This is just AI alignment again, but with added difficulty and ethical questionability.


Structuring a CPU so it just does addition, and doesn't get board or frustrated and deliberately tamper with the numbers is utterly trivial.

Arranging a social structure where people add numbers and are actually incentivised to get the answer right, not so easy. Especially if the numbers they are adding control something the workers care about.

I think this misunderstands my purpose a little bit.

My point isn't that we should try to solve the problem of how to run a business smoothly. My point is that if you have a plan to create alignment in AI of some kind, it is probably valuable to ask how that plan would work if you applied it to a corporation.

Creating a CPU that doesn't lie about addition is easy, but most ML algorithms will make mistakes outside of their training distribution, and thinking of ML subcomponents as human employees is an intuition pump for how or whether your alignment plan might interact.

Modelling an AI as a group of humans is just asking for an anthropomorphized and probably wrong answer. The human brain easily anthropomorphizes by default, thats a force you have to actively work against, not encourage. 

Humans have failure modes like getting bored of doing the same thing over and over again, and stopping paying attention. AI's can overfit the training data and produce useless predictions in practice. 

Another way of seeing this is to consider two different AI designs, maybe systems with 2 different nonlinearity functions, or network sizes or whatever. These two algorithms will often do different things. If the algorithms get "approximated" into the same arrangement of humans, the human based prediction must be wrong for at least one of the algorithms.

The exception for this is approaches like IDA, which use AI's trained to imitate humans, so will probably actually be quite human like.

Take an example of an aligned AI system, and describe what the corresponding arrangement of humans would even be. Say take a satificer agent with an impact penalty. This is an agent that gets 1 reward if the reward button is pressed at least once, and is penalised in proportion to the difference between the real world and the hypothetical where it did nothing.How many people does this AI correspond to, and how are the people arranged into a coorporation?

I think you're misunderstanding my analogy.

I'm not trying to claim that if you can solve the (much harder and more general problem of) AGI alignment, then it should be able to solve the (simpler specific case of) corporate incentives.

It's true that many AGI architectures have no clear analogy to corporations, and if you are using something like a satisficer model with no black-box subagents, this isn't going to be a useful lens.

But many practical AI schema have black-box submodules, and some formulations like mesa-optimization or supervised amplification-distillation explicitly highlight problems with black box subagents.

I claim that an employee that destroys documentation so that they become irreplaceable to a company is a misaligned mesa-optimizer. Then I further claim that this suggests:

  • Company structures contain existing research on misaligned subagents. It's probably worth doing a literature review to see if some of those structures have insights that can be translated.
  • Given a schema for aligning sub-agents of an AGI, either the schema should also work on aligning employees at a company or there should be a clear reason it breaks down
    • if the analogy applies, one could test the alignment schema by actually running such a company, which is a natural experiment that isn't safely accessible for AI projects. This doesn't prove that the schema is safe, but I would expect aspects of the problem to be easier to understand via natural experiment than via doing math on a whiteboard.

Company structures contain existing research on misaligned subagents. It's probably worth doing a literature review to see if some of those structures have insights that can be translated.

"Principal-agent problem" seems like a relevant keyword.

Also, how does nature solve this problem? How are genes aligned with the cell as a whole, cells with the multicellular organism, ants with the anthill?

Though I suspect that most (all?) solutions would be ethically and legally unacceptable for humans. They would translate as "if the company fails, all employees are executed" and similar.