cross-posted from my personal blog
I have an intuition that the gap between short term ML alignment problems and long term AGI alignment problems is smaller than usual.
Writing on the subject I usually see rarely takes this perspective. I see a few things actually get written:
- People who work on long term AGI alignment saying short term ML alignment doesn’t help their use case
- People who work on short term ML alignment frustrated with companies that they can’t get their work implemented
- People who talk about the incentives of business structures and draw some parallels to long term AGI alignment, for example Immoral Mazes and Mesa Optimizers. Maybe I’m projecting, as I’m not sure how much this parallel is being drawn explicitly.
This seems to me like a strong throughline, especially for a Christiano-style amplification path. That is, we already observe hierarchical self-referential systems that encounter supervision problems for aligning their stated objectives (which may or may not be honest) with their pursued objectives (which may or may not be aligned) by using semi-aligned components (employees) and indirect supervision (middle management).
There has been significant discussion of the analogy between corporations and superintelligences and I don’t mean to rehash this. Certainly a superintelligent AGI may have many properties that no corporation of humans has, and one of the most likely and concerning of those properties is operating on a faster serialized time scale.
But I think it’s hard to deny that corporations are a form of very weak superintelligence that exists in the real world. And I think that means they are an opportunity to experiment with methods for alignment in superintelligences in the real world.
Or to phrase it differently, I don’t think “being able to run a good corporation” is a solution to the problem of AGI alignment, but I DO think that a solution to the problem of AGI alignment, probably in general but definitely in Christiano-style amplification, should contain a solution to being able to run a company that doesn’t experience value drift. I think this philosophy, of attempting to solve simplified sub-problems, is pretty well established as a good idea.
Unfortunately, I don’t really know that much about business or management. And I don’t know that much about the kind of selection pressures companies undergo, especially around initial funding. So I’m going to try to write a little bit about some toy models and hope this train of thought is helpful for someone.
A Toy Model of Businesses
A company starts with a founder. The founder has an idea which will create value, and an idea about how to sell that value and capture some of it for themself.
The founder then needs to create a company to make the idea happen. Creating a company requires a bunch of capital to get tools and employees. Often in order to acquire that capital, the founder needs to find an investor.
The founder needs to convince the investor of a few things:
The idea will create value
There will be ENOUGH value for them to split it and make better than market returns
The company can make the idea happen
The founder can make the company happen, given enough capital
By appealing to the investor the founder is letting the company be less aligned with the idea, because the investor only cares about the return on investment. If the idea changes so that it is harder to sell the value created, this may be a big win for the founder or society at large, but will be a loss for the investor.
Assuming this deal works out, the founder needs to create the company by hiring employees to do subtasks that will help make the idea real.
These employees have a number of possible motivations, and probably a mix of all of them
To capture some of the value from selling the idea
To capture some of the capital from the investor
To benefit directly from the idea and it’s value
To support the efforts of the founder
To support the efforts of the investor
To support the efforts of other employees
To send signals to external parties (for example governments, other companies)
To create value from a different idea.
From the perspective of the founder, motivations 3 and 4 are probably aligned (since in 3 they may pay/work for lower salaries), and motivations 1 and 8 can be actively harmful (less value remains for the founder to capture).
From the perspective of the investor, motivations 4, 5, and 8 are aligned, and motivations 1 and 2 are actively harmful.
From the perspective of society at large, motivations 3 and 8 are aligned, while motivation 7 is suspect.
From the perspective of employee self-interest, motivations 1, 2, and 7 are aligned.
From the perspective of employee reputation, motivations 6 and 8 are best aligned while 1 and 2 may be less aligned.
So let’s look at some (hypothetical) real world events in the context of a toy model.
A team of engineers sees a way to increase engagement on their component in a way that differs from the plan.
The engineers see this as a way to support the core idea, (4), and they see their own ideas (8), adding value because of this. They may also be motivated by increasing their contribution, so they can increase their share of the value (1).
Once they have this idea, gathering together around it is supported by 6.
Importantly though, this only satisfies 8 IF optimizing for engagement (the imperfect proxy they are given) actually adds value to the idea (3).
Otherwise, the change might actually damage the company’s overall value proposition by overoptimizing a piece.
This individual team often doesn’t understand the core value proposition. For something like Facebook clicks, it’s not clear that even the founder understands the core value proposition.
Supervision in this case could work. If you have a high level executive who cares (because of his investment incentives) about how much value the company overall generates, AND that executive has the ability to measure whether the change overall adds value (for example, by doing UX research and getting a holistic picture of what the value is), the executive could observe that “build a tool that finds posts that are fairly clickable” adds value but “build a tool that finds posts that are MAXIMALLY clickable” reduces value.
This is at least a much shorter, easier to verify statement than the full details of differences between possible tool implementations.
My impression is that this doesn’t happen in practice, which could be why clickbait is such a problem?
Are there real life examples of anti-goodharting based on user studies? What I have seen in practice is usually more like measuring a range of different types of clickiness, and hoping that broadening in this way reduces over-optimization.
An ethics researcher discovers that part of the software that implements the company’s idea actually destroys some value. They bring this concern to their manager.
The ethics researcher is trying to enhance 3 and 4, possibly directly in response to the previous case. In this case, they are working directly against 6, and creating a zero sum game with other employees for 1, because credit assignment would mean reducing how much of the value the people who wrote this tool are capturing. However, if the value destruction is a secret, then the investor may not care, because it won’t affect how well the company can sell itself.
In this case, the founder’s original vision requires the work of the ethics researcher, but this works directly against the investor’s interests (by reducing external measures of value) and creates a zero sum conflict among employees.
Since incentives are working against them, their manager quiets the researcher and the idea creates less value.
Can supervision fix this?
If our aligned executive comes back, he could be convinced that the implementation may destroy some value. If this identifies a case like (1) where this value destruction comes from changes to a plan that misalign the company, he could intervene and the researcher would be a valuable part of the oversight and summarization process.
But if the issue identified is part of the original plan, the executive may not know what alternative paths the team could take, whether those alternatives would actually destroy less value vs. this being a discovery about the idea fundamentally containing less value, or how much cost the alternatives would have. In particular, because the researcher is not the person who originally made the product in the first place, it’s not clear that ANYONE would have a clear picture of how much effort it would take to fix.
Imagine Facebook showing users posts that will get reactions. Reactions are a bad proxy for post quality, and optimizing for them Goodharts any value they have far too hard. But Facebook has to show posts in some way! I can’t imagine that hundreds of engineers at Facebook would be happy to say “yeah, let’s go back to reverse chronological ordering and find new jobs.” And finding a better way may just be unachievable.
It seems to me that it would be hard to summarize this situation, so that an executive supervisor could easily set incentives for other teams in response. In particular, it might be hard to distinguish between researchers who point out real gaps in the product and those who simply point at politically weak teams in hope of stealing their share of value.
If that is the common outcome, then you may see ethics researchers or other whistleblowers leaving the company on bad terms, because they are unable to contribute value.
An executive does their best to incentivize projects to create real value and stop projects under them that destroy real value in order to capture more of the remaining value
Whether this executive is following their incentives from 1 depends on the exact tradeoff between productivity and corruption they face, and potentially depends on contingent factors about the investor.
Say there are opportunities to seize X shares by destroying Y value with an opportunity cost of creating Z value spread across all shares.
Say the executive has 1% of shares.
If value creation (the opportunity cost Z) can add 1% to company value, that is worth 0.01% directly.
If the executive can seize all of another executives shares (going to 2%) by destroying 1% of company value, that is instead worth 0.98% directly.
If value creation can add 5% to company value, it is worth 0.05% directly
If the executive can expand his influence slightly (going up to 1.1% of shares) at high cost (destroying 5% of company value) this is worth 0.045% directly.
How able the investor is to notice value destruction, how much the investor (or board) wants to punish value grabbing, how easy value creation is, how easy value grabbing is, and how much influence over punishments can be provided by value grabbing, are all a bunch of knobs that could get aligned executives promoted or removed from the company. I have no idea what sorts of numbers are reasonable or what kind of stable equilibrium might result.
That is, can this kind of supervision compete in the toy model market? Unclear.