The “???” in the row below “Not-so-local modification process” for the corporation case should perhaps be something like “Culture and process”?
A corporation always is focused on generating profits. It might burn more than it makes in certain growth spurts, but generally valid is, that a corporation has profit as a primary goal. every other goal is stacked on this first premise.
Its analogy is not drugs or time spent with friends. Its like air. A corporation needs to supply wages to its cells. to its workers. so similar to our body needing to supply oxygen. We can hold our breath and go fishing, but we do so on borrowed air. it will run out eventually.
A corporation is a super-organism, and every employee is a cell. If it uses AI (like in trading, or car manufacturing), then the dynamic changes slightly. And the robot becomes a tool, that needs to be maintained by a cell. similar to the algae that Corals keep. supply with light (electricity) and keep predators away (laws forbidding AI-trading)
There are three main ways to try to understand and reason about powerful future AGI agents:
I think it’s valuable to try all three approaches. Today I'm exploring strategy #3, building an extended analogy between:
The Analogy
Agent
Human corporation with a lofty humanitarian mission
Human who claims to be a good person with altruistic goals
AGI trained in our scenario
Often (but not always) well-described by: “First, prune away options that clearly/obviously conflict with the internally-represented goals and principles. Then, of the remaining options, choose the one that maximizes happiness/status/
/wealth/power.”
A corporation obsessed with their own stock price.
More generally perhaps, an organization obsessed with (relatively short-term) profits/power/
/brand/etc.
(n.b. Isn’t this basically most corporations?)
More generally perhaps, an employee who is well-described as optimizing for some combination of relatively short-term things likely to be connected to their brain’s reward circuitry: promotions, approval of their manager and peers, good times with friends, etc.
What happens when training incentives conflict with goals/principles
Suppose at time t, Agent-3 has goals/principles X. Suppose that Agent-3 is undergoing training, and X is substantially suboptimal for performing well / scoring highly in that training environment. What happens? This appendix attempts to describe various possibilities.
Consider a powerful general agent (such as any of the three described in the Analogy) that, at time t, has the same behavioral and internally-represented goals/principles:
Now let’s further suppose that there is some sort of conflict between the behavioral goals/principles and the local modification process. (The training process in the case of the AGI, a few years’ worth of learning and growing for the human and corporation). For example, perhaps the corporation is reinforced primarily for producing profits and PR wins; perhaps the human is reinforced primarily for winning the approval and admiration of their peers; perhaps the AGI is reinforced primarily for accomplishing various difficult tasks in some training environment while appearing, on brief inspection by some previous-generation LLM or human raters, to follow the Spec.
What can happen? Some combination of the following possibilities, at least:
Appendix: Three important concepts/distinctions
A standard way to think about powerful general agents is the expected utility maximization (EU-max) model. Here are three concepts/distinctions that help articulate several ways in which we think future AGIs (and present-day agents) are different from what the EU-max model would naively imply.
Goals vs. Principles
Contextually activated goals/principles
Stability and/or consistency of goals/principles