I'm trying think through various approaches to AI alignment, and so far this is the one I came up with that I like best. I have not read much of the literature, so please do point me if this has been discussed before.

What if we train an AI agent (ie, reinforcement learning) to survive/thrive in an environment where there are a wide variety of agents with wildly different levels of intelligence? In particular, such that pretty much every agent can safely assume they'll eventually meet an agent much smarter than they are; structure the environment to reward tit-for-tat with a significant bias towards cooperation, eg require agents to "eat" resources that require cooperation to secure and are primarily non-competitive. The idea is to have them learn to respect even beings of lesser intelligence, because they want beings of higher intelligence to respect them; and because in this environment a bunch of lesser intelligences can gang up and defeat one higher-intelligence being. Also, we effectively train each AI to detect and defeat new AIs that seek to disturb this balance. I have not thought this through, curious what you all think

(Cross posted from EA Forum)

New Comment
27 comments, sorted by Click to highlight new comments since: Today at 5:38 PM

I believe this would suffer from distributional shift in two different ways.

First, if the agents are supposed to scale up to the point where they can update their beliefs even after training, then we have a problem once the AI notices it can do pretty well without cooperating with humans in this new  environment. If we allow agents to update their beliefs at runtime, then basically any reinforcement-learning-like preconditioning would be pretty much useless, I think. And if the agent can't update its beliefs given new data then that can't be an AGI.

Second, even if you solve the first problem somehow, there is a question of what exactly you mean by "cooperate" and "respect". In the real world the choice is rarely binary between "cooperate" and "defect", and there is often option of "do things that look like you're cooperating while not actually putting much effort into it" (i.e. what most politicians do) or "actively try to deceive everyone looking to make them think you're nice" (you don't have to be actually smarter than everyone else for this, just smarter than everyone who cares enough to use their time to look at you closely). If you're only giving the agents a binary choice, then that's not realistic enough. If you're giving them many choices, but also putting in an "objective" way to check if what they chose is "cooperative enough", then all it takes is for the agents to deceive you, not their smarter AI colleagues. And if there's no objective way to check that, then we're back to describing human values with a utility function.

Point is to make "cooperate" a more convergent instrumental goal than "defect". And yes, not just in training, but in real world too. And yes, it's more fine-grained than a binary choice.

There is much more ways to see how cooperative AI is, compared to how well we can check now how human is cooperative. Including checking the complete logs of AI's actions, knowledge and thinking process.

And there are objective measures of cooperation. It's how well it's action affect other agents success in pursuing their goals. I.e. do other agents want to "vote out" this particular AI from being able to make decisions and use resources or not.

Yep pretty much what I had in mind

The problem is what do we count as an agent. Also, can't a realistic human-level-smart AI cheat this? Just build a swarm of small and stupid AIs that always cooperate with you (or coerce someone into building that), and then you and your swarm can "vote out" anyone you don't like. And you also get to behave in whatever way you want, because good luck overcoming your mighty voting swarm.

(Also, are you sure we can just read out AI's complete knowledge and thinking process? That can partially be done with interpretability, but in full? And if not in full, how do you make sure there aren't any deceptive thoughts in parts you can't read?)

what do we count as an agent?

Within the training, an agent (from the AI's perspective) is ultimately anything in the environment that responds to incentives, can communicate intentions, and can help/harm you Outside the environment that's not really any different

Just build a swarm of small AI

That's actually a legitimate point: assuming an AI in the real world has been effectively trained to value happy AIs, it could try to "game" that by just creating more happy AIs rather than making existing ones happy. Like some parody of a politician supporting immigration to get the new immigrants' votes, at the expense of existing citizens. One reason to predict they might not do this is that it's not a valid strategy in the simulation. But I'll have to think on this one more.

are you sure we can just read out AI's complete knowledge and thinking process?

The general point is we don't need to, it's the agent's job to convince other agents based on its behavior; ultimately similar to altruism in humans. Yes, it's messy, but in environments where cooperation is inherently useful it does develop.

it's the agent's job to convince other agents based on its behavior

So agents are rewarded for doing stuff that convinces others that they're a "happy AI", not necessarily actually being a "happy AI"? Doesn't that start an arms race of agents coming up with more and more sophisticated ways to deceive each other?

Like, suppose you start with a population of "happy AIs" that cooperate with each other, then if one of them realizes there's a new way to deceive the others, there's nothing to stop them until other agents adapt to this new kind of deception and learn to detect it? That feels like training an inherently unsafe and deceptive AI that also are extremely suspicious of others, not something "happy" and "friendly"

Doesn't that start an arms race of agents coming up with more and more sophisticated ways to deceive each other?

Yes, just like for humans. But also, if they can escape that game and genuinely cooperate, they're rewarded, like humans but more so.

Ah, I see. But how would they actually escape the deception arms race? The agents still need some system of detecting cooperation, and if it can be easily abused, it generally will be (Goodhart's Law and all that). I just can't see any other outcome other than agents evolving exceedingly more complicated ways to detect if someone is cooperating or not. This is certainly an interesting thing to simulate, but I'm not sure how that is useful for aligning the agents. Aren't we supposed to make them not even want to deceive others, instead of trying to find a deception strategy and failing? (Also, I think even an average human isn't that well aligned as we want our AIs to be. You wouldn't want to give a random guy from the street nuclear codes, would you?)

How do humans do it? Ultimately, genuine altruism is computationally hard to fake; so it ends up being evolutionarily advantageous to have some measure of the real thing. This is particularly true in environments with high cooperation rewards and low resource competition; eg where carrying capacity is maintained primarily by wild animals, general hard conditions, and disease, rather than overuse of resources. So we put our thumbs on the scale there to make these AIs better than your average human. And we rely on the AIs themselves to keep each other in check.

Ah, true. I just think this wouldn't be enough and that there could be distributional shift if the agents are put into an environment with low cooperation rewards and high resource competition. I'll reply in more detail under your new post, it looks a lot better

Agent is anyone or anything that has intelligence and the means of interacting with the real world. I.e. agents are AIs or humans.

One AI =/= one vote. One human = one vote. AIs are only getting as much authority as humans, directly or indirectly, entrust them with. So, if AI needs more authority, it has to justify it to humans and other AIs. And it can't request too much of authority just for itself, as tasks that would require a lot of authority will be split between many AIs and people.

You are right that the authority to "vote out" other AIs may be misused. That's where logs would be handy - for other agents to analyse the "minds" of both sides and see who was doing right. 

It's not completely fool proof, of course, but it means that attempts to power grab will not likely to happen completely under the radar.

Since there are no humans in the training environment, how do you teach that? Or do you put human-substitutes there (or maybe some RLHF-type thing)?  Also, how would such AIs will even reason about humans, since they can't read our thoughts? How are they supposed to know if we would like to "vote them out" or not? I do agree though that a swarm of cooperative AIs with different goals could be "safer" (if done right) than a single goal-directed agent.

This setup seems to get more and more complicated though. How are agents supposed to analyze "minds" of each other? I don't think modern neural nets can do that yet. And if we come up with a way that allows us to reliably analyze what an AI is thinking, why use this complicated scenario and not just train (RL or something) it directly to "do good things while thinking good thoughts", if we're relying on our ability to distinguish "good" and "bad" thoughts anyway?

(On an unrelated note, there already was a rather complicated paper (explained a bit simpler here, though not by much) showing that if agents reasoning in formal modal logic are able to read each other's source code and prove things about it, then at least in the case of a simple binary prisoner's dilemma you can make reasonable-looking agents that also don't do stupid things. Reading source code and proving theorems about it is a lot more extreme than analyzing thought logs, but at least that's something)

Since there are no humans in the training environment, how do you teach that? Or do you put human-substitutes there (or maybe some RLHF-type thing)?  

Yes, probably some human models.

Also, how would such AIs will even reason about humans, since they can't read our thoughts? How are they supposed to know if we would like to "vote them out" or not? 

By being aligned. I.e. understanding the human values and complying to them. Seeking to understand other agents' motives and honestly communicating it's own motives and plans to them, to ensure there is no conflicts from misunderstanding. I.e. behaving much like civil and well meaning people behave work together.

And if we come up with a way that allows us to reliably analyze what an AI is thinking, why use this complicated scenario and not just train (RL or something) it directly to "do good things while thinking good thoughts", if we're relying on our ability to distinguish "good" and "bad" thoughts anyway?

Because we don't know how to tell "good" thoughts from "bad" reliably in all possible scenarios.

So, no "reading" minds, just looking at behaviours? Sorry, I misundertood. Are you suggesting the "look at humans, try to understand what they want and do that" strategy? If so, then how do we make sure that the utility function they learned in training is actually close enough to actual human values? What if the agents learn something on the level "smiling humans = good", which isn't wrong by default, but is wrong if taken to the extreme by a more powerful intelligence in the real world?

Thanks for helping me think this through.

For the first problem, the basic idea is that this is used to solve the specification problem of defining values and training a "conscience", rather than it being the full extent of training. The conscience can remain static, and provide goals for the rest of the "brain", which can then update its beliefs.

For the second issue, I meant that we would have no objective way to check "cooperate" and "respect" on the individual agent level, except that the individual can get other agents to cooperate with it. So eg, in order to survive/reproduce/get RL rewards, the agents have to consume a virtual resource that requires effort from multiple/many agents (simple implementation: some sort of voting; but can be more complicated, eg requiring tokens that are generated at a fixed rate for each agent), but also generally be non-competitive, eg no stealing tokens or food, and there's more than enough food for everyone, if they can cooperate. The theory is that this should lead to a form of tit-for-tat, including AIs detecting and deterring liars.

Thinking a bit more: I think the really dangerous part of AI is the "independent agent", presumably trained with methods resembling RL; so that's the part I would train in this environment; it can then be hooked up to eg an LLM which is optimized on something like perplexity and acts more like ChatGPT, ie predicting the next word. Ie, have a separate "brain" and "conscience", with the brain possibly smarter but the "conscience" holding the reins; during the above training, mix different variants of both components, with different intelligence levels.

Okay, so if that's just a small component, then sure, first issue goes away (though I still have questions on how you're gonna make this simulation realistic enough to just hook it up to an LLM or "something smart" and expect it to set coherent and meaningful goals in real life, though that's more of a technical issue).

However, I believe there are still other issues with this appoach. The way you describe it makes me think it's really similar to Axelrod's Iterated Prisoner's Dilemma tournament, and that did invent tit-for-tat strategy as one of the most successful ones. But that wasn't the only successful strategy. For example there were strategies that were mostly tit-for-tat, but would defect if they could get away with it. If, for example, that still mostly results in tit-for-tat, except for some rare cases of it defecting when the agents in question are too stupid to "vote it out", do we punish it?

Second, tit-for-tat is quite succeptible to noise. What if the trained agent misinterprets someone's actions as "bad", even if they in actuality did something innocent, like "affectionately pat their friend on the back", which the AI interpreted as fighting? No matter how smart the AI gets, there still will be cases where it wrongly believes someone to be a liar, and so believes it has every justification to "deter" them. Do we really want even some small probability of that behaviour? How about an AI that doesn't hurt humans unconditionally, and not only when it believes us to be "good agents"?[1]

Last thing, how does AI determine what is an "intelligent agent" and what is a "rock"? If there are explicit tags in the simulation for that thing, then how do you make sure every actual human in the real world gets tagged correctly? Also, do we count animals? Should animal abuse be enough justification for the AI to turn on the rage mode? What about accidentally stepping on an ant? And if you define "intelligent agent" as relative to the AI, when what do we do once it gets smart enough to rationally think of us like ants?

  1. ^

New post with a distillation of my evolved thoughts on this: https://www.lesswrong.com/posts/3SJCNX4onzu4FZmoG/alignment-ai-as-ally-not-slave

Will continue responding, but first, after reading the existing comments, I think I do need to explicitly make humans preferred. I propose that in the sim we have some agents whose inner state is "copyable" and they get reincarnated, and some agents who are not "copyable". Subtract points from all agents whenever a non-copyable agent is harmed/dies. The idea is that humans are not copyable, and that's the main reason AIs should treat us well, while AIs are, and that's the main reason we don't have to treat them well. But also, I think we as humans might actually have to learn to treat AIs well in some sense/to some degree...

I think a key danger here is that treatment of other agents wouldn't transfer to humans, both because it's inherently different and because humans themselves are likely to be on the belligerent side of the spectrum. But even so I think it's a good start in defining an alignment function that doesn't require explicitly encoding some particular form of human values.

To extend the approach to address this, I think we'd have to explicitly convey a message of the form "do not discriminate based on superficial traits, only choices"; eg, in addition to behavioral patterns, agents possess superficial traits that are visible to other agents, and are randomly assigned with no particular correlation with the behaviors.

Better yet, have the agents experience discrimination themselves to internalize the message that it is bad

I think it could work better if AIs are of roughly the same power. Then if some of them would try to grab for more power, or otherwise misbehave, others could join forces oppose it together.

Ideally, there should be a way for AIs to stop each other fast, without having to resort to actually fight.

In general my thinking was to have enough agents such that each would find at least a few within a small range of their level; does that make sense?

I just mean that "wildly different levels of intelligence" is probably not necessary, and maybe even harmful. Because then there will be few very smart AIs at the top, which could usurp the power without smaller AI even noticing.

Though, it maybe could work if those AI are smartest, but have little authority. For example they can monitor other AIs and raise alarm/switch them off if they misbehave, but nothing else.

Part of the idea is to ultimately have a super intelligent AI treat us the way it would want to be treated if it ever met an even more intelligent being (eg, one created by an alien species, or one that it itself creates). In order to do that, I want it to ultimately develop a utility function that gives value to agents regardless of their intelligence. Indeed, in order for this to work, intelligence cannot be the only predictor of success in this environment; agents must benefit from cooperation with those of lower intelligence. But this should certainly be doable as part of the environment design. As part of that, the training would explicitly include the case where an agent is the smartest around for a time, but then a smarter agent comes along and treats it based on the way it treated weaker AIs. Perhaps even include a form of "reincarnation" where the agent doesn't know its own future intelligence level in other lives.

While having lower intelligence, humans may have bigger authority. And AIs terminal goals should be about assisting specifically humans too.

Ideally, sure, except that I don't know of a way to make "assist humans" be a safe goal. So I'm advocating for a variant of "treat humans as you would want to be treated", which I think can be trained