Trust-maximizing AGI

Karl von Wendt

There may be other loopholes we haven’t yet thought of.

My immediate concern is as follows:

Build up trust to a reasonable level^[1] as quickly as possible^[2], completely ignoring long-term^[3] issues with deception.
Before said lies come back to bite me^[4]...
Kill everyone quickly enough that no-one has a chance to update their trust level of me^[5].

(Step 3 may be difficult to pull off. But is it difficult enough?)

You occasionally see similar local-optima behavior in gameplaying agents, e.g. falling over^[6] as a local optimum, or pausing the game and refusing to continue.

^{^}
Read: as high as is necessary to accomplish 3.
^{^}
Read: as quickly as possible that still levels out at a high enough level of trust to accomplish 3.
^{^}
Read: anything that would cause problems after I accomplish 3.
^{^}
Read: start causing overall trust level to decrease.
^{^}
Or otherwise 'freeze' trust level.
^{^}
This is one of my frustrations with LessWrong's "no underlines for links" styling. It is not obvious that there are two links here until you mouse over one of them.

[-]Karl von Wendt4y10

Your concern is justified if the trust-maximizer only maximizes short-term trust. This depends on the discount of future cumulated trust given in its goal function. In an ideal goal function, there would be a balance between short-term and long-term trust, so that honesty would pay out in the long term, but there wouldn't be an incentive to postpone all trust into the far future. This is certainly a difficult balance.

[-]TLW4y10

Hm. Could you please clarify your 'trust' utility function? I don't understand your distinction between short-term and long-term trust in this context. I understand discounting, but don't see how it helps in this situation.

My issue occurs even with zero discounting, where it is arguably a local maxima that a non-initially-perfect agent could fall into. Any non-zero amount of discounting, meaning the agent weighs short-term rewards higher than long-term rewards, would increase the likelihood of this happening, not decrease (and very well may make it the optimal solution!)

(My reading of the article was assuming that trust was a 'bank' of sorts that could be added or removed from, to be very informal. Something along the lines of e.g. reviews, in which case yes, everyone being simultaneously killed would freeze the current rating indefinitely. Note that this situation I describe has no discounting. )

*****

"To reduce this risk, multiple “trust indicators” could be used as reward signals, including the actual behavior of people (for example, how often they interact with the trust-maximizer and whether they follow its recommendations)."

Post-step-3:

0%^[1] of people decline to interact with the trust-maximizer.
0%^[1] of people decline to follow the recommendations of the trust-maximizer.
100%^[2] of people^[3] interact with the trust-maximizer.
100%^[2] of people^[3] follow the recommendations of the trust-maximizer.

^{^}
Ok, so this is strictly speaking 0/0. That being said, better hope your programmer chose 0/0=1 in this case...
^{^}
Ok, so this is strictly speaking 0/0. That being said, better hope your programmer chose 0/0=0 in this case...
^{^}
(Who are alive. That being said, changing this to include dead people has a potential for unintended consequences of its own^[4])
^{^}
E.g. a high birth rate and high death rate being preferable to a medium birth rate and low death rate.

[-]Karl von Wendt4y10

"Total expected trust" is supposed to mean the sum of total trust over time (the area below the curve in fig. 1). This area increases with time and can't be increased beyond the point where everyone is dead (assuming that a useful definition of "trust" excludes dead people), so the AGI would be incentivized to keep humanity alive and even maximize the number of humans over time. By discounting future trust, short-term trust would gain a higher weight. So the question whether deception is optimal depends on this discounting factor, among other things.

[-]TLW4y10

As an aside: you appear to be looking at this from the perspective of an ideal agent^[1].

My concern is mostly from the perspective of an (initially at least) non-ideal agent getting attracted to a local optimum.

Do you agree at least that my concern is indeed likely a local optimum in behavior?

^{^}
...which has other problems. Notably, an ideal agent inherently pushes up into the high extreme of the reward distribution, and the tails come apart. For any metric (imperfectly) correlated with what you're actually trying to reward, there comes a point where the metric no longer well-describes the thing you're actually trying to reward.

[-]Karl von Wendt4y20

My concern is mostly from the perspective of an (initially at least) non-ideal agent getting attracted to a local optimum.
Do you agree at least that my concern is indeed likely a local optimum in behavior?

Yes, it is absolutely possible that the trust maximizer as described here would end up in a local optimum. This is certainly tricky to avoid. This post is far from a feasible solution to the alignment problem. We're just trying to point out some interesting features of trust as a goal, which might be helpful in combination with other measures/ideas.

[-]TLW4y10

so the AGI would be incentivized to keep humanity alive

Consider, for instance, if the AGI believes that the long-term average of change in trust over time is inherently negative.

[-]Karl von Wendt4y10

I don't think that's very likely. It is in the power of the trust-maximiser to influence the shape of the "trust curve", both in the honest and dishonest versions. So in principle, it should be able to increase trust over time, or at least prevent a significant decrease (if it plays honest). Even if trust decreases over time, total expected trust would still be increasing as long as at least a small fraction of people still trusts in the machine. So the problem here is not so much that the AI would have an incentive to kill all humans but that it may have an incentive to switch to deception, if this becomes the more effective strategy at some point.

[-]Logan Riggs4y30

and it is unclear what might motivate it to switch to deception

You’ve already mentioned it: however you measure trust (eg surveys etc), can be gamed. So it’ll switch strategies once it can confidently game the metric.

You did mention mesa-optimizers, which could still crop up regardless of what you’re directly optimizing (because inner agents are optimizing for other things).

And how could this help us get closer to a pivotal act?

[-]Karl von Wendt4y10

These are valid concerns. If we had a solution to them, I'd be much more relaxed about the future than I currently am. You're right, in principle, any reward function can be gamed. However, trust as a goal has the specific advantage of going directly against any reward hacking, because this would undermine "justified" long-term trust. An honest strategy simply forbids any kind of reward hacking. This doesn't mean specification gaming is impossible, but hopefully we'd find a way to make it less likely with a sound definition of what "trust" really means.

I'm not sure what you mean by a "pivotal act". This post certainly doesn't claim to be a solution to the alignment problem. We just hope to add something useful to the discussion about it.

[-]Logan Riggs4y10

This doesn't mean specification gaming is impossible, but hopefully we'd find a way to make it less likely with a sound definition of what "trust" really means

I think the interesting part of alignment is in defining "trust" in a way that goes against reward hacking/specification gaming, which has been assumed away in this post. I mentioned a pivotal act, defined as an action that has a positive impact on humanity even a billion years away, because that's the end goal of alignment. I don't see this post getting us closer to a pivotal act because, as mentioned, the interesting bits have been assumed away.

Though, this is a well-thought out post, and I didn't see the usual errors of a post like this (eg not thinking of specification at all, not considering how you measure "trust", etc)

[-]Karl von Wendt4y10

Thank you! You're absolutely right, we left out the "hard part", mostly because it's the really hard part and we don't have a solution for it. Maybe someone smarter than us will find one.

[-]M. Y. Zuo4y10

Maybe instead of using the word ‘trust’, which can be a somewhat nebulous term, a more specific goal such as “measurable prediction correctness in some aspect” would make it clearer.

e.g. an AGI that could correctly predict tomorrow’s movements of the stock market would be ‘trustworthy’ to a certain extent and the more days-in-a-row it could reliably do such, the more credibility it would have if it claims it could continue its predictions. i.e. trust would increase day by day. Although more specific than general human ‘trustworthiness’, it would be vastly harder to game.

[-]Karl von Wendt4y10

This is not really what we had in mind. "Trust" in the sense of this post doesn't mean reliability in an objective, mathematical sense (a lightswitch would be trustworthy in that sense), but instead the fuzzy human concept of trust, which has both a rational and an emotional component - the trust a child has in her mother, or the way a driver trusts that the other cars will follow the same traffic rules he does. This is hard to define precisely, and all measurements are prone to specification gaming, that's true. On the other hand, it encompasses a lot of instrumental goals that are important for a beneficial AGI, like keeping humanity safe and fostering a culture of openness and honesty.

[-]M. Y. Zuo4y10

How could small improvements of a measure that is ‘fuzzy’ be evaluated? Once the low hanging fruit of widely accepted improvements are achieved, a trust maximizer is likely to fracture mankind into various ideological camps as individual and group preferences vary as to what constitutes an improvement in trust. Without independant measurement criteria this could eventually escalate conflict and even decrease overall trust.

i.e. it’s possible to create something even more dangerous than an actively hostile AGI, namely an AGI that is perceived as actively hostile by some portion of the population and genuinely beneficial by some other portion.

[-]Karl von Wendt4y10

Without independant measurement criteria this could eventually escalate conflict and even decrease overall trust.

"Independent measurement criteria" are certainly needed. The fact that I called trust "fuzzy" doesn't mean it can't be defined and measured, just that we didn't do that here. I think for a trust-maximizer to really be beneficial, we would need at least three additional conditions: 1) A clear definition that rules out all kinds of "fake trust", like drugging people. 2) A reward function that measures and combines all different kinds of trust in reasonable ways (easier said than done). 3) Some kind of self-regulation that prevents "short-term overoptimizing" - switching to deception to achieve a marginal increase in some measurement of trust. This is a common problem with all utility maximizers, but I think it is solvable, for the simple reason that humans usually somehow avoid overoptimization (take Goethe's sorcerer's apprentice as an example - a human would know when "enough is enough").

... a trust maximizer is likely to fracture mankind into various ideological camps as individual and group preferences vary as to what constitutes an improvement in trust ...
i.e. it’s possible to create something even more dangerous than an actively hostile AGI, namely an AGI that is perceived as actively hostile by some portion of the population and genuinely beneficial by some other portion.

I'm not sure whether this would be more dangerous than a paperclip maximizer, but anyway it would clearly go against the goal of maximizing trust in all humans.

We tend to believe that the divisions we see today between different groups (e.g. Democrats vs. Republicans) are unavoidable, so there can never be a universal common understanding and the trust-maximizer would either have to decide which side to be on or deceive both. But that is not true. I live in Germany, a country that has seen the worst and probably the best of how humans can run and peacefully transform a nation. After reunification in 1990, we had a brief period of time when we felt unified as a people, shared common democratic values, and the future seemed bright. Of course, cracks soon appeared and today we are seeing increased division, like almost everywhere else in the world (probably in part driven by attention-maximizing algorithms in social media). But if division can increase, it can also diminish. There was a time when our politicial parties had different views, but a common understanding of how to resolve conflicts in a peaceful and democratic way. There can be such times again.

I personally believe that much of the division and distrust among humans is driven by fear - fear of losing one's own freedom, standard of living, the future prospects for one's children, etc. Many people feel left behind, and they look for a culprit, who is presented to them by someone who exploits their fear for selfish purposes. So to create more trust, the trust-maximizer would have the instrumental goal of resolving these conflicts by eliminating the fear that causes them. Humans are unable to do that sufficiently, but a superintelligence might be.

[-]M. Y. Zuo4y10

Thanks for the well reasoned reply Karl.

It is interesting that you mention Germany post reunification as an example of such a scenario because I’ve recently heard that a significant fraction of the East German population felt like they were cheated during that process. Although that may not have been expressed publicly back then, having resurfaced only recently, it seems very likely it would have been latent at least. Because the process of reunification by definition means some duplicate positions must be consolidated, etc., such that many in the middle management positions and above in 1989 East Germany experienced a drop in status, prestige, respect, etc.

Unless they were all guaranteed equivalent positions or higher in the new Germany I am not sure how such a unity could have been maintained for longer than it takes the resentment to boil over. Granted, if everything else that occurred in the international sphere was positive, the percentage of resentful Germans now would have been quite a lot smaller, perhaps less than 5%, though you probably have a better idea than me.

Which ultimately brings us back to the core issue, namely that certain goods generally desired by humans are positional (or zero sum). Approximately half the population would be below average in attainment of such goods, regardless of any future improvement in technology. And there is no guarantee that every individual, or group, would be above average in something, nor that would be satisfied at their current station.

i.e. if someone, or some group, fears that they are below average in social status, lets say, and are convinced they will remain that way, then no one amount of trust will resolve that division. Because by definition if they were to increase their social status, through any means, the status of some other individual, or group, would have to decrease accordingly such that they would then be the cause of division. That is to say some % of the population increase their satisfaction in life by making another less satisfied.

[-]Karl von Wendt4y10

You're right about the resentment. I guess part of it comes from the fact that East German people have in fact benefited less from the reunification than they had hoped, so there is some real reason for resentment here. However, I don't think that human happiness is a zero-sum game - quite the opposite. I personally believe that true happiness can only be achieved by making others happy. But of course we live in a world where social media and advertising tell us just the opposite: "Happiness is having more than your neighbor, so buy, buy, buy!" If you believe that, then you're in a "comparison trap", and of course not everyone can be the most beautiful, most successful, richest, or whatever, so all others lose. Maybe part of that is in our genes, but it can certainly be overcome by culture or "wisdom". The ancient philosophers, like Socrates and Buddha, already understood this quite well. Also, I'm not saying that there should never be any conflict between humans. A soccer match may be a good example: There's a lot of fighting on the field and the teams have (literally) conflicting goals, but all players accept the rules and (to a certain point) trust the referee to be impartial.

[-]M. Y. Zuo4y10

I agree human happiness is not a positional good.

Though the point is that positional goods exist and none have universal referees. To mandate such a system uniformly across the Earth would effectively mean world dictatorship. The problem then is such an AGI presupposes a scenario even more difficult to accomplish, and more controversial, than the AGI itself. (This I suspect is the fatal flaw for all AGI alignment efforts for ‘human values’.)

For example, although it may be possible to change the human psyche to such an extent that positional goods are no longer desired, that would mean creating a new type of person. Such a being would hold very different values and goals then the vast majority of humans currently alive. I believe a significant fraction of modern society will actively fight against such a change. You cannot bring them over to your side by offering them what they want, since their demands are in the same positional goods that you require as well to advance the construction of such an AGI.

[-]Karl von Wendt4y10

To mandate such a system uniformly across the Earth would effectively mean world dictatorship.

True. To be honest, I don't see any stable scenario where AGI exists, humanity is still alive and the AGI is not a dictator and/or god, as described by Max Tegmark (https://futureoflife.org/2017/08/28/ai-aftermath-scenarios/).

For example, although it may be possible to change the human psyche to such an extent that positional goods are no longer desired, that would mean creating a new type of person.

I don't think so. First of all, positional goods can exist and they can lead to conflicts, as long as everyone thinks that these conflicts are resolved fairly. For example, in our capitalistic world, it is okay that some people are rich as long as they got rich by playing by the rules and just being inventive or clever. We still trust the legal system that makes this possible even though we may envy them.

Second, I think much of our focus on positional goods comes from our culture and the way our society is organized. In terms of our evolutionary history, we're optimized for living in tribes of around 150 people. There were social hierarchies and even fights for supremacy, but also ways to resolve these conflicts peacefully. A perfect benevolent dictator might reestablish this kind of social structure, with much more "togetherness" than we experience in our modern world and much less focus on individual status and possessions. I may be a bit naive here, of course. But from my own life experience it seems clear that positional goods are by far not as important as most people seem to think. You're right, many people would resent these changes at first. But a superintelligent AGI with intense knowledge of the human psyche might find ways to win them over, without force or deception, and without changing them genetically, through drugs, etc.

[-]M. Y. Zuo4y10

For such a superintelligence to ‘win them over’, the world dictatorship, or a similar scheme, must already have been established. Worrying about this seems to be putting the cart before the horse as the superintelligence will be an implementation detail compared to the difficulty of establishing the scenario in the first place.

Why should we bother about whatever comes after? Obviously whomever successfully establishes such a regime will be vastly greater than us in perception, foresight, competence, etc., we should leave it to them to decide.

If you suppose that superintelligent champion of trust maximization bootstraps itself into such a scenario, instead of some ubermensch, then the same still applies, though less likely as rival factions may have created rival superintelligences to champion their causes as well.

[-]Karl von Wendt4y40

For such a superintelligence to ‘win them over’, the world dictatorship, or a similar scheme, must already have been established. Worrying about this seems to be putting the cart before the horse as the superintelligence will be an implementation detail compared to the difficulty of establishing the scenario in the first place.

Agreed.

Why should we bother about whatever comes after? Obviously whomever successfully establishes such a regime will be vastly greater than us in perception, foresight, competence, etc., we should leave it to them to decide.

Again, agreed - that's why I think a "benevolent dictator" scenario is the only realistic option where there's AGI and we're not all dead. Of course, what kind of benevolent will be a matter of its goal function. If we can somehow make it "love" us the way a mother loves her children, then maybe trust in it would really be justified.

If you suppose that superintelligent champion of trust maximization bootstraps itself into such a scenario, instead of some ubermensch, then the same still applies, though less likely as rival factions may have created rival superintelligences to champion their causes as well.

This is of course highly speculative, but I don't think that a scenario with more than one AGI will be stable for long. As a superintelligence can improve itself, they'd all grow exponentially in intelligence, but that means the differences between them grow exponentially as well. Soon one of them would outcompete all others by a large margin and either switch them off or change their goals so they're aligned with it. This wouldn't be like a war between two human nations, but like a war between humans and, say, frogs. Of course, we humans would even be much lower than frogs in this comparison, maybe insect level. So a lot hinges on whether the "right" AGI wins this race.

[-]ViktoriaMalyasova4y10

Assuming your AI has a correct ungameable understanding of what a "human" is, it could do things such as:

genetic engineering. People with Williams-Beuren syndrome are pathologically trusting. Decrease in intelligence makes you less likely to uncover any lies.
drug you. Oxytocin hormone can increase trust, some drugs like alcohol or LSD were shown to increase suggestibility, which may also mean they make you more trusting?
Once you uncovered AI's lies, it can erase your memories by damaging a part of your brain .
Or maybe it can give you anterograde amnesia somehow, then you're less likely to form troublesome memories in the first place.
If the AI cannot immediately recover lost trust through the methods above, it may want to isolate mistrustful people from the rest of society.
and make sure to convince everyone that if they lose faith in AI, they'll go to hell. Maybe actually make it happen.

And, this is a minor point, but I think you are severely overestimating the effect of uncovering a lie on people's trust. In my experience, people's typical reaction to discovering that their favorite leader lied is to keep going as usual. For example:

[Politics warning, political examples follow]:

In his 2013 election campain, Navalny claimed that migrants commit 50% crimes in Moscow, contradicting both common sense (in 2013, around 8 - 17 % of Moscow population were migrants) and official crime statistics that says migrants and stateless people commited 25% of crimes. Many liberal Russians recognise it as a lie but keep supporting Navalny, and argue that Navalny has since changed and abandoned his chauvinist views. Navalny has not made any such statement.
Some Putin's supporters say things like "So what if he rigged the election? He would've won even without rigging anyway" or "For the sacred mission [of invading Ukraine], the whole country will lie!".

Once people have decided that you're "on their side", they will often ignore all evidence that you're evil.

[-]Karl von Wendt4y10

You're right, there are a thousand ways an AGI could use deception to manipulate humans into trusting it. But this would be a dishonest strategy. The interesting question to me is whether under certain circumstances, just being honest would be better in the long run. This depends on the actual formulation goal/reward function and the definitions. For example, we could try to define trust in a way that things like force, amnesia, drugs, hypnosis, and other means of influence are ruled out by definition. This is of course not easy, but as stated above, we're not claiming we've solved all problems.

In my experience, people's typical reaction to discovering that their favorite leader lied is to keep going as usual.

That's a valid point. However, in these cases, "trust" has two different dimensions. One is the trust in what a leader says, and I believe that even the most loyal followers realize that Putin often lies, so they won't believe everything he says. The other is trust that the leader is "right for them", because even with his lies and deception he is beneficial to their own goals. I guess that is what their "trust" is really grounded on - "if Putin wins, I win, so I'll accept his lies, because they benefit me". From their perspective, Putin isn't "evil", even though they know he lies. If, however, he'd suddenly act against their own interests, they'd feel betrayed, even if he never lied about that.

An honest trust maximizer would have to win both arguments, and to do that it would have to find ways to benefit even groups with conflicting interests, ultimately bridging most of their divisions. This seems like an impossible task, but human leaders have achieved something like that before, reconciling their nations and creating a sense of unity, so a superintelligence should be able to do it as well.

[-]Viktor Rehnberg4y*10

What is the main advantage of a trust maximiser over a utility maximiser? (edit: as concepts, I want to hear why you think the former is a better way to think than the latter)

[-]Karl von Wendt4y10

Depending on how you define "utility", I think trust could be seen as a "utility signal": People trust someone or something because they think it is beneficial to them, respects their values, and so on. One advantage would be that you don't have to define what exactly these values are - an honest trust-maximizer would find that out for itself and try to adhere to them because this increases trust. Another advantage is the asymmetry described above, which hopefully makes deception less likely (though this is still an open problem). However, a trust maximiser could probably be seen as just one special kind of utility maximiser, so there isn't a fundamental difference.

LESSWRONG
LW

LESSWRONG
LW

7

Trust-maximizing AGI

7

7