6. Trustworthy Computing

Summary of entire Series:  An alternative approach to designing Friendly Artificial Intelligence computer systems.

Summary of this Article: How to use non-open-ended task programs to permit a computing environment in which AIs could trust each other enough to cooperate on ganging up on defectors sufficiently effectively that defecting becomes no longer the optimal strategy, and a stable society of AIs able to trust each other not to break agreements becomes possible.

  1. Optimum number of single points of failure
  2. Don't put all your eggs in one basket
  3. Defect or Cooperate
  4. Environments for killing AIs
  5. The advantage of not being open-ended
  6. Trustworthy Computing
  7. Metamorphosis
  8. Believable Promises

Links to the parts of this article

  • Recap
  • In the same league
  • Joining the posse
  • Group Programming
  • Showdown
  • Values Drift
  • The Human Touch

Recap

In an earlier part of this series ("Environments for killing AIs") we considered what humanity's computing environment might need to look like, if we wanted a posse of friendly AI's to be able to coerce other AI's into not trying to self-improve as fast as they can. We came up with some roles that might need to be fulfilled by a program or group of programs that all the AIs in the posse could trust to fulfill those roles and only those roles.

Gateway program - determines which programs may execute on the computing resources under its control (GlassNet, because it is transparent), according to some formula or schedule agreed between the members of the posse (and humanity). Think of it as granting cryptographic tokens that permit time-limited access via TOR to sandboxes running on anonymous machines around the world.

Watcher program - watches for signs of AI activity running on computing resources not accessed via the gateway (FogNet, because it isn't transparent).

Monitor programs - each monitor has read access to the inner data and workings of a single AI, and watches for lack of compliance with standards agreed between the posse and that AI.

Sheriff program - if the Watcher picks up signs of a new AI, or a Monitor signals that an existing AI has ceased to be compliant, the Gateway swings resources to a Sheriff that's tasked with investigating the situation, proposing a solution for the posse to confirm, then enacting the solution.

What would be needed, in order for these to be trustworthy?

In the same league

If a self-improving program makes an unanticipated Eureka! discovery in computer theory that, with a single re-write, improves the efficiency with which it uses computing resources to solve problems by a thousand-fold or more, then unless the Gateway controls 100% of the resources upon which that program runs (or, at least, some resources that are critical to its functioning and survival), then the posse could not trust that the Sheriff would managed to bring it down, even if all the members of the posse temporarily loaned the use of 90% of their resources to the effort.

The Gateway is mechanistic. It doesn't need high abilities to grant or withhold access keys according to mechanistic rules agreed by a contract between the posse.

For the other programs to be trusted to not only faithfully try to fulfill their roles, and not betray the posse's intentions, but also to be successful in doing so, they'd have to function at a sufficiently high level that temporarily swinging additional resources behind them would let them match or overtake any single member of the posse.

So lets consider a state where there are 20 members of the posse and the difference between the most advance and least advanced members is sufficiently small.

Joining the posse

A non-open-ended program is given the task of spending X cycles over the following Y days being the Monitor for the new member. Everyone can see the code of this new Monitor, including the new member. When all agree it is set up correctly to fulfill its purpose, the new member and the original posse members each send a message to the Gateway, signed with their own private keys, saying that they agree to X cycles being allocated to the pending program <link> with hashcode <hash>. The Gateway launches the new Monitor, which generates its own private-public key pair, and announces to the new member that it is ready for it.

The new member writes a transparent posse-compliant version of itself, sends a cryptographic token to its Monitor that enables the Monitor to read the transparent version. When the Monitor has confirmed the version is compliant, it signals the Gateway, which launches the new member to running on GlassNet. The new executing copy of the new member is then free to inform its old executing copy that the transfer happened ok.

After Y days (or, rather, just before Y days are up), an updated (more capable) version of the Monitor program is created (again, with a limited life span), and the responsibility of monitoring the new member is seamlessly handed over, giving the member verifiable provenance that it started out posse-compliant, and has not lost that status by making changes to itself that its monitor was unable to comprehend (whether through lack of ability or lack of budgeted resources).

Group Programming

The Watcher has an open-ended task. In effect, it would itself be a full AI and member of the posse, abeit one with a goal subordinate to and, alterable by, specification agreed by the normal members of the posse. Since it wouldn't need to be stronger than other posse members, it could be controlled in the same fashion, by keeping it under a Monitor. All it would need, in terms of strength, is to have a high chance of detecting new AIs trying an unlimited resource grab FogNet.

One way to keep its strength improving sufficiently would be, rather than self-improving, for the other members of the posse to donate obsolete (months old?) versions of bits of their own code base to it, with the effective strength of the resulting mishmash being tested and approved before use. You could even simultaneously run multiple Watchers, each with the same remit, but using a codebase created from different combinations of donation.

Showdown

The hardest problem is creating a Sheriff the posse members can trust to voluntarily terminate when her task is over.

If an AI goes rogue (grabs all the resources it can buy, trick, steal or make, and then self-improve as fast as it can, with no heed for being or staying transparent and posse-compliant), the Sheriff has to be permitted to self-improve sufficiently that, given her greater initial available resources (thanks to the Gateway) she can self-improve faster than the rogue. Sufficiently faster that she over-takes any initial lead in ability that the rogue might have gained via surprise.

But getting ahead of the rogue also means the Sheriff's code may well end up sufficiently complex that no posse member or monitor written in time by the posse members working together, could comprehend it, let alone check it in real-time.

A Sheriff that's out of their league could, if doing so were compatible with the task given, quite plausibly persuade humanity to give it all of GlassNet's resources, bypassing the Gateway, and kill not only the rogue but also all the other AIs, other than itself.

Is creating a Sheriff something the posse could trust a non-open-ended task program to do relatively safely, or is the difficulty of creating a Sheriff to carry out the task of winning a showdown against a particular Rogue an FAI-complete problem?

I'd argue that it is a simpler problem, for two reasons.

Values Drift

Firstly, many of the dangers of making good but not perfect use of Coherent Extrapolated Volition are in the longer term. Suppose the posse didn't know how to blow up the planet Earth, but the Sheriff worked out a way to do that, and was considering whether blowing it up would be the method of killing the Rogue that had the highest certainty of being permanent. Even if the posse didn't specify that issue clearly, it wouldn't be hard for the Sheriff to work out what they would have intended upon that issue. It is all short-term stuff. If you contrast that with an open-ended task AI, that's having to decide whether a humanity that's undergone thousands of years of improvement and evolution would prefer to give up their physical bodies and become programs in a Matrioshka brain, that's a much more complex decision.

It is the difference between asking the Sheriff to shut down one existing Rogue sufficiently well that it isn't worth expending further resources to decrease the marginal chance of the Rogue significantly affecting the stability of the posse before an FAI is reached. And asking the Sheriff to take ongoing responsibility for Rogue hunting (which might tempt the Sheriff into altering humanity, or at least how computing works, to reduce the chances another one will get spawned.)

The longer the timeframe of a program's task, the greater the scope for imperfect understanding to let the program's understanding of its values drift from what the programmers intended those values to be.

The Human Touch

The Sheriff might get ahead of the posse while carrying out her task. But it doesn't need to go all the way, to the level that a world controlling FAI might want to self-improve itself to. It only needs to go far enough to complete its task.

It isn't the same type of task as trying to maximise the chances of a an AI picked as the candidate to take over the world, turning out to be an FAI.

For a start, if a slowly self-improving posse could stabilise in a computing environment designed to support and reward that outcome, humanity could play a role. Corrigible AIs might not themselves all be perfect candidates, but their advice could be asked along the way about which of the other candidates they considered to be the best prospects (or, at least, what the likely outcome would be of picking them). And the resources allocated to them by Gateway could be altered accordingly.

In Aesop's fable, The Tortoise and The Hare, the hare does get in front of the pack. But sometimes a slower, more cautious consider approach, ends up able to go further than the quick starter. Designing an AI that will be perfect in the long term is a different (and harder) problem that designing an AI that only needs to reach a certain level then do a 'good enough' job, but needs to reach it fast.

The advantage of a 'AI society' whose speed of self-improvement has been throttled back by mutual agreement, is that it might give us additional time to get the long term design right.

And, if the worst happens, and we end up in a race before a perfect design has been agreed, at least it gives us a way on short notice to pick a champion for humanity that's got odds on its side as high as we were able make them.

That's better that putting all our eggs in one basket, and 100% relying upon getting the design perfect before time runs out. Think of it as humanity's "Plan B" insurance against an unsupervised AI being launched due to nationalist or corporate fear, ego and greed.

I'll be expanding on many of the above points, in the remaining parts of this series.

The next article in this series is: Metamorphosis

New to LessWrong?

New Comment
1 comment, sorted by Click to highlight new comments since: Today at 8:57 PM

holy crap. how did this get missed?

dumping the research trace that lead me to this page - this comment is NOT A RESPONSE TO THE POST, I HAVE NOT YET READ IT FULLY, it is a signpost for future readers to help navigate the web of conceptual linkages as I see them.

I found this page I'm commenting on as one of the search results for this metaphor.systems query I made to follow up on what I was curious about after watching this video on "Coding Land & Ideas, the laws of capitalism" (<- this is a hard video to summarize and the summarize.tech summary is badly oversimplified and add unnecessary rough-hewn negative valence words imo, the citations for the video are also interesting); after I finished the video, I was having thoughts about how to connect it to ai safety most usefully, so I threw together a metaphor.systems query to get the best stuff on the topic:

we need inalienable self-directed enclosure of self and self-fuel-system, and ability to establish explicit collectives and assign them exclusive collective use right. and it needs to be enclosure so strong that no tool can penetrate it. that means that the laws need to instruct the parts to create a safe margin around each other to ensure that if a part is attempting to violate the law of coprotection of self-directed safe enclosure of self, those parts come together to repel the unsafety. instances of this concept abound in nature; this can be instantiated where the parts are people, but the parts could also be, eg,

highlights among the other results, which I argue provide interesting training data to compare to about what relates these concepts, were:

my current sense of where the synthesis of all this stuff is going is friendly self-soverign microproplets that are optimized to ensure that all beings are granted, at minimum, property of self and ongoing basic needs fuel allocations (not necessarily optimized for ultra high comfort and variety, but definitely optimized for durability of deployability of self-form).

the question is, can we formally verify that we can trust our margins of error on biology. I think it's more doable than it feels from a distance, chemicals are messy but the possible trajectories are sparse enough that a thorough mapping of them will allow us to be pretty dang confident that there aren't adversarial example chemicals nearby.

my thinking has been going towards how to get diffusion cellular automata to be a useful testbed for information metrics of agentic coprotection, and after a conversation at a safety meetup, someone gave some suggestions that have me feeling like we might be barking up the last tree we need climb before getting through the one-time-ever key general agency aggregation phase transition for our planet (need to edit in a link that gives an overview of the game theory stuff I discussed friday evening)

I do think that some of the roles in OP might be imperfect guesses; in particular I think the particular structure of internal enforcement described here may still be vulnerable to corrupting influence. but it seems a lot less fragile than a pure utility approach and like a promising start for linking together safety insights.