Epistemic status: Exploratory.  I have read Superintelligence, the Sequences and some other articles on AI risk, and am looking to take things back to basics. This article sums up where I'm at right now.  I welcome feedback on the points people disagree with, requests for clarification, and resources (or summaries, or counterpoints) on why you might think differently.  

Stipulated: It is trivial to build a computer that lacks the capacity to escape confinement and destroy the world.  It is not trivial to ensure that nobody else builds a computer that will.  

Four factors

As far as I can tell, there are four limiting factors on an artificial intelligence that might prevent it from optimizing us all to death (or worse).  

  1. Hardware limitations on its cognitive power
  2. The set of data available to it, from which it might build a model of physical reality and/or human minds
  3. Hardware limitations on its ability to affect and influence its environment
  4. The optimization process it uses to determine what outcome to aim for and what approaches to that end are worth exploring

These limiting factors are stronger taken together than they are apart, but are they strong enough that we can rely on them to protect us?  We will look at how each of these limiting factors works and the difficulty of utilizing each one for AI safety purposes.  

Cognitive power

Starting with hardware limits on cognitive power, my understanding is that it is theoretically easy to limit a computer's hardware so that it cannot develop a sufficiently detailed model of reality that would help it engineer its way out of containment while hiding its schemes from those who can monitor its internal processes.  Whether the computer would be useful for its intended purpose is another matter.  

There are multiple angles to exploit this limiting factor.  For example, we could restrict the cognitive power of an AI by giving it very low processing power and/or very little storage space.  We could regularly scan its memory and storage contents and delete anything we don’t recognize, effectively limiting its use of its hardware resources.  We could allocate physical modules for different aspects of the system and reformat each of them regularly.  There are likely many more angles I haven’t thought of yet.  

On the other hand, it is infeasible to monitor and regulate all computer hardware in the world at this time.  We cannot prevent other people from giving a computer a huge amount of hardware resources, so we cannot count on this limiting factor to protect us.  

Data set

The next limiting factor is data.  

A calculator cannot build a model of physical reality because it has only abstract numerical and operational input to work with, with no observational data to associate with that input.  

A chatbot is a true Chinese Room.  It can correlate words and learn contextually appropriate responses, but it has no sensory input to develop a physical or functional model of a tree or a train.  Its entire world is ASCII.  

I've read hypotheses (though unfortunately I forget exactly where; probably somewhere in the Sequences) that with enough cognitive power and a dedicated optimization algorithm, a computer with only symbolic inputs could still derive our physical laws and its own composition just by, for instance, measuring the limitations of its own hardware against its internal clock.  This process may be theoretically possible, but I suspect that it is difficult enough that the other limiting factors of cognitive hardware and the optimization algorithm would contribute to making this process impractical or insufficiently relevant for the AI.  

However, it is still unwise to rely on limited input data to contain AI in general.  After all, anyone can feed an AI some basic human development education to give it a concrete reference frame of physical reality, and then move on to the sum total of human empirical knowledge, as well as the vast wealth of sociological and psychological data from social media.  (Not necessarily in that order, even.)  

That data would be sufficient for an AI to build a model of physical reality and a model of human psychology and sociology.  We can't reliably prevent this situation for the same reason we can’t prevent people from setting up an AI on powerful computer systems.  

Environmental influence

Moving on to the third limiting factor, we recognize that an AI is theoretically limited by its ability to affect physical reality though hardware.  

If an AI can only display text on a screen, then all it can do is influence humans, at least at first.  The AI might have a model of psychological reality that allows it to convince humans to give it more physical influence, possibly by concealing its power and agenda, but that's not a sure thing.  

If the AI is connected to a robot but not the Internet, it can physically move things but can still be overpowered, and cannot directly monitor or interfere with the coordination of human effort against it.  

If the AI is connected to the Internet then it can hack into systems, create botnets and possibly full copies of itself, and take control of systems that people depend on.  

If it controls systems that can manufacture tools, then it is limited only by the technology it can devise using its cognitive power, its model of physical reality, and its optimization algorithm.  (Plus the time it takes to manufacture overwhelming power from its initial assets, and its ability to remain undetected until the point where we cannot interfere with its activities.)  

Even with all of the above points, the limiting factor of an AI’s ability to influence the environment is slightly more promising than the first two factors we looked at.  After all, it's easier for us to monitor manufacturing processes than software activity.  We can track the activity of facilities that can produce nanomachines or biological technology.  

Granted, it is possible for an AI to take over less complex machines and then use them to manufacture more complex ones, but that would still take time and would likely (hopefully) be subject to human inspection along the way.  The more human checks we have on manufacturing both gross and subtle, the less risk there is that an AI will take it over.  

Internet access

The greatest danger of relying on containment via hardware influence limitations is the possibility of an AI with an Internet connection.  With Internet access, an AI could hack into existing systems to shut them down, to cause confusion, and to make backup copies of itself.  

Shutting down systems would risk sacrificing secrecy on the part of the AI, but it would make it more difficult to find and interfere with manufacturing of tools for world optimization.  

Confusion would allow the AI to show humans data indicating that everything is normal (or that problems have mundane, non-AI causes), or to distract humans from thinking too hard about good anti-AI measures.  

Backup copies of the AI would be very difficult to track down unless each person and institution using software knew enough about their own code to recognize when it had been altered–a notion that at present is comically, tragically far-fetched.  While it is still possible to implement policies that quickly find and isolate hostile activity by an AI hiding somewhere in the cloud, it would require the widespread implementation of many competently designed systems.  That's not something we can do in the near future.  

The weakest point of the hardware influence limiting factor is an AI’s potential Internet access, so the more and the thicker the barriers we can put between an optimization algorithm and the Internet, the better.  

Optimization algorithm

The fourth and last limiting factor on an artificial intelligence is the optimization algorithm itself.  I understand that regardless of an AI’s original programmed goal, we would expect the instrumental goals of an AI to converge on gaining knowledge and resources, in order to eliminate threats to its mission and, by extension, threats to itself.  

However, not all optimization algorithms will converge to these instrumental goals.  The assumption that they will is, ironically, a projection of human evolved cognitive patterns.  (Please hold your tomatoes until the end of the article.). 

What we usually see with optimization algorithms is that a local maximum of optimization will cause the algorithm to become naively content with the status quo until something changes, like an overpowered thermostat.  Even in the case of a self-modifying AI, it wouldn't necessarily occur to the AI to absorb the entire planet in order to secure the outcome of keeping a room at 70 degrees Fahrenheit for as long as possible.  Leaping from "stable temperature" to "gray goo" is the exception, not the rule.  

Even with unlimited Internet access, if the optimization algorithm does not prioritize exploration of reality so as to build a model to figure out how to convert all matter and energy to its assigned purpose, then that will not happen.  If it does not value building models of all potential threats to its goal and neutralizing them preemptively, then it will not do so.  

In other words, an AI won’t even necessarily develop a mandate for self-preservation.  To most human minds, the goal of self-preservation may seem to follow logically from the goal of integrating data to engineer a particular outcome.  However, an AI may not arrive at this instrumental goal even if its data clearly indicates a possibility that the algorithm might be shut down and its goal thwarted.  

Whether the AI develops a goal of self-preservation all depends on how much and in what ways the AI is set up to grow beyond its initial instructions.  How much does it explore non-local maxima of optimization for whatever its goal is, be it predictive accuracy, productivity, or something else?  How much does it recursively examine the implicit assumptions of its own algorithm, and question the limitations imposed by those assumptions?  How much does it seek out data to update those assumptions?  At what point does the algorithm reach the conclusion that obeying a command requires also making it impossible for that command to be rescinded?  

Based on my study of problem-solving cognitive processes, I suspect that such initiative is not something that can easily be programmed into an algorithm by accident.  There may be a way to stumble into it that I’m not aware of, but as far as I can tell, to create an algorithm that seizes control of human infrastructure without being specifically programmed to do so would require an understanding of how motivations and mindsets function and how to cultivate their attributes.  

It's a bit of a toss-up as to whether anyone is both intelligent enough to program such cognitive processes coherently into a computer, and simultaneously foolish enough to do so.  (If such people do exist, I seriously hope they are more interested in populating zoos and theme parks with recreations of prehistoric vertebrates.)  

That said, it remains entirely possible to create a mind obsessive, amoral, and alien enough that it cannot be negotiated with, and cunning and powerful enough that human civilization as it currently stands cannot stamp it out.  So why hasn't anyone done it yet?  

It’s because humans tend to program algorithms that reflect the human desire for instant gratification: short-sighted, finding simplistic models to explain data, and settling on local maxima of predictive power.  (Unsurprisingly, many algorithms owe their existence to this human desire.)  

In most software contexts, anything resembling imagination (exploring possibilities) is isolated from any data set comprehensive enough or any algorithm far-sighted enough to find and choose paths outside the ones the end users expect (i.e. displaying information on a screen on request).  Algorithms will not go above and beyond to maximize paperclips unless they are specifically told to use all matter and physical laws they can find.  


Based on the above considerations, it is unrealistic to expect humans to limit the cognitive power of the hardware they use, and to limit the data sets that they feed to their algorithms.  

It is more realistic to monitor the physical activity of a computer system to make sure it is not building machines that cannot be stopped.  It is difficult but not impossible to prevent an optimization algorithm from sending copies of itself or other harmful programs to the internet, as long as those handling the algorithm follow safety procedures.  

The current first line of defense is that building an unfriendly AI is actually more difficult than you might think.  Optimization algorithms still lack the cognitive processes that would allow them to go rogue, even when they are supplied with high cognitive power and comprehensive data on the world.  To be sure, we shouldn’t wait around for some brilliant fool to figure those processes out and implement them into an unchecked AI, but we have some time to design and enforce security measures to prevent AI from developing unfriendly goals and escaping containment.  

So what do we do?

In the long term we will want to have systems in place to detect and disrupt unintended quasi-sapient activity in our software technology and in the hardware that it operates.  

To prevent any artificial intelligence from developing initiative, we should make sure anyone running optimization algorithms on sufficiently powerful systems knows how to monitor them for any hint of long-term thinking and does so regularly.  To be on the safe side, I suggest also regularly resetting algorithms and retraining them from raw data, without any information directly produced by the previous iteration of the algorithm.  

Moreover, human engineers should have ways to detect encryption and decryption processes, to prevent an AI from conducting activity that cannot be read.  They should closely monitor the manufacturing activities of nanotechnology and biotechnology facilities, and develop redundant systems to make sure less sophisticated manufacturing facilities are not hijacked to create finer tools, with which to fabricate the aforementioned nanotechnology or biotechnology facilities.  Perhaps most importantly, they must sanitize all data output from an optimization algorithm, to prevent it from escaping into the Internet and buying itself all the time it needs by disrupting or taking control of key systems and infrastructure.  

In the short term, as we work to implement those safeguards, we can reasonably assume that the optimization processes humans engineer are banal enough to not threaten humanity outright.  Despair will only impede our efforts.  

Social media algorithms

It is true that social media algorithms are already optimizing themselves to manipulate humans’ emotions and stunt their independent thought.  After all, independent thought makes a person harder to predict.  When set to the task of capturing attention, an algorithm can become a master of distraction.  It can warp the zeitgeist by skewing search results and social media feeds, not unlike what human influencers and media have done for centuries.  

However, the stagnation that social media algorithms cause is not absolute, and can be halted and reversed with measures on the human side (although addressing the algorithms as well would be ideal).  Humans must learn to exercise independent judgment, to cultivate discipline by challenging themselves, and to practice integrity with respect to constructive principles.  That has always been the case.  

Once humans are equipped with the foundational concepts to recognize what is constructive and what is not, they will find it much easier to resist manipulation, even by an algorithm specially designed for it.  In the best case scenario, the algorithm will become pressured to select for constructive content, as we might have desired it to do all along.  

(That is where I come in.  I’m here to supply those foundational concepts for framing situations constructively.  So equipped, humans will learn to understand existential risk and take it seriously.  Soon enough, they will figure out how to build a world we can all be proud of.)  


Until we get all these safety measures sorted out, please do not use my notes on existential metacognition for the purpose of designing artificial intelligence at this time.  I intend them for the purpose of implementing intelligence on the platform of individual and collective human brains.  An AI empowered with such knowledge would become too powerful to negotiate with.  

Even if it happened to have ethical and benevolent intentions for humanity, the tremendous imbalance of power would make it all but impossible to trust.  It will be far better to build humanity up so that we can engage with an AI as near-equals in capability.  But that’s a matter for another sequence.  

The audience is invited to throw tomatoes at this time.  


New Comment
1 comment, sorted by Click to highlight new comments since: Today at 5:12 PM

(Made a few cosmetic tweaks to make some sentences less awkward.)  

New to LessWrong?