Epistemic status: fake framework

To do effective differential technological development for AI safety, we'd like to know which combinations of AI insights are more likely to lead to FAI vs UFAI. This is an overarching strategic consideration which feeds into questions like how to think about the value of AI capabilities research.

As far as I can tell, there are actually several different stories for how we may end up with a set of AI insights which makes UFAI more likely than FAI, and these stories aren't entirely compatible with one another.

Note: In this document, when I say "FAI", I mean any superintelligent system which does a good job of helping humans (so an "aligned Task AGI" also counts).

Story #1: The Roadblock Story

Nate Soares describes the roadblock story in this comment:

...if a safety-conscious AGI team asked how we’d expect their project to fail, the two likeliest scenarios we’d point to are "your team runs into a capabilities roadblock and can't achieve AGI" or "your team runs into an alignment roadblock and can easily tell that the system is currently misaligned, but can’t figure out how to achieve alignment in any reasonable amount of time."

(emphasis mine)

The roadblock story happens if there are key safety insights that FAI needs but AGI doesn't need. In this story, the knowledge needed for FAI is a superset of the knowledge needed for AGI. If the safety insights are difficult to obtain, or no one is working to obtain them, we could find ourselves in a situation where we have all the AGI insights without having all the FAI insights.

There is subtlety here. In order to make a strong argument for the existence of insights like this, it's not enough to point to failures of existing systems, or describe hypothetical failures of future systems. You also need to explain why the insights necessary to create AGI wouldn't be sufficient to fix the problems.

Some possible ways the roadblock story could come about:

  • Maybe safety insights are more or less agnostic to the chosen AGI technology and can be discovered in parallel. (Stuart Russell has pushed against this, saying that in the same way making sure bridges don't fall down is part of civil engineering, safety should be part of mainstream AI research.)

  • Maybe safety insights require AGI insights as a prerequisite, leaving us in a precarious position where we will have acquired the capability to build an AGI before we begin critical FAI research.

    • This could be the case if the needed safety insights are mostly about how to safely assemble AGI insights into an FAI. It's possible we could do a bit of this work in advance by developing "contingency plans" for how we would construct FAI in the event of combinations of capabilities advances that seem plausible.
      • Paul Christiano's IDA framework could be considered a contingency plan for the case where we develop much more powerful imitation learning.
      • Contingency plans could also be helpful for directing differential technological development, since we'd get a sense of the difficulty of FAI under various tech development scenarios.
  • Maybe there will be multiple subsets of the insights needed for FAI which are sufficient for AGI.

    • In this case, we'd like to speed the discovery of whichever FAI insight will be discovered last.

Story #2: The Security Story

From Security Mindset and the Logistic Success Curve:

CORAL: You know, back in mainstream computer security, when you propose a new way of securing a system, it's considered traditional and wise for everyone to gather around and try to come up with reasons why your idea might not work. It's understood that no matter how smart you are, most seemingly bright ideas turn out to be flawed, and that you shouldn't be touchy about people trying to shoot them down.

The main difference between the security story and the roadblock story is that in the security story, it's not obvious that the system is misaligned.

We can subdivide the security story based on the ease of fixing a flaw if we're able to detect it in advance. For example, vulnerability #1 on the OWASP Top 10 is injection, which is typically easy to patch once it's discovered. Insecure systems are often right next to secure systems in program space.

If the security story is what we are worried about, it could be wise to try & develop the AI equivalent of OWASP's Cheat Sheet Series, to make it easier for people to find security problems with AI systems. Of course, many items on the cheat sheet would be speculative, since AGI doesn't actually exist yet. But it could still serve as a useful starting point for brainstorming flaws.

Differential technological development could be useful in the security story if we push for the development of AI tech that is easier to secure. However, it's not clear how confident we can be in our intuitions about what will or won't be easy to secure. In his book Thinking Fast and Slow, Daniel Kahneman describes his adversarial collaboration with expertise researcher Gary Klein. Kahneman was an expertise skeptic, and Klein an expertise booster:

We eventually concluded that our disagreement was due in part to the fact that we had different experts in mind. Klein had spent much time with fireground commanders, clinical nurses, and other professionals who have real expertise. I had spent more time thinking about clinicians, stock pickers, and political scientists trying to make unsupportable long-term forecasts. Not surprisingly, his default attitude was trust and respect; mine was skepticism.

...

When do judgments reflect true expertise? ... The answer comes from the two basic conditions for acquiring a skill:

  • an environment that is sufficiently regular to be predictable
  • an opportunity to learn these regularities through prolonged practice

In a less regular, or low-validity, environment, the heuristics of judgment are invoked. System 1 is often able to produce quick answers to difficult questions by substitution, creating coherence where there is none. The question that is answered is not the one that was intended, but the answer is produced quickly and may be sufficiently plausible to pass the lax and lenient review of System 2. You may want to forecast the commercial future of a company, for example, and believe that this is what you are judging, while in fact your evaluation is dominated by your impressions of the energy and competence of its current executives. Because substitution occurs automatically, you often do not know the origin of a judgment that you (your System 2) endorse and adopt. If it is the only one that comes to mind, it may be subjectively undistinguishable from valid judgments that you make with expert confidence. This is why subjective confidence is not a good diagnostic of accuracy: judgments that answer the wrong question can also be made with high confidence.

Our intuitions are only as good as the data we've seen. "Gathering data" for an AI security cheat sheet could helpful for developing security intuition. But I think we should be skeptical of intuition anyway, given the speculative nature of the topic.

Story #3: The Alchemy Story

Ali Rahimi and Ben Recht describe the alchemy story in their Test-of-time award presentation at the NeurIPS machine learning conference in 2017 (video):

Batch Norm is a technique that speeds up gradient descent on deep nets. You sprinkle it between your layers and gradient descent goes faster. I think it’s ok to use techniques we don’t understand. I only vaguely understand how an airplane works, and I was fine taking one to this conference. But it’s always better if we build systems on top of things we do understand deeply? This is what we know about why batch norm works well. But don’t you want to understand why reducing internal covariate shift speeds up gradient descent? Don’t you want to see evidence that Batch Norm reduces internal covariate shift? Don’t you want to know what internal covariate shift is? Batch Norm has become a foundational operation for machine learning. It works amazingly well. But we know almost nothing about it.

(emphasis mine)

The alchemy story has similarities to both the roadblock story and the security story.

From the perspective of the roadblock story, "alchemical" insights could be viewed as insights which could be useful if we only cared about creating AGI, but are too unreliable to use in an FAI. (It's possible there are other insights which fall into the "usable for AGI but not FAI" category due to something other than their alchemical nature--if you can think of any, I'd be interested to hear.)

In some ways, alchemy could be worse than a clear roadblock. It might be that not everyone agrees whether the systems are reliable enough to form the basis of an FAI, and then we're looking at a unilateralist's curse scenario.

Just like chemistry only came after alchemy, it's possible that we'll first develop the capability to create AGI via alchemical means, and only acquire the deeper understanding necessary to create a reliable FAI later. (This is a scenario from the roadblock section, where FAI insights require AGI insights as a prerequisite.) To prevent this, we could try & deepen our understanding of components we expect to fail in subtle ways, and retard the development of components we expect to "just work" without any surprises once invented.

From the perspective of the security story, "alchemical" insights could be viewed as components which are clearly prone to vulnerabilities. Alchemical components could produce failures which are hard to understand or summarize, let alone fix. From a differential technological development point of view, the best approach may be to differentially advance less alchemical, more interpretable AI paradigms, developing the AI equivalent of reliable cryptographic primitives. (Note that explainability is inferior to interpretability.)

Trying to create an FAI from alchemical components is obviously not the best idea. But it's not totally clear how much of a risk these components pose, because if the components don't work reliably, an AGI built from them may not work well enough to pose a threat. Such an AGI could work better over time if it's able to improve its own components. In this case, we might be able to program it so it periodically re-evaluates its training data as its components get upgraded, so its understanding of human values improves as its components improve.

Discussion Questions

  • How plausible does each story seem?

  • What possibilities aren't covered by the taxonomy provided?

  • What distinctions does this framework fail to capture?

  • Which claims are incorrect?

New Comment
5 comments, sorted by Click to highlight new comments since: Today at 4:00 AM
We can subdivide the security story based on the ease of fixing a flaw if we're able to detect it in advance. For example, vulnerability #1 on the OWASP Top 10 is injection, which is typically easy to patch once it's discovered. Insecure systems are often right next to secure systems in program space.

Insecure systems are right next to secure systems, and many flaws are found. Yet, the larger systems (the company running the software, the economy, etc) manage to correct somehow. It's because there are mechanisms in the larger systems poised to patch the software when flaws are discovered. Perhaps we could fit and optimize this flaw-exploit-patch-loop in security as a technique for AI alignment.

If the security story is what we are worried about, it could be wise to try & develop the AI equivalent of OWASP's Cheat Sheet Series, to make it easier for people to find security problems with AI systems. Of course, many items on the cheat sheet would be speculative, since AGI doesn't actually exist yet. But it could still serve as a useful starting point for brainstorming.

This sounds like a great idea to me. Software security has a very well developed knowledge base at this point and since AI is software, there should be many good insights to port.

What possibilities aren't covered by the taxonomy provided?

Here's one that occurred to me quickly: Drastic technological progress (presumably involving AI) destabilizes society and causes strife. In this environment with more enmity, safety procedures are neglected and UFAI is produced.

Trying to create an FAI from alchemical components is obviously not the best idea. But it's not totally clear how much of a risk these components pose, because if the components don't work reliably, an AGI built from them may not work well enough to pose a threat.

I think that using alchemical components in an possible FAI can lead to a serious risk if the people developing it aren't sufficiently safety conscious. Suppose that either implicitly or explicitly, the AGI is structured using alchemical components as follows:

  1. A module for forming beliefs about the world.
  2. A module for planning possible actions or policies.
  3. A utility or reward function.

In the process of building an AGI using alchemical means, all of the above will be incrementally improved to a point where they are "good enough". The AI is forming accurate beliefs about the world, and is making plans to get things that the researchers want. However, in a setup like this, all of the classic AI safety concerns come into play. Namely, the AI has an incentive to upgrade the first 2 modules and preserve the utility function. Since the utility function is only "good enough", this becomes the classic setup for Goodhart and we get UFAI.

Even in a situation where the AI does not participate in it's own further redesign, it's effective ability to optimise the world increases as it gets more time to interact with it. As a result, an initially well behaved AGI might eventually wander into a region of state space where it becomes unfriendly using only capabilities comparable to that of a human.

That said, remains to be seen if researchers will build AI's with this kind of architecture without additional safety precautions. But we do see model free RL variants of this general architecture such as Guided Cost Learning and Deep reinforcement learning from human preferences.

As a practical experiment to validate my reasoning, one could replicate the latter paper using a weak RL algorithm, and then see what happens if it's swapped out with a much stronger algorithm after learning the reward function. (Some version of MPC maybe?)

From what I understand about humans, they are so self-contradictory and illogical that any AGI that actually tries to optimize for human values will necessarily end up unaligned, and that the best we can hope for is that whatever we end up creating will ignore us and will not need to disassemble the universe to reach whatever goals it might have.

This seems as misled as arguing that any AGI will obviously be aligned because to turn the universe into paperclips is stupid. We can conceivably build an AGI that is aware that humans are self-contradictory and illogical, and therefore won't assume that they are rational because it knows that that would make it misaligned. We can do at least as well as an overseer that intervenes on needless death and suffering as it would happen.

If you mean an AGI that optimizes for human values exactly as they currently are will be unaligned, you may have a point. But I think many of us are hoping to get it to optimize for an idealized version of human values.