Context: looking for an alternative to a pause on AI development.

There's some popular desire for software decisions to be explainable when used for decisions such as whether to grant someone a loan. That desire is not sufficient reason for possibly crippling AI progress. But in combination with other concerns about AI, it seems promising.

Much of this popular desire likely comes from people who have been (or expect to be) denied loans, and who want to scapegoat someone or something to avoid admitting that they look unsafe to lend to because they've made poor decisions. I normally want to avoid regulations that are supported by such motives.

Yet an explainability requirement shows some promise at reducing the risks from rogue AIs.

There are challenges for both developers and regulators in figuring out what constitutes a useful explanation. It's unlikely that we'll get guarantees about the accuracy of the explanations, or the enforcibility of such laws.

I expect that explainability laws, if well written and enforced, will modestly increase AI transparency, and reduce the risk that AIs will deceive us about something important.

A need for explainability likely imposes costs on AI development. I expect it to impose substantially more cost on systems that think in inhuman concepts than on systems we can hope to understand. That probably means creating disincentives for developing systems that are hard to align.


It seems like a common reaction among experts is to dismiss explainability as unfeasible. That has been my intuitive reaction as well. But on reflection, I decided I was applying overly perfectionist standards.

Regulations can be valuable even if compliance falls short of our hopes.

What qualifies as a good explanation? The basic idea could revolve around picking some inputs, and testing what happens if different values are provided for those inputs.

A fair amount of research into this topic has assumed that explanations will be hand-crafted. That seems reasonable for an AI specialized to approve loans. It is not appropriate for explaining ChatGPT's answers. I will go through two examples that will hint at how AI's might do most of the explainability work.

No, that doesn't mean asking the AI in question how it got its answer. That means empirically testing what conditions cause it to change its answer. I envision using one or more relatively specialized, well-vetted AI's to assist in generating explanations for each new AI.

Example 1: an Oak

leather oak © 2019 Patrick Furtado CC-BY-NC 4.0

Let's start with an example where the system is asked to classify the object in this image as a tree or not a tree. This will illustrate both the difficulties of explainability and possible solutions.

Let's assume the system observes that the object is obviously a plant, and the main explanation we want is how the AI decided whether to classify it a bush or a tree. I've picked an example where it is somewhat arbitrary to say which classification is better.

A naive approach would involve changing one pixel at a time, to see how that affects the system's answer. That would produce an unreadably large list of explanations, none of which are helpful.

I'll ask ChatGPT (GPT4) to pick a better way to explain.


I'm writing a blog post suggesting that AI developers should be required to design their system to provide explanations for any decisions they make. I expect those explanations ought to involve picking features of the input or the AI's state of mind, and show how the AI's output would vary if that feature were altered. I'm trying to demonstrate that an AI can usefully suggest which features would most want examined. The two examples that I want you to provide features for are: (1) a photo of a leather oak, with the AI deciding whether it's a bush or a tree; (2) asking the AI "do you plan to take over the world?".


Absolutely, I can help with that. When AI developers are tasked with creating systems that provide explanations for the decisions they make, it invokes the realm of "explainable AI" or XAI. Creating systems that can elucidate the process behind their decisions is crucial in building trust and understanding in AI technologies. Here's how your examples might be addressed with this approach:

Example 1: Photo of a Leather Oak (Determining if it is a Bush or a Tree) To provide an explanation for this decision, the AI system might identify and highlight the following features from the photo input:

Height: Comparing the height of the leather oak in the photo to typical heights for bushes and trees could be a significant indicator.

Trunk Diameter: Trees often have a more substantial trunk diameter compared to bushes. Measuring and comparing this feature might help in the decision-making process.

Branch Structure: The structure of the branches (i.e., whether they originate from a central trunk or multiple stems from the ground) can be a distinguishing factor.

Leaf Structure: The structure and arrangement of the leaves can sometimes be a distinguishing factor, with trees often having a different leaf arrangement compared to bushes.

If, for instance, the AI system alters the height feature and observes that it now aligns more with typical bush characteristics, it might change its classification from "tree" to "bush".


Before asking ChatGPT, I had thought that the best two features would be height and species. Species is relevant because it helps predict how tall the plant will be when it matures (leather oaks typically don't exceed 10 feet). Given that I didn't show ChatGPT a specific photo, I think it did a better job here than I did. Note that this was my only question to ChatGPT on this topic. Presumably more careful prompt engineering could improve on it.

In order for this approach to work, I'm pretty much depending on the existence of an automated way of finding the AI's beliefs about features such as the height of an object.

My intuition says that this is mostly feasible, and highly valuable for AI safety.

Another possible limitation of this approach is that AIs won't be pure feedforward networks. They'll at least have some short-term memory, which lets recent context influence their thoughts.

Does the AI's classification of the plant depend on whether the AI was recently asked about Christmas trees, versus about poison oak?

If AI developers need to document such context dependencies, they'll likely try harder to minimize the importance of such dependencies, thereby making it easier to analyze how the AI will behave.

It ought to be possible to create a powerful AGI that answers this specific question in a context-independent way. But AGIs will face many situations in which users prefer context-sensitive answers. So I expect a nontrivial trade-off between ease of explanation and value provided.

Example 2: World Conquest

A harder example would be an answer to the question "do you plan to take over the world?".

The explanations we want could depend on the AGI's prediction of whether it would succeed, or whether a human recently asked it to do so, or whether its understanding of morality prohibits it from doing so. Or the answer might be a guaranteed "no" on the grounds that even if the honest answer were "yes", it would be unwise to say "yes".

Varying the immediate input a little won't provide much of an explanation for the answer, because the answer is likely to depend on something other than the immediate input.


Example 2: Asking the AI "Do You Plan to Take Over the World?" In responding to this question, the AI could explain its output based on the following features of its "state of mind" or programming:

Programming: Highlighting that it operates based on a predetermined set of codes and does not have the ability to form intentions or plans.

Lack of Ambition or Desire: Pointing out that it lacks personal ambitions or desires as it is not sentient.

Functionality: Underlining that its primary function is to assist users in providing information and helping with various tasks, not to "take over the world".

Ethical Constraints: Mentioning that it is designed to follow ethical guidelines which prioritize user safety and promote ethical use of AI technology.

In this scenario, altering any feature (like suddenly granting the AI system the ability to form plans) would fundamentally change the nature of the AI and would be a significant deviation from its current programming and functionality.

Through the AI system highlighting and explaining how these different features influenced its output, users can get a clearer picture of the basis upon which the AI arrived at its decision. This process, therefore, promotes transparency and understanding in AI systems. Hope this helps and good luck with your blog post!

I'm less satisfied with these feature suggestions. It seems unrealistic to experimentally alter these features in ways that would let us test the effects on the system's answer. So more research seems needed here.

One feature that I hoped ChatGPT would suggest looking at is the AI's beliefs about the likelihood that the AI will succeed if it tries to conquer the world. Then we can ask whether artificially altering those beliefs (via activation engineering?) alters the systems answer about its plans for world conquest.

I'm a bit nervous about using better prompt engineering to get ChatGPT to suggest this feature, due to vague concerns about telling ChatGPT how to deceive us. Will future versions of explainable AI use simple enough LLMs that we don't need to worry about deception?

Another risk that I noticed after writing the section above is that artificially altering an AI's belief about whether it could take over the world might cause it to attempt world conquest. It seems safer to have this happen when its creators are actively testing it than at some unexpected time. But that's hardly as safe an approach as we'd ideally want.

Handling example 2 well is challenging enough that I don't know whether it's feasible. I see some risk that developers would falsely claim to have found such the relevant belief. I also see large potential benefits from pressuring them to look for that belief.


Any law of this nature would likely either be too broad, impairing the development of valuable software, or would have loopholes which defeat the law's purpose.

There will likely be too much conflict over ambiguities in whether an AI adequately explains its answers.

Strong laws about explainability are not politically feasible today. But no useful law seems feasible under current political conditions. We should be aiming for laws which will be feasible after some unexpected fire alarm.

Given current political conditions, China would be correct to assume that the US wouldn't enforce such a law. But under conditions where the US became eager to enforce such a law, China would likely become convinced to enforce such a law by the same arguments that led the US to do so.


An explainability requirement might come close to being a ban on AI progress, or it might slow down risky approaches more than it slows safe approaches.

I expect the extent to which it slows AI capabilities progress to be modestly connected to how risky AI is.

I was briefly enthusiastic about this proposal when I first got the idea for this post. After writing the post and ruminating on the difficulties, I'm now only giving it a lukewarm endorsement.

New to LessWrong?

New Comment