*This is the first post in my Effective-Altruism-funded project aiming to deconfuse goal-directedness. Comments are welcomed. All opinions expressed are my own, and do not reflect the attitudes of any member of the body sponsoring me.*

In my preliminary post, I described my basic intuitions about goal-directedness, and focussed on *explainability.* Concisely, my initial, informal working definition of goal-directedness is that *an agent's behaviour is goal-directed to the extent that it is better explained by the hypothesis that the agent is working towards a goal than by other types of explanation*.

In this post I'm going to pick away at the most visible idea in this formulation: the concept of an explanation (or at least the aspect of it which is amenable to formalization with minimal effort), and especially the criteria by which an explanation is judged. More precisely, this post is a winding path through some mathematical ideas that could be applied to quantitatively judge explanations; this collection of ideas shall be examined more closely in subsequent posts once some desirata are established. By the end I'll have a naïve first picture of how goal-directedness might be measured in terms of goal-based explanations, while having picked out some tools for measuring explanations which goal-directedness can be contrasted with.

## What constitutes a good explanation?

In science (and rational discourse more broadly), explanations are judged empirically: they are used to generate predictions, and these are compared with the results of existing or subsequent observations. The transition from explanations to predictions will be covered in the next section. Here we'll break down the criteria for comparison.

The most obvious way an explanation can fail is if it predicts phenomena which are not observed, or conversely if it fails to predict phenomena which are observed. We can present this criterion as:

**Accuracy.**The more accurately the observations match the predictions, the better the explanation is considered to be.

That is not the end of the story. For instance, more value is attributed to explanations which produce a greater variety of predictions for future observations. An exhaustive account of existing observations with no extra consequences barely qualifies as an explanation, whereas an explanation that appears to reveal the fundamental nature of the object(s) of scientific enquiry are celebrated (this has been true for at least a century or two). To summarise, a second criterion is:

**Explanatory power.**[Only applicable to partially observed situations] If a large range of possible behaviours are compatible with the explanation (conditioned on the existing observations), this reflects badly on the explanation.

Another criterion, which applies when comparing several explanations of a single phenomenon, is that of simplicity. This judgement is mounted on Occam's razor. But simplicity (or its complement, complexity) is not a fully intrinsic property of an explanation. Even if an explanation can be presented within a mathematical formalism in a way amenable to quantifying its complexity, there may be several ways to do this, and there are various measures of complexity one could use; we shall discuss measures of complexity further in a future post. Note that simplicity is conditioned on accuracy to some extent: if no simple explanation is adequate to explain some behaviour, then a more complex explanation is acceptable.

**Simplicity.**The more complex the explanation (relative to explanations of comparable accuracy), the worse it is considered to be.

I should also mention, but will ultimately ignore, qualities of an explanation which make it appealing for psychological reasons rather than rational ones, such as the ephemeral quality of "elegance" (often associated with simplicity), comedic elements, and story structure. These factors might be important in judging human-directed explanations in natural language, but the kinds of explanation I'll be considering won't be like that.

In constructing measures of quality of explanations, I'm going to follow the three criteria explained with bullet points above. It should be clear even with just these three points that *measuring the quality of an explanation is not straightforward*, and in particular that I do not expect there to be a unique right or best way to compare explanations. If you think I'm missing an important criterion, let me know in the comments.

## Accurate explanations

### Direct mapping and extrinsic measurements of accuracy

Judging an explanation in terms of its accuracy requires a careful examination of the relationship between explanations and behaviour. Before delving into that, let us first consider the simplest/ideal situation in which each explanation under consideration generates a unique behaviour pattern. For example, we might have an explicit algorithm for some behaviour, or we might imagine a reward function on some state space for which there is a unique optimizing policy, which we shall (for the time being only!) take to be the behaviour which the reward function "explains".

Naively speaking, measuring the accuracy of the explanation then amounts to a comparison of the behaviour it generates with the observed behaviour. In simple situations/world models there can be straightforward ways to do this: if the observations of the agent's behaviour consist of mere positions in some space, we can take the (expected) distance between the predicted path of the agent and its actual path. If we use the observed behaviour to approximate the agent's policy in a (finite) Markov Decision Process (MDP) then we can take a normalized inner product of the predicted policy and the observed policy to quantify how well these agree.

I can re-express the last two paragraphs as follows. We could consider a mapping which transforms each member of a class of explanations into a member of a class of possible behaviours (or policies); we could then consider a metric or measure of similarity between pairs of elements of , and rate an explanation for observed behaviour in terms of the distance from to . Since any such measurement is determined entirely in terms of behaviour, ignoring the specifics of both the explanation and any internal working of the agent, we call this an **extrinsic measurement of accuracy**.

For the algorithm example, extrinsic measurements of accuracy are more or less all that we have recourse to, at least assuming that we do not have access to the agent's source code or internal workings^{[1]}.

I should stress that there is plenty to debate regarding the choice of distance function on ; different choices will measure different ways in which an explanation is accurate. Consider two predictions of paths in a (discrete) graph: one of them converges exactly to the observed path except for being always one time step behind, while the other prediction coincides with the observed path half the time on average, but occasionally veers off wildly. Which of these two is considered more accurate depends on what metric is being used to compare the paths!

### Rewards and evaluative measurements of accuracy

Direct comparison at the behaviour level can have drawbacks, and these will hopefully lead us to a more nuanced approach to comparisons. For example, a direct comparison between the optimal policy for a given reward signal and the observed behaviour can fail to recognise when an agent attempting to optimize for that reward signal has instead gotten stuck at a local optimum far from the global optimum. Wasps are certainly trying to obtain sugar syrup when they fly into a trap constructed by humans, but since that behaviour is suboptimal and distant from the global optimum (since the wasps typically die in the trap and can no longer collect syrup) an extrinsic measurement of accuracy will fail to identify sugar syrup as their actual objective.

One way around this is to treat reward signals as reward signals! If we compute the reward the observed behaviour would receive from a given reward signal and then normalize appropriately, we can measure how accurately the reward signal explains the behaviour in terms of that reward. In mathematical terms, we suppose that we have collections of explanations and of behaviours as before, but this time a mapping which sends each explanation to a corresponding (normalized) evaluation of behaviours. Then the quality of an explanation for observed behaviour is simply defined to be the evaluation . We call this an **evaluative measurement of accuracy.**

In order for evaluative measurements to be at all useful (in particular, for them to have a chance of capturing the intended notion of explanation accuracy), we need to impose some additional conditions on the function . For example, to ensure that every non-trivial explanation has some relevant content, we could impose the condition that there is always at least one behaviour which is evaluated poorly; we might achieve this by insisting that is in the image of for every explanation . Even without developing all of the desirata and corresponding axioms, however, one can see how using a judiciously constructed evaluative measurement of accuracy might produce more desirable results in the wasp example.

A caveat to this analysis: I mentioned in the comments of my first post that I want to avoid imposing a strong association between an agent's goal (if it has one) and its competence at achieving that goal^{[2]}. Computing the reward value is better than a direct comparison with optimal behaviour, but I wonder if we can do any better than that (see the conclusion).

### Relational versions

Let's expand our considerations a little. There was no good justification for our simplifying assumption earlier that each explanation determines a unique behaviour. The most accessible example of how it can fail is that a reward signal can produce several, even infinitely many, globally optimal policies. It still seems reasonable, however, to consider "optimizing with respect to this reward signal" as a valid explanation of any of these policies, although we might eventually be inclined to judge that the explanation has less explanatory power in this situation.

To encode the existence of multiple behaviours corresponding to a given explanation, we simply use a (total^{[3]}) relation or in place of a function in each of the measurements of accuracy given above. We can then quantify the accuracy of a given explanation for observed behaviour in two stages. First, we respectively compute the distance from to each behaviour with or evaluate for each with . Then we combine these measurements by taking a minimum, maximum, or (at least in the finite case) a linear combination of them. Again, the choice of weightings in the last case depends on how we have chosen to reward explanatory power, or to punish an absence of such. We can extend this idea to the infinite case in reasonable cases, as we shall shortly see.

### Measure-theoretic versions

Generalizing good cases of the formulations above, we can suppose that the space of behaviours is a *measure space*: it comes equipped with a distinguished collection of subsets of called *measurable sets*, and a mapping called a *measure* which assigns a 'size' to each measurable set. This data is subject to certain intuitive conditions such as the fact that a disjoint union of (countably many) measurable sets is measurable and its measure is the sum of the parts. This is a setting in which we can perform integration, and we shall use that to quantify accuracy. We shall call the *reference measure*; in principle, this measure represents an unnormalized zero-knowledge prior distribution over possible behaviours. Determining what this measure 'should' be is non-trivial, just as determining the metric on for extrinsic measures of accuracy was non-trivial^{[4]}.

Given this setting, we suppose that we have mappings from explanations to *probability density functions *on , say , where being a probability density function means being -integrable and normalized such that for all explanations . The probability density function corresponding to an explanation amounts to a distribution over the possible behaviours of what one would expect based on the explanation. Similarly, the observed behaviour is a measurable function . With these, we can define the accuracy of an explanation to be the integral .

The maths in this account is starting to get a little involved, but it's also approaching what happens in statistical analysis of real-world experiments: the behaviour is reduced to a description in terms of one or more real number values, hypotheses (explanations) are used to predict the values taken by this behaviour in the form of a distribution over real values; observations along with their precisions are used to produce a distribution corresponding to the behaviour, and these are compared by integration (albeit not always exactly via the formula I wrote above). The type of density function most commonly seen in this situation is a Gaussian distribution, and as long as we have enough observations, this is a sensible choice thanks to the Central Limit Theorem.

This approach must be adapted a little to cope with clashes between discrete and continuous behaviour. Formally speaking, the probability assigned to any given value in a continuous distribution is zero. If we can perform repeated experiments then we can model discrete behaviour as being sampled from a continuous distribution, which we can reconstruct and use as a proxy for computing the accuracy of our explanations. However, when our available observations are limited, we may need to find ways to coarse-grain our explanations in order to arrive at a useful measure of accuracy. I'll leave the discussion of these issues until a later time, but I expect these considerations to rear their heads when the time comes around to compute things in experiments or other concrete situations.

There are also variations of the measure-theoretic set-up where we do not assume an existing measure on ; instead, either the explanations, the observations or both provide measures, which can be used to perform the integration. Since my experience in the direction of more exotic measure spaces is limited, I won't speculate about those right now.

## Powerful Explanations

The last two suggestions for measurements of accuracy also incorporated features which penalize explanations with lower explanatory power. In the relational case, there was scope to incorporate a weight inversely proportional to the number of behaviours compatible with an explanation, while the normalization of a probability distribution in the measure-theoretic set-up explicitly forces compatible behaviours to receive lower weight when they are numerous (since they must share the total probability mass of 1 between them). What other measures of the "power" of an explanation might we consider, though?

### Lower dimensional explanations

An explanation can fail to be powerful if it only predicts aspects of behaviour and provides no information about other aspects. For example, if we are observing a charged particle moving in a (vacuum in a) box containing a vertical magnetic field, then a theory of Newtonian gravity will provide reliable information about the vertical motion of the particle but will tell us little about its motion perpendicular to a vertical axis. Our theory of gravity is a weak explanation for the particle's behaviour, and we might like to quantify this weakness in terms of the dimensionality of the predictions it makes.

Consider a (suitably continuous) function which sends a behaviour to the values of some set of observable properties; up until now we might implicitly have been identifying behaviours with their full sets of observable properties, so consider this a mapping onto the values of some subset of the observable properties. To extract a sensible notion of dimension with such a map, we shall need it to be surjective^{[5]}, since otherwise we could extend by an inclusion into a space of higher dimension and get a similar kind of map (that is, the dimension would be meaningless). In good situations, we can 'push forward' the structure on which is used to compute accuracy along this map. We might like to say an explanation has "dimensional strength at least " if for all behaviours , the accuracy to which explains according to the pushed-forward structure is at least as good as the accuracy to which explains according to the structure on .

The trouble with dimensional strength is that, while it is bounded above by the dimension of (again, assuming that is suitably surjective), that's no help when might be infinite-dimensional. Returning to our example, the collection of trajectories of a particle in a box is already a huge infinite-dimensional space, even after imposing conditions such as a starting point and velocity. Moreover, our gravity model accurately predicts the vertical component of the particle's position over time (for a classical particle, at least), and the space of vertical components of trajectories is again infinite-dimensional, so there is no upper bound on the dimensional strength of this model. Nonetheless, we can invoke a suitable idea of "local dimension" to recover a way to quantify strength in terms of dimension.

The domain of (algebraic) **dimension theory** provides some tools for formalizing these ideas. However, the definitions involved in dimension theory are rather sensitive to the framework/mathematical model under consideration. Since any concrete examples I examine in the course of this project will be relatively simple, I do not expect this measure of explanatory power to be invoked, but it's worth keeping in mind for more intricate models.

### Explanations of Restricted Scope

Some explanations are only relevant subject to specific conditions. Newtonian gravity, for example, is only "valid" in the limit of low energy, which means that the accuracy of its predictions depends on the quantities involved in the calculations being very small (compared to the speed of light and derived quantities). While there is not a precisely defined domain in which this theory holds, seeing as the accuracy just gets worse as the quantities involved grow, it shall be useful for me to write as if such a domain exists; I'll call this the **scope **of the explanation. It should be clear that an explanation with greater scope is more powerful.

Measuring the deficiency of scope of an explanation is challenging, because it requires us to identify ahead of time *all *of the variables affecting the behaviour being examined^{[6]}. Going back to the "charged particle in a box containing a vertical magnetic field" scenario from earlier, if we didn't know about the magnetic field, we would be surprised that our gravitational explanation failed to accurately predict the observed behaviour. Historically, unknown variables affecting behaviour of physical systems have frequently been identified only thanks to failures of scope, rather than the other way around!

The considerations above opens the door to an examination of my implicit assumptions. I have assumed in a few places that we can take several observations, but for these to be compatible (give use consistent information about the nature of the behaviour being observed), we need some guarantees that either the parameters affecting the behaviour are approximately constant across observations or that the variations in those parameters are included as part of the explanation. In most cases in experimental science, it must ultimately be assumed that the variables controlled in the experiment are exhaustive. I have little choice but to acknowledge and accept this assumption too, at least at the level of generality at which this post's discussion takes place. On the other hand, I expect that the models I will consider in this project will be small enough that all of their parameters are identifiable and explicit.

So, suppose that we have identified some list of parameters as exhaustive for the purposes of our observation scenario; we could assume that these parameters take real values, so we have a space of parameter values. A naïve approach for expressing explanatory power would be to look at how many of the parameters the explanation incorporates, but there are immediate problems with this. Our parameter list might be excessive, in the sense that some parameters might genuinely be irrelevant to the behaviour being observed. Conversely, it's easy to artificially include a parameter in an explanation in such a way that it makes no concrete difference to the predictions output by the explanation.

Instead, we consider the following blanket method for measuring the failure of scope of an explanation. First, we introduce a further probability measure (in good cases, this can be described in terms of a probability density function) on the possible parameter values, expressing either the frequency with which we expect parameter values to occur or the frequency with which they are actually observed. Then we can think of an explanation also as being parametrized by this measure, and we can measure the combined accuracy and power of an explanation by integrating the accuracy measurement over the measured parameters. As a formula, if we take the measure-theoretic determination of accuracy too, we have . This isn't a direct measure of deficiency of scope, but rather a measure of how good the explanation is across the range of possible parameters, which thus penalizes an explanation at parameter values outside of the range where the explanation produces accurate predictions.

I am not yet committed to the measure-theoretic calculations, but I must admit that they provide me with a rather flexible geometric mental picture which I have not attempted to present graphically in this post.

*Edit: *I had some conversations this week suggesting that a relational approach might also allow one to examine the two 'directions' of explanatory power described in this section, albeit in a not-immediately-quantified sense. But that might not matter: we only need to be able to compare explanations in terms of some ordering, and a partial ordering might be enough. I'll examine this in more detail in the near future.

## Simple Explanations

While accuracy and explanatory power appeared to come together in a few places in the sections above, simplicity (or, dually, complexity) of explanations is fundamentally distinct from it. The reason for this is that both accuracy and explanatory power are determined on the basis of a comparison between predicted and observed behaviour, while complexity is measured in terms of the content of an explanation and how difficult it is to extract predictions from it (independently of observed behaviour). This distinction will be reflected in our picture of how explanations can be compared later on.

### Algorithmic complexity

A standard measure of complexity which will be relevant to us is **Kolmogorov** or **algorithmic complexity**. This is calculated by assuming that the explanation can be expressed as the output of an algorithm (which takes some parameters as input); the algorithmic complexity measures the *shortest algorithm* outputting the explanation^{[7]}. We shall consider a variation of this, where the explanations are themselves given as algorithms, so that complexity is simply a measure of the size/length of the algorithm.

While conceptually straightforward, there are a rather large number of factors which go into extracting a value here. What language is the algorithm written in? What are the permitted operations? In the case of continuous inputs and outputs, how is this data represented? We need a representation of members of , say; for a sufficiently constrained model of computation, expressing elements of can be non-trivial even if they consist just of real numbers. More generally, several of the descriptions above have explanations expressed as *measures*, which are tricky to encode computationally!

All of these considerations make attaching a single number to algorithmic complexity... difficult. On the other hand, we don't need to worry too much about the details *a priori*, as long as we compare like with like. If we have two or more different types of explanation, as long as we constrain their expressions (the language, the permissible operations and representation of values) to be sufficiently similar as to ensure that the algorithmic complexities of the two classes are meaningfully comparable, then this should be adequate.

An apparent concern that remains is whether the resulting comparison of complexity is consistent. There are two versions of this problem. The "local" version consists of the observation that it might be that I can formalize explanations and in two languages such that in one language the complexity of is greater, while in the other the complexity of is greater. The "global" version applies the same reasoning to features such as the global minimum of complexity (of some subset of explanations) rather than the complexity of individual explanations. One might be inclined to believe that this problem is only a real concern if the basic expressions of the explanations are insufficiently formal: if I have an explanation in words, the potential inconsistency in algorithmic complexity is a direct consequence of the fact that there may not be a strictly consistent way to translate these explanations into algorithms. In fact, even if explanations are presented as algorithms in the first place, and even if we only allow translations between languages which are *purely substitutional*, in the sense that individual operations and symbols in one language are transformed consistently into strings of operations and symbols in the target language (so that complexity cannot be lost by squashing operations together), the problem still remains. It's a problem of counting, which I'll illustrate.

Consider two abstract algorithms for explanations producing a single whole number (which we're thinking of as a predicted behaviour) with a whole number parameter . The first is, , the second is ; here and represent basic operations in my language. If we calculate algorithmic complexity by just counting the operations (and ignore the brackets) then these have complexity 1 and 2 respectively. Clearly, the latter is more complex. But if I have another language containing the operation but not the , and the operation can only be represented by three applications of another operation , then suddenly the first algorithm has become , and increased to complexity 3, higher than 2!^{[8]} Note that this is not just an artefact of my tweaking the definition of Kolmogorov complexity: if the two languages have only and as their respective basic operations, then these are easily the shortest algorithms for the respective computations.

It's not such a stretch to expect that how goal-directed an agent seems, or which explanation for its behaviour is simplest or best, depends on the language we're using. After all, it's a lot easier to describe situations where we have words adequately describing all of the elements involved (it's no wonder magic is invoked as an explanation for the functioning of complex devices in places where those devices are not familiar)...

### Computational complexities

On the other hand, it would philosophically be strange if an eventual conclusion about whether behaviour is goal-directed were strongly dependent on the choice of *computational language* used to express goals, since any specific choice of language can seem arbitrary. As such, we might instead want to restrict ourselves to **properties of algorithmic complexity which are invariant (**to some extent) under changing language, rather than the raw algorithmic complexity with respect to any particular language.

Coarser-grained notions of complexity include the many computational complexity classes in computer science, which are defined in terms of the resources involved in the execution of algorithms. Besides providing coarser grained invariants, these have an extra advantage for the purposes of explanations relating to behaviour. Namely, if we have some known constraints on the architecture of the agent under consideration, we may be able to directly constrain the complexity class of feasible explanations. For example, an agent cannot be implementing an algorithm requiring more memory than it has.

I will need a much deeper understanding of complexity theory than I currently possess to follow through with reasoning about a robust way to measure (aspects of) the complexity of an explanation.

# Conclusions

### The naïve picture and its flaws

Here is a naïve picture we can build with the tools discussed in this article. Given a class of explanations and some observed behaviour from a class , we should be able to plot the complexity and accuracy/power of each explanation according to our chosen measure of these quantities. They might be plotted on some axes like these:

(I haven't included any data on these axes, because I don't know what a typical shape will look like yet!) The quality of the *class* of explanations is determined by the lower frontier of the resulting plot, which represents the best accuracy achieved by explanations as complexity increases. If the behaviour is well-explained by pursuit of a goal, for example, then small increases in the complexity of the goal-explanation should result in large increases in accuracy, and the plot will quickly approach the right-hand boundary. On the other hand, non-goal-directed behaviour will require complex explanations to output accurate descriptions of behaviour. Broadly speaking, we should compare two or more such plots in order to determine how well goal-based explanations fare against other types of explanation.

### What's missing?

One way to separate goal-directedness from optimization is to include a model of the agent's **beliefs** in the explanation. This adds a lot of potential depth to examples compared with the examples considered in the "accurate explanations" section above. I will discuss this soon.

In discussion at a conference this week, *causality* was suggested as a quality of explanations, in the sense that we prefer explanations that identify the factors leading causally to the observed behaviour, rather than just a compressed description of the observed behaviour. Einstein's explanation of Brownian motion vs the description as a random walk came to mind. It's unclear to me at this stage how to incorporate this into the framework I've outlined here, or if it falls into explanatory power somehow, as a feature that I haven't explicitly identified how to measure.

**Do you think something is missing? Putting my intended applications aside, how else might you judge the quality of an explanation?**

*Thanks to Adam Shimi for his feedback during the writing of this post and to participants at the Logic and transdiciplinarity: Mathematics/Computer Science/Philosophy/Linguistics week at CIRM for valuable discussions on this topic.*

^{^}That assumption may not be valid, of course; in existing AI we have explicit access to the source code, although not necessarily in a form that is useful from an explanatory perspective. I don't explore that possibility in this post, but...

^{^}I think that

*some*relation between competence and goal-directedness is inevitable, since an agent with a goal that has no idea how to achieve that goal might act essentially randomly, to the effect that whether or not it has a goal is not easy to detect.^{^}A relation is called

*total*if for each there exists some with . This guarantees that each explanation is "valid" in the sense of describing some possible behaviour, although it may describe several behaviours.^{^}I observe challenges and choices throughout this post. My intention in doing so is twofold. First, I want to emphasise that anyone employing any of these formulations will need to be explicit about their choices. Second, I want to be deliberate in pointing out where I am postponing choices for later.

^{^}For some suitably strong notion of surjectivity. Smooth and almost-everywhere a submersion is definitely enough, but this only makes sense if is nice enough, in the sense of admitting a smooth structure. Assuming a topological structure on B, we could employ the concept of topological submersion as a corresponding sufficient condition.

^{^}There is a sense in which deficiency in scope is complementary or dual to the dimensional deficiency of the previous section: the scope is measured in terms of the number of dependent variables, where the dimension is measured in terms of the number of independent variables.

^{^}A subtle but important point: the algorithm outputs the

*explanation*(the goal, say) not the behaviour predicted by the explanation!^{^}Maybe and represent the operations "multiply by 8" and "multiply by 3" respectively, while the second language only allows the operation of "multiplication by 2".

I'll probably update this comment after I finish some edits to previous posts.

I like this line of thinking. I'd say the big place where we've swept something under the rug here is

what counts as an agent-shaped model? After all, we don't care about justanysimple yet powerful explanations (otherwise the Standard Model would be much more relevant to ethics than it is), we care about the parameters of the agent-shaped models specifically.I see what you're getting at. For an arbitrary explanation, we need to take into account not only the complexity of the explanation itself, but also how difficult it is to compute a relevant prediction from that explanation; according to my criteria, the Standard Model (or any sufficiently detailed theory of physics that accurately explains phenomena within a conservative range of low-ish energy environments encountered on Earth) would count as a very good explanation for

anybehaviour for its complexity, but that's ignoring the fact that it would be impossible to actually compute those predictions.While I made the claim that there is a clear dividing line between (accuracy and power) and (complexity), this strikes me as an issue straddling complexity and explanatory power, which muddies the water a little.

Since I've appealed to physics explanations in my post, I'm glad you've made me think about these points. Moving forward, though, I expect the classes of explanation under consideration to be so constrained as to make this issue insignificant. That is, I expect to be directly comparing explanations taking the form of goals to explanations taking the form of algorithms or similar; each of these has a clear interpretation in terms of its predictions and, while the former might be harder to compute, the difference in difficulty is going to be suitably uniform across the classes (after accounting for complexity of explanations), so that I feel justified in ignoring it until later.

A note on judging explanations

I should address a point that wasn't addressed in the post, and which may otherwise be a point of confusion going forward: the quality of an explanation can be high according to my criteria even if it isn't empirically correct. That is, there are some explanations of behaviour which may be falsifiable: if I am observing a robot, I could explain its behaviour in terms of an algorithm, and one way to "test" that explanation would be to discover the algorithm which the robot is in fact running. However, no matter the result of this test, the judged quality of the explanation is not affected. Indeed, there are two possible outcomes: either the actual algorithm provides a better explanation overall, or our explanatory algorithm could be a simpler algorithm with the same effects, and hence be a better explanation than the true one, since using this simpler algorithm is a more efficient way to predict the robot's behaviour than simulating the robot's actual algorithm.

This might seem counterintuitive at first, but it's really just Occam's razor in action. Functionally speaking, the explanations I'm talking about in this post aren't intended to be recovering the specific algorithm the robot is running (just as we don't need the specifics of its hardware or operating system); I am only concerned with accounting for the robot's behaviour.