Algorithms as Case Studies in Rationality

by abramdemski6 min read14th Feb 201140 comments

38

Logic & Mathematics Computer Science
Personal Blog

This post springs out of a very long line of thought, which I will only summarise small parts of.

It began with the puzzling realisation that the algorithm which computers use to perform symbolic integration is radically different from the guess-and-check method taught in schools. My first reaction was, why are we not taught the systematic way of doing it? This is true for other areas of mathematics as well. In a few cases, such as solving quadratic equations or systems of linear equations, students are eventually taught the fast way. However, in many cases, the existence of an algorithm is not even mentioned.

My point, however, is not to criticise the educational practice; in fact, I agree with the idea of teaching mathematics as an exploration of ideas rather than an application of formulaic solution methods. Rather, I would like to encourage people to eventually learn the algorithms, and try to apply them. A good algorithm is a study in rational behaviour, and I think we can take home a lesson from each.

I'll just give two examples which I find particularly fascinating: Knuth-Bendix completion and the summary-product algorithm. The first is most relevant to fast mathematical reasoning. The second is relevant to studying the reasonableness of fast-and-messy probabilistic reasoning, the way humans do it.

Knuth-Bendix Completion

Knuth-Bendix completion ("K-B" from now on) is a fascinating formalisation of the mathematical idea of simplification. Everyone should recognise the phrase "simplify your answer" from high school homework assignments. I never suspected that simplification could be a really powerful method for mathematical reasoning.

K-B is a method for reasoning about equality. Equality is one of the simplest relationships we can have between two things, yet the naive way of reasoning about it results in an explosion of possible inferences. If there are N things in our universe, there are N2 equality statements which may be true or false. We can substitute equals for equals all day and get nowhere. So why does it seems like such a simple relation? I think the K-B algorithm gives a good answer to that.

We start with the basic axioms for the domain we are reasoning about. The domain consists of a set of expressions, some of which are equal to others. The axioms provide some general equalities; for example, they might be the trigonometric identities, if we are reasoning about trigonometric expressions.

First, we decide on a relatively arbitrary ordering of "simplicity" over the expressions we're using. Practical considerations and personal preferences will go into this, but for the algorithm itself, there are only a few requirements: it should be a well-ordering (so any two expressions are comparable, and there are no infinitely descending chains), and it should be consistent in the sense that if we replace a sub-expression with a simpler sub-expression, the whole expression gets simpler.

Second, we take the starting set of identities, and now treat them as rewrite rules: we judge which direction goes from complex to simple and we only apply them in that direction. Since expressions can only get simpler, we now know that we won't just keep producing equality statements endlessly without getting where we want to go.

Unfortunately, we lose some information by just taking one side of the equality. The truth is, it's sometimes necessary to alter an expression in a way which at first makes it more complicated, to allow application of another rule which eventually gets us to a simpler form.

The genius of K-B is to find these situations, referred to as critical pairs, and add new simplification rules which abbreviate such steps: Third, look for critical pairs and add the necessary rules. (I won't try to give the exact statement of this step; I refer the reader again to the Wolfram article. [Edit: Or the Wikipedia article.]) This step is repeated until the set of rules is complete. (It will sometimes happen that completion is impossible, sadly.)

Once the rules are complete, we can reason purely by simplification. The procedure "simplify" is embodied by looking through our list of rules and applying them in the simplifying direction, until we can't find one that applies anymore. The result of simplification is referred to as the normal form of an expression. We can test whether two expressions are equal by putting them in normal form: if the two are equal, they will have the same normal form; if they are not equal, they won't.

How can this be applied to life? Well, I admit that this one only applies if you do fairly advanced math on a regular basis (though, since I'm writing for rationalists, perhaps you'll be sympathetic if I encourage you to learn more mathematics and to apply it to your life!). In my experience, the K-B algorithm matches the behaviour of sophisticated mathematical reasoning fairly well; people will have a preferred way of simplifying different sorts of expressions, and the interesting theorems correspond to new simplifications which can only be proved by initially reasoning in a direction which looks more complex.

The second algorithm I'll review is more applicable to everyday situations.

Summary-Product

Summary-product is a recent algorithm, but one with deep roots. The interesting thing about this algorithm is that it is a general case of dozens of the most important algorithms in different specific areas. The PDF I linked to is a good introduction to this aspect of the algorithm, and I encourage readers to check it out; for my overview, however, I'll only look at the special case which is known as the belief propagation algorithm.

Belief propagation is a way of reasoning about probabilities in a tractable way.

First, we need to factor our beliefs into a large number of local belief functions, which deal only with a few statements. These represent local probabilistic dependencies between facts; perhaps conditional probabilities (in what's called a Bayesian Network) or just weights which contribute to probabilities (in what's called Markov Networks). What's important is that if we multiplied all the local functions together, we would get a coherent probability distribution.

Next, we take stock of the evidence. This gives fixed beliefs to some of the variables.

What we want to find now are beliefs for the other variables, referred to as marginal probability distributions. The naive way of finding these is to sum over the entire state-space. For N unobserved variables, we have an N-dimensional state-space; we want to sum over every variable but the one which we're calculating a belief for. If every dimension is of size v, we have vN entries to sum up.

The idea which leads to the belief propagation algorithm is that we can do better by using the factorisation, thanks to the distributive law: we push the sums as far in as we can, so that whenever possible, we sum over a dimension before we multiply. If the dependencies are treelike, this will give us a calculation which takes time proportional in N, rather than vN; a huge gain.

Turning this into a local propagation algorithm gives us rules for updating a belief in relation to its neighbours. This has two advantages.

First, it happens that we can calculate a belief for all of the variables in just twice the time that it takes to calculate the belief for the one; that is, 2N time. Why? Computing the belief function for one variable amounts to propagation starting at the far edges of the network and heading towards that variable. This loads the network with most of the information it needs. We can finish off the computation by simply propagating in the reverse direction, away from that variable; this spreads the evidence all around the network like jelly on toast, completing the marginalisation of every variable in 2N time.

Second, a surprising feature of the propagation algorithm is that it gives good approximate results for non-treelike factorisations. This is like saying that we can push in sums even when the distributive law doesn't hold! Rather than 2N propagations, we just keep propagating until our beliefs converge to stable values (or until we're fed up with waiting). In many situations, this will still converge to exact results or useful estimates. This makes belief propagation a reasonable candidate for general inference.

I think there are many lessons for rationality hidden in the details of the belief propagation algorithm. In fact, since belief propagation is a core Bayesian activity, there are already posts here dealing with some of the important lessons. I'll just mention one which I find quite striking, personally.

A naive method of belief propagation might update the belief of a variable based on the raw estimated marginal which the neighbour currently has. If two neighbouring beliefs X and Y imply each other with high probability, though, we'd run into trouble: suppose some evidence raises our belief in X. Then we can propagate this to Y, raising our belief there. But then we notice that once again, a neighbour of X has changed, and we can re-update X; performing belief propagation raises our belief in X once more. We'd go back and forth like this until X and Y lock in with probability indistinguishable from 1.

The actual rules of belief propagation stop this from happening with the idea of a directional message. The message which X sends to Y is the belief we compute for X with all the evidence except the evidence coming from Y. In general, the belief which gets sent in a particular direction uses all the evidence except the evidence coming back that same direction.

This prevents circular reasoning. For all the the talk about the fallacy of circular reasoning, classical logic does not really have a good treatment of it. For a statement A, it's perfectly valid to prove A from an assumption of A in classical logic; in fact we can prove A->A for any proposition (where "->" is material implication)! I think it's not until we think about tractable probabilistic reasoning that we really get a good account of what's wrong with circular reasoning: it makes us count the same evidence more than once.

Next time you're talking with someone, try to keep track of the directionality of the beliefs being propagated by an argument. Yes, I'm serious: it works! In fact, most of the time it's rather trivial. Sometimes, though, it gives an interesting insight into what's going on-- often cases where classical logic tells us that an inference is just fine, but informal pragmatics tell us that there is something silly about it. (I think there's some value in knowing precisely what we mean by silly.)

Conclusion

I presented very rough glosses of these two algorithms, and I'd encourage everyone to read more about the details in other places. However, what I'm really trying to drive at here is a broader point: algorithms which have been developed by computer scientists can really give us useful lessons about rationality. Often we find that if it's a good way for a computer to think, it's a good way for us to think, too. This does not mean memorising and mechanically executing an algorithm, of course; it means understanding the trick that makes the algorithm good, and applying that trick whenever we think it's relevant, until it becomes second nature.

Question for discussion: what algorithms have you applied to your own thought processes? What algorithms might you be interested in applying?

38

40 comments, sorted by Highlighting new comments since Today at 5:24 PM
New Comment

Sometimes, though, it gives an interesting insight into what's going on-- often cases where classical logic tells us that an inference is just fine, but informal pragmatics tell us that there is something silly about it.

Can you please give an example of this?

It's an interesting experience to learn formal logic and then take a higher-level math class (any proof-intensive topic). During the process of finding a proof, we ask all sorts of questions of the form "does that imply that?". However, since we're typically proving something which we already know is a theorem, we could logically answer: "Yes: any two true statements imply one another, and both of those statements are true." This is a silly and unhelpful reply, of course. One way of seeing why is to point out that although we may already be willing to believe the theorem, we are trying to construct an argument which could increase the certainty of that belief; hence, the direction of propagation is towards the theorem, so any belief we may have in the theorem cannot be used as evidence in the argument.

What do you think, am I overstepping my bounds here? I feel like the probabilistic case gives us something more. In classical logic, we either believe the statement already or not; we don't need to worry about counting evidence twice because evidence is either totally convincing or not convincing at all.

"does that imply that?"

Because in casual speech the question doesn't actually mean "does that imply that?", but rather "do we have a derivation of that from that, using our set of inference rules?" Not the same, but people seldom realise the distinction.

This is the "paradox of the material conditional", which is one of the primary motivations of relevance logic - to provide a sentential connective that corresponds to how we actually use "implies", as opposed to the material (truth-functional) implication.

http://plato.stanford.edu/entries/logic-relevance/

Good point! Perhaps you won't be surprised, though, if I say that my own preferred account of the conditional is the probabilistic conditional.

That is also a fair interpretation, especially for those students who just want to get the homework done with and don't really care about increasing their sureness in the theorem being re-proved.

If we additionally care about the argument and agree with all the inference rules, then I think there is a little more explaining to do.

Not only for the students, I think. Confusion between implication and inference was enough widespread to motivate Lewis Carroll to write an essay, and nothing much changed has since then. I didn't properly understand the distinction even after finishing university.

Another example (perhaps a bit frivolous): when browsing Less Wrong comments and deciding which to upvote, it might be tempting to take the existing upvotes into account. However (to an extent-- it's just an analogy), this is like using the fact that my probability for some statement A is high as an argument to increase the probability of A. The direction of propagation is towards the estimated goodness of the post, so using that estimate in the argument is bad form.

where classical logic tells us that an inference is just fine, but informal pragmatics tell us that there is something silly about it.

Give me a long enough lever and a place to stand...

Archimedes did not know that gravity was caused by the Earth's mass. His only mistake was overconfidence about the cause of gravity, which can be seen from Bayesian reasoning, not just informal pragmatics.

Algorithms I find useful that I didn't put in the article:

--Find decent solutions to packing problems by packing the largest items first, then going down in order of size

--Minimum-description-length ideas (no surprise to rationalists! Just occam's razor)

--Binary search (just for finding a page in a book, but still, learning the algorithm actually improved my speed at that task :p)

--Exploration vs exploitation trade-off in reinforcement learning (I can't say I'm systematic about it, but learning the concept made me aware that it is sometimes rational to take actions which seem suboptimal from what I know, just to see what happens)

My classical example for algorithms applicable to real life: Merge sort for sorting stacks of paper.

Only that "Exploration vs exploitation trade-off" is not an algorithm. Reinforcement learning (RL) is pretty much "non-algorithmic" (as Pei Wang would say). ETA: there are specific algorithms in RL (and in -- related -- planning and game playing), but the "trade-off" is a concept; it sure needs to be expressed algorithmically but is it fair to give credit to "algorithmicality" in this case?

Right; when I say "I'm not systematic about it" I mean that I don't purposefully follow a specific algorithm. I would probably benefit from being a bit more systematic, but for the moment, I'm merely trying to "train my intuition".

I would hope that all these algorithms would be applied "non-algorithmically" in Pei Wang's sense-- that is, the ideas from the algorithm should interact dynamically with the rest of my thought process.

Reinforcement learning is pretty much "non-algorithmic"

I'm rather certain I could implement reinforcement learning as an algorithm. In fact, I'm rather certain I have done so already. If I can point to an algorithm and say "look, that's a damn reinforcement learning algorithm" then I'm not sure how meaningful it can be to call it "non-algorithmic".

I concede, RL is a prototype example of algorithmic learning problem. The exploration vs exploitation trade-off is something that needs to be addressed by RL algorithms. It is fair then to say that we gain insight into the "trade-off" by recognizing how the algorithms "solve" it.

It is also fair to say there is an abstract concept of 'trade off' that is not itself algorithmic.

Nice to see you here again!

The Wolfram link for K-B completion was surprisingly unhelpful, Wikipedia worked much better for me because it has a detailed example.

Symbolic integration is not actually a complete algorithm because it sometimes needs to check if an expression is equivalent to zero, which is not yet known to be decidable, so in practice people use heuristics for that part. Though of course I agree that teaching math students to find antiderivatives by tricks and guesses is dumb.

I actually cited the Wolfram article because I preferred it, but I went ahead and added a link to the wikipedia article for those whose taste is closer to yours! Thanks.

The Risch algorithm for symbolic integration is what first gave me a hunger to learn "the actually good ways of doing things" in this respect, and a sense that I might have to search for them beyond the classroom. However, I never did learn to use the Risch algorithm itself! I don't really know whether it turns out to be good for human use.

My impression is that many if not most algorithms for computers are not quite directly usable by humans.

For example, back-tracking is a simple algorithm that works very well for some problems, but there are just too many steps for anything but the smallest problems for a human to follow, even with pen and paper. A human will need fudge parts of it (skip steps, make decisions based on guesses instead of systematically) to be able to finish it quickly.

But knowing about real back-tracking is still useful: one has a better intuition which steps should be fudged and which shouldn’t, a better estimate of how hard the problem is (which helps, e.g., for deciding whether you’re likely to find a solution if you spend a bit more time, or if it’s better to go to a computer), or how to pick solutions systematically when deciding it’s worth to do it “by hand”. This applies to many algorithms.

Agreed! My intention is definitely more toward the second approach then the first.

Recommending http://algorithmstoliveby.com/ for the same reasons

How do we deal with cycles larger than 2?

Referring to belief propagation? The actual procedure does something really simple: ignore the problem. Experimentally, this approach has shown itself to be very good in a lot of cases. Very little is known about what determines how good an approximation this is, but if I recall correctly, it's been proven that a single loop will always converge to the correct values; it's also been proven that if all the local probability distributions are Gaussian, then the estimated means will also converge correctly, but the variances might not.

Many things can be done to improve the situation, too, but I'm not "up" on that at the moment.

From what I recall of reading Pearl (Probabilistic Reasoning in Intelligent Systems), there are a few other ways to do it.

One is to collapse the cycle down to one multi-input, multi-output node, so you're conditioning on all inputs. This makes the node more complex, but forces you to consider the beliefs together, and prevent the information cascade problem.

Another is to make each message carry information about where it originated so that you can spot where evidence is being double-counted. (This is analogous to the "cite your sources" requirement in scholarship so that a faulty piece of evidence can be traced through multiple papers back to its source.)

Generally, to find ways around the problem, think back to the "soldier counting" problem and see what solutions would work there. The soldier counting problem is where you have some network of soldiers -- ideally, without cycles -- and want to know how you can count the size of the network by only looking at messages from soldiers connected to you. With an acyclic network, anyone can send the message "count" to all connections, and all soliders do the same until they're connect to no one, at which point they return "1" plus the sum of any messages coming back.

EDIT: On rereading your question, it seems I associated it with the wrong section of the article, matching to a question I asked and answered for myself when reading it.

The last two steps in the simplification get combined into one step, and then the next pass combines that step with the previous step, and so on until all the steps are combined and the last pass does not combine any steps so you know that you are done.

Thanks for the post. I wasn’t quite aware of those particular algorithms, nor of the usefulness of thinking in real life in terms of algorithms.

I did have a vague feeling that “if only more people would study programming, I wouldn’t want to hit that many of them that often”, but this post raised my awareness of why it is so.

I can’t think of applying a particular algorithm in life(*), but I do notice myself thinking in programming terms. For example, more than once I noticed thinking of things like government and management, and the difficulty thereof, in terms of trees (the logical structure) and cooperative peer-to-peer algorithms and bottlenecks.

(*: The one exception that comes to mind is binary search, which I do use on occasion; for example, I never could remember how to do roots and logarithms “properly”, and when I need one I do a quick binary search with the reverse operation.)

I’m really curious to hear of other algorithms you (all) found useful for human reasoning.

I suspect your intuition about computer programming is also based on the way it forces a certain amount of clear thinking to be done.

DPLL and JTMS (justification-based truth maintenance system) might be good candidates for "teach this algorithm to humans to enhance human rationality".

I'll second and constraint-solving algorithms; specifically, the variable-ordering heuristics seem helpful to me. Choose the variables which most constraint the problem first! Note, the constraint propagation step is another instance of the sum-product algorithm. :)

I've wondered lately while reading The Laws of Thought if BDDs might help human reasoning too, the kind that gets formalized as boolean logic, of course.

This article reminded me of your post elsewhere about lazy partial evaluation / explanation-based learning and how both humans and machines use it.

You do manipulate BDDs as a programmer when you deal with if- and cond-heavy code. For example, you reorder tests to make the whole cleaner. The code that you look at while refactoring is a BDD, and if you're refactoring, a sequence of snapshots of your code would be an equivalence proof.

This is the lazy partial evaluation post, cut and pasted from my livejournal:

Campbell's Heroic Cycle (very roughly) is when the hero experiences a call to adventure, and endures trials and tribulations, and then returns home, wiser for the experience, or otherwise changed for the better.

Trace-based just-in-time compilation is a technique for simultaneously interpreting and compiling a program. An interpreter interprets the program, and traces (records) its actions as it does so. When it returns to a previous state (e.g. when the program counter intersects the trace), then the interpreter has just interpreted a loop. On the presumption that loops usually occur more than once, the interpreter spends some time compiling the traced loop, and links the compiled chunk into the interpreted code (this is self-modifying code) then it continues interpreting the (modified, accelerated) program.

Explanation-based learning is an AI technique where an AI agent learns by executing a general strategy, and then when that strategy is done, succeed or fail, compressing or summarizing the execution of that strategy into a new fact or item in the agent's database.

In general, if you want to make progress, it seems (once you phrase it that way) just good sense that, any time you find yourself "back in the same spot", you should invest some effort into poring over your logs, trying to learning something - lest you be trapped in a do loop. However, nobody taught me that heuristic (or if they tried, I didn't notice) in college.

What does "back in the same spot" mean? Well, returning from a recursive call, or backjumping to the top of an iterative loop, are both examples. It doesn't mean you haven't made any progress, it's more that you can relate where you are now, to where you were in your memory.

Thanks for the analogy between those two algorithms! I think more could be done in the way of specifying when and how it is useful to go back and reflect, but deciding how to apply these algorithms to everyday thinking is really something that requires empiricism. These are habits to be perfected (or discarded) over longer periods of time.

Knuth-Bendix Completion

I think you just described how to do most undergraduate and high school maths exams!

I'm glad I conveyed my enthusiasm. :) I think that one reason I was overconfident when I got to higher math was because I had mastered this sort of simplification-based reasoning, and assumed it would always work!

[-][anonymous]10y 0

I'd hope that this doesn't hold for most undergraduate maths courses, but it's definitely false at the top end.

What do you mean by "most"? It seems to work pretty well all the way up through calc 1, imho.

Correction: I was wrong; there are some basic cases which knuth-bendix can't handle. It looks like it wouldn't be sufficient up to calc 1 after all.

[-][anonymous]10y 0

I was thinking of mathematics degree courses as a whole, rather than specific lecture courses, and in particular of the British system. The mechanics of calculus is taught at A-level in the UK, and here I'd definitely agree that following a standard recipe is most of what's required. But the key feature of a good university maths course is that it develops certain ways of thinking that enable you to tackle problems unlike any you've seen before, and this is the experience that I hope would be shared by all students of mathematics. This is a genuine hope though, not an expectation.

I liked your original post by the way.