May 10, 2010
tl;dr version: The article introduces the pattern of Optimization by Proxy (OBP), which can be found in many large scale distributed systems, including human societies. The pattern occurs when a computationally limited algorithm uses a proxy property as a shortcut indicator for the presence of a hard to measure target quality. When intelligent actors with different motivations control part of the data, the existence of the algorithm reifies the proxy into a separate attribute to be manipulated with the goal of altering the algorithm's results. This concept is then applied to Google and the many ways it interacts with the various groups of actors on the web. The second part of this article contains examination of how OBP contributes towards the degrading of the content of the web, and how this relates to the Friendly Artificial Intelligence concept of 'paperclipping'.
The first thing a newly-hatched herring gull does after breaking out of its shell is to peck on its mother’s beak, which causes her to give it its first feeding. Puzzled by this apparent automatic recognition of its mother, Dutch ethologist and ornithologist Nikolaas Tinbergen conducted a sequence of experiments designed to determine what precisely it was that the newborn herring gull was attracted to. After experimenting with facsimiles of adult female herring gulls, he realized that the beak alone, without the bird, would elicit the response. Through multiple further iterations he found that the characteristics that the newborns were attracted to were thinness, elongation, redness and an area with high contrast. Thus, the birds would react much more intensely to a long red stick-like beak with painted stripes on the tip than they would to a real female herring gull. It turns out that the chicks don't have an ingrained definition of 'motherness' but rather determine their initial actions by obeying very simple rules, and are liable to radically miss the mark in the presence of objects that are explicitly designed to the specification of these rules. Objects of this class, able to dominate the attention of an animal away from the intended target were later called ‘supernormal stimuli’ (or superstimuli) and have been commonly observed in nature and our own human environment ever since.
Generalising the above example, we can say that Optimization by Proxy occurs when an algorithm substitutes the problem of measuring a hard to quantify attribute, with a usually co-occurring a proxy that is computationally efficient to measure.
A similar pattern appears when algorithms intended to make optimized selections over vast sets of candidates are applied on implicitly or explicitly social systems. As long as the fundamental assumption that the proxy co-occurs with the desired property holds, the algorithm performs as intended, yielding results that to the untrained eye look like ‘magic’. Google’s PageRank, in its original incarnation, aiming to optimize for page quality, does so indirectly, by data mining the link structure of the web. As the web has grown, such algorithms, and their scalability characteristics, have helped search engines dominate navigation on the web over previously dominant human-curated directories.
When there is only a single party involved in the production, filtering, and consumption of results, or when the incentives of the relevant group of actors are aligned, such as in the herring gull case, the assumption of the algorithm remains stable and its results remain reliable.
When however instances of the proxy are in the control of intelligent actors that can manipulate it, and stand to benefit from distorting the results of the algorithm, then the existence of the algorithm itself and the motive distortions it creates alter the results it produces. In the case of PageRank, what we have is essentially Google acting as a singleton intermediary between two groups: content producers and consumers. Its early results owe to the fact that the link structure it crawled was effectively an unintentional byproduct of the buildup of the web. By bringing it to the attention of website owners as a distinct concept however, they have been incentivised to manipulate it separately, through techniques such as link farming, effectively making the altered websites act as supernormal stimuli for the algorithm. In this sense, the act of observation and the computation and publication of results alters that which is being observed. What follows is an arms race between the algorithm designers and the external agents, each trying to affect the algorithm’s results in their own preferred direction, with the algorithm designers controlling the algorithm itself and malicious agents controlling part of the data it is applied on.
The above figure (original Google drawing here) may help visualise the issue. Items that satisfy the proxy but not the target quality are called false positives. Items possessing the target quality but not the proxy become false negatives. What effectively happens when Optimization by Proxy is applied to a social system, is that malicious website owners locate the semantic gap between target quality and proxy, and aim to fit in the false positives of that mismatch. The fundamental assumption here is that since the proxy is easier to compute, it is also easier to fake. That this is not the case in NP-complete problems (while no proof of P=NP exists) may offer a glimmer of hope for the future, but current proxies are not of this class. The result is that where proxy and target quality would naturally co-occur, the arrival of the algorithm, and the distortion it introduces to the incentive structure, make the proxy and the target quality more and more distinct by way of expanding the false positives set.
We can obtain a little more insight by considering a simple Bayesian network representation of the situation. A key guide to algorithm design is the identification of some measure that intuitively will be highly correlated with quality. In terms of PageRank in its original incarnation, the reasoning is as follows. High quality web sites will attract attention from peers who are also contributing related content. This will “cause” them to link into the web site under consideration. Hence if we measure the number of highly ranked web sites that link into it, this will provide us with an indication of the quality of that site. The key feature is that the causal relationship is from the underlying quality (relevance) to the indicator that is actually being measured.
This simple model raises a number of issues with the use of proxies. Firstly, one needs to be aware that it is not just a matter of designing a smart algorithm for quantifying the proxy. One also needs to quantify the strength of association between the proxy and the underlying concept.
Secondly, unless the association is an extremely strong one, this makes use of the proxy a relatively “lossy” test for the underlying concept. In addition, if one is going to use the proxy for decision-making, one needs some measure of confidence in the value assigned to the strength of the relationship – a second-order probability that reflects the level of experience and consistency of the evidence that has been used to determine the strength of the relationship.
Finally, and most critically, one needs to be aware of the consequences of performing inference in the reverse causal direction. In modeling this as a Bayesian Network, we would use the conditional probability distribution p(PR | Q) as a measure of the “strength” of the relationship between cause and proxy (where “PR” is a random variable representing the value of PageRank, and “Q” is a random variable representing the value of the (hidden) cause, Quality). Given a particular observation of PR, what we need to determine is p(Q | PR) – the distribution over Quality given our observation on the proxy. This (in our simple model) can be determined through the application of Bayes’ rule:
What this is reminding us of us that the prior probability distribution on Quality is a major factor in determining its posterior following an observation on the proxy. In the case of social systems however, this prior is the very thing that is shifting.
One approach by algorithm owners is to keep secret the operation of the algorithm, creating uncertainty over the effects of manipulation of the proxy. This is effectively security by obscurity and can be counteracted by dedicated interrogation of the algorithm’s results. In the case of PageRank, a cottage industry has formed around Search Engine Optimization (SEO) and Search Engine Marketing (SEM), essentially aimed at improving a website’s placing in search engine results, despite the secrecy of the algorithm’s exact current operation. While a distinction can be made between black-hat and white-hat practitioners, the fact remains that the existence of these techniques is a direct result of the existence of an algorithm that optimizes by proxy. Another approach may be to use multiple proxies. This however is equivalent to using a single complex proxy. While manipulation becomes more difficult, it also becomes more profitable as less people will bother doing it.
As a response to the various distortions and manipulations, algorithms are enriched with heuristics to identify them. This, as the arms race progresses, is hoped to converge to the point where the proxy approaches the original target more and more, and hence the external actors are forced to simulate the algorithm’s target quality to the point where, to misquote Arthur C. Clarke, “sufficiently advanced spam is indistinguishable from content”. This of course would hold only if processing power were not an issue. However, if processing cost was not an issue, far more laborious algorithms could be used to evaluate the target attribute directly and if an algorithm could be made to describe the concept to the level that a human would be able to distinguish. Optimization by Proxy, being a computational shortcut, is only useful when processing power or ability to define is limited. In the case of the Web search, there is a natural asymmetry, with the manipulators able to spend many more machine- and man-hours to optimization of the result than the algorithm can spend judging the quality of any given item. Thus, algorithm designers can only afford to tackle the most broadly-occurring and easily distinguishable forms of manipulation, while knowingly ignoring the more sophisticated or obscure ones. On the other hand, the defenders of the algorithm always have the final judgment and the element of surprise on their side.
Up to this point, I have tried to more or less describe Optimization by Proxy and the results of applying it to social systems, and used Google an PageRank as a well known example for illustration purposes. The rest of this article focuses more on the effect that Google has on the Web and applies this newly introduced concept to further the understanding of that situation.
While Google can and does make adjustments and corrections to its algorithms, it can only catch manipulations that are themselves highly automated such as content scraping and link farms. There have long been complaints about the ever increasing prevalence of made-for-adsense websites, affiliate marketers, and other classes of spam in search results. These are a much harder nut to crack and comes back to the original limitations of the algorithm. The idea behind made-for-adsense websites is that there is low quality human authored original content that is full of the appropriate keywords, and which serves adsense advertisements. The goal is twofold: First to draw traffic into the website by ranking highly for the relevant searches, and secondly to funnel as many of these visitors to the advertisers as possible, therefore maximising revenue.
Optimization by Proxy here can be seen occurring at least thrice: First of all it is exploited as a way of gaining prevalence in search results using the above mentioned mechanisms. Secondly, the fact that the users' only relevance metric, other than search ranking, is the title and a short snippet, can mislead users into clicking through. If the title is closely related to their search query, and the snippet seems relevant and mentions the right keywords, the users will trust this proxy when the actual quality of the content that awaits them on the other side is substandard. Finally, advertisers will have their ads being placed on low quality websites that are selected by keyword, when perhaps they would not have preferred that their brand is related with borderline spam websites. This triple occurrence of Optimization by Proxy creates a self-reinforcing cycle where the made-for-adsense website owners are rewarded with cold hard cash for their efforts. What's worse, this cash flow has been effectively subtracted from the potential gains of legitimate content producers. One can say that the existence of Google search/adsense/adwords makes all this commerce possible in the first place, but this does not make the downward spiral of inefficiency disappear. Adding to this the related scourge of affiliate marketers only accelerates the disintegration of quality results.
An interesting characteristic of this problem is that it targets less savvy users, as they are the most likely to make the most generic queries, be unable to distinguish a trusted from an untrusted source, and click on ads. This means that those with the understanding of the underlying mechanics are actually largely shielded from realising the true extent of the problem.
Its effectiveness has inevitably led to an industrialisation of the technique, with content farms such as Demand Media which pays about $5 per article and expects its authors to research and produce 5 articles an hour(!). It also pays film directors for short videos and has become by far the largest contributor to YouTube. Its method relies on purchasing search logs from ISPs and data mining those and other data sets for profitable niche keywords to produce content on. Demand Media is so wildly profitable that there is talk of an IPO, and it is obviously not the only player in this space. No matter what improvements Google makes on their algorithm short of aggressively delisting such websites (which it hasn't been willing to do thus far), the algorithm is unable to distinguish between low quality and high quality material as previously discussed. The result is crowding out of high quality websites in favour of producers of industrialised content that is designed to just barely evade the spam filters.
What we have seen is that a reliance on a less than accurate proxy has led to vast changes in the very structure and content of the web, even when the algorithms applied are less intelligent than a human and are constantly supervised and corrected by experts. All this in my mind drives home the fundamental message of FAI. While descriptions of FAI have thus far referred to thought experiments such as paperclipping, real examples, albeit in scale, are all around us. In our example, the algorithm is getting supervised by at least four distinct groups of people (Google, advertisers, content producers, consumers) and still its effects are hard to contain due to the entangled incentives of the actors. Its skewed value system is derailing the web contrary to the desires of most of the participants (except for the manipulators, I guess). For PageRank a positive is a positive whereas the difference between true and false positive is only apparent to us humans. Beyond PageRank, I feel this pattern has applicability in many areas of everyday life, especially those related to large organizations, such as employers judging potential employees by the name of the university they attended, companies rewarding staff, especially in sales, with a productivity bonus, academic funding bodies allocating funds according to bibliometrics, or even LessWrong karma when seens as an authority metric. Since my initial observation of this pattern I have been seeing it in more and more and now consider it one of my basic 'models', in the sense that Charlie Munger uses the term.
While I have more written material on this subject, especially on possible methods of counteracting this effect, I think this article has gone on way too long, and I'd like to see the LessWrong community's feedback before possibly proceeding. This is a still developing concept in my mind and my principle motivation for posting it here is to solicit feedback.
Disclaimer: Large parts of the above material have been published at the recent Web Science '10 conference. Also parts have been co-written with my PhD supervisor Prof. Paul Krause. Especially the Bayesian section is essentially written by him.
I should also probably say that, contrary to what you might expect, Google is one of the technology companies I most respect. Their success and principled application of technology has just happened to make them a fantastic example for the concept I am trying to communicate.
Update(s): The number of updates has gotten a bit unwieldy, so I just collapsed them all here. To summarize, there have been numerous changes throughout the article over the last few days as a response to the fantastic feedback throughout the comments here and elsewhere. Beyond the added links at the top on prior statements of the same principle in other fields, here is also a very interesting article on the construction of spam, with a similar conclusion. Also, I hear from the comments that the book Measuring and Managing Performance in Organizations touches on the same issue in the context of people's behaviour in corporate environments.
Followup on the Web: Since I am keeping my ears on the ground, here I will try to maintain a list of articles and discussions that refer to this article. I don't necessarily agree with the contents, but I will keep them here for future reference.