Historically AI had several winters, where hype bubbles rose and collapsed. It is also full of pronouncements or predictions such as

“We think that a significant advance can be made in one or more of these problems if a carefully selected group of scientists work on it together for a summer.”

There has also been an important philosophical confusion that plagued AI in the 1970s. It has been discussed in depth in a great text “When Artificial Intelligence Meets Natural Stupidity”, as well as here. The confusion was around attempting to represent knowledge in diagrams like these:

| IS-A

An even more silly example of “knowledge”

| IS-A

From the paper:

Ø the tendency to see in natural language a natural source of problems and solutions. Many researchers tend to talk as if an internal knowledge representation ought to be closely related to the "corresponding" English sentences; and that operations on the structure should resemble human conversation or "word problems".

This is a good example of how a serious philosophical confusion about how much knowledge can be “contained in” a statement drove a large amount of research into systems that later proved unworkable.

What philosophical assumptions is AI research fundamentally mistaken about today?

Right now, the dominance and usefulness of current deep learning neural networks makes it seem that they capture some key pieces of general reasoning and could lead us to a more general intelligence. Sarah Constantin disagrees. David Chapman isn’t sure either.

I wish to highlight several other politico-philosophical problems that are inherent in writing about AI and AI safety. If the AI researchers of the past believed that writing “happiness is-a state of mind” lead to good reasoning, there are likely similar issues that occur today.

The current big problems are:

1. Misunderstanding of “bias”

2. Contradictory treatment of human trustworthiness

3. Deep confusions about the nature of “values”

4. Overemphasized individuality and misunderstanding the nature of status

5. Overreliance on symbolic intelligence, especially in the moral realm

6. Punting the problems elsewhere

1. Human Bias

The idea that humans are “biased” doesn’t sound too controversial. That description usually arises when a person fails to fulfill a model of rationality or ethics that another person holds dear. There are many models of rationality from which a hypothetical human can diverge, such as VNM rationality of decision making, Bayesian updating of beliefs, certain decision theories or utilitarian branches of ethics. The fact that many of them exist should already be a red flag on any individual model’s claim to “one true theory of rationality.”

However, the usage of the term “bias” in popular culture generally indicates something else – a real or imagined crime of treating people of different categories in a manner not supported by the dominant narrative. As in “police are biased against a certain group” or “tech workers are biased against a certain group.” In this case, the term “those humans are biased” is a synonym for a “this is a frontline along the current culture war”, “those humans are low status.” The actual questions of whether the said group’s is biased or not or if their bias is the only causal explanation for some effect gets you fired from your job. As a result, a more rational discourse on the nature of human reasoning’s relationship with the ideal can’t happen in the tech industry.

This question has entered the AI discourse both from the perspective of narrow AI being used today in the justice system, a similar concern about potential narrow AI application in the future – see deepmind paper as well as the discussion of “human values”

As a Jacobin piece has pointed out, the notions of mathematical bias, such as having a propensity of over or under-estimating is completely different from what the journalists consider “bias.” A critique of Pro Publica is not meant to be an endorsement of Bayesian justice system, which is still a bad idea due to failing to punish bad actions instead of things correlated with bad actions. The question of bias is being used badly on both sides of the debate about AI in the justice system. On one hand, the AI could be biased, on the other hand it could be less biased than the judges. The over-emphasis of equality as the only moral variable is, of course, obscuring the question of whether the justice system is optimally fulfilling its real role as a deterrent of further escalations of violence.

Even as a simple mathematical construct, the notion of “bias” is over-emphasized. Bias and variance are both sources of error, but the negative emotional appeal of “bias” has made people look for estimators that minimize it, instead of minimizing overall error (see Probability, the Logic of Science, Chapter 17).

Basically, there is a big difference between these concepts:

a) Tendency of an estimator to systematically over and under-estimate a variable

b) Absolute error of a machine learning system

c) Human error in simple mathematical problems

d) Human sensitivity to framing effects and other cognitive biases (cognitive features?)

e) Discrepancy between revealed and stated preferences, or signaling vs acting

f) Divergence from a perfect model of a homo economics that may or may not have bearing on whether a particular government scheme is better than a market scheme

g) Generally imagined error attributing diverging group outcomes to foul play (see James Damore)

h) Correct understanding of differences between groups, such as IQ that is politically incorrect to talk about (Charles Murray)

i) Real error of failing to model other minds correctly

j) General favoritism towards the one’s ingroup

k) General favoritism towards the one’s ingroup ideas at the expense of truth

l) Real or imagined error of humans acting against their perceived interest (how dare poor x race voters vote for Trump?)

m) Real act of people opposing Power (or merely existing) and Power using the perceived wrong of bias to subjugate people to humiliating rituals. (Unconscious bias retraining in the workplace)

Even though there is only a single word that could be used to describe these, teay are different mathematical constructs and have to be treated differently. The danger of the last one pretending to be any of the previous ones means that any notion of “people being biased” might be wrong since it might be in someone’s interest to make other people seem less rational than they are.

Paul Christiano has mentioned that modelling “mistakes” has high complexity:

Ø “I don’t think that writing down a model of human imperfections, which describes how humans depart from the rational pursuit of fixed goals, is likely to be any easier than writing down a complete model of human behavior.

I think on the surface this statement is true, but also has fundamentally mistaken assumption about what “mistakes” even are or what goal they might serve. People’s mistakes are still always a product of some process, a process that “thinks” it is right. Whether there is a contradictory set of assumptions about the world, two subagents within a person fighting each other, or two groups of humans vying for influencing the person, the notion of “mistake” is still somewhat relative to which one is winning in the moment. The Who-Whom of reason, so to speak. True Reconciliation of sub-agent power struggles is a lot more complex than calling one of them a “mistake.” “Rational pursuit of fixed goals” is also an unlikely model that you would want to force on a human due to both inside (conflicting desires) and outside (belonging to a community) views of humanity.

Where does that leave us? The emphasis on object level bias in discussions about AI has easily lead to a meta-level bias, where the AI researchers start out by assuming that other people are more wrong about their life or the world than they are. This is extremely dangerous, as it provides Yet Another Excuse to ignore common sense or to build in false assumptions about reasoning or causality into an AI.

For example, the following question is dangerous to ask: where does supposed “bias” come from? Why did evolution allow it to exist? Is it helpful genetically? While it’s easy to blame evolutionary adaptation, lack of brain power or adaptation execution, it’s a little harder to admit that somebody’s else’s error is not a pattern matcher misfiring, but a pattern matcher working correctly.

This brings me to another potential error, which is:

2. Contradictory treatment of human trustworthiness

This is clearly seen around the change of heart that Eliezer had around corrigibility. In “AI as a positive and negative factor in global risk,” the plan was to not allow programmers to interfere in the action of the AI once it got started. This was clearly changed in following years, as the designs began to include a large corrigibility button that based on the judgement of the programmers would turn off the AI after it started.

A complex issue arises here - what exact information does the programmer need to consume to effectively perform the task or how to handle disagreements about whether to turn off the AI within the programmers themselves.

But, even more generally, this brings two conflicting points of view in how much we would trust people to interfere with the operation of an AI after it has been created. The answers range from completely to none.

The confusion is somewhat understandable, and we have analogies today that push us one way or another. We certainly trust a calculator to compute a large arithmetic task more than we trust a calculator programmer to compute the same task. We still don’t trust a virtual assistant made to emulate someone’s email responses more than person it is emulating.

Paul Christiano’s capability amplification proposal also has the contradiction.

Ø “We don’t want such systems to simply imitate human behavior — we want them to improve upon human abilities. And we don’t want them to only take actions that look good to humans — we want them to improve upon human judgment.”

What is the positive vision here? Is an AI making complex judgement calls and sometimes being overruled its controllers? Or is an AI ruling the world in a completely unquestioned manner?

On one hand, I understand the concern, if the AI only does actions that “look good” to humans, then it might be as good as a slimy politician. However, if the AI does a lot of actions that don’t look good to humans, what can the humans do to counter-act it? If the answer is “nothing,” this conflicts with corrigibility or ability to improve on AI design through observing negative object-level actions. Ability to “improve” on human judgement implies action without human oversight and lack of trust in human. At the same time, capability amplification generally starts out with an assumption about humans being aligned.

In other words, there principles are nearly impossible to satisfy together:

a) AI can do actions that don’t look good to its controllers, aka AI is able to improve on human judgement

b) AI is corrigible and can be shut down and redesigned when it does actions that don’t look good to its controllers.

This problem is in some ways present even now, as particular proposals, such as IRL could be rejected based claims that they emulate humans too well and humans can’t be trusted.

For example, earlier in this post, there is also this tidbit.

Ø If you were to find any principled answer to “what is the human brain optimizing?” the single most likely bet is probably something like “reproductive success.” But this isn’t the answer we are looking for.

Why not? What is the philosophical objection to considering the plausible outcome of many of the algorithms that one is considering, such as IRL or even capability-amplified human (which would then presumably be better at “reproductive success”). Is it that “reproductive success” is a low status goal? Is that we are supposed to pretend to care about other things while still optimizing for it? We are forbidden to see the fundamental drives due to their ability to cause more mimetic conflict? Is “reproductive success” unsupported as a goal by the current Power structures? None of these seem compelling enough reasons to reject it as a potential outcome of algorithms fundamentally understanding humans or to reject algorithms that have a chance of working just because some people are uncomfortable with the evolutionary nature of humans.

The point isn’t to argue for maximization of “reproductive success”. Human values are a bit more complex than that. The point is that it’s philosophically dangerous to think “we are going to reject algorithms that work on understanding humans because they work and accepting realities of evolution is seen as low status.”

3. Speaking of human “values”

If there is one abstraction that is incredibly leaky and likely causing a ton of confusion is that of “human values.” The first question - what mathematical type of thing are they? If you consider AI utility functions, then a “human value” would be some sort numeric evaluation of world states or world state histories. Things that approach this category is something like “hedonic utilitarianism.” However, this is not what is colloquially meant by “values”, a category which has a hugely broad range.

We then have 4 separate things that are meant by “values” in today’s culture

a) Colloquial / political usage “family values” / “western values”

b) Corporate usage “move fast and break things”, “be right”

c) Psychological usage – moral foundations theory (care, authority, liberty, fairness, ingroup loyalty, purity)

d) AI / philosophical usage example - “hedonic utilitarianism” – generally a thing that *can* be maximized

There are other subdivisions here as well. Even within “corporate values”, there is lack of agreement on whether they are supposed to be falsifiable or not.

As far as the moral foundations, they are not necessarily a full capture of human desires or even a necessarily a great starting point from which to try and derive algorithms. However, given that they do capture *some* arch-typical concerns of many people, it’s worth checking how any AI designs take them into account. It seems that they are somewhat ignored because they don’t feel like the necessary type signature. Loyalty and Authority may seem like strongly held heuristics that enable coordination between humans towards whatever other goals humans might want. So, for example, not everybody in the tribe needs to be a utilitarian, if the leader of a tribe is a utilitarian and the subjects are loyal and follow his authority. This means that acceptance of the leader as the leader is more key fact of the person’s moral outlook compared to their day to day activity without anyone else’s involvement. Loyalty to ones ingroup is *of course* completely ignored in narrow AI designs. The vision that is almost universally paraded is that of “bias being bad,” bias needs to be factored out of human judgement by any means necessary.

There is, of course a legitimate conflict between fairness to all and loyalty to the ingroup and it is almost completely resolved in favor of real or fake fairness. The meta-decision, by which this is resolved, is political fiat, rather than philosophical debate or mathematical reconciliation based on higher meta-principles. You might read my previous statements about moral foundations and wonder – how do you maximize the purity foundation? The point isn’t to maximize every facet of human psychology, the point is to hope that one’s design does not code fundamentally false assumptions about how humans operate.

There is a broader and more complex point here. “Values” as expressed in simple words such as loyalty and liberty are simultaneously expressions of “what we truly want” as well as tools of political and group conflict. I don’t want to go full post-modern deconstruction here and claim that language has no truth value, only tribal signaling value, however the unity of groups around ideological points obviously happens today. This creates a whole new set of problems for AIs trying to learn and aggregate human values among groups. It seems that learning values / world history evaluations of a single human is the hard part and once you have your wonderful single-person utility function, then simply averaging the human preferences works. The realization that values are necessarily both reflections of universal desires AND adversarial group relations could create a lot of weird cancelling out once the “averaging” starts. There are other problems. One is intensifying existing group conflict, including conflict between countries over the control of the AI. Another is mis-understanding the nature of how the cultural norms of “good” / “bad” evolve. Group and institutional conflict forces some values to be more preferred than others, while others are suppressed. Allowing for “moral progress” in AI is a already strange term, since in some people’s minds it stands for suppressing the outgroup’s humans as humans more and more.

This could be avoided if you focus more on intersection of values or human universals instead of trying to either hard-code liberal assumptions. Intersection has a potentially lower chance of causing earlier fights over the control of the AI compared to the union. That solution has problems of its own. Human “universals” are still subject to the time frame you are considering. Intersections could be too small to be relevant, some cultures are, in fact, better than others, etc. The could be genetic components that alter the ideal structure of society, which means that “value clusters” might be workable based on the various genetic make-ups of the people. Even without that, there might be some small “wiggle-room” in values, even though there are definitive constraints on what culture any group of humans could plausibly adopt and still maintain a functional civilization.

There is not a simple solution, but it always needs to start with a more thorough causal model of how power dynamics affect morality and vice versa.

At the end of the day, the hard work of value philosophy reconciliation might still need to be done by hyper-competent human philosophers.

It’s a bit surreal to read the AI papers with these problems in mind. Let’s learn human values, except we consider humans biased, so we can’t learn loyalty, we over-emphasize individualism, so we can’t really learn “authority.” We don’t understand how to reconcile desire to not be told what to do with our technocratic designs, we so can’t learn “liberty”. “Reproductive success” is dismissed without explanation. I wonder if they are disguised by the purity foundation as well.

The last confusion that is implied by the word “values” is that humans are best approximated by rational utility maximizers, where there is some hidden utility function “inside them.” The problem isn’t that this a “rational values maximizer” is that terrible of a model for some EAs who are really dedicated to do EA, but it misses most people who are neither ideologically driven, not that time consistent, who expect themselves to evolve significantly or depend on other for their moral compass. It also fails to capture people who are more “meta-rational” or “post-ideological” and who are less keen to pursue certain values due to adversarial nature of them. In other words, people at various stages of emotional and ethical maturity might need a different mathematical treatment of what they consider valuable.

These are not merely theoretical problems, it seems that the actual tech industry, including Open AI has adopted a similar “values” language. In an appropriately Orwellian fashion the AI represents the “values” of the developers, which thus requires “diversity.” The AI representing the values of people with the power over the developers is probably closer to the truth, of course. The same “diversity” which is supposed to in theory represent the diversity of ideas is then taken to mean absence of whites and Asians. There are many problems here, but the confusion around the nature of “values” is certainly not helping.

4) Overemphasized individuality and misunderstanding the nature of status

A key confusion that I alluded to in the discussion of authority is the problem of learning about human preferences from an individual human isolated from society. This assumption is present nearly everywhere, but it gets even worse with things like “capability amplification” where “a human thinking for an hour” is taken to represent a self-contained unit of intelligence.

The issue is that humans are social animals. In fact, the phrase “social animals” vastly under-estimates the social nature of cognition and evaluation of “goodness.” On one hand people want things *because* other people want them. Secondly, people clearly want status, which is a place in the mind of the Other. Last, but not least, people follow others with power, which means that they are implementing an ethic or an aesthetic, which is not necessarily their own in isolation. This doesn’t only mean only following others with high amounts of power. It’s just that the expectations that people we care about have of us shape a large amount of our behavior. The last point is not a bad thing most of the time, since many people go crazy in isolation and thus the question of “what are their values in isolation?” is not a great starting point. This feeds into the question of bias, as people might be an irrational part of more rational super-system.

What does this have to do with AI designs? Any design that at it’s heart wishes to benefit humanity must be able to distinguish between:

a) I want the bag (make more bags)

b) I want the feeling of being-bag-owner (make more drugs that simulate that feeling)

c) I want the admiration of others that comes with having the bag, whether or not I know about it (make admiration maximizers)

d) I want to appease powerful interests that wish for me to own the bag (maximize positive feelings for those with power over me)

e) Other possibilities, including what “you” really want

You can imagine that there can be some amount of reconciling some of those into a coherent vision, but this is prone to creating a wrong-thing maximizer.

Failure modes of maximizing these are very reminiscent of general problems of modernity. Too much stuff, too many drugs, too much social media, etc.

At the end of the day, human behavior, including thinking, does not make that much sense outside the social context. This problem is helped but isn’t simply solved by observing humans in the natural environment, instead of clean-rooms.

5) Overreliance on symbolic intelligence, especially in the moral realm

This problem is exactly analogous to the problem referenced in the introduction. Establishing “is-a” relationships or classifying concepts into other conceptual buckets is a part of the human brain does. However, it’s not a complete model of intelligence. While the assumption that a system emulating this would lead to a complete model of the intelligence is no longer present, parts of the mis-understanding remain.

AI alignment through debate has this issue. From Open AI

Ø To achieve this, we reframe the learning problem as a game played between two agents, where the agents have an argument with each other and the human judges the exchange. Even if the agents have a more advanced understanding of the problem than the human, the human may be able to judge which agent has the better argument (similar to expert witnesses arguing to convince a jury).

Ø Our hope is that, properly trained, such agents can produce value-aligned behavior far beyond the capabilities of the human judge

I understand the rationale here, of course. The human needs to have information about the internal functioning of an AI presented in a format that the human understands. This is a desirable property and is worth researching. However, the problem comes not in verifying that the argument the AI has is correct. The problem comes in making sure the AI’s words and actions correspond to each other. Imagine the old school expert system that has a “happiness is-a state-of mind” node hooked up to constructing arguments. It could potentially create statements that the human might agree with. The problem is that the statement does not correspond to anything, let alone any potential real-world actions.

How does one check that the AI’s words and its actions are connected to reality in a consistent manner? You would have to be able to have a human look at which signals the AI is sending and consider the consequences of its actions and then wonder if the correspondence is good enough to not. However, there is no reason to believe that this is much easier than programmers trying to plot AI internals through other means such as data visualization or doing complex real-world tests.

We have this problem with politicians and media today.

This also has an extra level of difficulty due to definitions of words and their connection to reality being political variables today.

For example, what would happen if instead of cats and dogs, OpenAI made a game classifying pictures of men vs women?

Even that would be probably raise eye-brows, but there can be other examples…What about ideas about what is an “appropriate” prom dress? What about whether an idea needs to be protected by free speech? These are many examples, where “truth” of a statement has a political character about it. It’s becoming harder and harder to find things that don’t. Once again, I don’t want to go full deconstruction here and claim that “truth” is only socially determined. There are clearly right and wrong statements, or statements more or less entangled with reality. However, the key point is that most of the cognitive work happens outside of the manipulation of verbal arguments and instead it comes in both picking the right concepts and verifying that they don’t ring hollow.

There is also a general philosophical tendency to replace questions of what real meta-ethics would be with questions about “statements about ethics.” The meta-ethical divide between moral realism and other meta-ethical theories more frequently concerns statements, instead of plausible descriptions of how people’s thinking about ethics evolves over time, historical causes, or even questions like “would different culture’s programmers agree on whether a particular algorithm exhibits “suffering” or not.” Wei Dai made a similar point well here.

The general point is that there is a strong philosophical tendency to tie too much intelligence /ethics with ability to make certain verbal statements, or worshiping memes instead real-world people. Misunderstanding of this problem is especially likely to occur since it is extremely similar to the types of problems misunderstood in the past.

6) Punting the problems elsewhere

Punting the problem is a form of “giving up” without explicitly saying that you are giving up. There are two general patterns here:

a) Giving the problem to someone else (let’s find other people and convince them AI safety is important / what if the government regulated AI / it’s important to have a conversation)

b) Biting the bullet on negative outcomes (people deserve to die / unaligned AI might not be that bad)

c) Punting too many things to the AI itself (create a Seed AI that solves philosophy)

Of all the issues here, this one is, in some ways, the most forgivable. If a regular person is given the issue of AI safety, it is more correct for them to realize that it is too difficult and to punt on it, rather than try and “solve it” without background and get the wrong impression of its simplicity. So, punting on the problem is a wrong epistemic move, but the right strategic move.

That said, the meta-strategy of convincing others AI safety is plausible, but still needs to terminate in the production actual philosophy / mathematics at some point. Pursuing the meta-strategy thus cannot come at the expense of it.

Similarly, the government could, in theory, be helpful in regulating AI, however, in practice, it’s more important to work out what that policy would be before asking for blanket regulations, which would have difficulty straddling between extremes of over-regulation and doing practically nothing.

So, this is forgivable for some people, however it’s less forgivable to the key players. If a CEO says that “we need to have a discussion about the issue of AI,” instead of having that discussion, it’s a form of procrastinating on the problem by going meta.

Biting the bullet on negative outcomes is not something I see very often in papers, but occasionally you get strange responses from tech leaders, such as “Using Facebook passively is associated with mental health risks” without further follow-up on whether the internal optimization is, in fact, driving people to use the site more passively or not.

Punting too many things to the AI itself is a problem basically in most approaches. If you have two tasks, such as “solve philosophy” and “create an algorithm that solves philosophy,” the second task is not necessarily easier than the first. To solve it, you still need ways to verify that it has done so correctly, which requires at least being able to recognize good answers in this domain.

Saying algorithm “X is able to be generally-aligned”, where X is any algorithm requires for you to have a stronger level of certainty in the ability of the algorithm to produce correct answers than in your own judgement of the evaluation of some philosophical puzzle. Of course, this is possible to do with arithmetic or playing go, so people analogize for AI to reason better in other domains as well. However, the question of what happens when at some point it’s likely to produce an answer one doesn’t like. How does one acquire enough certainty about the previous correctness of the algorithm for them to be convinced?

In conclusion, the philosophical problems of AI are serious, and it is unlikely that will be easily solved “by default” or with a partially aligned AI. For example, it’s hard to imagine that iterating on an expert system that produces “happiness is-a state of mind” statements to get it to a neural network like reasoning without dropping a fundamental assumption about how much intelligence happens in verbal statements. Similarly iterating on a design that assumes too much about how much ethics is related to belief or pronouncements is unlikely to produce real algorithms that compute the thing we want to compute.

To be able to correctly infer true human wants we need to be able to correctly model human behavior, which include understanding many potentially unsavory theories of dominance hierarchies, mimetic desire, scapegoating, theories of the sacred, enlightenment, genetic variability, complexities of adult ethical development as well as how “things that are forbidden to see” arise.

The good news is that there is a lot of existing philosophy that can be used to help at least check one’s designs against. Girard, Heidegger or Wittgenstein could be good people to review. The bad news is that in the modern world the liberal doctrines of individualism and fears of inequity have crowded out all other ethical concerns and the current culture war has stunted the development of complex ideas.

However, it’s kind of pointless to be afraid of speaking up against the philosophical problems of AI. We are probably dead if we don’t.

New to LessWrong?

New Comment
1 comment, sorted by Click to highlight new comments since: Today at 12:13 PM

Good post. Some nitpicks:

There are many models of rationality from which a hypothetical human can diverge, such as VNM rationality of decision making, Bayesian updating of beliefs, certain decision theories or utilitarian branches of ethics. The fact that many of them exist should already be a red flag on any individual model’s claim to “one true theory of rationality.”

VNM rationality, Bayesian updating, decision theories, and utilitarian branches of ethics all cover different areas. They aren't incompatible and actually fit rather neatly into each other.

As a Jacobin piece has pointed out

This is a Jacobite piece.

A critique of Pro Publica is not meant to be an endorsement of Bayesian justice system, which is still a bad idea due to failing to punish bad actions instead of things correlated with bad actions.

Unless you're omniscient, you can only punish things correlated with bad actions.