"I want to get into AI alignment," said Xenophon.

"Why?" said Socrates.

"Because an AGI is going to destroy humanity if we don't stop it," said Xenophon.

"What's an AGI?" said Socrates.

"An artificial general intelligence," said Xenophon.

"I understand the words 'artificial' and 'general'. But what is an 'intelligence'?" said Socrates.

"An intelligence is kind of optimizer," said Xenophon.

"Like my teacup," said Socrates. He took a sip of iced tea.

"What?" said Xenophon.

"My teacup. It keeps my tea cold. My teacup optimizes the world according to my wishes," said Socrates.

"That's not what I mean at all. An optimizer cannot do just one thing. There must be an element of choice. The optimizer must take different actions under different circumstances," said Xenophon.

"Like my teacup," said Socrates.

"But—"

"Now it is summer. But in the winter my teacup keeps my tea hot. Do you see? My teacup does has a choice. Somehow it knows to keep hot things hot and cold things cold," said Socrates.

"A teacup isn't intelligent," said Xenophon.

"Why not?" said Socrates. He savored another sip.

"Because the teacup is totally passive. An intelligence must act on its environment," said Xenophon.

"Far away, in the Levant, there are yogis who sit on lotus thrones. They do nothing, for which they are revered as gods," said Socrates.

"An intelligence doesn't have to act. It just needs the choice," said Xenophon.

"Then it is impossible to tell if something is intelligent solely based on its actions," said Socrates.

"I don't follow," said Xenophon.

Socrates picked up a rock. "This intelligent rock chooses to never do anything," said Socrates.

"That's ridiculous," said Xenophon.

"I agree," said Socrates, "Hence why I am so confused by the word 'intelligent'."

"No intelligence would choose to do nothing. If you put it in a box surely any intelligent being would attempt to escape," said Xenophon.

"Yogis willingly box themselves in boxes called 'monasteries'," said Socrates.

"I see what you're getting at. This is a case of the belief-value uncertainty principle. It's impossible to tell from their actions whether yogis are good at doing nothing (their value function is to do nothing) or bad at doing things (their value function is to act and they are just very stupid)," said Xenophon.

Socrates nodded.

"Since it is impossible to deduce whether something is intelligent based solely on its external behavior, intelligence cannot be an external property of an object. Intelligence must be an internal characteristic," said Xenophon.

Socrates waited.

"An intelligence is something that optimizes its external environment according to an internal model of the world," said Xenophon.

"My teacup's model of the world is the teacup's own temperature," said Socrates.

"But there's no value function," said Xenophon.

"Sure there is. 'Absolute distance from room temperature' is the function my teacup optimizes," said Socrates.

"Your teacup is too passive," said Xenophon.

"'Passivity' was not part of your definition. But nevermind that. Suppose I built a machine with artificial skin that felt the temperature of the cup and added ice to cold cups and lit a fire under hot cups. Surely such a machine would be intelligent," said Socrates.

"Not at all! You just programmed the machine to do what you want. It's all hard-coded," said Xenophon.

"So whether something is intelligent doesn't depend on what's inside of it. Intelligence has to do with whether something was designed. If the gods carefully designed human beings to do their bidding then us human beings would not be intelligent," said Socrates.

"That's not what I meant at all!" said Xenophon.

"Then what did you mean?" said Socrates.

"Let's start over, tabooing both the words 'intelligent' and 'optimizer'. A 'Bayesian agent' is something that creates a probability distribution over world models based on its sensory inputs and then takes an action according to a value function," said Xenophon.

"That doesn't pass the teacup test. Under that definition my teacup qualifies as a 'Bayesian agent,'" said Socrates.

"Oh, right," said Xenophon, "How about 'Systems that would adapt their policy if their actions would influence the world in a different way'?"

"Teacup test," said Socrates.

"So are you saying the entire field of AI Alignment is bunk because intelligence isn't a meaningful concept?" said Xenophon.

"Maybe. Bye!" said Socrates.

"No! Wait! That was a joke!" said Xenophon.

71

New Comment
28 comments, sorted by Click to highlight new comments since: Today at 12:23 AM

I think there are two main problems being pointed at here. First is that it seems reasonable to say that for the most part under most definitions of intelligence, intelligence is largely continuous (though some would argue for the existence of a few specific discontinuities)---thus, it seems unreasonable to ask "is X intelligent". A teacup may be slightly more intelligent than a rock, and far less intelligent than GPT-3. 

Second is the fact that the thing we actually care about is neither "intelligence" nor "Bayesian agent"; just because you can't name something very precisely yet doesn't mean that thing doesn't exist or isn't worth thinking about. The thing we care about is that someone might make a thing in 10 years that literally kills everyone, and we have some models of how we might expect that thing to be built. In analogy, perhaps we have a big philosophical argument over what counts as a "chair"---some would argue bitterly whether stools count as chairs, or whether tiny microscopic chair-shaped things count as chairs, or whether rocks count as chairs because you can sit on them, some people arguing that there is in fact no such thing as a physical chair, because concepts like that exist only in the map and the territory is made of atoms, etc. But if you have the problem that you expect chairs to break when you sit on them if they aren't structurally sound, then most of these arguments are a huge distraction. Or more pithily:

"nooo that's not really intelligent" I continue to insist as I shrink and transform into a paperclip

The other part is that humans can pursue pretty arbitrary instrumental goals, whereas if you tell the teacup it has to win a chess match or die it will die.

No, they can't. See: "akrasia" on the path to protecting their hypothetical predicted future selves 30 years from now.

The teacup takes the W here too. It's indifferent to blackmail! [chad picture]

"Pretty arbitrary" of course not meaning "absolutely arbitrary", just meaning more arbitrary than most things, such as teacups. And when I said "tell" I give an ultimatum and then follow through.

Fair.

Something something blackmailer is subjunctively dependent with the teacup! (This is a joke.)

When a Greek philosopher starts asking you what a thing actually is, sooner or later you might find yourself saying "I know it when I see it". (That plucked chicken is not a man).

I know it when I see it

So, the neural networks that recognize things without anyone being able to explain how they actually do it, are actually doing it the right way.

"The definition of intelligence is the following: take the value of the upper left pixel in my retina, and multiply it by 0.02378. Take the value from the next pixel and multiply it by 0.02376. Take the value...

(20 years later)

...and if you add this together, and the result is greater than 0.86745, then yes, I would call such system 'intelligent'. Any objections?"

Socrates: "Please someone pass me the hemlock already."

Intelligent systems are systems which design and deploy nanomachines that make life on earth inhospitable for humans.

Oh, so human diseases in the form of bacteria/viruses! And humans working on gain-of-function research.

I actually have no problem calling either of those systems intelligent, as long as the bacteria/viruses are evolving on a short enough timescale. I guess you could call them slow.

I think the candle and ice mechanism technically contains an iota of intelligence.  Not much. Less than a fly. But some. 

For what it's worth, I think we solved the core issue in this post some time ago as part of the project of cybernetics by carefully defining the sort of systems that we might call intelligent.

I'm going to bite the bullet and say that an "intelligence" and "optimizer" are fundamentally the same thing; or rather that these words points to the same underlying concept we don't quite have a non-misleading word for.

An optimizer is a system that pulls probability-mass away from some world-states and toward some world-states; anything that affects reality is an "optimizer". A tea cup is an optimizer.

The purpose of an optimizer is what it does; what is optimizing for. A teacup's purpose is to contain objects inside, when the teacup is positioned the right way up in in a gravity field. A teacup's purpose is to transfer heat. A teacup's purpose is to break when dropped from a height. A teacup's purpose is the set of all it does. A teacup's purpose, in full generality is: "be a teacup".

A teacup is aligned to the purpose of being a teacup.
A system that is a aligned is one that is the correct state to fullfill its own purpose.
And a teacup is, obviously, in the correct physical state for being a teacup. Tautology.
All systems are perfectly aligned to their own purpose.

But a teacup is imprefectly aligned for the purpose of "being used as a teacup by a human". If it a dropped, it may break. If it is tipped, it may spill. All of these things are aligned to the purpose of "be the physical object: the teacup" but imprefectly "be a useful teacup for drinking tea and other liquids"

What is an optimizer?

An optimizer is an engine that converts alignment into purpose.

Alignment: "be a teacup" -> purpose: "behave like a teacup". This part is tautological.

Alignment: "be a useful teacup for humans" -> purpose: "be used in beneficial ways by humans". This part is not tautological. 

A teacup may be good or bad at that. A teacup may harm humans, though: it may spill tea. It may break into sharp shards of ceramic. So a teacup may cause both good and bad outcomes.

A Friendly teacup, a human-aligned teacup is one that is optimizes for its purpose, of making good outcomes more likely and bad outcomes less likely

A Friendly teacup is harder to spill or to accidentally drop. A Friendly teacup is not so heavy that it would injure a human if it falls on their foot. A Friendly teacup is one that is less likely to harm a human if it breaks.

But how does a teacup optimize for good outcome? By being a teacup. By continuing to be a teacup.

Once a physical object has been aligned into the state of being a teacup, it continues to be a teacup. Because a teacup is a physical system that optimizes for retaining its shape (unless is broken or damaged). 

A Friendly teacup, once aligned into being a Friendly teacup, serves its purpose by continuing to be a Friendly teacup. A Friendly teacup optimizes humans, it optimizes some particular set of outcomes for humans by continuing to be a Friendly teacup.

How does a Friendly teacup optimize you? Because its existence, its state of being and continuing to be a teacup, leads to you to make different choices than if it that were not the case; you might enjoy a refreshing cup of tea!

You are being optimized. By a teacup. So that it may fullfill its assigned purpose. This is a perfectly valid way to see things.

The teacup has been made (aligned) in a way that makes it a good teacup (purpose): in this example the optimization-pressure is the process that created the teacup.

 

So this is my answer: your example is valid. Both teacup alignment and AI alignment are fields that use some of the same underlying concepts, if you understand these terms to full generality.

But for teacups these things are obvious, so we don't need fancy terminology for them, it is confusing to try and use the terminology this way.

But it is valid.

I don't disagree with any of this.

And yet, some people seem to be generalizedly "better at things" than others. And I am more afraid of a broken human person (he might shoot me) than a broken teacup.

It is certainly possible that "intelligence" is a purely intrinsic property of my own mind, a way to measure "how much do I need to use the intentional stance to model another being, rather than model-based reductionism?" But this is still a fact about reality, since my mind exists in reality. And in that case "AI alignment" would still need to be a necessary field, because there are objects that have a larger minimal-complexity-to-express than the size of my mind, and I would want knowledge that allows me to approximate their behavior.

But I can't robustly define words like "intelligence" in a way that beats the teacup test. So overall I am unwilling to say "the entire field of AI Alignment is bunk because intelligence isn't a meaningful concept?" I just feel very confused.

Reminds me of a discussion I've had recently about whether humans solve complex systems of [mechanical] differential equations while moving. The counter-argument was "do you think that a mercury thermometer solves differential equations [while 'calculating' the temperature]?"

This is glorious in so many ways. Thank you.

In what ways?

All the correct ways and none of the incorrect ways, of course! I see the ambivalence and range of plausible interpretations. Can't I just appreciate a good post for the value I found in it without being fished out for suspected misunderstandings? :p

I especially liked how this is the cutest version of Socrates I've encountered in any literature.

I was just curious and wanted to give you the occasion to expand your viewpoint. I didn't downvote your comment btw.

Aye, I didn't jump to the conclusion that you were aggressive. I wanted to make my comment communicate that message anyway, and that your comment could be interpreted like that gave me an excuse.

The paper On the Measure of Intelligence by François Chollet does a pretty good job of defining the intelligence, including the knowledge that the creator of a system has hardcoded into it.

"What is intelligence?" is a question you can spend an entire productive academic career failing to answer. Intentionally ignoring the nerd bait, I do think this post highlights how important it is for AGI worriers to better articulate which specific qualities of "intelligent" agents are the most worrisome and why. 

For example, there has been a lot of handwringing over the scaling properties of language models, especially in the GPT family. But as Gary Marcus continues to point out in his inimitable and slightly controversial way, scaling these models fails to fix some extremely simple logical mistakes - logical mistakes that might need to be fixed by a non-scaling innovation before an intelligent agent poses an ex-risk. On forums like these it has long been popular to say something along the lines of "holy shit look how much better these models got when you add __ amount of compute! If we extrapolate that out we are so boned." But this line of thinking seems to miss the "intelligence" part of AGI completely, it seemingly has no sense at all of the nature of the gap between the models that exist today and the spooky models they worry about. 

It seems to me that we need a better specification for describing what exactly intelligent agents can do and how they get there.

"Systems that would adapt their policy if their actions would influence the world in a different way"

Does the teacup pass this test? It doesn't necessarily seem like it.

We might want to model the system as "Heat bath of Air -> teacup -> Socrates' tea". The teacup "listens to" the temperature of the air on its outside, and according to some equation transmits some heat to the inside. In turn the tea listens to this transmitted heat and determines its temperature.

You can consider the counterfactual world where the air is cold instead of hot. Or the counterfactual world where you replace "Socrates' tea" with "Meletus' tea", or with a frog that will jump out of the cup, or whatever. But in all cases the teacup does not actually change its "policy", which is just to transmit heat to the inside of the cup according to the laws of physics.

To put it in the terminology of "Discovering Agents", one can add mechanism variables  going into the object level variables. But there are no arrows between these, so there's no agent.

Of course, my model here is bad and wrong physically speaking, even if it does capture crude cause-effect intuition about the effect of air temperature on beverages. However I'd be somewhat surprised if a more physically correct model would introduce an agent to the system where there is none.

But in all cases the teacup does not actually change its “policy”, which is just to transmit heat to the inside of the cup according to the laws of physics.

This kind of description depends completely on how you characterize things. If the policy is "transmit heat according to physics" the policy doesn't change. If the policy is "get hotter" this policy changes to "get colder". It's the same thing, described differently.

I've been thinking about the Eliezer's take on the Second Law of Thermodynamics and while I can't think of a succint comment to drop with it. I think it could bring value to this discussion.

What about complexity as a necessary condition for intelligence? The teacup does not possess this, but arguably the yogi does (at the very least he must feed himself, interact socially to some extent, etc). Intelligent creatures have brains that are fairly large (millions/billions of neurons) and have a high degree of complexity in their internal dynamics, and this corresponds to complexity in their actions.

a rock wants to stay a rock, and how badly it wants it is defined by the equations of the nuclear strong forces. a teacup wants to stay a teacup, same deal as the rock. a yogi appears to want to stay a yogi, though I'm less confident that a yogi wants that quite as badly or consistently as a rock wants to stay a rock; thankfully, most yogi produce significant communication before becoming silent, and there are incremental tests you can do that will reveal that attempting to interfere with their actions will result in them attempting to maintain their state as a yogi.

the teacup test is reasonable, but a good ai safety solution would be teacup invariant.

and anyway, I have an intuition that damaging anything is necessarily going to be against the values of some agent, so even if nobody seems to care, it's worth interrogating a map of other agents' preferences about stuff. maybe they aren't allowed to override yours, depending on the preference permissions, but it still seems like if you want to smash your cup and it's the most beautiful cup that ever existed, maybe I want to offer to trade your beautiful cup that many agents would seek to intervene to protect with something less special. perhaps, for example, a virtual simulation of the behavior a cup would have when breaking, if knowledge of the process is what truly appetizes; maybe you'd even enjoy simulations of much more complicated stuff breaking? or perhaps you might like a different cup. after all, those atoms want to stay together right now using their local agency of being physical chemicals, and my complicated agency-throwing machine called a brain detects there may be many agency-throwers in the world who want to protect the teacup.

New to LessWrong?