Let’s say you have an idea in mind for how to align an AI with human values.

Go prep a slide with some e-coli, put it under a microscope, and zoom in until you can see four or five cells. Your mission: satisfy the values of those particular e-coli. In particular, walk through whatever method you have in mind for AI alignment. You get to play the role of the AI; with your sophisticated brain, massive computing power, and large-scale resources, hopefully you can satisfy the values of a few simple e-coli cells.

Perhaps you say “this is simple, they just want to maximize reproduction rate.” Ah, but that’s not quite right. That’s optimizing for the goals of the process of evolution, not optimizing for the goals of the godshatter itself. The e-coli has some frozen-in values which have evolved to approximate evolutionary fitness maximization in some environments; your job is optimize for the frozen-in approximation, even in new environments. After all, we don’t want a strong AI optimizing for the reproductive fitness of humans - we want it optimizing for humans’ own values.

On the other hand, perhaps you say “these cells don’t have any consistent values, they’re just executing a few simple hardcoded algorithms.” Well, you know what else doesn’t have consistent values? Humans. Better be able to deal with that somehow.

Perhaps you say “these cells are too simple, they can’t learn/reflect/etc.” Well, chances are humans will have the same issue once the computational burden gets large enough.

This is the problem of AI alignment: we need to both define and optimize for the values of things with limited computational resources and inconsistent values. To see the problem from the AI’s point of view, look through a microscope.

New Comment
24 comments, sorted by Click to highlight new comments since: Today at 2:14 PM
Perhaps you say “these cells are too simple, they can’t learn/reflect/etc.” Well, chances are humans will have the same issue once the computational burden gets large enough.

I don't think the situations is symmetrical here.

Humans have easy-to-extract preferences over possible "wiser versions of ourselves." That is, you can give me a menu of slightly modified versions of myself, and I can try to figure out which of those best capture my real values (or over what kind of process should be used for picking which of those best capture my real values, or etc.). Those wiser versions of ourselves can in turn have preferences over even wiser/smarter versions of ourselves, and we can hope that the process might go on ad infinitum.

It may be that the process with humans eventually hits a ceiling---we prefer that we become smarter and wiser in some obvious ways, but then eventually we've picked the low hanging fruit and we are at a loss for thinking about how to change without compromising our values. Or it may be that we are wrong about our preferences, and that iterating this deliberative process goes somewhere crazy.

But those are pretty fundamentally different from the situation with E. coli, where we have no way to even get the process started. In particular, the difficulty of running the process with E. coli doesn't give us much information about whether the process with humans would top out or go off the rails, once we know that humans are able to get the process started.

I agree that an e-coli's lack of reflective capability makes it useless for reasoning directly about iterated amplification or anything like it.

On the other hand, if we lack the tools to think about the values of a simple single-celled organism, then presumably we also lack the tools to think about whether amplification-style processes actually converge to something in line with human values.

Humans have easy-to-extract preferences over possible "wiser versions of ourselves." That is, you can give me a menu of slightly modified versions of myself, and I can try to figure out which of those best capture my real values (or over what kind of process should be used for picking which of those best capture my real values, or etc.). Those wiser versions of ourselves can in turn have preferences over even wiser/smarter versions of ourselves, and we can hope that the process might go on ad infinitum.

This seems a pretty bold claim to me. We might be tempted to construe our regular decision making process as doing this (I come up with what wiser-me might do in the next instant, and then do it), but this to me seems to be misunderstanding how decisions happen by confusing the abstraction of "decision" and "preferences" for the actual process that results in the world ending up in a causally subsequent state which I might later look back on and reify as myself having made some decision. Since I'm suspicious that something like this is going on when the inferential distance is very short, I'm even more suspicious when the inferential distance is longer, as you seem to be proposing.

I'm not sure if I'm arguing against your claim that the situations are not symmetrical, but I do think this reasoning for thinking the situations are not symmetrical is likely flawed because it seems to be to be assuming something about humans being fundamentally different from e-coli that is not.

(There are of course many differences between the two, just not ones that seem relevant to this line of argument.)

First off, great thought experiment! I like it, and it was a nice way to view the problem.

The most obvious answer is: “Wow, we sure don’t know how to help. Let’s design a smarter intelligence that’ll know how to help better.”

At that point I think we’re running the risk of passing the buck forever. (Unless we can prove that process terminates.) So we should probably do at least something. Instead of trying to optimize, I’d focus on doing things that are most obvious. Like helping it not to die. And making sure it has food.

At that point I think we’re running the risk of passing the buck forever. (Unless we can prove that process terminates.)

I am inclined to believe that indeed the buck will get passed forever. This idea you raise is remarkably similar to the Procrastination Paradox (which you can read about at https://intelligence.org/files/ProcrastinationParadox.pdf).

Here's my attempt at solving the puzzle you provide – I believe the following procedure will yield a list of approximate values for the E-Coli bacterium. (It'd take a research team and several years, but in principle it is possible.)

  • Isolate each distinct protein present in E-Coli individually. (The research I found (https://www.pnas.org/content/100/16/9232, https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4332353/) puts the number of different proteins in E-Coli at 1-4 thousand, which makes this difficult but not completely infeasible.)
  • For each protein, create a general list of its effects on the biochemical environment within the cell.
  • Collect each effect that is redundantly produced by several distinct proteins simultaneously (say, 10+). This gives us a rough estimate of the bacteria's values, though it is not yet determined which are more instrumental and which are more terminal in nature.
  • Organize the list into a graph that links properties by cause and effect. (Example: Higher NaCl concentration makes the cell more hypertonic relative to the environment.)
  • For each biochemical condition on the chart, evaluate its causes and its effects. Conditions with many identifiable causes but few identifiable effects are more likely to be terminal goals than conditions with few identifiable causes but many identifiable effects. In this way, a rough hierarchy can be determined between terminal-ish goals and instrumental-ish goals.

Potential issues with this plan:

  • This procedure is ill-equipped to keep up with mutations in the E-Coli culture. It takes much longer to create a plan for a particular bacteria culture than it does for the culture to spontaneously change in significant ways.
  • Some E-Coli values may not be detectible in a lab environment. For example, if E-Coli cells evolved to excrete chemicals that promote growth of other, potentially symbiotic cells, values associated with that behavior will likely go undetected.
  • In expressing this plan, I make the assumption that the behavior of the E-Coli cell reflects its values. This presents a theoretical limit on how precisely we can specify the cell's values, as we are not equipped to detect where behavior and volition diverge.

A few properties of our situation that aren't true with E coli:

  • We will know that an AI system has been created
  • We will have designed the AI system ourselves
  • We can answer questions posed to us by the AI system

The OP didn't equate humans with bacteria, but offered an outside view that we humans being inside tend to not notice. Of course we are somewhat more complex than E.Coli. We know that and we can see that easily. The blind spots lie where we are no different, and blind spots is what gets in the way of the recent MIRI's buzzword, deconfusion.

Further, to nitpick your points:

"We will know that an AI system has been created" -- why are you so sure? How would you recognize an AI that someone else has created without telling you? Maybe we would interpret it through the similar prism through which an E.coli interprets its environment: "Is it food? Is it a danger?" not being able to fathom anything more complex than that.

"We will have designed the AI system ourselves" -- that is indeed the plan, and, arguably, in a fixed-rules environment we already have, the AlphaZero. So, if someone formalizes the human interaction rules enough, odds are, something like AlphaZero would be able to self-train to be more human than any human in a short time.

"We can answer questions posed to us by the AI system" -- Yes, but our answers are not a reliable source of truth, they are mostly post hoc rationalizations. It has been posted here before (I can't seem to find the link) that answers to "why" questions are much less reliable than the answers to "what" questions.

“We will know that an AI system has been created”—why are you so sure? How would you recognize an AI that someone else has created without telling you? Maybe we would interpret it through the similar prism through which an E.coli interprets its environment: “Is it food? Is it a danger?” not being able to fathom anything more complex than that.

I feel like we are kind of in this position relative to The Economy.

The OP didn't equate humans with bacteria, but offered an outside view that we humans being inside tend to not notice. Of course we are somewhat more complex than E.Coli. We know that and we can see that easily.

The title of this post is "The E-Coli Test for AI Alignment". The first paragraph suggests that any good method for AI alignment should also work on E-Coli. That is the claim I am disputing. Do you agree or disagree with that claim?

Perhaps the post was meant in the weaker sense that you mention, which I mostly agree with, but that's not the impression I get from reading the post.

"We will know that an AI system has been created" -- why are you so sure? How would you recognize an AI that someone else has created without telling you? Maybe we would interpret it through the similar prism through which an E.coli interprets its environment: "Is it food? Is it a danger?" not being able to fathom anything more complex than that.

I am super confused about what you are thinking here. At some point a human is going to enter a command or press a button that causes code to start running. That human is going to know that an AI system has been created. (I'm not arguing that all humans will know that an AI system has been created, though we could probably arrange for most humans to know this if we wanted.)

that is indeed the plan, and, arguably, in a fixed-rules environment we already have, the AlphaZero. So, if someone formalizes the human interaction rules enough, odds are, something like AlphaZero would be able to self-train to be more human than any human in a short time.

I don't see how this is a nitpick of my point.

Yes, but our answers are not a reliable source of truth, they are mostly post hoc rationalizations. It has been posted here before (I can't seem to find the link) that answers to "why" questions are much less reliable than the answers to "what" questions.

Sure. They nonetheless contain useful information, in a way that E coli may not. See for example Inverse Reward Design.

Well, first, you are an expert in the area, someone who probably put 1000 times more effort into figuring things out, so it's unwise for me to think that I can say anything interesting to you in an area you have thought about. I have been on the other side of such a divide in my area of expertise, and it is easy to see a dabbler's thought processes and the basic errors they are making a mile away. Bur since you seem to be genuinely asking, I will try to clarify.

At some point a human is going to enter a command or press a button that causes code to start running. That human is going to know that an AI system has been created. (I'm not arguing that all humans will know that an AI system has been created,

Right, those who are informed, would know. Those who are not informed may or may not figure it out on their own, and with minimal effort the AI hand can probably be masked as a natural event. Maybe I misinterpreted your point. Mine was that, just like an E.coli would not recognize an agent, neither would humans if it wasn't something we are already primed to recognize.

My other point was indeed nto a nitpick, more about a human-level AI requiring a reasonable formalization of the game of human interaction, rather than any kind of a new learning mechanism, those are already good enough. Not an AGI, but a domain AI for a specific human domain that is not obviously a game. Examples might be a news source, an emotional support bot, a science teacher, a poet, an artist...

They nonetheless contain useful information, in a way that E coli may not. See for example Inverse Reward Design.

Interesting link, thanks! Right, the information can be useful, even if not truthful, as long as the asker can evaluate the reliability of the reply.

Right, those who are informed, would know. Those who are not informed may or may not figure it out on their own, and with minimal effort the AI hand can probably be masked as a natural event. Maybe I misinterpreted your point. Mine was that, just like an E.coli would not recognize an agent, neither would humans if it wasn't something we are already primed to recognize.

Yup, agreed. All of the "we"s in my original statement (such as "We will know that an AI system has been created") were meant to refer to the people who created and deployed the AI system, though I now see how that was confusing.

First two yes, last one no. There is a communication gap in any case, and crossing that communication gap is ultimately the AI's job. Answering questions will look different in the two cases: maybe typing yes/no at a prompt vs swimming up one of two channels on a microfluidic chip. But the point is, communication is itself a difficult problem, and an AI alignment method should account for that.

But we can design in communication protocols into an AI system. With E coli we would have to figure out how they "communicate".

I think there are some important advantages that humans have over e. coli, as subjects of value learning. We have internal bits that correspond to much more abstract ways of reasoning about the world and making plans. We can give the AI labeled data or hardcoded priors. We can answer natural language questions. We have theory of mind about ourselves. The states we drive the world into within our domain of normal operation are more microscopically different, increasing the relative power of abstract models of our behavior over reductionist ones.

It seems like this example would in some ways work better if the model organism was mice not bacteria because bacteria probably do not even have values to begin with (so inconsistency isn't the issue) nor any internal experience.

With say mice though (though perhaps roundworms might work here, since it's more conceivable that they could actually have preferences) the answer to how to satisfy their values seems almost certainly is just wireheading since they don't have a complex enough mind to have preferences about the world distinct from just their experiences.

So I'm not sure whether this type of approach works because you probably need more intelligent social animals in order for satisfying their preferences to not just be best achieved through wireheading.

Still I suppose this does raise the question of how one might best satisfy the preferences/values of animals like corvids or primates who lack some of the more complex human values but still share the most basic values like being socially validated (and caring about the mental states of other animals; which rules out experience machine like solutions).

I'm assuming you think wireheading is a disastrous outcome for a super intelligent AI to impose on humans. I'm also assuming you think if bacteria somehow became as intelligent as humans, they would also agree that wireheading would be a disastrous outcome for them, despite the fact that wireheading is probably the best solution that can be done given how unsophisticated their brains are. I.e. the best solution for their simple brains would be considered disastrous by our more complex brains.

This suggests the possibility that maybe the best solution that can be applied to human brains would be considered disastrous for a more complex brain imagining that humans somehow became as intelligent as them.

While I consider wireheading only marginally better than oblivion the more general issue is the extent to which you can really call something alignment if it leads to behavior that the overwhelming majority of people consider egregious and terrible in every way. It really doesn't make sense to talk to talk about there being a "best" solution here anyway because that basically begs the question with regards to certain moral philosophy.

>I'm also assuming you think if bacteria somehow became as intelligent as humans, they would also agree that wireheading would be a disastrous outcome for them, despite the fact that wireheading is probably the best solution that can be done given how unsophisticated their brains are. I.e. the best solution for their simple brains would be considered disastrous by our more complex brains.

This assumption doesn't hold and somewhat misses my point entirely. As I talked about in my comment bacteria don't seem like they meaningfully have thoughts or preferences so the idea of making a super smart bacteria is rather like making a superintelligent rock. I can remove those surface level issues if I just replace say "bacteria" with say "mice" in which case there's a different misunderstanding involved here.

The main issue here is that it seems like you are massively anthropomorphizing animals. If a species of animal doesn't have a certain degree of intelligence it's unlikely to have a value system that actually cares about the external world. However it would be a form of anthropocentrism to expect that an "uplifted" version of an animal would necessarily start gaining certain terminal human values just because it's smarter.

So my point more generally is that you seem to need (in natural life at least) a degree of intelligence and socialness to both be able to and have evolved a mind design that cares about the external world. So most animals can have their values easily and completely encompassed by wireheading so there's no reason not to do that to them and that doesn't really generalize to aligning AI for smarter more social species.

Is there anything in the world that we know of that does alignment for something else? Can we say that humans are doing "coherent extrapolated volition" for evolution? Keeping in mind that under this view, evolution itself would evolve and change into something more complex and may be better.

Humans try to make their pets happy, usually...

The only example I can think of is with parents and their children. Evolutionarily, parents are optimized to maximize the odds that their children will survive to reproduce, up to and including self-sacrifice to that end. However, parents do not possess ideal information about the current state of their child, so they must undergo a process resembling value alignment to learn what their children need.

I think there's a bias when we consider optimizing for X's values to consider only X without its environment. But the environment gave rise to X and actually much of X doesn't make sense without it. So I think to some extent we also would need to find ways to do alignment on the environment itself. And that means to some extent helping evolution.

A correct implementation of the function, DesireOf(System) should not have a defined result for this input. Sitting and imagining that there is a result for this input might just lead you further away from understanding the function.

Maybe if you tried to define much simpler software agents that do have utility functions, which are designed for very very simple virtual worlds that don't exist, then try to extrapolate that into the real world?

You'll run into wetware fundamentals pretty much at once. Do you satisfy each bacteria's values (however you define them) or the values of the Population of Five (however you define them)? They are going to be different. Or maybe you take a higher level, the ecosystemic one (remember, you're AI, you are entirely free to do it)? Or do you go lower, and view the cells as carriers for the things that matter - what's to prevent you from deciding that the really important things human bodies provide for are the worms in their guts, and not the brains?