Concept Safety: Producing similar AI-human concept spaces

by Kaj_Sotala8 min read14th Apr 201545 comments

31

Outer AlignmentNeuroscienceMachine LearningAI
Frontpage

I'm currently reading through some relevant literature for preparing my FLI grant proposal on the topic of concept learning and AI safety. I figured that I might as well write down the research ideas I get while doing so, so as to get some feedback and clarify my thoughts. I will posting these in a series of "Concept Safety"-titled articles.

A frequently-raised worry about AI is that it may reason in ways which are very different from us, and understand the world in a very alien manner. For example, Armstrong, Sandberg & Bostrom (2012) consider the possibility of restricting an AI via "rule-based motivational control" and programming it to follow restrictions like "stay within this lead box here", but they raise worries about the difficulty of rigorously defining "this lead box here". To address this, they go on to consider the possibility of making an AI internalize human concepts via feedback, with the AI being told whether or not some behavior is good or bad and then constructing a corresponding world-model based on that. The authors are however worried that this may fail, because

Humans seem quite adept at constructing the correct generalisations – most of us have correctly deduced what we should/should not be doing in general situations (whether or not we follow those rules). But humans share a common of genetic design, which the OAI would likely not have. Sharing, for instance, derives partially from genetic predisposition to reciprocal altruism: the OAI may not integrate the same concept as a human child would. Though reinforcement learning has a good track record, it is neither a panacea nor a guarantee that the OAIs generalisations agree with ours.

Addressing this, a possibility that I raised in Sotala (2015) was that possibly the concept-learning mechanisms in the human brain are actually relatively simple, and that we could replicate the human concept learning process by replicating those rules. I'll start this post by discussing a closely related hypothesis: that given a specific learning or reasoning task and a certain kind of data, there is an optimal way to organize the data that will naturally emerge. If this were the case, then AI and human reasoning might naturally tend to learn the same kinds of concepts, even if they were using very different mechanisms. Later on the post, I will discuss how one might try to verify that similar representations had in fact been learned, and how to set up a system to make them even more similar.

Word embedding

A particularly fascinating branch of recent research relates to the learning of word embeddings, which are mappings of words to very high-dimensional vectors. It turns out that if you train a system on one of several kinds of tasks, such as being able to classify sentences as valid or invalid, this builds up a space of word vectors that reflects the relationships between the words. For example, there seems to be a male/female dimension to words, so that there's a "female vector" that we can add to the word "man" to get "woman" - or, equivalently, which we can subtract from "woman" to get "man". And it so happens (Mikolov, Yih & Zweig 2013) that we can also get from the word "king" to the word "queen" by adding the same vector to "king". In general, we can (roughly) get to the male/female version of any word vector by adding or subtracting this one difference vector!

Why would this happen? Well, a learner that needs to classify sentences as valid or invalid needs to classify the sentence "the king sat on his throne" as valid while classifying the sentence "the king sat on her throne" as invalid. So including a gender dimension on the built-up representation makes sense.

But gender isn't the only kind of relationship that gets reflected in the geometry of the word space. Here are a few more:

It turns out (Mikolov et al. 2013) that with the right kind of training mechanism, a lot of relationships that we're intuitively aware of become automatically learned and represented in the concept geometry. And like Olah (2014) comments:

It’s important to appreciate that all of these properties of W are side effects. We didn’t try to have similar words be close together. We didn’t try to have analogies encoded with difference vectors. All we tried to do was perform a simple task, like predicting whether a sentence was valid. These properties more or less popped out of the optimization process.

This seems to be a great strength of neural networks: they learn better ways to represent data, automatically. Representing data well, in turn, seems to be essential to success at many machine learning problems. Word embeddings are just a particularly striking example of learning a representation.

It gets even more interesting, for we can use these for translation. Since Olah has already written an excellent exposition of this, I'll just quote him:

We can learn to embed words from two different languages in a single, shared space. In this case, we learn to embed English and Mandarin Chinese words in the same space.

We train two word embeddings, Wen and Wzh in a manner similar to how we did above. However, we know that certain English words and Chinese words have similar meanings. So, we optimize for an additional property: words that we know are close translations should be close together.

Of course, we observe that the words we knew had similar meanings end up close together. Since we optimized for that, it’s not surprising. More interesting is that words we didn’t know were translations end up close together.

In light of our previous experiences with word embeddings, this may not seem too surprising. Word embeddings pull similar words together, so if an English and Chinese word we know to mean similar things are near each other, their synonyms will also end up near each other. We also know that things like gender differences tend to end up being represented with a constant difference vector. It seems like forcing enough points to line up should force these difference vectors to be the same in both the English and Chinese embeddings. A result of this would be that if we know that two male versions of words translate to each other, we should also get the female words to translate to each other.

Intuitively, it feels a bit like the two languages have a similar ‘shape’ and that by forcing them to line up at different points, they overlap and other points get pulled into the right positions.

After this, it gets even more interesting. Suppose you had this space of word vectors, and then you also had a system which translated images into vectors in the same space. If you have images of dogs, you put them near the word vector for dog. If you have images of Clippy you put them near word vector for "paperclip". And so on.

You do that, and then you take some class of images the image-classifier was never trained on, like images of cats. You ask it to place the cat-image somewhere in the vector space. Where does it end up? 

You guessed it: in the rough region of the "cat" words. Olah once more:

This was done by members of the Stanford group with only 8 known classes (and 2 unknown classes). The results are already quite impressive. But with so few known classes, there are very few points to interpolate the relationship between images and semantic space off of.

The Google group did a much larger version – instead of 8 categories, they used 1,000 – around the same time (Frome et al. (2013)) and has followed up with a new variation (Norouzi et al. (2014)). Both are based on a very powerful image classification model (from Krizehvsky et al. (2012)), but embed images into the word embedding space in different ways.

The results are impressive. While they may not get images of unknown classes to the precise vector representing that class, they are able to get to the right neighborhood. So, if you ask it to classify images of unknown classes and the classes are fairly different, it can distinguish between the different classes.

Even though I’ve never seen a Aesculapian snake or an Armadillo before, if you show me a picture of one and a picture of the other, I can tell you which is which because I have a general idea of what sort of animal is associated with each word. These networks can accomplish the same thing.

These algorithms made no attempt of being biologically realistic in any way. They didn't try classifying data the way the brain does it: they just tried classifying data using whatever worked. And it turned out that this was enough to start constructing a multimodal representation space where a lot of the relationships between entities were similar to the way humans understand the world.

How useful is this?

"Well, that's cool", you might now say. "But those word spaces were constructed from human linguistic data, for the purpose of predicting human sentences. Of course they're going to classify the world in the same way as humans do: they're basically learning the human representation of the world. That doesn't mean that an autonomously learning AI, with its own learning faculties and systems, is necessarily going to learn a similar internal representation, or to have similar concepts."

This is a fair criticism. But it is mildly suggestive of the possibility that an AI that was trained to understand the world via feedback from human operators would end up building a similar conceptual space. At least assuming that we chose the right learning algorithms.

When we train a language model to classify sentences by labeling some of them as valid and others as invalid, there's a hidden structure implicit in our answers: the structure of how we understand the world, and of how we think of the meaning of words. The language model extracts that hidden structure and begins to classify previously unseen things in terms of those implicit reasoning patterns. Similarly, if we gave an AI feedback about what kinds of actions counted as "leaving the box" and which ones didn't, there would be a certain way of viewing and conceptualizing the world implied by that feedback, one which the AI could learn.

Comparing representations

"Hmm, maaaaaaaaybe", is your skeptical answer. "But how would you ever know? Like, you can test the AI in your training situation, but how do you know that it's actually acquired a similar-enough representation and not something wildly off? And it's one thing to look at those vector spaces and claim that there are human-like relationships among the different items, but that's still a little hand-wavy. We don't actually know that the human brain does anything remotely similar to represent concepts."

Here we turn, for a moment, to neuroscience.

Multivariate Cross-Classification (MVCC) is a clever neuroscience methodology used for figuring out whether different neural representations of the same thing have something in common. For example, we may be interested in whether the visual and tactile representation of a banana have something in common.

We can test this by having several test subjects look at pictures of objects such as apples and bananas while sitting in a brain scanner. We then feed the scans of their brains into a machine learning classifier and teach it to distinguish between the neural activity of looking at an apple, versus the neural activity of looking at a banana. Next we have our test subjects (still sitting in the brain scanners) touch some bananas and apples, and ask our machine learning classifier to guess whether the resulting neural activity is the result of touching a banana or an apple. If the classifier - which has not been trained on the "touch" representations, only on the "sight" representations - manages to achieve a better-than-chance performance on this latter task, then we can conclude that the neural representation for e.g. "the sight of a banana" has something in common with the neural representation for "the touch of a banana".

A particularly fascinating experiment of this type is that of Shinkareva et al. (2011), who showed their test subjects both the written words for different tools and dwellings, and, separately, line-drawing images of the same tools and dwellings. A machine-learning classifier was both trained on image-evoked activity and made to predict word-evoked activity and vice versa, and achieved a high accuracy on category classification for both tasks. Even more interestingly, the representations seemed to be similar between subjects. Training the classifier on the word representations of all but one participant, and then having it classify the image representation of the left-out participant, also achieved a reliable (p<0.05) category classification for 8 out of 12 participants. This suggests a relatively similar concept space between humans of a similar background.

We can now hypothesize some ways of testing the similarity of the AI's concept space with that of humans. Possibly the most interesting one might be to develop a translation between a human's and an AI's internal representations of concepts. Take a human's neural activation when they're thinking of some concept, and then take the AI's internal activation when it is thinking of the same concept, and plot them in a shared space similar to the English-Mandarin translation. To what extent do the two concept geometries have similar shapes, allowing one to take a human's neural activation of the word "cat" to find the AI's internal representation of the word "cat"? To the extent that this is possible, one could probably establish that the two share highly similar concept systems.

One could also try to more explicitly optimize for such a similarity. For instance, one could train the AI to make predictions of different concepts, with the additional constraint that its internal representation must be such that a machine-learning classifier trained on a human's neural representations will correctly identify concept-clusters within the AI. This might force internal similarities on the representation beyond the ones that would already be formed from similarities in the data.

Next post in series: The problem of alien concepts.

31

45 comments, sorted by Highlighting new comments since Today at 9:09 AM
New Comment

Firstly, good job. I wish there were more posts like this which discuss the connections and impact of deep learning advances on AI safety.

A frequently-raised worry about AI is that it may reason in ways which are very different from us, and understand the world in a very alien manner.

Notice, however, that this particular concern comes from AI philosopher types rather than machine learning researchers.

I'll start this post by discussing a closely related hypothesis: that given a specific learning or reasoning task and a certain kind of data, there is an optimal way to organize the data that will naturally emerge. If this were the case, then AI and human reasoning might naturally tend to learn the same kinds of concepts, even if they were using very different mechanisms.

This is beyond just a hypothesis, as it follows naturally from optimality notions in computer science. As there is only ever one optimal solution (with perhaps many functionally equivalent isomorphic encodings), then obviously for any set of optimization criteria, constraints (including computational), and dataset there naturally can only ever be a single optimal solution (with potentially many equivalent encodings).

However, for any real problems of interest, finding the global optimum is intractable and utility wasteful regardless (diminishing returns and all that). But a more general relaxed version of the principle still applies if we consider that as one approaches the global optimum (and its many near-equivalents), the solution space collapses to a point.

One could also try to more explicitly optimize for such a similarity. For instance, one could train the AI to make predictions of different concepts, with the additional constraint that its internal representation must be such that a machine-learning classifier trained on a human's neural representations will correctly identify concept-clusters within the AI. This might force internal similarities on the representation beyond the ones that would already be formed from similarities in the data.

I have some ideas along these lines for how an internal monologue system could be used to enhance safety/security in a large complex ANN based AGI. RL based AGI designs typically already rely on or require a tight information bottleneck at the top which essentially acts as a history compressor.

We could then connect the AI's internal compressed state history to a linguistic (word/sentence vector) RNN which is trained to output sensible sentences that match the AI's internal thought process and communicate its intents to humans - essentially its language output. Admittedly there is still a ways to go on the front, but the recent progress in mapping images to sentences is highly encouraging. If we can map images to sentences, we should be able to map internal compressed mental histories to sentences.

If we can do that then we can setup the AI to have a speech monologue that is always active, but is only routed to the external output based on an internal binary switch variable the AI controls (similar to humans). And then viola we can have the AI's stream of thoughts logged to a text file.

Of course at some point the AI may become aware that we are reading its thoughts, but by that point we should have learned a good deal about it already. No its not a slam dunk complete solution to AI safety, but having an internal monologue appears to be a very useful tool.

Firstly, good job. I wish there were more posts like this which discuss the connections and impact of deep learning advances on AI safety.

Thank you very much!

This is beyond just a hypothesis, as it follows naturally from optimality notions in computer science. As there is only ever one optimal solution (with perhaps many functionally equivalent isomorphic encodings), then obviously for any set of optimization criteria, constraints (including computational), and dataset there naturally can only ever be a single optimal solution (with potentially many equivalent encodings).

However, for any real problems of interest, finding the global optimum is intractable and utility wasteful regardless (diminishing returns and all that). But a more general relaxed version of the principle still applies if we consider that as one approaches the global optimum (and its many near-equivalents), the solution space collapses to a point.

Good point about the "there might be a natural organization for the data" part. I should be more specific about the "in which case an AI and human might naturally use similar kinds of concepts" part, since that one involves additional assumptions:

  1. That the same kind of organization is optimal both for computation implemented in biological cells and for computation implemented in a conventional digital computer
  2. That the human brain has actually evolved to employ a close-to-optimal organization of the data

1) seems to me likely to be untrue in literal form, but could possibly be avoided by just building a system that wasn't necessarily totally digital-optimal. 2) probably depends on the domain - e.g. Körding 2007 mentions that

Estimation will often depend on two different cues. For example, we may see and feel an object and use both senses to infer the properties of the object. [...] A couple of recent studies have examined how subjects solve such cue combination problems. For example, the combinations of visual and auditory information and visual and tactile information, as well as within modality cue combination (e.g., texture and disparity), have been studied. In such cases, cues are combined in a fashion that is close to the optimum prescribed by Bayesian statistics (14, 15, 17, 18).

But on the other hand, I would expect the brain to use suboptimal representations for evolutionary recent tasks, such as doing abstract mathematics.

We could then connect the AI's internal compressed state history to a linguistic (word/sentence vector) RNN which is trained to output sensible sentences that match the AI's internal thought process and communicate its intents to humans - essentially its language output. Admittedly there is still a ways to go on the front, but the recent progress in mapping images to sentences is highly encouraging. If we can map images to sentences, we should be able to map internal compressed mental histories to sentences.

If we can do that then we can setup the AI to have a speech monologue that is always active, but is only routed to the external output based on an internal binary switch variable the AI controls (similar to humans). And then viola we can have the AI's stream of thoughts logged to a text file.

That's a very interesting idea. One challenge that comes to mind is that since the AI's internal world-model would be constantly changing, you might need to constantly re-train the language network to understand what the changed concepts correspond to. But since you didn't know for sure what the new concepts did correspond to, you wouldn't have a fully reliable training set to re-train it with. Still, you might be able to pull it off anyway.

  1. That the same kind of organization is optimal both for computation implemented in biological cells and for computation implemented in a conventional digital computer
  2. That the human brain has actually evolved to employ a close-to-optimal organization of the data

1) seems to me likely to be untrue in literal form, but could possibly be avoided by just building a system that wasn't necessarily totally digital-optimal. 2) probably depends on the domain - e.g. Körding 2007 mentions that

We have some evidence for situations in which 1.) is true. In vision for example, V1 learns a decomposition of the image into gabor filters. Likewise, most hierarchical machine learning vision systems also learn a first stage of filters based on gabor filters when fed natural image data.

In regards to 2.), exact optimality matters less than optimality relative to the computational power applied to inference.

But on the other hand, I would expect the brain to use suboptimal representations for evolutionary recent tasks, such as doing abstract mathematics.

This assumes that task specific representations are hardwired in by evolution, which is mostly true only for the old brain. The cortex (along with the cerebellum) is essentially the biological equivalent of a large machine learning coprocessor, and at birth it has random connections, very much like any modern ML system, like ANNs. It appears that the cortex uses the same general learning algorithms to learn everything from vision to physics. This is the 'one learning algorithm' hypothesis, and has much support at this point. At a high level we know that it should be true - after all we know that the stronger forms of bayesian inference can learn anything there is to learn, and the success of modern variants of SGD - which can be seen as a scalable approximation of bayesian inference - provides further support. (the brain probably uses something even better than modern SGD, and we are getting closer and closer to matching its inference algorithms - many many researchers are trying to find the next best approximate inference algorithm past SGD)

If we can do that then we can setup the AI to have a speech monologue that is always active, but is only routed to the external output based on an internal binary switch variable the AI controls (similar to humans). And then viola we can have the AI's stream of thoughts logged to a text file.

That's a very interesting idea. One challenge that comes to mind is that since the AI's internal world-model would be constantly changing, you might need to constantly re-train the language network to understand what the changed concepts correspond to. But since you didn't know for sure what the new concepts did correspond to, you wouldn't have a fully reliable training set to re-train it with. Still, you might be able to pull it off anyway.

Yes - that is a specific instance of the general training dependency problem. One general solution is to train everything together. In this specific case I imagine that once the language/speech output module is hooked up, we can then begin training it online with the rest of the system using whatever RL type criterion we are using to train the other motor output modules. So in essence the AI will learn to improve its language output capability so as to better communicate with humans, and this improvement will naturally co-adapt to internal model changes.

This assumes that task specific representations are hardwired in by evolution, which is mostly true only for the old brain. The cortex (along with the cerebellum) is essentially the biological equivalent of a large machine learning coprocessor, and at birth it has random connections, very much like any modern ML system, like ANNs. It appears that the cortex uses the same general learning algorithms to learn everything from vision to physics. This is the 'one learning algorithm' hypothesis, and has much support at this point.

I agree that there seems to be good evidence for the 'one learning algorithm' hypothesis... but there also seems to be reasonable evidence for modules that are specialized for particular tasks that were evolutionary useful; the most obvious example would be the extent to which we seem to have specialized reasoning capacity for modeling and interacting with other people, capacity which is to varying extent impaired in people on the autistic spectrum.

Even if one does assume that the cortex used the same learning algorithms for literally everything, one would still expect the parameters and properties of those algorithms to be at least partially genetically tuned towards the kinds of learning tasks that were most useful in the EEA (though of course the environment should be expected to carry out further tuning of the said parameters). I don't think that the brain learning everything using the same algorithms would disprove the notion that there could exist alternative algorithms better optimized for learning e.g. abstract mathematics, and which could also employ a representation that was better optimized for abstract math, at the cost of being worse at more general learning of the type most useful in the EEA.

I agree that there seems to be good evidence for the 'one learning algorithm' hypothesis... but there also seems to be reasonable evidence for modules that are specialized for particular tasks that were evolutionary useful

The paper you linked to is long-winded. I jumped to the section titled "Do Modules Require Their Own Genes?". I skimmed a bit and concluded that the authors were missing huge tracks of key recent knowledge from comp and developmental neuroscience and machine learning, and as a result they are fumbling in the dark.

the most obvious example would be the extent to which we seem to have specialized reasoning capacity for modeling and interacting with other people, capacity which is to varying extent impaired in people on the autistic spectrum.

Learning will automatically develop any number of specialized capabilities just as a natural organic process of interacting with the environment. Machine learning provides us with concrete specific knowledge of how this process actually works. The simplest explanation for autism inevitably involves disruptions to learning machinery, not disruptions to preconfigured "people interaction modules".

Again to reiterate - obviously there are preconfigured modules - it is just that they necessarily form a tiny portion of the total circuitry.

Even if one does assume that the cortex used the same learning algorithms for literally everything, one would still expect the parameters and properties of those algorithms to be at least partially genetically tuned towards the kinds of learning tasks that were most useful in the EEA (though of course the environment should be expected to carry out further tuning of the said parameters).

Perhaps, perhaps not. Certainly genetics specifies a prior over model space. You can think of evolution wanting to specify as much as it can, but with only a tiny amount of code. So it specifies the brain in a sort of ultra-compressed hierarchical fashion. The rough number of main modules, neuron counts per module, and gross module connectivity are roughly pre-specified, and then within each module there are a just a few types of macrocircuits, each of which is composed of a few types of repeating microcircuits, and so on.

Using machine learning as an analogy, to solve a specific problem we typically come up with a general architecture that forms a prior over model space that we believe is well adapted to the problem. Then we use a standard optimization engine - like SGD - to handle the inference/learning given that model. The learning algorithms are very general purpose and cross domain.

I don't think that the brain learning everything using the same algorithms would disprove the notion that there could exist alternative algorithms better optimized for learning e.g. abstract mathematics, and which could also employ a representation that was better optimized for abstract math, at the cost of being worse at more general learning of the type most useful in the EEA.

The distinction between the 'model prior' and the 'learning algorithm' is not always so clear cut, and some interesting successes in the field of metalearning suggest that there indeed exists highly effective specialized learning algorithms for at least some domains.

one would still expect the parameters and properties of those algorithms to be at least partially genetically tuned towards the kinds of learning tasks that were most useful in the EEA

Compare jacob_cannell's earlier point that

obviously for any set of optimization criteria, constraints (including computational), and dataset there naturally can only ever be a single optimal solution (emphasis added)

Do we know or can we reasonably infer what those optimization criteria were like, so that we can implement them into our AI? If not, how likely and by how much would we expect the optimal solution to change?

At a high level we know that it should be true - after all we know that the stronger forms of bayesian inference can learn anything there is to learn, and the success of modern variants of SGD - which can be seen as a scalable approximation of bayesian inference - provides further support. (the brain probably uses something even better than modern SGD, and we are getting closer and closer to matching its inference algorithms - many many researchers are trying to find the next best approximate inference algorithm past SGD)

I really agree with your general point but this isn't correct. Bayesian inference can only learn something so long as the model specified is correct. I know this is kind of pedantic but it's important to keep in mind.

E.g. there are some functions a simple bayesian network won't be able to model well without exponentially many parameters and training examples. Because of the no free lunch theorem all models have weaknesses in some cases.

Of course some people might say the no free lunch theorem is useless since we can assume real world problems are drawn from some distribution of simple computable models in some Turing complete language. However this doesn't really help us since we can't do efficient inference on anything remotely like a Turing complete language, and so must use much more restricted models.

SGD is not an approximation of bayesian inference. It has nothing to do with bayesianism. It's just a general optimization algorithm which is useful for fitting models to data.

And I doubt the brain uses anything better than SGD. I would be very surprised if it's even half as efficient as SGD. Reason being that computers are numerically accurate through many layers and many timesteps, while the brain is extremely noisy and can't do global algorithms like that. Additionally computers can iterate through a dataset many times and fine tune every parameters, while brains only get to see things once.

However that's fine since SGD seems to be more than enough for modern NNs. Inventing a 10x more efficient optimization algorithm would just mean you can train the nets slightly faster. But training time isn't the limiting factor for the most part.

SGD is not an approximation of bayesian inference. It has nothing to do with bayesianism. It's just a general optimization algorithm which is useful for fitting models to data.

How well read are you in machine learning? 10's of papers? 100's? 1000's? PHD level? This and your other comment about IRL suggests that you have only cursory knowledge of the field. Also 'bayesianism' isn't a thing, outside of LW.

Bayesian inference is also just an "algorithm which is useful for fitting models to data."

Inference problems can be turned into optimization problems and vice versa. In particular the single MLE estimate from a full exhaustive inference over some data set conditioned on some observable is exactly equivalent to a global optimization problem solved with exhaustive search.

Exhaustive methods have exponential order costs, so the first obvious large improvement is to approximate the full joint CDF by a factored graphical model, such as a factor graph. For real valued variables, tracking full distributions is still quite expensive, so the next level of approximation/optimization is to use simple analytic distributions such as gaussians. Another useful approximation then is to use some incremental sampling algorithm.

SGD type algorithms are equivalent to approximate MLE inference where only the mean of each variable is tracked, update messages are swept through the data in a simple fashion, and the variance is related to the learning rate.

I apologize if my comment came off as rude. I certainly didn't mean to assert any kind of authority over this. I am just a hobbyists and some minor points you made bothered me. Mainly the comment about the limiting factor of NNs being the optimization algorithm they use, or that the brain uses something far better. The points about bayesian inference was just tangential.

I didn't mean bayesianism the philosophy, just bayesian methods.

Bayesian inference is also just an "algorithm which is useful for fitting models to data."

Yes but it's not an optimization algorithms. Optimization algorithms are more general than statistics. You can use an optimization algorithm to find the optimal parameters for an airplane wing or the shortest path between several cities.

Conversely bayesian inference doesn't specify how the parameters should be optimized, just that you should somehow weigh every possibility according to it's probability.

I am not saying that they aren't related at all, just that it's worth distinguishing them as qualitatively different concepts, where you seem to use them interchangeably.

.. .some minor points you made bothered me. Mainly the comment about the limiting factor of NNs being the optimization algorithm they use, or that the brain uses something far better.

I didn't say SGD is the main limiting factor of ANNs, or that the brain using something far better. I said "the brain probably uses something even better than modern SGD .."

Modern SGD methods - especially with auto learning rate tuning and the new normalization schemes (which btw relates directly to better variance/uncertainty models in stat inference methods) - are pretty powerful, but they still learn somewhat slowly, requiring numerous passes through the data to reach a good solution.

I don't have time to dig deep into how the brain may use techniques better than SGD ... but as a simple single example of one thing it does better: current SGD ANN training computes the same update steps for the same high cost across the entire network for every training example, even though examples vary vastly in their novelty/difficulty/utility of learning. The brain appears to be much better about managing its limited resources.

I am not saying that [inference and optimization] aren't related at all, just that it's worth distinguishing them as qualitatively different concepts, where you seem to use them interchangeably.

They are largely interchangeable in machine learning in the sense that you can use optimization techniques (SGD) or inference techniques (expectation propagation, expectation backpropagation, MCMC, etc) to train a model (such as an ANN).

Much of the 'wisdom' or deep insightful knowledge in a particular field consists of learning all the structural relations and symmetries between different algorithms/techniques which enable internal mental compression of all of the raw low level knowledge: learning which techniques are generalizations, specializations, approximations, or restricted transformations of others. In the beginning, everything looks disconnected and compartmentalized, but eventually one sees how everything is connected.

General optimization can be used to implement inference, and vice versa. You can recast optimization as an inference problem: the initial settings/constraints become a prior, the utility/loss function is converted into a probability measure, learning rates relate to variance/precision, etc. See survey papers such as "Representation Learning", or look into the use of bayesian methods in machine learning (as replacements for optimization methods) to get some perspective on how they all relate.

I'm glad that you're thinking about these things, but this misses what I think is the hard part of the problem: truly out-of-sample cases. The thing that I'm worried about isn't that a superhuman AI will map (human beings suffering in a currently understood way) to the concept "good", but that it will have a lot of degrees of freedom of where to map (thing that is only possible with nanotech, which human brains aren't capable of fully understanding) or (general strategy for meme-hacking human brains, which human brains aren't able to conceptualize), etc, and that a process of picking the best action may be likely to pick up one of these edge cases that would differ from our extrapolated volitions.

Basically, I don't see how we can be confident yet that this continues to work once the AI is able to come up with creative edge cases that our brains aren't explicitly able to encompass or classify the way our extrapolated volitions would want. For an example of progress that might help with this, I might hope there's a clever way to regularize model selection so that they don't include edge cases of this sort, but I've not seen anything of that type.

Hmm. In a future post, I'm hoping to get to the question of "suppose that an AI could expand the way it has defined its existing concepts by including additional dimensions which humans are incapable of conceptualizing, and this led its values to diverge from human ones", and I agree that this post is not yet sufficient to solve that one. I think that's the same problem as you're talking about (if previously your concepts had N dimensions and now they have N+1, you could find something that fulfilled all the previous criteria while still being different from what we'd prefer if we knew about the N+1th dimension), but I'm not entirely sure?

Yes, except I'm much more pessimistic about reinforcement learning sufficing, since I expect that a superhuman-engineering-capability AI would have, not just a few additional degrees of freedom, but incredibly many. And then it would not suffice for the AI to make its best guess about how to extrapolate human values to a world with nanotech and memehacking and (whatever else)- that would almost surely lead to disaster.

So how do you feel about the proposal I made in my latest post, to evaluate the new situation in light of the old values? (Might want to continue this thread in the comments of that post.)

My (low-confidence) intuition is that while it's certainly possible to easily screw up the implementation, if the system is engineered correctly, then the process by which the AI applies the old values to the new situation/new concept space should be essentially same as the one by which humans would do it. Of course, in practice "the system being engineered correctly" might require e.g. a very human-like design including a humanoid body etc. in order to get the initial concept space to become sufficiently similar to the human one, so that's a problem.

I think I'm also somewhat more optimistic about the range of solutions that might qualify as "good", because a large part of human values seem to be determined by reinforcement learning. (Compare Hanson on plasticity.) I suspect that if e.g. nanotech and memehacking became available, then the "best" approach to deal with it is underdetermined by our current values, and just because an AI would extrapolate our current values differently than humans would, doesn't necessarily mean that that extrapolation would be any worse. I mean, if the best extrapolation is genuinely underdetermined by our current values, then that means that a wide range of possibilities is equally good pretty much by definition.

I totally agree with you that AIs should be able to learn what humans mean by different concepts. I never really understood that objection. I think the problem is a bit deeper. This sentence right here:

the AI being told whether or not some behavior is good or bad and then constructing a corresponding world-model based on that.

What's to stop the AI from instead learning that "good" and "bad" are just subjective mental states or words from the programmer, rather than some deep natural category of the universe? So instead of doing things it thinks the human programmer would call "good", it just tortures the programmer and forces them to say "good" repeatedly.

The AI understands what you mean, it just doesn't care.

What's to stop the AI from instead learning that "good" and "bad" are just subjective mental states or words from the programmer, rather than some deep natural category of the universe?

Kaj_Sotala's post doesn't directly address this issue, but transparent concept learning of the form discussed in the OP should combine well with machine learning approaches to value learning, such as inverse reinforcement learning.

The general idea is that we define the agent's utility function indirectly as the function/circuit that would best explain human actions as a subcomponent of a larger generative model trained on a suitable dataset of human behavior.

With a suitably powerful inference engine and an appropriate training set, this class of techniques is potentially far more robust than any direct specification of a human utility function. Whatever humans true utility functions are, those preferences are revealed in the consequences of our decisions, and a suitable inference system can recover that structure.

The idea is that the AI will learn the structural connections between human's usage of the terms "good" and "bad" and its own utility function and value function approximations. In the early days its internal model may rely heavily on explicit moral instruction as the best predictor of the true utility function, but later on it should learn a more sophisticated model.

The general idea is that we define the agent's utility function indirectly as the function/circuit that would best explain human actions as a subcomponent of a larger generative model trained on a suitable dataset of human behavior.

I wrote a bit about this here, and posed the "very easy goal inference problem:" even if you had as much time as you wanted and knew exactly what a human would do in every possible situation, could you figure out what outcomes were "good" and "bad" then?

It seems like we have to be able to solve this problem in order to carry out the kind of strategy you describe.

I haven't seen any meaningful progress on it. I guess the hope is that we will get better at answering it as we get better at AI. But it seems quite distant from everything that people work on in AI, and it's pretty distant even from what people work on in cognitive science.

I wrote a bit about this here, and posed the "very easy goal inference problem:" even if you had as much time as you wanted and knew exactly what a human would do in every possible situation, could you figure out what outcomes were "good" and "bad" then?

Given infinite compute an unlimited perfect training set, the utility inference problem is still somewhat more complex than just predicting what the human would do (supervised learning), but it still seems pretty tractable.

Using essentially perfect unsupervised learning (full bayesian/solonomoff induction) on an enormous computer, you could easily compute a prior over models that explains the data well. The best models will tend to include circuits which efficiently approximate actual human thought processes. The problem of course - as you mention - is that we want to extract the equivalent of a human utility function so that we can combine it with a much improved superhuman predictive model of the world, along with a much longer planning horizon, improved Q/value function, etc.

This is still relatively easy to setup - at least in theory - in a suitable probabilistic programming environment. "Predict the human's decision output" corresponds to like one or two lines of code (its just the obvious supervised learning objective without any complex constraints), whereas a proper inverse reinforcement learning setup of the type discussed above which extracts a modular human utility function suitable for insertion into a more powerful AI perhaps corresponds to a few hundred lines of code to describe the more complex prior structure we are imposing on the space of models.

To be really robust we would need to explore many different potential modular models, and eventually we may need to start worrying about all the instantiated models we create - as if you make them complex enough there is an argument that they eventually could become equivalent to simulated humans, etc.

But in reality that is a ways out as we don't have unlimited computation, and it seems that humans are fairly capable of modelling the preferences of other humans using cheap approximations.

But it seems quite distant from everything that people work on in AI, and it's pretty distant even from what people work on in cognitive science.

At this stage in the game supervised learning is the most effective training paradigm for most tasks. The scope of AIs that we can build right now are systems with on the order of millions of neurons that can replicate some of the specific functionality from a few brain regions on rather specific tasks. IRL will become important later, once AI systems are much larger, much more educated, and have less to learn by directly imitating human experts.

So in short, people in AI aren't working on this today because it's not where the money is .. yet.

whereas a proper inverse reinforcement learning setup of the type discussed above which extracts a modular human utility function suitable for insertion into a more powerful AI perhaps corresponds to a few hundred lines of code to describe the more complex prior structure we are imposing on the space of models.

This would be great to see, but I'm not too optimistic. I can't tell whether the approach you are describing is designed to extract structure which actually exists, by lining up the prior with the actual structure of the brain, or whether it's designed to create new structure, by finding a simple explanatory model that is much more modular than the brain itself. Both seem tough!

I would be very surprised if there is a modular part of the brain that implements the "human utility function," and I've never heard a contemporary cognitive scientist endorse anything that would be suitable for your intended application.

If you write down an accurate model of the brain as "rational behavior + noise," I suspect the noise model ends up being exactly as complex as the human brain itself, since it has to e.g. specify how we think about things in order to predict what things we won't have time to think about. And once the complexity is in the noise model, the normal model selection test isn't going to really work for finding the values. E.g., if you imposed the obvious kind of structure, I wouldn't be surprised if the utility function you got out was not at all what humans valued, but just a useful heuristic for explaining some small part of human behavior.

Do you see any reason to suspect otherwise, or generally any evidence to make you optimistic about this project succeeding? Can you imagine any kind of breakdown that doesn't obviously fail?

I agree you could extract something like the map from "Perceptions" ---> "Human's answer to the question: 'does it seem like things are going well?'", and I have mostly focused on attacking AI control using capabilities like this.

So in short, people in AI aren't working on this today because it's not where the money is .. yet.

In general: some day AI control will become an economically relevant problem and it will receive attention if needed. This seems to be why serious people are optimistic about our prospects. But if we want to foster relative progress on control, then we should try to understand the issue further in advance than will happen by default

So we may disagree about how optimistic we should be about this research project, but hopefully we can agree that it's a research project that (1) would have to be solved in order for this approach to AI control to work, though it could be solved incidentally, and (2) is not currently benefitting that much from conventional research in AI.

You may be optimistic that it will obviously be solved once it becomes relevant. I don't see much reason for such optimism. But the bigger difference is that it just wouldn't change my outlook much if I thought it had a 75% chance of being solved.

This would be great to see, but I'm not too optimistic. I can't tell whether the approach you are describing is designed to extract structure which actually exists, by lining up the prior with the actual structure of the brain, or whether it's designed to create new structure, by finding a simple explanatory model that is much more modular than the brain itself. Both seem tough!

Remember we are talking about infinite inference power and infinite accurate training data, so specifying a careful accurate prior over the space of models is just not something we have to worry about. All we need to worry about is ensuring that our problem definition actually solves the correct problem.

So to clarify - the general problem is something along the lines of: find a utility function (a function which maps observation histories to real numbers), and a model RL agent architecture that together explain/predict the output dataset (human minds). We can then use that utility function in a more powerful RL agent.

The assumptions we need to make to solve this problem are only those related to our intent: namely that human decisions - however implemented internally - imply preferences over observation histories/worlds, and that we want to create new agents which optimize for those preferences more effectively.

f you write down an accurate model of the brain as "rational behavior + noise," I suspect the noise model ends up being exactly as complex as the human brain itself, since it has to e.g. specify how we think about things in order to predict what things we won't have time to think about.

I'm not sure what you mean by this. Noise is used in generative models to cope with the fact that we can't train perfect predictors. With infinite inference power the model search is likely to find very low complexity solutions, but there will always be some number of complexity bits that go somewhere - in your unknown hyper parameters, noise, whatever. The type of model I was imagining was one that parameterized all of model space (ala Solonomoff Induction), rather than one which uses noise explicitly.

And once the complexity is in the noise model, the normal model selection test isn't going to really work for finding the values. E.g., if you imposed the obvious kind of structure, I wouldn't be surprised if the utility function you got out was not at all what humans valued, but just a useful heuristic for explaining some small part of human behavior.

The problem definition you gave in your blog involves "a lookup table of what a human would do after making any sequence of observations". I interpreted that as a perfect training dataset that covers the entire human mindspace. Given infinite inference power, the resulting solutions would be - by the properties of Solonomoff Induction - the best possible explanation of that data - and vastly superior to anything humans will ever come up with. At a philosophical level infinite inference power corresponds to actually instantiating entire multiverses just to solve the problem.

Now it could be that human minds can not be described very well by any type of RL agent architecture for any possible utility function. I very much doubt this, because that's an extremely general agent framework. However, even if that were true (which it isn't), then the infinite inference engine would recover the ultimate approximation given those assumptions, which is probably good enough.

Do you see any reason to suspect otherwise, or generally any evidence to make you optimistic about this project succeeding? Can you imagine any kind of breakdown that doesn't obviously fail?

The project is just a thought experiment, because we will never have infinite inference power and infinite perfect training data.

That being said, I still think the general approach is probably correct and could lead to approximately friendly AI eventually, the challenges naturally come from limited inference power and limited training data - with the latter being the especially difficult part.

The most important training data will be data covering hypothetical future situations. I don't yet see how to handle this. Maybe there is some simpler extrapolation technique, where the agent can learn some simple general principle - such as "humans prefer control over their future observation history" - that once mastered, allows extrapolation to avoiding death, wireheading, etc etc.

The other practical difficulty is testing. The most important situations we want to test are exactly those which we cannot - future hypotheticals.

So we may disagree about how optimistic we should be about this research project, but hopefully we can agree that it's a research project that (1) would have to be solved in order for this approach to AI control to work, though it could be solved incidentally, and (2) is not currently benefiting that much from conventional research in AI.

In regards to 1.) I think this is essentially the only feasible practical approach to the FAI utility function problem on the table - at least that I am aware of. 2.) Is not entirely correct - as this approach is enabled by all the great progress in machine learning in improving our general inference capabilities.

On the other hand, machine learning is very much an experimental engineering field - progress comes mostly from experiments rather than theory. So how can we setup a series of experiments that leads us to friendly superintelligence? - that appears to be a core hard problem. One analogy that comes to mind is the creation of a new large nation state - especially one of a new type - like the US or the french republic. Unfortunately that is just not something that one can learn how to do through a large number of experiments.

One approach that could have promise is to learn a scaling theory. Or perhaps we focus on collective superintelligence where a large number of AIs learn the values of many humans and we let game theory sort it out.

All we need to worry about is ensuring that our problem definition actually solves the correct problem.

I agree.

I interpreted that as a perfect training dataset that covers the entire human mindspace. Given infinite inference power, the resulting solutions would be - by the properties of Solonomoff Induction - the best possible explanation of that data - and vastly superior to anything humans will ever come up with.

That's the interpretation I had in mind.

Now it could be that human minds can not be described very well by any type of RL agent architecture for any possible utility function.

As you point out, this is very unlikely. The question is whether the learned utility functions actually capture what humans care about.

If you think through a few easy approaches, you will see that they predictably fail. We can discuss in more detail, but it would be easier if you provided more insight into what kind of approach you are optimistic about. I can argue against N of them, but you will think that at least N-1 are straw men.

The most natural approach, to an LW mindset, is to define a basic frameowrk for "RL agents," that has a slot for a utility function. Then we can take a simplicity prior over models that fit into this basic framework, and do inference to find a posterior distribution over models, and hence over utility functions. If this is what you have in mind, I'm happy to comment in more depth on why I'm pessimistic.

The basic problem is that the simplest model of a human is clearly not as an RL agent, it's to directly model the many particular cognitive effects that shape human behavior. For any expressive framework, the most parsimonious model is going to throw out your framework and just model these cognitive effects directly. Of course it can't literally throw out your framework, but it can do so in all but name. For a crude example, the definition of "utility function" could consult the real model of the human to figure out what action a human would take, and then output a simple utility function that directly incentivized the predicted actions.

This will break your intended correspondence between the box in your model labeled "utility" and the actual values of the human subject, and if you give this utility function to a stronger RL agent I don't think the results will be satisfactory.

If we were to pick any concrete model I am quite confident that I could demonstrate this kind of behavior. I suspect that the only way we can avoid it is by being sufficiently vague about the approach that we can't make any concrete statements about what kind of representation it would learn.

2.) Is not entirely correct - as this approach is enabled by all the great progress in machine learning in improving our general inference capabilities.

Yes, actually getting a solution would require impressive inference capability. For now I'm happy to supoose that continuing AI progress will deliver inference abilities that are up to the task.

But I am especially interested in the residual---even if your inference abilities are as good as you could ask for, how do you solve the problem? It is about this residual that I am most pessimistic, and improvements in our inference ability don't help.

The most natural approach, to an LW mindset, is to define a basic frameowrk for "RL agents," that has a slot for a utility function. Then we can take a simplicity prior over models that fit into this basic framework, and do inference to find a posterior distribution over models, and hence over utility functions. If this is what you have in mind, I'm happy to comment in more depth on why I'm pessimistic.

Yes, more or less. I should now point out that almost everything of importance concerning the outcome is determined by the training dataset, not the model prior. This may seem counter-intuitive at first, but it is true and important.

The basic problem is that the simplest model of a human is clearly not as an RL agent, it's to directly model the many particular cognitive effects that shape human behavior.

This is not clear at all, and furthermore appears to contradict what you agreed to earlier above - namely that human minds can be described well as a specific type of RL agent with some particular utility function.

I consider myself reasonably up to date in both computational neuroscience and ML, and the most successful over-arching theory for explaining the brain today is indeed as a form of RL agent. Thus the RL framework in some sense is the most general framework we have and it includes human, animal, and a wide class of machine agents as special cases.

For any expressive framework, the most parsimonious model is going to throw out your framework and just model these cognitive effects directly.

The 'framework' I proposed is minimal - describing the class of all RL agents requires just a few lines of math. Remember the training set is near infinite and perfect, so the tiny number of bits I am imposing on the model prior matters not at all.

You seem to perhaps believe that I am specifying a framework in terms of modules or connections or whatever on the agent, and that was not my idea at all (at least in the infinite computing case). I was proposing the absolute minimal assumptions. The inference engine will explore the model space - and probably come up with something ridiculous like simulations of universes if you give it infinite compute. With practical but very large amounts of compute power, it will - probably - come up some sort of approximate brain-like ANN solution.

If we were to pick any concrete model I am quite confident that I could demonstrate this kind of behavior. I suspect that the only way we can avoid it is by being sufficiently vague about the approach that we can't make any concrete statements about what kind of representation it would learn.

I am skeptical you could demonstrate this, but you could start by taking one of the existing IRL systems in the literature and demonstrating the failure there. Or maybe I am unclear on the nature of your concern. You seem to be concerned with the details of how the resulting model works. I believe that is a fundamentally misguided notion, and instead we really care only about results. This could be a fundamental difference in mindsets - I"m very much an engineer.

In other words, the ultimate question is this: is the resulting agent better at doing what we actually want (on whatever set of tasks the training set includes) than the human experts that are the source of that training data?

For after all, that is the key advantage of RL techniques over supervised learning, an advantage which IRL inherits.

So here is a more practical set of experiments we could do today. Take a deep RL agent like deepmind's atari player. But instead of training it using the internal score as the reward function directly, we use IRL using traces of expert human play. We can compare to a baseline with the same model but trained using supervised learning. The supervised baseline would learn human errors and thus would asymptote at human level play. The IRL agent instead should eventually learn a good approximation of the score function as its utility/reward function and thus achieve capability close to the original RL agent.

A cool variation would be to add another training sequence where the human expert has additional constraints - such as maximize score without killing any other 'agents'. For the games for which that applies, I think that would be a really cool important demonstration of the beginnings of learning ethical behavior from humans.

So the core idea is to apply that same concept, but to life in general, where our 'game world' is the real world, and there is no predefined score function, and the ideal utility function must be inferred.

But I am especially interested in the residual---even if your inference abilities are as good as you could ask for, how do you solve the problem?

I don't claim to have a clear solution to the full problem yet, but my thought experiment above sketches out the vague beginnings of an IRL based solution. Again the training is everything - so the full solution becomes something more like educating an AI population, a problem that goes far beyond the basic math or machine learning and connects to politics, education, game theory, etc.

Remember the training set is near infinite and perfect, so the tiny number of bits I am imposing on the model prior matters not at all.

Yes, the model you get won't depend at all on the tiny number of bits that you are imposing, unless your model class is extremely crippled. This is precisely my point. You will get a really good model. But you imposed some structure in the model, perhaps with a little box labeled "utility function." After inference, that box isn't going to have the utility function in it. Why would your universe-simulating model bother dividing itself neatly into "utility function" and "everything else"? It will just ignore your division and do whatever is most efficient.

You seem to be concerned with the details of how the resulting model works. I believe that is a fundamentally misguided notion, and instead we really care only about results

I believe you will get out a model that predicts human behavior well. I think we can agree on that! But it's just not enough to do anything with. Now you have a simulation of a human; what do you do with it?

You are making a further claim---that in the box labeled "utility function," the model will put a reasonable representation of a human utility function, such that you'd be happy with your AI maximizing that utility function. It seems like you are the one making a detailed assumption about how the learned model works, an assumption which seems implausible to me. If you think you aren't making such an assumption, could you express (even very informally) the argument that the IRL agent will work well?

If your model doesn't have a box labeled "utility function," can you say again how you are extracting the utility function from the learned model?

Or do you think that you will not find a reasonable utility function, but produce desirable behavior anyway? I don't understand why this would happen.

I am skeptical you could demonstrate this, but you could start by taking one of the existing IRL systems in the literature and demonstrating the failure there.

We seem to be talking past each other. Could you cite a paper with what you think is a plausible model? I could respond to any of them, but again it would feel like a straw man, because I don't think that the authors of these papers expect them to apply to general human behavior.

For example, most of these models make no attempt to model reasoning, and instead assume e.g. that the probability that an agent takes an action depends only on the payoff of that action. This is obviously not a very good model! How do you see this working?

So here is a more practical set of experiments we could do today...

I agree that this experiment can probably yield better behavior than training a supervised learner to reproduce human play.

But existing approaches won't scale to learn perfect play, even with infinite computing power and unlimited training data, except in extremely simple environments. To make this clear you'd have to fix a particular model, which I invite you to do. But I think that most (all?) models in the literature will converge to exactly reproducing the "modal human policy" (in each state, do the thing that the expert is most likley to do) in the limit of infinite training data and a sufficiently rich state space. Do you have a counterexample in mind?

You can probably get optimal play in the atari case by leaning heavily on the simplicity prior for the rewards and neglecting the training data. But earlier in your comment it (very strongly) sounded like you wanted to let the training data wash out the prior.

But you imposed some structure in the model, perhaps with a little box labeled "utility function." After inference, that box isn't going to have the utility function in it. Why would your universe-simulating model bother dividing itself neatly into "utility function" and "everything else"? It will just ignore your division and do whatever is most efficient.

Hmm at this point I should now actually write out a simple RL model to help me understand your critique.

Here is some very simple math for a general RL setup (bellman-style recursive function form):

model = p(s,a,s')

policy(s) = argmax_a Q(s,a)

Q(s,a) = sum_s' p(s,a,s') [ R(s') + gV(s) ]

V(s) = max_a Q(s,a)

The function p(s,a,s') is the agent's world model which gives transition probabilities between consecutive states (s,s') on action a. The states really are observation histories - entire sequences of observations. The variable 'g' represents the discount factor (although really this should probably be a an unknown function). R(s') is the reward/utility function, and Q(s,a) is the value-action function that results from planning ahead to optimize R. The decision/policy function just selects the best action.

We condition on the actions and observations to learn the best model , reward and discount functions. And now I see your point (i think), after writing this out, that the model and reward functions are not really well distinguished and either could potentially substitute for the other (as they just multiply). It could learn a reward function that is just '1' and stuff everything in the model.

So - yes we need more prior structure than the 4 lines of math model. My initial initial guess was about 100 lines of math code in a tight prob prog model, which still may be reasonable in the future but is perhaps slightly optimistic.

Ok, so here is version 2. We know roughly that the cortex is responsible for modelling the world and we know its rough circuit complexity. So we can use that as a prior on the model function. Better yet, we can train the model function separately (constrained to cortex size or smaller), without including the policy function/argmax stuff, and on a dataset which includes situations where no actions are taken, forcing it to learn a world model first. Then we can use those results as an initial prior when we train the whole thing on the full dataset with the actions.

That doesn't totally solve the general form of your objection, but it at least forces the utility function to be somewhat more sensible. I can now kindof see where version 100 of this idea or so is going and how it could work well, but it probably requires increasingly complex models of human-like brains (along with more complex training schemes) as priors.

If your model doesn't have a box labeled "utility function," can you say again how you are extracting the utility function from the learned model?

Extract is perhaps not the right word, but the general idea is that once we have learned a human-level model function and reward function, in theory we can get superintelligent extrapolation by improving the model function, running it faster, and or eliminating any planning limitations or noise. The model function we learn to explain human data in particular will only know/model what humans actually know.

So here is a more practical set of experiments we could do today...

I agree that this experiment can probably yield better behavior than training a supervised learner to reproduce human play. . . .But I think that most (all?) models in the literature will converge to exactly reproducing the "modal human policy" (in each state, do the thing that the expert is most likley to do) in the limit of infinite training data and a sufficiently rich state space. Do you have a counterexample in mind?

The modal human policy, as you describe it, sounds identical to the supervised learner which just reproduces human ability. Beating supervised learning (the modal human policy) is again what really matters.

You can probably get optimal play in the atari case by leaning heavily on the simplicity prior for the rewards and neglecting the training data.

Not sure what you mean here - you need the training data to get up to any decent level of play. Perhaps you were thinking only of the utility function, but to learn that you still need some training data.

The deep ANN approach to RL is still new, and hasn't been merged with IRL research yet, which mostly appears to be in the small model stage (with the exception perhaps of some narrow applications in robotics and pathfinding).

the model and reward functions are not really well distinguished and either could potentially substitute for the other (as they just multiply)

They can also substitute in more subtle ways, e.g. by learning R(s) = 1 if the last action implied by the state history matches the predicted human action. If the human is doing RL imperfectly then that is going to have a much better explanatory fit to the data (it can be arbitrarily good, while any model of a human as a perfect RL agent will lose Bayes points all over the place), so you have to rely on the prior to see that it's a "bad" model.

it probably requires increasingly complex models of human-like brains (along with more complex training schemes) as priors

That's my concern; I think things get pretty hairy, and moreover I don't know whether the resulting systems would typically be competitive with (e.g.) the best RL agents that we could design by more direct methods.

once we have learned a human-level model function and reward function

That's what I mean by a "box labeled 'utility function'."

The modal human policy, as you describe it, sounds identical to the supervised learner which just reproduces human ability. Beating supervised learning (the modal human policy) is again what really matters.

Yes. Do you know any model of IRL that can (significantly) beat the modal human policy in this context?

Not sure what you mean here - you need the training data to get up to any decent level of play

Sorry, I meant "assign low total weight" to the training data, so that the learner can infer that some of the human's decisions were probably mistakes (since they can only be explained by an artificial reward function). This is very delicate, and it requires paying more attention to athe prior than you seemed to want to (and more attention to the prior than is consistent with actually making good predictions about human behavior).

the model and reward functions are not really well distinguished and either could potentially substitute for the other (as they just multiply)

They can also substitute in more subtle ways, e.g. by learning R(s) = 1 if the last action implied by the state history matches the predicted human action. If the human is doing RL imperfectly then that is going to have a much better explanatory fit to the data (it can be arbitrarily good, while any model of a human as a perfect RL agent will lose Bayes points all over the place), so you have to rely on the prior to see that it's a "bad" model.

That may or may not be a problem with the simplest version 1 of the idea, but it is not a problem in version 2 which imposes more realistic priors/constraints and also uses model pretraining on just state transitions to force differentiation of the model and reward functions.

I think things get pretty hairy, and moreover I don't know whether the resulting systems would typically be competitive with (e.g.) the best RL agents that we could design by more direct methods.

Ok, I think we are kindof in agreement, but first let me recap where we are. This all started when I claimed that your 'easy IRL problem' - solve IRL given infinite compute and infinite perfect training data - is relatively easy and could probably be done in 100 lines of math. We both agreed that supervised learning (reproducing the training set - the modal human policy) would be obviously easy in this setting.

After that the discussion forked and got complicated - which I realize in hindsight - stems from not clearly specifying what would entail success. So to be more clear - success of the IRL approach can be measured as improvement over supervised learning - as measured in the recovered utility function. Which of course leads to this whole other complexity - how do we know that is the 'true utility function' - leave that aside for a second, and I'll get back to it.

I then brought up a concrete example of using IRL on an deep RL Atari agent. I described how learning the score function should be relatively straightforward, and this would allow an IRL agent to match the performance of the RL agent in this domain, which leads to better performance than the supervised/modal human policy.

You agreed with this:

So here is a more practical set of experiments we could do today...

I agree that this experiment can probably yield better behavior than training a supervised learner to reproduce human play.

So it seems we have agreed that IRL surpassing the modal human policy is clearly possible - at least in the limited domain of atari.

If we already know the utility function apriori, then obviously IRL given the same resources can only do as good as RL. But that isn't that interesting, and remember IRL can do much more - as in the example of learning to maximize score while under other complex constraints.

So in scaling up to more general problem domains, we have the issue of modelling mistakes - which you seem to be especially focused on - and the related issue of utility function uniqueness.

Versions 2 and later of my simple proto-proposal use more informed priors for the circuit complexity combined with pretraining the model on just observations to force differentiate the model and utility functions. In the case of atari, getting the utility function to learn the score should be relatively easy - as we know it is a simple immediate visual function.

This type of RL architecture can model human's limited rationality by bounding the circuit complexity - at least that's the first step. We could get increasingly more accurate models of the human decision surface by incorporating more of the coarse abstract structure of the brain as a prior over our model space.

Ok, so backing up a bit :

it probably requires increasingly complex models of human-like brains (along with more complex training schemes) as priors

That's my concern; I think things get pretty hairy, and moreover I don't know whether the resulting systems would typically be competitive with (e.g.) the best RL agents that we could design by more direct methods.

For the full AGI problem, I am aware of a couple of interesting candidates for an intrinsic reward/utility function - the future freedom of action principle (power) and the compression progress measure (curiosity). If scaled up to superhuman intelligence, I think/suspect you would agree that both of these candidates are probably quite dangerous. On the other hand, they seem to capture some aspects of human's intrinsic motivators, so they may be useful as subcomponents or features.

The IRL approach - if taken all the way - seems to require reverse engineering the brain. It could be that any successful route to safe superintelligence just requires this - because the class of agents that combine our specific complex unknown utility functions with extrapolated superintelligence necessarily can only be specified in reference to our neural architecture as a starting point.

The IRL approach - if taken all the way - seems to require reverse engineering the brain. It could be that any successful route to safe superintelligence just requires this - because the class of agents that combine our specific complex unknown utility functions with extrapolated superintelligence necessarily can only be specified in reference to our neural architecture as a starting point.

This sounds really interesting and important (if true), but I have only a vague understanding of how you arrived at this conclusion. Please consider writing a post about it.

This sounds really interesting and important (if true), but I have only a vague understanding of how you arrived at this conclusion. Please consider writing a post about it.

It's not so much a conclusion as an intuition, and most of the inferences leading up to it are contained in this thread with PaulChristiano and a related discussion with Kaj Sotala.

I'm interested in IRL and I think it's the most promising current candidate for value learning, but I must admit I haven't read much of the relevant literature yet. Reading up on IRL and writing a discussion post on it has been on my todo list - your comment just bumped it up a bit. :)

Another related issue is the more general question of how the training data/environment determines/shapes safety issues for learning agents.

My reaction when I first came across IRL is similar to this author's:

However, the current IRL methods are limited and cannot be used for inferring human values because of their long list of assumptions. For instance, in most IRL methods the environment is usually assumed to be stationary, fully observable, and some- times known; the policy of the agent is assumed to be stationary and optimal or near-optimal; the reward function is assumed to be stationary as well; and the Markov property is assumed. Such assumptions are reasonable for limited motor control tasks such as grasping and manipulation; however, if our goal is to learn high-level human values, they become unrealistic.

But maybe it's not a bad approach for solving a hard problem to first solve a very simplified version of it, then gradually relax the simplifying assumptions and try to build up to a solution of the full problem.

My reaction when I first came across IRL is similar to this author's:

As a side note, that author's attempt at value learning is likely to suffer from the same problem Christiano brought up in this thread - there is nothing to enforce that the optimization process will actually nicely separate the reward and agent functionality. Doing that requires some more complex priors and or training tricks.

The author's critique about limiting assumptions may or may not be true, but the author only quotes a single paper from the IRL field - and its from 2000. That paper and it's follow up both each have 500+ citations, and some of the newer work with IRL in the title is from 2008 or later. Also - most of the related research doesn't use IRL in the title - ie "Probabilistic reasoning from observed context-aware behavior".

But maybe it's not a bad approach for solving a hard problem to first solve a very simplified version of it, then gradually relax the simplifying assumptions and try to build up to a solution of the full problem.

This is actually the mainline successful approach in machine learning - scaling up. MNIST is a small 'toy' visual learning problem, but it lead to CIFAR10/100 and eventually ImageNet. The systems that do well on ImageNet descend from the techniques that did well on MNIST decades ago.

MIRI/LW seems much more focused on starting with a top-down approach where you solve the full problem in an unrealistic model - given infinite compute - and then scale down by developing some approximation.

Compare MIRI/LW's fascination with AIXI vs the machine learning community. Searching for "AIXI" on r/machinelearning gets a single hit vs 634 results on lesswrong. Based on #citations of around 150 or so, AIXI is a minor/average paper in ML (more minor than IRL), and doesn't appear to have lead to great new insights in terms of fast approximations to bayesian inference (a very active field that connects mostly to ANN research).

MIRI is taking the top-down approach since that seems to be the best way to eventually obtain an AI for which you can derive theoretical guarantees. In the absence of such guarantees, we can't be confident that an AI will behave correctly when it's able to think of strategies or reach world states that are very far outside of its training and testing data sets. The price for pursuing such guarantees may well be slower progress in making efficient and capable AIs, with impressive and/or profitable applications, which would explain why the mainstream research community isn't very interested in this approach.

I tend to agree with MIRI that the top-down approach is probably safest, but since it may turn out to be too slow to make any difference, we should be looking at other approaches as well. If you're thinking about writing a post about recent progress in IRL and related ideas, I'd be very interested to see it.

MIRI is taking the top-down approach since that seems to be the best way to eventually obtain an AI for which you can derive theoretical guarantees.

I for one remain skeptical such theoretical guarantees are possible in principle for the domain of general AI. The utility of formal math towards a domain tends to vary inversely with domain complexity. For example in some cases it may be practically possible to derive formal guarantees about the full output space of a program, but not when that program is as complex as a modern video game, or let alone a human. The equivalent of theoretical guarantees may be possible/useful for something like a bridge, but less so for an airplane or a city.

For complex systems simulations are the key tool that enables predictions about future behavior.

In the absence of such guarantees, we can't be confident that an AI will behave correctly when it's able to think of strategies or reach world states that are very far outside of its training and testing data sets.

This indeed would be a problem if the AI's training ever stopped, but I find this extremely unlikely. Some AI systems already learn continuously - whether using online learning directly or by just frequently patching the AI with the results of updated training data. Future AI systems will continue this trend - and learn continuously like humans.

Much depends on one's particular models for how the future of AI will pan out. I contend that AI does not need to be perfect, just better than humans. AI drivers don't need to make optimal driving decisions - they just need to drive better than humans. Likewise AI software engineers just need to code better than human coders, and AI AI researchers just need to do their research better than humans. And so on.

The price for pursuing such guarantees may well be slower progress in making efficient and capable AIs, with impressive and/or profitable applications, which would explain why the mainstream research community isn't very interested in this approach.

For the record, I do believe that MIRI is/should be funded at some level - it's sort of a moonshot, but one worth taking given the reasonable price. Mainstream opinion on the safety issue is diverse, and their are increasingly complex PR and career issues to consider. For example corporations are motivated to downplay long term existential risks, and in the future will be motivated to downplay similarity between AI and human cognition to avoid regulation.

If you're thinking about writing a post about recent progress in IRL and related ideas, I'd be very interested to see it.

Cool - I'm working up to it.

Future AI systems will continue this trend - and learn continuously like humans.

Sure, but when it comes to learning values, I see a few problems even with continuous learning:

  1. The AI needs to know when to be uncertain about its values, and actively seek out human advice (or defer to human control) in those cases. If the AI is wrong and overconfident (like in http://www.evolvingai.org/fooling but for values instead of image classification) even once, we could be totally screwed.
  2. On the other hand, if the AI can think much faster than a human (almost certainly the case, given how fast hardware neurons are even today), learning from humans in real time will be extremely expensive. There will be high incentive to lower the frequency of querying humans to a minimum. Those willing to take risks, or think that they have a simple utility function that the AI can learn quickly, could have a big advantage in how competitive their AIs are.
  3. I don't know what my own values are, especially when it comes to exotic world states that are achievable post-Singularity. (You could say that my own training set was too small. :) Ideally I'd like to train an AI to try to figure out my values the same way that I would (i.e., by doing philosophy), but that might require very different methods than for learning well-defined values. I don't know if incremental progress in value learning could make that leap.

For complex systems simulations are the key tool that enables predictions about future behavior. [...] I contend that AI does not need to be perfect, just better than humans.

My point was that an AI could do well on test data, including simulations, but get tripped up at some later date (e.g., it over-confidently thinks that a certain world state would be highly desirable). Another way things could go wrong is that an AI learns wrong values, but does well in simulations because it infers that it's being tested and tries to please the human controllers in order to be released into the real world.

I generally agree that learning values correctly will be a challenge, but it's closely related to general AGI challenges.

I'm also reasonably optimistic that we will be able to reverse engineer the brain's value learning mechanisms to create agents that are safer than humans. Fully explaining the reasons behind that cautious optimism would require a review of recent computational neuroscience (the LW consensus on the brain is informed primarily by a particular narrow viewpoint from ev psych and the H&B literature, and this position is in substantial disagreement with the viewpoint from comp neuroscience.)

  1. The AI needs to know when to be uncertain about its values,

Mostly agreed. However it is not clear that actively deferring to humans is strictly necessary. In particular one route that circumvents most of these problems is testing value learning systems and architectures on a set of human-level AGIs contained to a virtual sandbox where the AGI does not know it is in a sandbox. This allows safe testing of designs to be used outside of the sandbox. The main safety control is knowledge limitation (which is something that MIRI has not considered much at all, perhaps because of their historical anti-machine learning stance).

The fooling CNN stuff does not show a particularly important failure mode for AI. These CNNs are trained only to recognize images in the sense of outputting a 10 bit label code for any input image. If you feed them a weird image, they just output the closest category. The fooling part (getting the CNN to misclassify an image) specifically requires implicitly reverse engineering the CNN and thus relies on the fact that current CNNs are naively deterministic. A CNN with some amount of random sampling based on a secure irreversible noise generator would not have this problem.

  1. [Learning values could take too long, corps could take shortcuts.]

This could be a problem, but even today our main technique to speed up AI learning relies more on parallelization than raw serial speedup. The standard technique involves training 128 to 1024 copies of the AI in parallel, all on different data streams. The same general technique would allow an AI to learn values from large number of humans in parallel. This also happens to automatically solve some of the issues with value representativeness.

  1. I don't know what my own values are, especially when it comes to exotic world states that are achievable post-Singularity.

The current world is already exotic from the perspective of our recent ancestors. We already have some methods to investigate the interaction of our values with exotic future world states: namely our imagination, as realized in thought experiments and especially science fiction. AI could help us extend these powers.

My point was that an AI could do well on test data, including simulations, but get tripped up at some later date

This is just failure to generalize or overfitting, and how to avoid these problems is much of what machine learning is all about.

Another way things could go wrong is that an AI learns wrong values, but does well in simulations because it infers that it's being tested and tries to please the human controllers in order to be released into the real world.

This failure requires a specific combination of: 1. that the AI learns a good model of the world, but 2. learns a poor model of human values, and 3. learns that it is in a sim. 4. wants to get out. 5. The operators fail to ever notice any of 2 through 4.

Is this type of failure possible? Sure. But the most secure/paranoid type of safety model I envision is largely immune to that class of failures. In the most secure model, potentially unsafe new designs are constrained to human-level intelligence and grow up in a safe VR sim (medieval or earlier knowledge-base). Designs which pass safety tests are then slowly percolated up to sims which are closer to the modern world. Each up migration step is like reincarnation - a new AI is grown from a similar seed. The final designs (seed architectures rather than individual AIs) that pass this vetting/testing process will have more evidence for safety/benevolence/altruism than humans.

Fully explaining the reasons behind that cautious optimism would require a review of recent computational neuroscience (the LW consensus on the brain is informed primarily by a particular narrow viewpoint from ev psych and the H&B literature, and this position is in substantial disagreement with the viewpoint from comp neuroscience.)

Sounds like another post to look forward to.

The current world is already exotic from the perspective of our recent ancestors.

I think we'll need different methods to deal with future exoticness though. See this post for some of the reasons.

In the most secure model, potentially unsafe new designs are constrained to human-level intelligence and grow up in a safe VR sim (medieval or earlier knowledge-base).

Do you envision biological humans participating in the VR sim, in order to let the AI learn values from them? If so, how to handle speed differences that may be up to a factor of millions (which you previously suggested will be the case)? Only thing I can think of is to slow the AI down to human speed for the training, which might be fine if your AI group has a big lead and you know there aren't any other AIs out there able to run at a million times human speed. Otherwise, even if you could massively parallelize the value learning and finish it in one day of real time, that could be giving a competitor a millions days of subjective time (times how many parallel copies of the AI they can spawn) to make further progress in AI design and other technologies.

The final designs (seed architectures rather than individual AIs) that pass this vetting/testing process will have more evidence for safety/benevolence/altruism than humans.

Safer than humans seems like a pretty low bar to me, given that I think most humans are terribly unsafe. :) But despite various problems I see with this approach, it may well be the best outcome that we can realistically hope for, if mainstream AI/ML continues to make progress at such a fast pace using designs that are hard to reasonable about formally.

I think we'll need different methods to deal with future exoticness though. See this post for some of the reasons.

Perhaps. The question of uploading comes to mind as something like an 'ontological crisis'. We start with a intuitive model of selfhood built around the concept of a single unique path extending through time, and the various uploading thought experiments upend that model. Humans (at least some) appear to be able to deal with these types of challenges given enough examples to cover the space and enough time to update models.

Do you envision biological humans participating in the VR sim, in order to let the AI learn values from them?

Of course. And eventually we can join the AIs in the VR sim more directly, or at least that's the hope.

If so, how to handle speed differences that may be up to a factor of millions (which you previously suggested will be the case)?

Given some computing network running a big VR AI sim, in theory the compute power can be used to run N AIs in parallel or one AI N times accelerated or anything in between. In practice latency and bandwidth overhead considerations will place limits on the maximum serial speedup.

But either way the results are similar - the core problem is the total throughput of AI thought volume to human monitor thought volume. It's essentially the student/teacher ratio problem. One human could perhaps monitor a couple dozen 'children' AI without sophisticated tools, or perhaps hundreds or even thousands with highly sophisticated narrow AI tools (automated thought monitors and visualizers).

I don't expect this will be a huge issue in practice due to simple economical considerations. AGI is likely to arrive near the time the hardware cost of an AGI is similar to human salary/cost. So think of it in terms of the ratio of human teacher cost to AGI hardware cost. AGI is a no brainer investment when that cost ratio is 1:1, and just gets better over time.

The point in time at which AGI hardware costs say 1/100th of a human teacher - (say 20 cents per hour) that time is already probably well in to the singularity anyway. The current trend is steady exponential progress in driving down hypothetical AGI hardware cost. (which I estimate is vaguely around $1,000/hr today - the cost of about 1000 gpus) If that cost suddenly went down due to some new breakthrough, that would just accelerate the timeline.

Humans (at least some) appear to be able to deal with these types of challenges given enough examples to cover the space and enough time to update models.

I don't know how to deal with this myself, and I doubt whether people who claim to be able to deal with these scenarios are doing so correctly. I wrote about this in http://lesswrong.com/lw/g0w/beware_selective_nihilism/

Given some computing network running a big VR AI sim, in theory the compute power can be used to run N AIs in parallel or one AI N times accelerated or anything in between. In practice latency and bandwidth overhead considerations will place limits on the maximum serial speedup.

If you have hardware neurons running at 10^6 times biological speed (BTW, are you aware of HICANN, a chip that today implements neurons running at 10^4 faster than biological? See also this video presentation), would it make sense to implement a time-sharing system where one set of neurons is used to implement multiple AIs running at slower speed? Wouldn't that create unnecessary communication costs (swapping AI mind states in and out of your chips) and coordination costs among the AIs?

would it make sense to implement a time-sharing system where one set of neurons is used to implement multiple AIs running at slower speed? Wouldn't that create unnecessary communication costs

In short, If you don't time share, then you are storing all synaptic data on the logic chip. Thus you need vastly more logic chips to simulate your model, and thus you have more communication costs.

There are a number of tradeoffs here that differ across GPUs vs neuro ASICs like HICANN or IBM TruNorth. The analog memristor approaches, if/when they work out, will have similar tradeoffs to neuro-ASICs. (for more on that and another viewpoint see this discussion with the Knowm guy )

GPUs are von neumman machines that take advantage of the 10x or more cost difference between the per transistor cost of logic vs that of memory. Logic is roughly 10x more expensive, so it makes sense to have roughly 10x more memory bits than logic bits. ie: a GPU with 5 billion transistors might have 4 gigabytes of offchip RAM.

So on the GPU (or any von neumman), typically you are always doing time-swapping: simulating some larger circuit by swapping pieces in and out of memory.

The advantage of the neuro-ASIC is energy efficiency: synapses are stored on chip, so you don't have to pay the price of moving data which is most of the energy cost these days. The disadvantages are threefold: you lose most of your model flexibility, storing all your data on the logic chip is vastly more expensive per synapse, and you typically lose the flexibility to compress synaptic data - even basic weight sharing is no longer possible. Unfortunately these problems combine.

Lets look at some numbers. The HICANN chip has 128k synapses in 50 mm^2, and their 8-chip reticle is thus equivalent to a mid-high end GPU in die area. That's 1 million synapses in 400 mm^2. It can update all of those synapses at about 1 mhz - which is about 1 trillion synop-hz.

A GPU using SOTA ANN simulation code can also hit about 1 trillion synop-hz, but with much more flexibility in the tradeoff between model size and speed. In particular 1 million synapses isn't really enough - most competitive ANNS trained today are in the 1 to 10 billion synapse range - which would cost about 1000 times more for the HICANN, because it can only store 1 million synapses per chip, vs 1 billion or more for the GPU.

IBM's truenorth can fit more synapses on a chip - 256 million on a GPU sized chip (5 billion transistors), but it runs slower, with a similar total synop-hz throughput. The GPU solutions are just far better, overall - for now.

Apparently HICANN was designed before 2008, and uses a 180nm CMOS process, whereas modern GPUs are using 28nm. It seems to me that if neuromorphic hardware catches up in terms of economy of scale and process technology, it should be far superior in cost per neural event. And if neuromorphic hardware does win, it seems that the first AGIs could have a huge amortized cost per hour of operation, and still have a lower cost per unit of cognitive work than human workers, due to running much faster than biological brains.

It seems like this GPU vs neuromorphic question could have a large impact on how the Singularity turns out, but I haven't seen any discussion of it until now. Do you have any other thoughts or references on this topic?

Apparently HICANN was designed before 2008, and uses a 180nm CMOS process, whereas modern GPUs are using 28nm.

That's true, but IBM's TrueNorth is 28 nm, with about the same transistor count as a GPU. It descends from earlier research chips on old nodes that were then scaled up to new nodes. TrueNorth can fit 256 million low-bit synapses on a chip, vs 1 million for HICANN (normalized for chip area). The 28 nm process has roughly 40x the transistor density. So my default hypothesis is that if HICANN was scaled up to 28 nm it would end up similar to TrueNorth in terms of density (although TrueNorth is wierd in that it is intentionally much slower than it could be to save energy).

It seems to me that if neuromorphic hardware catches up in terms of economy of scale and process technology, it should be far superior in cost per neural event.

I expect this in the long term, but it will depend on how the end of Moore's Law pans out. Also, current GPU code is not yet at the limits of software simulation efficiency for ANNs, and GPU hardware is still improving rapidly. It just so happens that I am working on a new type of ANN sim engine that is 10x or more faster than current SOTA for networks of interest. My approach could eventually be hardware accelerated. There are some companies already pursuing hardware acceleration of the standard algorithms - such as Nervana, targeting similar speedup but through dedicated neural asics.

One thing I can't stress enough is the advantage of programmeable memory for storing weights - sharing and compressing weights helps solve much of the bandwidth problems the GPU would otherwise have.

It seems like this GPU vs neuromorphic question could have a large impact on how the Singularity turns out, but I haven't seen any discussion of it until now. Do you have any other thoughts or references on this topic?

I don't know much it really effects outcomes - whether one uses clever hardware or clever software, the brain is probably near or on the pareto surface for statistical inference energy efficiency, and we will probably get close in the near future.

[-][anonymous]5y 0

I'm trying to wrap my head around all this and as someone with no programming/ai background, I found this the clearest, gentlest learning curve article on the inverse reinforcement learning.

I know of inverse reinforcement learning and similar ideas, I still argue that they are bad for the same reason.

In regular reinforcement learning, the human presses a button that says "GOOD", and a sufficiently intelligent AI learns that it can just steal the button and press it itself.

In inverse reinforcement learning, the human presses a button that says "GOOD" at first. Then the button is turned off, and the AI is told to predict what actions would have led to the button being pressed. Instead of actual reinforcement, there is merely predicted reinforcement.

However a sufficiently intelligent AI should predict that stealing the button would have resulted in the button being pressed, and so it will still do that. Even though the button is turned off, the AI is trying to predict what would be best in the counter-factual world where the button is still on.

And so the programmer thinks that they have taught the AI to understand what is good, but really they have just taught it to figure out how to press a button labelled "GOOD".

[This comment is no longer endorsed by its author]Reply

This is not how IRL works at all. The utility function does not come from a special reward channel controlled by a human. There is no button.

To reiterate my description earlier, IRL is based on inferring the unknown utility function of an agent given examples of the agent's behaviour in terms of observations and actions. The utility function is entirely an internal component of the model.

What's to stop the AI from instead learning that "good" and "bad" are just subjective mental states or words from the programmer, rather than some deep natural category of the universe? So instead of doing things it thinks the human programmer would call "good", it just tortures the programmer and forces them to say "good" repeatedly.

The pictures and videos of torture in the training set that are labelled "bad".

It is not perfect, but I think the idea is that with a large and diverse training set the hope is that it alternative models of "good/bad" become extremely contrived, and the human one you are aiming for becomes the simplest model.

I found the material in the post very interesting. It holds out hope that after training your world model, it might not be as opaque as people fear.

To add to the other comments: "the AI understands what you mean, it just doesn't care" refers to a situation where we have failed to teach the AI to care about the things we care about. At that point, it's likely that it can figure out what we actually wanted it to do, but it isn't motivated to do what we wanted it to do.

This post describes a part of a strategy looking to figure out how the AI might be made to care about the same things as we do, by having an internal understanding of the world that's similar to the human understanding and then having its goals grounded in terms of that understanding.

It's not a complete solution (or even a complete subsolution), but rather hacking at the edges. As you mention, if things go badly it is e.g. possible for the AI to escape the box and rewire the reward function. The intended approach for avoiding that would be to program it to inherently care about the same things as humans do before letting it out of the box. At that point, it wouldn't be primarily motivated by the programmer's feedback anymore, but its own internalized values, which would hopefully be human-friendly.