Distance Functions are Hard

[-]John_Maxwell6yΩ7150

Learning a distance function between pictures of human faces has been used successfully to train deep learning based face recognition systems.

My takeaway from your examples is not that "distance functions are hard" so much as "hardcoding is brittle". The general approach of "define a distance function and train a model based on it" has been pretty successful in machine learning.

[-]David Scott Krueger (formerly: capybaralet)6yΩ110

At the same time, the importance of having a good distance/divergence, the lack of appropriate ones, and the difficulty of learning them are widely acknowledged challenges in machine learning.

A distance function is fairly similar to a representation in my mind, and high-quality representation learning is considered a bit of a holy grail open problem.

Machine learning relies on formulating *some* sort of objective, which can be viewed as analogous to the choice of a good distance function, so I think the central point of the post (as I understood it from a quick glance) is correct: "specifying a good distance measure is not that much easier than specifying a good objective".

It's also an open question how much learning, (relatively) generic priors, and big data can actually solve the issue of weak learning signals and weak priors for us. A lot of people are betting pretty hard on that; I think its plausible, but not very likely. I think its more like a recipe for unaligned AI, and we need to get more bits of information about what we actually want into AI systems somehow. Highly interactive training protocols seem super valuable for that, but the ML community has a strong preference against such work because it is a massive pain compared to the non-interactive UL/SL/RL settings that are popular.

[-]John_Maxwell6yΩ120

Why are highly interactive training protocols a massive pain?

Do you have any thoughts on self-supervised learning? That's my current guess for how we'll get AGI, and it's a framework that makes the alignment problem seem relatively straightforward to me.

[-]David Scott Krueger (formerly: capybaralet)6yΩ110

They're a pain because they involve a lot of human labor, slow down the experiment loop, make reproducing results harder, etc.

RE self-supervised learning: I don't see why we needed the rebranding (of unsupervised learning). I don't see why it would make alignment straightforward (ETA: except to the extent that you aren't necessarily, deliberately building something agenty). The boundaries between SSL and other ML is fuzzy; I don't think we'll get to AGI using just SSL and nothing like RL. SSL doesn't solve the exploration problem, if you start caring about exploration, I think you end up doing things that look more like RL.

I also tend to agree (e.g. with that gwern article) that AGI designs that aren't agenty are going to be at a significant competitive disadvantage, so probably aren't a satisfying solution to alignment, but could be a stop-gap.

[-]John_Maxwell6y*Ω120

They're a pain because they involve a lot of human labor, slow down the experiment loop, make reproducing results harder, etc.

I see. How about doing active learning of computable functions? That solves all 3 problems.

Instead of standard benchmarks, you could offer an API which provides an oracle for some secret functions to be learned. You could run a competition every X months and give each competition entrant a budget of Y API calls over the course of the competition.

RE self-supervised learning: I don't see why we needed the rebranding (of unsupervised learning).

Well I don't see why neural networks needed to be rebranded as "deep learning" either :-)

When I talk about "self-supervised learning", I refer to chopping up your training set into automatically created supervised learning problems (predictive processing), which feels different from clustering/dimensionality reduction. It seems like a promising approach regardless of what you call it.

I don't see why it would make alignment straightforward (ETA: except to the extent that you aren't necessarily, deliberately building something agenty).

In order to make accurate predictions about reality, you need to understand humans, because humans exist in reality. So at the very least, a superintelligent self-supervised learning system trained on loads of human data would have a lot of conceptual building blocks (developed in order to make predictions about its training data) which could be tweaked and combined to make predictions about human values (analogous to fine-tuning in the context of transfer learning). But I suspect fine-tuning might not even be necessary. Just ask it what Gandhi would do or something like that.

Re: gwern's article, RL does not seem to me like a good fit for most of the problems he describes. I agree active learning/interactive training protocols are powerful, but that's not the same as RL.

Autonomy is also nice (and also not the same as RL). I think the solution for autonomy is (1) solve calibration/distributional shift, so the system knows when it's safe to act autonomously (2) have the system adjust its own level of autonomy/need for clarification dynamically depending on the apparent urgency of its circumstances. I have notes for a post about (2), let me know if you think I should prioritize writing it.

[-]David Scott Krueger (formerly: capybaralet)6yΩ110

I see. How about doing active learning of computable functions? That solves all 3 problems

^ I don't see how?

I should elaborate... it sounds like your thinking of active learning (where the AI can choose to make queries for information, e.g. labels), but I'm talking about *inter*active training, where a human supervisor is *also* actively monitoring the AI system, making queries of it, and intelligently selecting feedback for the AI. This might be simulated as well, using multiple AIs, and there might be a lot of room for good work there... but I think if we want to solve alignment, we want a deep and satisfying understanding of AI systems, which seems hard to come by without rich feedback loops between humans and AIs. Basically, by interactive training, I have in mind something where training AIs looks more like teaching other humans.

So at the very least, a superintelligent self-supervised learning system trained on loads of human data would have a lot of conceptual building blocks (developed in order to make predictions about its training data) which could be tweaked and combined to make predictions about human values (analogous to fine-tuning in the context of transfer learning).

I think it's a very open question how well we can expect advanced AI systems to understand or mirror human concepts by default. Adversarial examples suggest we should be worried that apparently similar concepts will actually be wildly different in non-obvious ways. I'm cautiously optimistic, since this could make things a lot easier. It's also unclear ATM how precisely AI concepts need to track human concepts in order for things to work out OK. The "basin of attraction" line of thought suggests that they don't need to be that great, because they can self-correct or learn to defer to humans appropriately. My problem with that argument is that it seems like we will have so many chances to fuck up that we would need 1) AI systems to be extremely reliable, or 2) for catastrophic mistakes to be rare, and minor mistakes to be transient or detectable. (2) seems plausible to me in many applications, but probably not all of the applications where people will want to use SOTA AI.

Re: gwern's article, RL does not seem to me like a good fit for most of the problems he describes. I agree active learning/interactive training protocols are powerful, but that's not the same as RL.

Yes ofc they are different.

I think algorithms the significant features of RL here are: 1) having the goal of understanding the world and how to influence it, and 2) doing (possibly implicit) planning. RL can also be pointed at narrow domains, but for a lot of problems, I think having general knowledge will be very valuable, and hard to replicate with a network of narrow systems.

I think the solution for autonomy is (1) solve calibration/distributional shift, so the system knows when it's safe to act autonomously (2) have the system adjust its own level of autonomy/need for clarification dynamically depending on the apparent urgency of its circumstances.

That seems great, but also likely to be very difficult, especially if we demand high reliability and performance.

[-]John_Maxwell6yΩ120

^ I don't see how?

No human labor: Just compute the function. Fast experiment loop: Computers are faster than humans. Reproducible: Share the code for your function with others.

I'm talking about interactive training

I think for a sufficiently advanced AI system, assuming it's well put together, active learning can beat this sort of interactive training--the AI will be better at the task of identifying & fixing potential weaknesses in its models than humans.

Adversarial examples suggest we should be worried that apparently similar concepts will actually be wildly different in non-obvious ways.

I think the problem with adversarial examples is that deep neural nets don't have the right inductive biases. I expect meta-learning approaches which identify & acquire new inductive biases (in order to determine "how to think" about a particular domain) will solve this problem and will also be necessary for AGI anyway.

BTW, different human brains appear to learn different representations (previous discussion), and yet we are capable of delegating tasks to each other.

I'm cautiously optimistic, since this could make things a lot easier.

Huh?

My problem with that argument is that it seems like we will have so many chances to fuck up that we would need 1) AI systems to be extremely reliable, or 2) for catastrophic mistakes to be rare, and minor mistakes to be transient or detectable. (2) seems plausible to me in many applications, but probably not all of the applications where people will want to use SOTA AI.

Maybe. But my intuition is that if you can create a superintelligent system, you can make one which is "superhumanly reliable" even in domains which are novel to it. I think the core problems for reliable AI are very similar to the core problems for AI in general. An example is the fact that solving adversarial examples and improving classification accuracy seem intimately related.

I think algorithms the significant features of RL here are: 1) having the goal of understanding the world and how to influence it, and 2) doing (possibly implicit) planning.

In what sense does RL try to understand the world? It seems very much not focused on that. You essentially have to hand it a reasonably accurate simulation of the world (i.e. a world that is already fully understood, in the sense that we have a great model for it) for it to do anything interesting.

If the planning is only "implicit", RL sounds like overkill and probably not a great fit. RL seems relatively good at long sequences of actions for a stateful system we have a great model of. If most of the value can be obtained by planning 1 step in advance, RL seems like a solution to a problem you don't have. It is likely to make your system less safe, since planning many steps in advance could let it plot some kind of treacherous turn. But I also don't think you will gain much through using it. So luckily, I don't think there is a big capabilities vs safety tradeoff here.

I think having general knowledge will be very valuable, and hard to replicate with a network of narrow systems.

Agreed. But general knowledge is also not RL, and is handled much more naturally in other frameworks such as transfer learning, IMO.

So basically I think daemons/inner optimizers/whatever you want to call them are going to be the main safety problem.

[-]Grue_Slinky6yΩ110

Yes, perhaps I should've been more clear. Learning certain distance functions is a practical solution to some things, so maybe the phrase "distance functions are hard" is too simplistic. What I meant to say is more like

Fully-specified distance functions are hard, over and above the difficulty of formally specifying most things, and it's often hard to notice this difficulty

This is mostly applicable to Agent Foundations-like research, where we are trying to give a formal model of (some aspect of) how agents work. Sometimes, we can reduce our problem to defining the appropriate distance function, and it can feel like we've made some progress, but we haven't actually gotten anywhere (the first two examples in the post are like this).

The 3rd example, where we are trying to formally verify an ML model against adversarial examples, is a bit different now that I think of it. Here we apparently need transparent, formally-specified distance function if we have any hope of absolutely proving the absence of adversarial examples. And in formal verification, the specification problem often is just philosophically hard like this. So I suppose this example is less insightful, except insofar as it lends extra intuitions for the other class of examples.

[-]John_Maxwell6y*Ω240

Here we apparently need transparent, formally-specified distance function if we have any hope of absolutely proving the absence of adversarial examples.

Well, a classifier that is 100% accurate would also do the job ;) (I'm not sure a 100% accurate classifier is feasible per se, but a classifier which can be made arbitrarily accurate given enough data/compute/life-long learning experience seems potentially feasible.)

Also, small perturbations aren't necessarily the only way to construct adversarial examples. Suppose I want to attack a model M1, which I have access to, and I also have a more accurate model M2. Then I could execute an automated search for cases where M1 and M2 disagree. (Maybe I use gradient descent on the input space, maximizing an objective function corresponding to the level of disagreement between M1 and M2.) Then I hire people on Mechanical Turk to look through the disagreements and flag the ones where M1 is wrong. (Since M2 is more accurate, M1 will "usually" be wrong.)

This is actually one way to look at what's going on with traditional small perturbation adversarial examples. M1 is a deep learning model and M2 is a 1-nearest-neighbor model--not very good in general, but quite accurate in the immediate region of data points with known labels. The problem is that deep learning models don't have a very strong inductive bias towards mapping nearby inputs to nearby outputs (sometimes called "Lipschitzness"). L2 regularization actually makes deep learning models more Lipschitz because smaller coefficients=smaller eigenvalues for weight matrices=less capacity to stretch nearby inputs away from each other in output space. I think maybe that's part of why L2 regularization works.

Hoping to expand the previous two paragraphs into a paper with Matthew Barnett before too long--if anyone wants to help us get it published, please send me a PM (neither of us has ever published a paper before).

[-]Gurkenglas6y*50

I'm not convinced conceptual distance metrics must be value-laden. Represent each utility function by an AGI. Almost all of them should be able to agree on a metric such that each could adopt that metric in its thinking losing only negligible value. The same could not be said for agreeing on a utility function. (The same could be said for agreeing on a utility-parametrized AGI design.)

[-]Bunthut6y20

Represent each utility function by an AGI. Almost all of them should be able to agree on a metric such that each could adopt that metric in its thinking losing only negligible value.

This implies a measure over utility functions. Its propably true under the solomonoff measure, but abstract though they are, this is values.

[-]romeostevensit6y20

I think it's that any basis set I define in a super high dimensional space could be said to be value laden, though it might be tacit and I have little idea what it is. If I care about 'causal structure' or something that's still relative to the sorts of affordances that are relevant to me in the space?

[-]Gurkenglas6y10

Is this the same value payload that makes activists fight over language to make human biases work for their side? I don't think this problem translates to AI: If the AGIs find that some metric induces some bias, each can compensate for it.

[-]Bunthut6y20

Its sort of true that the correct distance function depends on your values. A better way to say it is that different distance functions are appropriate for different tasks, and they will be "better" or "worse" depending on how much you care about those tasks. But I dont think asking for the "best" metric in this sense is helpful, because you dont have to use the same metric for all tasks involving a certain space. Sometimes you want air distance, sometimes travel times. Maybe you have to decide because youre computationally limited, but its not philosophically relevant.

With that in mind, my attempts at two of your examples. The adversarial examples first, because its the clearest question: I think the problem is that you are thinking too abstractly. I dont think there is a meaningful sense of "concept similarity" thats purely logical, i.e. independent of the actual world. The intuitive sense of similarity youre trying to use here is propably something like this: Over the space of images, you want the propability measure of encountering them. Then you get a metric where two subsets of imagespace which are isomorphic under the metric always have the same measure. That is your similarity measure.

Counterfactuals usually involve some sort of propability distribution, which is then "updated" on the condition of the counterfactual being true, and then the consequent is judged under that distribution. What the initial distribution is depends on what youre doing. In the case of Lincoln, its propably reasonable expectations of the future from before the assasination. But for something like "What if conservation of energy wasnt true", its propably our current distribution over physics theories. Basically, whats the most likely alternative. The mathematical example is a bit different. There lot of ways to conclude a contradiction from 0=1, but its very hard to deduce a contradiction from denying the modularity theorem. If you were to just randomly perform logical inferences from "the modularity theorem is wrong", then there is a subset of propositions which doesnt include any claim that is a dircet negation of another in it, that your deductions are unlikely to lead you out of (it matters of course, in what way it is random, but it evidently works for "human mathmatician who hasnt seen the proof yet").

[-]Shmi6y10

"If Lincoln were not assassinated, he would not have been impeached" is a probabilistic statement that is not at all about THE Lincoln. It's a reference class analysis of leaders who did not succumb to premature death and had the leadership, economy etc. metrics similar to the one Lincoln. There is no "counterfactual" there in any interesting sense. It is not about the minute details of avoiding the assassination. If you state the apparent counterfactual more precisely, it would be something like

There is a 90% probability of a ruler with [list of characteristics matching Lincoln, according to some criteria] serving out his term.

So, there is no issue with "If 0=1..." here, unlike with the other one, "If the modularity theorem were false", which implies some changes in the very basics of mathematics, though one can also argue for the reference class approach there.

[-]Charlie Steiner6y10

I feel like this is practically a frequentist/bayesian disagreement :D It seems "obvious" to me that "If Lincoln were not assassinated, he would not have been impeached" can be about the real Lincoln as much as me saying "Lincoln had a beard" is, because both are statements made using my model of the world about this thing I label Lincoln. No reference class necessary.

[-]Shmi6y20

I am not sure if labels help here. I'm simply pointing out that logical counterfactuals applied to the "real Lincoln" lead to the sort of issues MIRI is facing right now when trying to make progress in the theoretical AI alignment issues. The reference class approach removes the difficulties, but then it is hard to apply it to the "mathematical facts", like what is the probability of 100...0th digit of pi being 0 or, to quote the OP "If the Modularity Theorem were false..." and the prevailing MIRI philosophy does not allow treating logical uncertainty as environmental.

[-]Charlie Steiner6y10

Sure. In the case of Lincoln, I would say the problem is solved by models even as clean as Pearl-ian causal networks. But in math, there's no principled causal network model of theorems to support counterfactual reasoning as causal calculus.

Of course, I more or less just think that we have an unprincipled causality-like view of math that we take when we think about mathematical counterfactuals, but it's not clear that this is any help to MIRI understanding proof-based AI.

[-]Shmi6y20

I don't think I am following your argument. I am not sure what Pearl's causal networks are and how they help here, so maybe I need to read up on it.

LESSWRONG
is fundraising!
LW

LESSWRONG
is fundraising!
LW

31

Distance Functions are Hard

31

Ω 11

31

Ω 11

Counterfactual Worlds

Algorithmic Similarity

Adversarial Examples

Distance Functions are Hard: The Evidence

Conclusions