Critch on career advice for junior AI-x-risk-concerned researchers

[-]John_Maxwell8y180

I like the idea of optimizing for career growth & AI safety separately. However, I'm not sure the difference between "capabilities research" and "safety research" is as clear-cut as Critch makes it sound.

Consider the problem of making ML more data-efficient. Superficially, this is "capabilities research": I don't think it appears on any AI safety research agenda, and it's an established mainstream research area.

However, in order to do value learning, I think we'll want ML to become much more data-efficient than it is currently. If ML is not data-efficient, then assembling a dataset for our values will be time-consuming, which might tempt arms race participants to cut corners.

And if we could make ML really data-efficient, that gets us closer to "do what I mean" systems where you give it a few examples of things to do/not do and it's able to correctly infer your intent.

So does that mean the AI safety community should work on making ML more data-efficient? I'm not sure. I can think of arguments on both sides.

But my personal view is that answering these kind of "differential capabilities research" questions is higher-impact than a lot of the AI safety work that is being done. As far as I can tell, most existing AI safety work either

(a) Treats safety as a applications problem, where we try to use existing AI techniques to prototype what safe systems might look like. But I expect such prototypes will be thrown away as the state of the art advances. Arguably, you hit the point of diminishing returns with this approach as soon as you finish your architecture diagram (since that's the part that's least likely to change as the field advances).

(b) Treats safety as a security problem, where we try to think of flaws AI systems might have and how we might guard against them. But flaws only exist in the context of particular systems. The C programming language has a lot of security issues due to the fact that strings are null-terminated. There's a massive cottage industry built around exploiting and guarding against C-specific issues. But this is all historically contingent: We only care about this because C is a popular programming language. If C was not popular, this cottage industry wouldn't exist.

Instead I would suggest a third approach:

(c) Treat safety as a differential technological development problem. Try to figure out which capabilities are on the critical path for FAI but not on the critical path for UFAI. Try to evaluate competing AI paradigms and forecast which could most easily evolve into a secure system, then try to improve benchmarks for that platform so it can win the standards war. If none of the existing paradigms seem likely to be adequate, maybe devise a new paradigm de novo. Don't forget about sociological factors.

Note that approach (c) looks a lot more like "capabilities research" than "safety research". It requires careful judgement calls by domain experts. Work of types (a) and (b) will likely be useful to inform those judgement calls. But (c) is the way to go in the long run, IMO. If you were an effective altruist living during the 1980s trying to ensure that computers of the future would be secure, I think promoting the adoption of a non-C programming language would likely be the highest-leverage thing to do.

[This ended up being a pretty long tangent, maybe I should make this comment into a toplevel post? Perhaps people could tell me if/why they disagree first.]

[-]Wei Dai8y60

Can you explain (c) a bit more? What specifically should someone be doing now, if they want to do (c)?

[-]John_Maxwell8y150

Well, here is a list of paradigms that might overtake deep learning. This list could probably be expanded, e.g. by researching various attempts to integrate deep learning with Bayesian reasoning, create more interpretable models, etc.

Then you could come up with a list of desiderata we seek in a paradigm: resistance to adversarial examples, robustness to distributional shift, interpretability, conservative concepts, calibration, etc. Additionally, there are pragmatic considerations related to whether a particular paradigm has a serious hope of widespread adoption. How competitive is it? Does it address researcher complaints about deep learning?

Then you could create a 2d matrix with paradigms on one axis and desiderata on another axis. For each paradigm/desiderata combo, figure out if that paradigm satisfies, or could be improved to satisfy, that desiderata. As you do this you'd probably get ideas for new rows/columns in your matrix.

Then you could look at your matrix and try to figure out which paradigms are most promising for FAI--or if none seem good enough, invent a new one. Do technology evangelism for the chosen paradigm(s). Try to improve the paradigm's resume of accomplishments. Rally the AI safety community.

Computer science, and AI in particular, have always been hype-driven. The market in paradigms doesn't seem efficient or driven purely by questions of technical merit. And there can be a lot of path-dependence. As AI safety concerns gain mindshare, I think we stand a solid chance of influencing which paradigms gain traction.

Another approach to differential capabilities development is to try to identify an application of AI which shares a lot of features with the AI safety problem and demonstrate its commercial viability. For example, self-driving cars are safety-critical in nature, which seems good. But they also must make real-time decisions, whereas it is probably desirable for an FAI to spend time pondering the nature of our values, ask us clarifying questions, etc.

Fun fact: Silicon Valley's new behemoth investor is a believer in the technological singularity. It's too bad the Singularity Summit is not still a thing or he could be invited to speak.

[-]paulfchristiano8y180

Then you could come up with a list of desiderata we seek in a paradigm: resistance to adversarial examples, robustness to distributional shift, interpretability, conservative concepts, calibration, etc.

For most of these examples, the current research in safety is more like "Try to find any approach that has a hope of satisfying that desideratum while being competitive."

So your matrix just ends up being a lot of "no" or "maybe if we did more research."

It seems correct that people are trying to "find some approach that might work" before they try "rally the community around an approach that might work."

[-]John_Maxwell8y-30

Well, I haven't seen even a blog post's worth of effort put into doing something like what I suggested. So an extreme level of pessimism doesn't seem especially well-justified to me. It seems relatively common for a task to be hard in one framework while being easy in another.

Standard CFAR advice: Instead of assuming a problem is unsolvable, sit down and try to think of a solution for a timed 5 minutes. Has anyone spent a timed 5 minutes trying to figure out, say, how vulnerable gcForest is likely to be to adversarial examples? You don't necessarily have to solve all the problems yourself, either: 5 minutes of research is enough to determine that creating models which "correctly capture uncertainty" seems to be one of Uber's design goals with Pyro (which seems related to calibration/robustness to distributional shift).

BTW, I've spent a fair amount of time thinking about & reading about creativity, and I don't think extreme pessimism is at all conducive to generating ideas. If your evidence for a problem being hard is "I couldn't think of any good approaches", and you were pretty sure there weren't any good approaches before you started thinking, I don't find that evidence super compelling.

It seems correct that people are trying to "find some approach that might work" before they try "rally the community around an approach that might work."

I agree. That's why I suggested going breadth-first initially.

Even if pessimism is justified, I think a breadth-first approach is sensible if it's possible to estimate the difficulty of overcoming various problems in the context of various frameworks in advance. If making any progress at all is expected to be hard, all the more reason to choose targets strategically.

[-]paulfchristiano8y150

Has anyone spent a timed 5 minutes trying to figure out, say, how vulnerable gcForest is likely to be to adversarial examples?

Yes. (Answer: deep learning is not unusually susceptible to adversarial examples.)

5 minutes of research is enough to determine that creating models which "correctly capture uncertainty" seems to be one of Uber's design goals with Pyro (which seems related to calibration/robustness to distributional shift)

In fact there is a (vast) literature on this topic.

Well, I haven't seen even a blog post's worth of effort put into doing something like what I suggested.

Go for it.

It seems relatively common for a task to be hard in one framework while being easy in another.

I agree. But note that from the perspective of the AI safety research that I do, none of the frameworks in the post you link change the basic picture, except maybe for hierarchical temporal memory (which seems like a non-starter).

[-]John_Maxwell8y30

deep learning is not unusually susceptible to adversarial examples

FWIW, this claim doesn't match my intuition, and googling around, I wasn't able to quickly find any papers or blog posts supporting it. This 2015 blog post discusses how deep learning models are susceptible due to linearity, which makes intuitive sense to me; the dot product is a relatively bad measure of similarity between vectors. It proposes a strategy for finding adversarial examples for a random forest and says it hasn't yet empirically been confirmed that random forests are unsafe. This empirical confirmation seems pretty important to me, because adversarial examples are only a thing because the wrong decision boundary has been learned. If the only way to create an "adversarial" example for a random forest is to permute an input until it genuinely appears to be a member of a different class, that doesn't seem like a flaw. (I don't expect that random forests always learn the correct decision boundaries, but my offhand guess would be that they are still less susceptible to adversarial examples than traditional deep models.)

I agree. But note that from the perspective of the AI safety research that I do, none of the frameworks in the post you link change the basic picture, except maybe for hierarchical temporal memory (which seems like a non-starter).

From my perspective, a lot of AI safety challenges get vastly easier if you have the ability to train well-calibrated models for complex, unstructured data. If you have this ability, the AI's model of human values doesn't need to be perfect--since the model is well-calibrated, it knows what it does/does not know and can ask for clarification as necessary.

Calibration could also provide a very general solution for corrigibility: If the AI has a well-calibrated model of which actions are/are not corrigible, and just how bad various incorrigible actions are, then it can ask for clarification as needed on that too. Corrigibility learning allows for notions of corrigibility that are very fine-grained: you can tell the AI that preventing a bad guy from flipping its off switch is OK, but preventing a good guy from flipping its off switch is not OK. By training a model, you don't have to spend a lot of time hand-engineering, and the model will hopefully generalize to incorrigible actions that the designers didn't anticipate. (Per the calibration assumption, the model will usually either generalize correctly, or the AI will realize that it doesn't know whether a novel plan would qualify as corrigible, and it can ask for clarification.)

That's why I currently think improving ML models is the action with the highest leverage.

[-]catherio8y170

FWIW, this claim doesn't match my intuition, and googling around, I wasn't able to quickly find any papers or blog posts supporting it.

"Explaining and Harnessing Adversarial Examples" (Goodfellow et al. 2014) is the original demonstration that "Linear behavior in high-dimensional spaces is sufficient to cause adversarial examples".

I'll emphasize that high-dimensionality is a crucial piece of the puzzle, which I haven't seen you bring up yet. You may already be aware of this, but I'll emphasize it anyway: the usual intuitions do not even remotely apply in high-dimensional spaces. Check out Counterintuitive Properties of High Dimensional Space.

adversarial examples are only a thing because the wrong decision boundary has been learned

In my opinion, this is spot-on - not only your claim that there would be no adversarial examples if the decision boundary were perfect, but in fact a group of researchers are beginning to think that in a broader sense "adversarial vulnerability" and "amount of test set error" are inextricably linked in a deep and foundational way - that they may not even be two separate problems. Here are a few citations that point at some pieces of this case:

"Adversarial Spheres" (Gilmer et al. 2017) - "For this dataset we show a fundamental tradeoff between the amount of test error and the average distance to nearest error. In particular, we prove that any model which misclassifies a small constant fraction of a sphere will be vulnerable to adversarial perturbations of size O(1/√d)." (emphasis mine)

I think this paper is truly fantastic in many respects.
The central argument can be understood from the intuitions presented in Counterintuitive Properties of High Dimensional Space in the section titled Concentration of Measure (Figure 9). Where it says "As the dimension increases, the width of the band necessary to capture 99% of the surface area decreases rapidly." you can just replace that with the "As the dimension increases, a decision-boundary hyperplane that has 1% test error rapidly gets extremely close to the equator of the sphere". "Small distance from the center of the sphere" is what gives rise to "Small epsilon at which you can find an adversarial example".

"Intriguing Properties of Adversarial Examples" (Cubuk et al. 2017) - "While adversarial accuracy is strongly correlated with clean accuracy, it is only weakly correlated with model size"

I haven't read this paper, but I've heard good things about it.

To summarize, my belief is that any model that is trying to learn a decision boundary in a high-dimensional space, and is basically built out of linear units with some nonlinearities, will be susceptible to small-perturbation adversarial examples so long as it makes any errors at all.

(As a note - not trying to be snarky, just trying to be genuinely helpful, Cubuk et al. 2017 and Goodfellow et al. 2014 are my top two hits for "adversarial examples linearity" in an incognito tab)

[-]ESRogs8y100

As the dimension increases, a decision-boundary hyperplane that has 1% test error rapidly gets extremely close to the equator of the sphere

What does the center of the sphere represent in this case?

(I'm imaging the training and test sets consisting of points in a highly dimensional space, and the classifier as drawing a hyperplane to mostly separate them from each other. But I'm not sure what point in this space would correspond to the "center", or what sphere we'd be talking about.)

[-]ESRogs8y100

The central argument can be understood from the intuitions presented in Counterintuitive Properties of High Dimensional Space in the section titled Concentration of Measure

Thanks for this link, that is a handy reference!

[-]ESRogs8y20

"Adversarial Spheres" (Gilmer et al. 2017) - "For this dataset we show a fundamental tradeoff between the amount of test error and the average distance to nearest error. In particular, we prove that any model which misclassifies a small constant fraction of a sphere will be vulnerable to adversarial perturbations of size O(1/√d)." (emphasis mine)

Slightly off-topic, but quick terminology question. When I first read the abstract of this paper, I was very confused about what it was saying and had to re-read it several times, because of the way the word "tradeoff" was used.

I usually think of a tradeoff as a inverse relationship between two good things that you want both of. But in this case they use "tradeoff" to refer to the inverse relationship between "test error", and "average distance to nearest error". Which is odd, because the first of those is bad and the second is good, no?

Is there something I'm missing that causes this to sound like a more natural way of describing things to others' ears?

[-]John_Maxwell8y10

Thanks for the links! (That goes for Wei and Paul too.)

a group of researchers are beginning to think that in a broader sense "adversarial vulnerability" and "amount of test set error" are inextricably linked in a deep and foundational way - that they may not even be two separate problems.

I'd expect this to be true or false depending on the shape of the misclassified region. If you think of the input space as a white sheet, and the misclassified region as red polka dots, then we measure test error by throwing a dart at the sheet and checking if it hits a polka dot. To measure adversarial vulnerability, we take a dart that landed on a white part of the sheet and check the distance to the nearest red polka dot. If the sheet is covered in tiny red polka dots, this distance will be small on average. If the sheet has just a few big red polka dots, this will be larger on average, even if the total amount of red is the same.

As a concrete example, suppose we trained a 1-nearest-neighbor classifier for 2-dimensional RGB images. Then the sheet is mostly red (because this is a terrible model), but there are splotches of white associated with each image in our training set. So this is a model that has lots of test error despite many spheres with 0% misclassifications.

To measure the size of the polka dots, you could invert the typical adversarial perturbation procedure: Start with a misclassified input and find the minimal perturbation necessary to make it correctly classified.

(It's possible that this sheet analogy is misleading due to the nature of high-dimensional spaces.)

Anyway, this relates back to the original topic of conversation: the extent to which capabilities research and safety research are separate. If "adversarial vulnerability" and "amount of test set error" are inextricably linked, that suggests that reducing test set error ("capabilities" research) improves safety, and addressing adversarial vulnerability ("safety" research) advances capabilities. The extreme version of this position is that software advances are all good and hardware advances are all bad.

(As a note - not trying to be snarky, just trying to be genuinely helpful, Cubuk et al. 2017 and Goodfellow et al. 2014 are my top two hits for "adversarial examples linearity" in an incognito tab)

Thanks. I'd seen both papers, but I don't like linking to things I haven't fully read.

[-]ESRogs8y60

Thanks. I'd seen both papers, but I don't like linking to things I haven't fully read.

I might just be confused, but this sentence seems like a non sequitur to me. I understood catherio to be responding to your comment about googling and not finding "papers or blog posts supporting [the claim that deep learning is not unusually susceptible to adversarial examples]".

If that was already clear to you then, never mind. I was just confused why you were talking about linking to things, when before the question seemed to be about what could be found by googling.

[-]John_Maxwell8y50

Oh, that makes sense.

[-]Wei Dai8y40

There doesn't seem to be a lot of work on adversarial examples for random forests. This paper was the only one I found, but it says:

On a digit recognition task, we demonstrate that both gradient boosted trees and random forests are extremely susceptible to evasions.

Also if you look at Figure 3 and Figure 4 in the paper, it appears that the RF classifier is much more susceptible to adversarial examples than the NN classifier.

[-]paulfchristiano8y20

googling around, I wasn't able to quickly find any papers or blog posts supporting it

I think it's a little bit tricky because decision trees don't work that well for the tasks where people usually study adversarial examples. And this isn't my research area so I don't know much about it.

That said, in addition to the paper Wei Dai linked, there is also this, showing that adversarial examples for neural nets transfer pretty well to decision trees (though I haven't looked at that paper in any detail).

[-]Vaniver8y120

Well, I haven't seen even a blog post's worth of effort put into doing something like what I suggested.

I think blog posts are potentially weird measures of effort, here. I also think that this is something that people are interested in doing--I think it's a component of MIRI's strategic sketch here, as part 8--but isn't the sort of thing where we have anything particularly worthwhile to show for it yet.

Perhaps it makes sense to sketch an argument for why none of the standard paradigms satisfy some desideratum? This is kind of what AI Safety Gridworlds did. But it's more the thing where, say, gradient boosted random forests have more of the 'transparency' property in a particular, legalistic way (it's easier to figure out blame for any particular classification than it would be with a neural net) but not in the way that we actually care about (looking at a gradient boosted random forest, we could figure out if it's thinking about things in the way that we want it to be thinking about), which might actually be easier with a neural net (because we could look at what neuron activations correspond to).

[-]Raemon8y80

Hey amsgrober! Welcome to LessWrong.

This comment is a bit off topic for this post – LessWrong is built around a lot of shared background discussion on AI as well as rationality. Part of the goal is to be able to talk about specific subproblems within AI Safety without having to rehash all the arguments every single time.

If you want to talk about this specific issue you can write a separate blogpost about it, although you'll probably get a better response if you've read up a bit on the past discussion on this topic. This post on the Orthogonality thesis is easiest link I have handy, and there's a lot of expanded discussion in the Rationality:A-Z sequences.

[-]Raemon8y60

Curated both for being concretely helpful advice to advice givers, as well as for highlighting an overall rationality-failure-mode that's easy to fall into.

[-]Trinley Goldenberg8y40

I have a model that there's something like a Pareto distribution where 20% of the people in a field contribute to 80% of the Actually Important advances, and of those advances about 80% of those people are a further 20% split of people who are deliberately and strategically choosing fields such that they can rationally expect to make advances. This implies that for instance in climate change, there will be ~4% of people who have actually done a fermi estimate of their impact on climate change that will contribute ~64% of the relevant advances in the field.

One thing you can say is that this is awful, and you really would like to have a field without this ridiculous distribution, and try to tell people to really wait to go into this field so they can contribute to Actually Important things. But it seems like there's a lot of countervailing forces preventing this, including the status incentive of saying "this is a field only for people who work on "Actually Important things." If your timelines are really short, you might not be worried about this, but it does seem like something to worry about over a decade or so of putting this message out in a specific field.

The other way to handle it would be to expect the Pareto distribution to happen because most people just aren't strategic about their careers, and do rationalize. The goal in that case is to just try and grow the field as much as possible, and know that some small percentage of the people who go into the field will be strategic thinkers who will contribute quite a bit. Not only does this strategy seem to actually capture the pattern of fields that have grown and made significant advances in solving problems, but it also has the benefit of getting the additional ~36% of Actually Important advances that come from people who aren't strategically trying to create impact.

[-]dregntael8y40

Thank you for posting this here, I mostly agree with the statement that acquiring skills early on is more important than producing anything directly. There's one thing that bugs me, however.

Early in your research career, you need to be in "consume" mode more than "produce" mode [...]

Counterpoint: if you spend most of your early research career in "consume" mode, you won't get any practice at producing valuable research or even know whether producing science is a good fit for you personally. I've personally seen people who are extremely good at processing large amounts of content during their studies, but got completely lost when tasked with finding and studying a novel problem that no-one had written on before. This seems like some kind of trap that many PhD student run into. Sometimes there's just no good way to learn how to do research other than, y'know, by doing research.

[-]crybx8y40

Somewhere, recently, I saw someone comment almost in passing that grad school shouldn't cost anything. I can't find the source now. Maybe someone can clarify if that's a serious claim? I've been under the impression for a while that grad school and academia would be an awfully expensive way to acquire the prerequisite knowledge for AI safety work.

[-]Richard_Ngo8y50

Expensive in terms of time, perhaps, but almost all good universities in the US and continental Europe provide decent salaries to PhD students. UK is a bit more haphazard but it's still very rare for UK PhDs to actually pay to be there, especially in technical fields.

[-]Vaniver8y40

Specifically, the salary is for being a teaching assistant or a research assistant, rather than being a student, but everything is structured under the assumption that graduate students will have a relevant part-time job that covers tuition and living expenses.

[-]Richard_Ngo8y10

I think that's true in the US, but not in most of Europe. E.g. in Switzerland a first year PhD student gets paid $40000 a year WITHOUT doing any teaching, and more if they teach. That's unusually generous, but I think the setup isn't uncommon.

LESSWRONG
is fundraising!
LW

LESSWRONG
is fundraising!
LW

120

Critch on career advice for junior AI-x-risk-concerned researchers

120

120