PhD student at the Center for Human-Compatible AI. Creator of the Alignment Newsletter.

rohinmshah's Comments

What are the best arguments and/or plans for doing work in "AI policy"?
If we don't have such a clear distinction, then there's not much that we can do, except ban AI, or ML entirely (or maybe ban AI above a certain compute threshold, or optimization threshold), which seems like a non-starter.

Idk, if humanity as a whole could have a justified 90% confidence that AI above a certain compute threshold would kill us all, I think we could ban it entirely. Like, why on earth not? It's in everybody's interest to do so. (Note that this is not the case with climate change, where it is in everyone's interest for them to keep emitting while others stop emitting.)

This seems probably true even if it was 90% confidence that there is some threshold over which AI would kill us all, that we don't yet know. In this case I imagine something more like a direct ban on most people doing it, and some research that very carefully explores what the threshold is.

This is only any use at all if governments can easily identify tractable research programs that actually contribute to AI safety, instead of have "AI safety" as a cool tagline. I guess that you imagine that that will be the case in the future? Or maybe you think that it doesn't matter if they fund a bunch of terrible, pointless research if some "real" research also gets funded?

A common way in which this is done is to get experts to help allocate funding, which seems like a reasonable way to do this, and probably better than the current mechanisms excepting Open Phil (current mechanism = how well you can convince random donors to give you money).

What? It seems like this is only possible if the technical problem is solved and known to be solved. At that point, the problem is solved

In the world where the aligned version is not competitive, a government can unilaterally pay the price of not being competitive because it has many more resources.

Also there are other problems you might care about, like how the AI system might be used. You may not be too happy if anyone can "buy" a superintelligent AI from the company that built it; this makes arbitrary humans generally more able to impact the world; if you have a group of not-very-aligned agents making big changes to the world and possibly fighting with each other, things will plausibly go badly at some point.

Again, if there are existing, legible standards of what's safe and what isn't this seems good. But without such standards I don't know how this helps?

Telling what is / isn't safe seems decidedly easier than making an arbitrary agent safe; it feels like we will be able to be conservative about this. But this is mostly an intuition.

I think a general response to your intuition is that I don't see technical solutions as the only options; there are other ways we could be safe (1, 2).


  • We're going to have clear, legible things that ensure safety (which might be "never build systems of this type").
  • Governments are much more competent than you currently believe (I don't know what you believe, but probably I think they are more competent than you do)
  • We have so little evidence / argument so far, that just the model uncertainty means that we can't conclude "it is unimportant to think about how we could use the resources of the most powerful actors in the world".
What are the best arguments and/or plans for doing work in "AI policy"?

What? I feel like I must be misunderstanding, because it seems like there are broad categories of things that governments can do that are helpful, even if you're only worried about the risk of an AI optimizing against you. I guess I'll just list some, and you can tell me why none of these work:

  • Funding safety research
  • Building aligned AIs themselves
  • Creating laws that prevent races to the bottom between companies (e.g. "no AI with >X compute may be deployed without first conducting a comprehensive review of the chance of the AI adversarially optimizing against humanity")
  • Monitoring AI systems (e.g. "we will create a board of AI investigators; everyone making powerful AI systems must be evaluated once a year")

I don't think there's a concrete plan that I would want a government to start on today, but I'd be surprised if there weren't such plans in the future when we know more (both from more research, and the AI risk problem is clearer).

You can also look at the papers under the category "AI strategy and policy" in the Alignment Newsletter database.

Comment on Coherence arguments do not imply goal directed behavior

I think this formulation of goal-directedness is pretty similar to one I suggested in the post before the coherence arguments post (Intuitions about goal-directed behavior, section "Our understanding of the behavior"). I do think this is an important concept to explain our conception of goal-directedness, but I don't think it can be used as an argument for AI risk, because it proves too much. For example, for many people without technical expertise, the best model they have for a laptop is that it is pursuing some goal (at least, many of my relatives frequently anthropomorphize their laptops). Should they worry that their laptops are going to take over the world?

For a deeper response, I'd recommend Intuitions about goal-directed behavior. I'll quote some of the relevant parts here:

There is a general pattern in which as soon as we understand something, it becomes something lesser. As soon as we understand rainbows, they are relegated to the “dull catalogue of common things”. This suggests a somewhat cynical explanation of our concept of “intelligence”: an agent is considered intelligent if we do not know how to achieve the outcomes it does using the resources that it has (in which case our best model for that agent may be that it is pursuing some goal, reflecting our tendency to anthropomorphize). That is, our evaluation about intelligence is a statement about our epistemic state.
[... four examples ...]
To the extent that the Misspecified Goal argument relies on this intuition, the argument feels a lot weaker to me. If the Misspecified Goal argument rested entirely upon this intuition, then it would be asserting that because we are ignorant about what an intelligent agent would do, we should assume that it is optimizing a goal, which means that it is going to accumulate power and resources and lead to catastrophe. In other words, it is arguing that assuming that an agent is intelligent definitionally means that it will accumulate power and resources. This seems clearly wrong; it is possible in principle to have an intelligent agent that nonetheless does not accumulate power and resources.
Also, the argument is not saying that in practice most intelligent agents accumulate power and resources. It says that we have no better model to go off of other than “goal-directed”, and then pushes this model to extreme scenarios where we should have a lot more uncertainty.

See also the summary of that post:

“From the outside”, it seems like a goal-directed agent is characterized by the fact that we can predict the agent’s behavior in new situations by assuming that it is pursuing some goal, and as a result it is acquires power and resources. This can be interpreted either as a statement about our epistemic state (we know so little about the agent that our best model is that it pursues a goal, even though this model is not very accurate or precise) or as a statement about the agent (predicting the behavior of the agent in new situations based on pursuit of a goal actually has very high precision and accuracy). These two views have very different implications on the validity of the Misspecified Goal argument for AI risk.
Coherence arguments do not imply goal-directed behavior

I pretty strongly agree with this review (and jtbc it was written without any input from me, even though Daniel and I are both at CHAI).

I think of 'coherence arguments' as including things like 'it's not possible for you to agree to give me a limitless number of dollars in return for nothing', which does imply some degree of 'goal-direction'.

Yeah, maybe I should say "coherence theorems" to be clearer about this? (Like, it isn't a theorem that I shouldn't give you limitless number of dollars in return for nothing; maybe I think that you are more capable than me and fully aligned with me, and so you'd do a better job with my money. Or maybe I value your happiness, and the best way to purchase it is to give you money no strings attached.)

Responses from outside this camp

Fwiw, I do in fact worry about goal-directedness, but (I think) I know what you mean. (For others, I think Daniel is referring to something like "the MIRI camp", though that is also not an accurate pointer, and it is true that I am outside that camp.)

My responses to the questions:

  1. The ones in Will humans build goal-directed agents?, but if you want arguments that aren't about humans, then I don't know.
  2. Depends on the distribution over utility functions, the action space, etc, but e.g. if it uniformly selects a numeric reward value for each possible trajectory (state-action sequence) where the actions are low-level (e.g. human muscle control), astronomically low.
  3. That will probably be a good model for some (many?) powerful AI systems that humans build.
  4. I don't know. (I think it depends quite strongly on the way in which we train powerful AI systems.)
  5. Not likely at low levels of intelligence, plausible at higher levels of intelligence, but really the question is not specified enough.
[AN #76]: How dataset size affects robustness, and benchmarking safe exploration by measuring constraint violations

It goes under many names, such as transfer learning, robustness to distributional shift / data shift, and out-of-distribution generalization. Each one has (to me) slightly different connotations, e.g. transfer learning suggests that the researcher has a clear idea of the distinction between the first and second setting (and so you "transfer" from the first to the second), whereas if in RL you change which part of the state space you're in as you act, I would be more likely to call that distributional shift rather than transfer learning.

Verification and Transparency

Planned newsletter summary:

This post points out that verification and transparency have similar goals. Transparency produces an artefact that allows the user to answer questions about the system under investigation (e.g. "why did the neural net predict that this was a tennis ball?"). Verification on the other hand allows the user to pose a question, and then automatically answers that question (e.g. "is there an adversarial example for this image?").
Clarifying "AI Alignment"

Crystallized my view of what the "core problem" is (as I explained in a comment on this post). I think I had intuitions of this form before, but at the very least this post clarified them.

Load More