All of bmg's Comments + Replies

Hm, I’d probably disagree.

A couple thoughts here:

First: To me, it seems one important characteristic of “planners” is that they can improve their decisions/behavior even without doing additional learning. For example, if I’m playing chess, there might be some move that (based on my previous learning) initially presents itself as the obvious one to make. But I can sit there and keep running mental simulations of different games I haven’t yet played (“What would happen if I moved that piece there…?”) and arrive at better and better decisions.

It doesn’t seem ... (read more)

Search doesn't buy you that much, remember. After relatively few nodes, you've already gotten much of the benefit from finetuning the value estimates (eg. AlphaZero, or the MuZero appendix). And you can do weight-tying to repeat feedforward layers, or just repeat layers/the model. (Is AlphaFold2 recurrent? Is ALBERT recurrent? A VIN? Or a diffusion model? Or a neural ODE?) This is probably why Jones [] finds that distilling MCTS runtime search into search-less feedforward parameters comes, empirically, at a favorable exchange rate I wouldn't call 'really huge'.
2Daniel Kokotajlo1y
OK, thanks. Why is it important that they be able to easily improve their performance without learning? I agree that eight layers doesn't seem like enough to do some serious sequential pondering. For comparison, humans take multiple seconds--often minutes--of subjective time to do this, at something like 100 sequential steps per second.

I really appreciate you taking the time both to write this report and solicit/respond to all these reviews! I think this is a hugely valuable resource, that has helped me to better understand AI risk arguments and the range of views/cruxes that different people have.

A couple quick notes related to the review I contributed:

First, .4% is the credence implied by my credences in individual hypotheses — but I was a little surprised by how small this number turned out to be. (I would have predicted closer to a couple percent at the time.) I’m sympathetic to the ... (read more)

4Joe Carlsmith1y
I’m glad you think it’s valuable, Ben — and thanks for taking the time to write such a thoughtful and detailed review.  Yes, I am too. I’m thinking about the right way to address this going forward.  I’ll respond re: planning in the thread with Daniel.
2Daniel Kokotajlo1y
I'm curious to hear more about how you think of this AlphaGo example. I agree that probably the version of AlphaGo without MCTS is not doing any super detailed simulations of different possible moves... but I think in principle it could be, for all we know, and I think that if you kept making the neural net bigger and bigger and training it for longer and longer, eventually it would be doing something like that, because the simplest circuit that scores highly in the training environment would be a circuit that does something like that. Would you disagree?
Answer by bmgDec 12, 20204

A tricky thing here is that it really depends how quickly a technology is adopted, improved, integrated, and so on.

For example, it seems like computers and the internet caused a bit of a surge in American productivity growth in the 90s. The surge wasn't anything radical, though, for at least a few reasons:

  1. Continued technological progress is necessary just to sustain steady productivity growth.

  2. It's apparently very hard, in general, to increase aggregate productivity.

  3. The adoption, improvement, integration, etc., of information technology was a relat

... (read more)

Since neural networks are universal function approximators, it is indeed the case that some of them will implement specific search algorithms.

I don't think this specific point is true. It seems to me like the difference between functions and algorithms is important. You can also approximate any function with a sufficiently large look-up table, but simply using a look-up table to choose actions doesn't involve search/planning.* In this regard, something like a feedforward neural network with frozen weights also doesn't seem importantly different than a l... (read more)

2Rafael Harth3y
(I somehow didn't notice your comment until now.) I believe you are correct. The theorem for function approximation I know also uses brute force (i.e., large networks) in the proof, so it doesn't seem like evidence for the existence of [weights that implement algorithms]. (And I am definitely not talking about algorithms in terms of input/output behavior.) I've changed the paragraph into Anyone who knows of alternative evidence I can point to here is welcome to reply to this comment.

I do agree that OT and ICT by themselves, without any further premises like "AI safety is hard" and "The people building AI don't seem to take safety seriously, as evidenced by their public statements and their research allocation" and "we won't actually get many chances to fail and learn from our mistakes" does not establish more than, say, 1% credence in "AI will kill us all," if even that. But I think it would be a misreading of the classic texts to say that they were wrong or misleading because of this; probably if you went back in time and asked Bost

... (read more)
1Sammy Martin3y
Perhaps what is going on here is that the arguments as stated in brief summaries like 'orthogonality thesis + instrumental convergence' just aren't what the arguments actually were, and that there were from the start all sorts of empirical or more specific claims made around these general arguments. This reminds me of Lakatos' theory of research programs []- where the core assumptions, usually logical or a priori in nature, are used to 'spin off' secondary hypotheses that are more empirical or easily falsifiable. Lakatos' model fits AI safety rather well - OT and IC are some of these non-emperical 'hard core' assumptions that are foundational to the research program and then in ~2010 there were some secondary assumptions, discontinuous progress, AI maximises a simple utility function etc. but in ~2020 we have some different secondary assumptions: mesa-optimisers, you get what you measure, direct evidence of current misalignment

I agree that your paper strengthens the IC (and is also, in general, very cool!). One possible objection to the ICT, as traditionally formulated, has been that it's too vague: there are lots of different ways you could define a subset of possible minds, and then a measure over that subset, and not all of these ways actually imply that "most" minds in the subset have dangerous properties. Your paper definitely makes the ICT crisper, more clearly true, and more closely/concretely linked to AI development practices.

I still think, though, that the ICT only get... (read more)

I think we can interpret it as a burden-shifting argument; "Look, given the orthogonality thesis and instrumental convergence, and various other premises, and given the enormous stakes, you'd better have some pretty solid arguments that everything's going to be fine in order to disagree with the conclusion of this book (which is that AI safety is extremely important)." As far as I know no one has come up with any such arguments, and in fact it's now the consensus in the field that no one has found such an argument.

I suppose I disagree that at least the ... (read more)

I think the purpose of the OT and ICT is to establish that lots of AI safety needs to be done. I think they are successful in this. Then you come along and give your analogy to other cases (rockets, vaccines) and argue that lots of AI safety will in fact be done, enough that we don't need to worry about it. I interpret that as an attempt to meet the burden, rather than as an argument that the burden doesn't need to be met.

But maybe this is a merely verbal dispute now. I do agree that OT and ICT by themselves, without any further premises like &qu... (read more)

for example, the "Universal prior is malign" stuff shows that in the limit GPT-N would likely be catastrophic,

If you have a chance, I'd be interested in your line of thought here.

My initial model of GPT-3, and probably the model of the OP, is basically: GPT-3 is good at producing text that it would have been unsurprising to find on the internet. If we keep training up larger and larger models, using larger and larger datasets, it will produce text that it would be less-and-less surprising to find on the internet. Insofar as there are safety concerns, th... (read more)

4Daniel Kokotajlo3y
I think it's a reasonable and well-articulated worry you raise. My response is that for the graphing calculator, we know enough about the structure of the program and the way in which it will be enhanced that we can be pretty sure it will be fine. In particular, we know it's not goal-directed or even building world-models in any significant way, it's just performing specific calculations directly programmed by the software engineers. By contrast, with GPT-3 all we know is that it's a neural net that was positively reinforced to the extent that it correctly predicted words from the internet during training, and negatively reinforced to the extent that it didn't. So it's entirely possible that it does, or will eventually, have a world-model and/or goal-directed behavior. It's not guaranteed, but there are arguments to be made that "eventually" it would have both, i.e. if we keep making it bigger and giving it more internet text and training it for longer. I'm rather uncertain about the arguments that it would have goal-directed behavior, but I'm fairly confident in the argument that eventually it would have a really good model of the world. The next question is then how this model is chosen. There are infinitely many world-models that are equally good at predicting any given dataset, but that diverge in important ways when it comes to predicting whatever is coming next. It comes down to what "implicit prior" is used. And if the implicit prior is anything like the universal prior, then doom. Now, it probably isn't the universal prior. But maybe the same worries apply.