Sorted by New


Transparency and AGI safety

Interesting, thanks! I stand corrected (and will read your paper)...

Transparency and AGI safety

Thanks Rohin! Agree with and appreciate the summary as I mentioned before. 

 I don’t agree with motivation 1 as much: if I wanted to improve AI timeline forecasts, there are a lot of other aspects I would investigate first. (Specifically, I’d improve estimates of inputs into <@this report@>(@Draft report on AI timelines@).) Part of this is that I am less uncertain than the author about the cruxes that transparency could help with, and so see less value in investigating them further.

I'm curious: does this mean that you're on board with the assumption in Ajeya’s report that 2020 algorithms and datasets + "business as usual" in algorithm and dataset design will scale up to strong AI, with compute being the bottleneck? I feel both uncertain about this assumption and uncertain about how to update on it one way or the other. (But this probably belongs more in a discussion of that report and is kind of off topic here.)

  • The "alien in a box" hypothetical made sense to me (mostly), but I didn't understand the "lobotomized alien" hypothetical. I also didn't see how this was meant to be analogous to machine learning. One concrete question: why are we assuming that we can separate out the motivational aspect of the brain? (That's not my only confusion, but I'm having a harder time explaining other confusions.)

A more concrete version of the “lobotomized alien" hypothetical might be something like this: There’s this neuroscience model that sometimes gets discussed around here that human cognition works by running some sort of generative model over the neocortex, with a loss function that's modulated by stuff going on in the midbrain (see e.g. https://www.lesswrong.com/posts/diruo47z32eprenTg/my-computational-framework-for-the-brain). Suppose that you buy this theory, and now suppose that we’re the AIs being trained in a simulation by a more advanced alien race. Then one way that the aliens could try to get us to do stuff for them might be to reinstantiate just a human neocortex and train it from scratch on a loss function + dataset of their choice, as some sort of souped-up unsupervised learning algorithm.

In this example, I’m definitely just assuming by fiat that the cognition and motivation parts of the brain are well-separated (and moreover, that the aliens are able to discover this, say by applying some coarse-grained transparency tools). So it’s just a toy model for how things *could* go, not necessarily how they *will* go.

  • It feels like your non-agentic argument is too dependent on how you defined "AGI". I can believe that the first powerful research accelerator will be limited to language, but that doesn't mean that other AI systems deployed at the same time will be limited to language.

Hmm. I think I agree that this is a weak point of the argument and it's not clear how to patch it. I think I had some intuition like, even once we have some sort of pretrained AGI algorithm (like an RL agent trained in simulation), we would have to fine-tune it on real-world tasks one at a time by coming up with a curriculum for each of those tasks; this seems easier to do for simple bounded tasks than for more open-ended ones (though in some sense that needs to be made more precise, and is maybe already assuming some things about alignment); and "research acceleration" seems like a much narrower task with a relatively well-defined training set of papers, books, etc. than "AI agent that competently runs a company", so might still come first on those grounds. But even then it would have to come first by a large enough margin for insights from the research accelerator to actually be implemented, for this argument to work. So there's at least a gap there...

  • It seems like there's a pretty clear argument for language models to be deceptive -- the "default" way to train them is to have them produce outputs that humans like; this optimizes for being convincing to humans, which is not necessarily the same as being true. (However, it's more plausible to me that the first such model won't cause catastrophic risk, which would still be enough for your conclusions.)

Yeah, fair enough. I should have said that I don't see a path for language models to get selection pressure in the direction of being catastrophically deceptive like in the old "AI getting out of the box" stories, so I think we agree.

Transparency and AGI safety

Thanks for the comment! Naively I feel like dropout would make things worse for the reason that you mentioned and anti-dropout better, but I’m definitely not an expert on this stuff.

I’m not sure I totally understand your first idea. Is the idea something like

- Feed some images through a NN and record which neurons have high average activation on them

- Randomly pick some of those neurons and record which dataset examples cause them to have a high average activation

- Pick some subset of those images and iterate until convergence?

Transparency and AGI safety

Thanks a lot for all the effort you put into this post! I don't agree with anything, but reading and commenting it was very stimulating, and probably useful for my own research.

 Likewise, thanks for taking the time to write such a long comment! And hoping that's a typo in the second sentence :)

I'm quite curious about why you wrote this post. If it's for convincing researchers in AI Safety that transparency is useful and important for AI Alignment, my impression is that many researchers do agree, and those who don't tend to have thought about it for quite some time (Paul Christiano comes to mind, as someone who is less interested in transparency while knowing a decent amount about it). So if the goal was to convince people to care about transparency, I'm not sure this post was necessary. 

Fair enough! Since I'm pretty new to thinking about this stuff, my main goal was to convince myself and organize my own thoughts around this topic. I find that writing a review is often a good way to get up to speed on something. Then once I’d written it, it seemed like I might as well post it somewhere.

Wrt the community though, I’d be especially curious to get more feedback on Motivation #2. Do people not agree that transparency is *necessary* for AI Safety? And if they do agree, then why aren’t more people working on it?

I agree with the idea, with maybe the caveat that it doesn't apply to Ems à la Hanson. A similar argument could hold about neuroscience facts we would need to know to scan and simulate brains, though.

Yeah, I'd add that if even we had a similar hardware-based forecast for mapping the human connectome, there would still be a lot that we don't know about dynamics there too. I have the impression that basically all ways to forecast things in this space have to make some non-obvious (to me) assumption that business as usual will scale up to strong AI without a need for qualitative breakthroughs.

My take on why verification might scale is that we will move towards specification of properties of the program instead of it's input/output relation. So verifying whether the code satisfy some formal property that indicates myopia or low goal-directedness. Note that transparency is still really important here, because even with completely formal definitions of things like myopia and goal-directedness, I think transparency will be necessary to translate them into properties of the specific class of models studied (neural networks for example).

I agree, but think that transparency is doing most of the work there (i.e. what you say sounds more to me like an application of transparency than scaling up the way that verification is used in current models.) But this is just semantics.

I think this misses a very big part of what makes a paperclip-maximizer dangerous -- the fact that it can come up with catastrophic plans after it's been deployed. So it doesn't have to be explicitly deceptive and bidding it's time; it might just be really competent and focused on maximizing paperclips, which requires more than exact transparency to catch. It requires being able to check properties that ensures the catastrophic outcomes won't happen.

Hm, I want to disagree, but this may just come down to a difference in what we mean by deployment. In the paragraph that you quoted, I was imagining the usual train/deploy split from ML where deployment means that we’ve frozen the weights of our AI and prohibit further learning from taking place. In that case, I’d like to emphasize that there’s a difference between intelligence as a meta-ability to acquire new capabilities and a system’s actual capabilities at a given time. Even if an AI is superintelligent, i.e. able to write new information into its weights extremely efficiently, once those weights are fixed, it can only reason and plan using whatever object-level knowledge was encoded in them up to that point. So if there was nothing about bio weapons in the weights when we froze them, then we wouldn't expect the paperclip-maximizer to spontaneously make plans involving bio weapons when deployed. 

On the other hand, none of this would apply to the “alien in a box” model that would basically be continuously training by my definition (though in that case, we could still patch the solution by monitoring the AI in real time). So maybe it was a poor choice of words.

An AI who does AI Safety research is properly terrifying. I'm really stunned by this choice, as I think this is probably one of the most dangerous case of oracle AI that I can think of. I see two big problems with it:

  • It looks like exactly the kind of tasks where, if we haven't solve AI alignment in advance, Goodhart is upon us. What's the measure? What's the proxy? Best scenario: the AI is clearly optimizing something stupid, and nobody cares. Worst case scenario, more probably because the AI is actually supposed to outperform humans: it pushes for something that looks like it makes sense but doesn't actually work, and we might use these insight to build more advanced AGIs and be fucked.
  • It's quite simple to imagine a Predict-o-matic type scenario: pushing simpler and easier models that appear to work but don't, so that its task becomes easier.


I don't think any of the intuitions given work, for a simple reason: even if the research agenda doesn't require in itself any real knowledge of humans, the outputs still have to be humanly understandable. I want the AI to write blog posts that I can understand. So it will have to master clear writing, which seems from experience to require a lot of modeling of the other (and as a human, I get a bunch of things for free unconsciously, that an AI wouldn't have, like a model of emotions).

These two comments seem related so let me reply to them together. I think what you're asking here is “how can we be sure that a "research accelerator" AI, trained to help with a self-contained AI safety agenda such as transparency, will produce solutions that we can understand before we implement them [so as to avoid getting tricked into implementing something that turns out to be bad, as in your first quote]?” And I would answer that I’ve made an assumption that knowledge is universal and new ideas are discovered by incrementally building on existing ones. This is why basically any student today knows more about science than the smartest people from a century ago, and on the flip side, I think would constrain how far beyond us the insights from early AGIs trained on our work could be. Suppose an AI system was trained on a dataset of existing transparency papers to come up with new project ideas in transparency. Then its first outputs would probably use words like neurons and weights instead of some totally incomprehensible concepts, since those would be the very same concepts that would let it efficiently make sense of its training set. And new ideas about neurons and weights would then be things that we could independently reason about even if they’re very clever ideas that we didn’t think of ourselves, just like you and I can have a conversation about circuits even if we didn’t come up with it.

Another issue with this proposal is that you're saying on one side that the AI is superhuman at technical AI safety, and on the other hand that it can only do these specific proposals that don't use anything about humans. That's like saying that you have an AI that wins at any game, but in fact it only works for chess. Either the AI can do research on everything in AI Safety, and it will probably have to understand humans; or it is specifically for one research proposal, but then I don't see why not create other AIs for other research proposals. The technology is available, and the incentives would be here (if only to be as productive as the other researchers who have an AI to help them).

Agree that there's a (strong!) assumption being made that "research accelerators for narrow agendas" will come before potentially dangerous AI systems. I think this might actually be a weak point of my story. Rohin asked something similar in the second bullet-point of his comment so I’ll try to answer there...