James Payor

Posts

Sorted by New

Wiki Contributions

Comments

On A List of Lethalities

I agree!

I think that in order to achieve this you probably have to do lots of white-box things, like watching the AI's internal state, attempting to shape the direction of its learning, watching carefully for pitfalls. And I expect that treating the AI more as a black box and focusing on containment isn't going to be remotely safe enough.

On A List of Lethalities

I think there are far easier ways out of the box than that. Especially so if you have that detailed a model of the human's mind, but even without. I think Eliezer wouldn't be handicapped if not allowed to use that strategy. (Also fwiw, that strategy wouldn't work on me.)

For instance you could hack the human if you knew a lot about their brain. Absent that you could try anything from convincing them that you're a moral patient, promising part of the lightcone with the credible claim that another AGI company will kill everyone otherwise, etc. These ideas of mine aren't very good though.

Regarding whether boxing can be an arduous constraint, I don't see having access to many simulated copies of the AI helping when the AI is a blob of numbers you can't inspect. It doesn't seem to make progress on the problems we need to in order to wrangle such an AI into doing the work we want. I guess I remain skeptical.

On A List of Lethalities

I will also add a point re "just do AI alignment math":

Math studies the structures of things. A solution to our AI alignment problem has to be something we can use, in this universe. The structure of this problem is laden with stuff like agents and deception, and in order to derive relevant stuff for us, our AI is going to need to understand all that.

Most of the work of solving AI alignment does not look like proving things that are hard to prove. It involves puzzling over the structure of agents trying to build agents, and trying to find a promising angle on our ability to build an agent that will help us get what we want. If you want your AI to solve alignment, it has to be able to do this.

This sketch of the problem puts "solve AI alignment" in a dangerous capability reference class for me. I do remain hopeful that we can find places where AI can help us along the way. But I personally don't know of current avenues where we could use non-scary AI to meaningfully help.

On A List of Lethalities

By the point your AI can design, say, working nanotech, I'd expect it to be well superhuman at hacking, and able to understand things like Rowhammer. I'd also expect it to be able to build models of it's operators and conceive of deep strategies involving them.

Also, convincing your operators to let you out of the box is something Eliezer can purportedly do, and seems much easier than being able to solve alignment. I doubt that anything that could write that alignment textbook has a non-dangerous level of capability.

So I'm suspicious that your region exists, where the AI is smart enough to be useful but dumb enough to remain boxed.

This isn't to say that ideas for boxing aren't helpful on the margin. They don't seem to me like a possible core for a safety story though, and require other ideas to handle the bulk of the work.

On A List of Lethalities

Thanks for writing this! I appreciate hearing how all this stuff reads to you.

I'm writing this comment to push back about current interpretability work being relevant to the lethal stuff that comes later, ala:

I have heard claims that interpretability is making progress, that we have some idea about some giant otherwise inscrutable matrices and that this knowledge is improving over time.

What I've seen folks understand so far are parts of perception in image processing neural nets, as well as where certain visual concepts show up in these nets, and more recently some of the structure of small transformers piping around information.

The goalpost for this sort of work mattering in the lethal regime is something like improving our ability to watch concepts move through a large mind made out of a blob of numbers, with sufficient fidelity to notice when it's forming understandings of its operators, plans to disable them and escape, or anything much subtler but still lethal.

So I see interpretability falling far short here. In my book this is mostly because interpretability for a messy AGI mind inherits the abject difficulty of making a cleaned up version of that AGI with the same capability level.

We're also making bounds of anti-progress on AGI Cleanliness every year. This makes everything that much harder.

Why all the fuss about recursive self-improvement?

Do you think that things won't look thresholdy even in a capability regime in which a large actor can work out how melt all the GPUs?

AGI Ruin: A List of Lethalities

Re (14), I guess the ideas are very similar, where the mesaoptimizer scenario is like a sharp example of the more general concept Eliezer points at, that different classes of difficulties may appear at different capability levels.

Re (15), "Fast capability gains seem likely, and may break lots of previous alignment-required invariants simultaneously", which is about how we may have reasons to expect aligned output that are brittle under rapid capability gain: your quote from Richard is just about "fast capability gain seems possible and likely", and isn't about connecting that to increased difficulty in succeeding at the alignment problem?

Re (32), I don't think your quote isn't talking about the thing Eliezer is talking about, which is that in order to be human level at modelling human-generated text, your AI must be doing something on par with human thought that figures out what humans would say. Your quote just isn't discussing this, namely that strong imitation requires cognition that is dangerous.

So I guess I don't take much issue with (14) or (15), but I think you're quite off the mark about (32). In any case, I still have a strong sense that Eliezer is successfully being more on the mark here than the rest of us manage. Kudos of course to you and others that are working on writing things up and figuring things out. Though I remain sympathetic to Eliezer's complaint.

AGI Ruin: A List of Lethalities

Eliezer's post here is doing work left undone by the writing you cite. It is a much clearer account of how our mainline looks doomed than you'd see elsewhere, and it's frank on this point.

I think Eliezer wishes these sorts of artifacts were not just things he wrote, like this and "There is no fire alarm".

Also, re your excerpts for (14), (15), and (32), I see Eliezer as saying something meaningfully different in each case. I might elaborate under this comment.

AGI Ruin: A List of Lethalities

maybe a reasonable path forward is to try to wring as much productivity as we can out of the passive, superhuman, quasi-oracular just-dumb-data-predictors. And avoid as much as we can ever creating closed-loop, open-ended, free-rein agents.

I should say that I do see this as a reasonable path forward! But we don't seem to be coordinating to do this, and AI researchers seem to love doing work on open-ended agents, which sucks.

Hm, regardless it doesn't really move the needle, so long as people are publishing all of their work. Developing overpowered pattern recognizers is similar to increasing our level of hardware overhang. People will end up using them as components of systems that aren't safe.

AGI Ruin: A List of Lethalities

Can you visualize an agent that is not "open-ended" in the relevant ways, but is capable of, say, building nanotech and melting all the GPUs?

In my picture most of the extra sauce you'd need on top of GPT-3 looks very agenty. It seems tricky to name "virtual worlds" in which AIs manipulate just "virtual resources" and still manage to do something like melting the GPUs.

Load More