james.lucassen

Wiki Contributions

Comments

Sorted by

So let's call "reasoning models" like o1 what they really are: the first true AI agents.

I think the distinction between systems that perform a single forward pass and then stop and systems that have an OODA loop (tool use) is more stark than the difference between "reasoning" and "chat" models, and I'd prefer to use "agent" for that distinction.

I do think that "reasoning" is a bit of a market-y name for this category of system though. "chat" vs "base" is a great choice of words, and "chat" is basically just a description of the RL objective those models were trained with.

If I were the terminology czar, I'd call o1 a "task" model.

I agree that I wouldn't want to lean on the sweet-spot-by-default version of this, and I agree that the example is less strong than I thought it was. I still think there might be safety gains to be had from blocking higher level reflection if you can do it without damaging lower level reflection. I don't think that requires a task where the AI doesn't try and fail and re-evaluate - it just requires that the re-evalution never climbs above a certain level in the stack.

There's such a thing as being pathologically persistent, and such a thing as being pathologically flaky. It doesn't seem too hard to train a model that will be pathologically persistent in some domains while remaining functional in others. A lot of my current uncertainty is bound up in how robust these boundaries are going to have to be.

I want to flag this as an assumption that isn't obvious. If this were true for the problems we care about, we could solve them by employing a lot of humans.

humans provides a pretty strong intuitive counterexample

Yup not obvious. I do in fact think a lot more humans would be helpful. But I also agree that my mental picture of "transformative human level research assistant" relies heavily on serial speedup, and I can't immediately picture a version that feels similarly transformative without speedup. Maybe evhub or Ethan Perez or one of the folks running a thousand research threads at once would disagree.

such plans are fairly easy and don't often raise flags that indicate potential failure

Hmm. This is a good point, and I agree that it significantly weakens the analogy.

I was originally going to counter-argue and claim something like "sure total failure forces you to step back far but it doesn't mean you have to step back literally all the way". Then I tried to back that up with an example, such as "when I was doing alignment research, I encountered total failure that forced me to abandon large chunks of planning stack, but this never caused me to 'spill upward' to questioning whether or not I should be doing alignment research at all". But uh then I realized that isn't actually true :/

We want particularly difficult work out of an AI.

On consideration, yup this obviously matters. The thing that causes you to step back from a goal is that goal being a bad way to accomplish its supergoal, aka "too difficult". Can't believe I missed this, thanks for pointing it out.

I don't think this changes the picture too much, besides increasing my estimate of how much optimization we'll have to do to catch and prevent value-reflection. But a lot of muddy half-ideas came out of this that I'm interested in chewing on.

Maybe I'm just reading my own frames into your words, but this feels quite similar to the rough model of human-level LLMs I've had in the back of my mind for a while now.

You think that an intelligence that doesn't-reflect-very-much is reasonably simple. Given this, we can train chain-of-thought type algorithms to avoid reflection using examples of not-reflecting-even-when-obvious-and-useful. With some effort on this, reflection could be crushed with some small-ish capability penalty, but massive benefits for safety.

In particular, this reads to me like the "unstable alignment" paradigm I wrote about a while ago.

You have an agent which is consequentialist enough to be useful, but not so consequentialist that it'll do things like spontaneously notice conflicts in the set of corrigible behaviors you've asked it to adhere to and undertake drastic value reflection to resolve those conflicts. You might hope to hit this sweet spot by default, because humans are in a similar sort of sweet spot. It's possible to get humans to do things they massively regret upon reflection as long as their day to day work can be done without attending to obvious clues (eg guy who's an accountant for the Nazis for 40 years and doesn't think about the Holocaust he just thinks about accounting). Or you might try and steer towards this sweet spot by developing ways to block reflection in cases where it's dangerous without interfering with it in cases where it's essential for capabilities.

I'm not sure exactly what mesa is saying here, but insofar as "implicitly tracking the fact that takeoff speeds are a feature of reality and not something people can choose" means "intending to communicate from a position of uncertainty about takeoff speeds" I think he has me right.

I do think mesa is familiar enough with how I talk that the fact he found this unclear suggests it was my mistake. Good to know for future.

Ah, didn't mean to attribute the takeoff speed crux to you, that's my own opinion.

I'm not sure what's best in fast takeoff worlds. My message is mainly just that getting weak AGI to solve alignment for you doesn't work in a fast takeoff.

"AGI winter" and "overseeing alignment work done by AI" do both strike me as scenarios where agent foundations work is more useful than in the scenario I thought you were picturing. I think #1 still has a problem, but #2 is probably the argument for agent foundations work I currently find most persuasive.

In the moratorium case we suddenly get much more time than we thought we had, which enables longer payback time plans. Seems like we should hold off on working on the longer payback time plans until we know we have that time, not while it still seems likely that the decisive period is soon.

Having more human agent foundations expertise to better oversee agent foundations work done by AI seems good. How good it is depends on a few things. How much of the work that needs to be done is conceptual breakthroughs (tall) vs schlep with existing concepts (wide)? How quickly does our ability to oversee fall off for concepts more advanced than what we've developed so far? These seem to me like the main ones, and like very hard questions to get certainty on - I think that uncertainty makes me hesitant to bet on this value prop, but again, it's the one I think is best.

I'm on board with communicating the premises of the path to impact of your research when you can. I think more people doing that would've saved me a lot of confusion. I think your particular phrasing is a bit unfair to the slow takeoff camp but clearly you didn't mean it to read neutrally, which is a choice you're allowed to make.

I wouldn't describe my intention in this comment as communicating a justification of alignment work based on slow takeoff? I'm currently very uncertain about takeoff speeds and my work personally is in the weird limbo of not being premised on either fast or slow scenarios.

Nice post, glad you wrote up your thinking here.

I'm a bit skeptical of the "these are options that pay off if alignment is harder than my median" story. The way I currently see things going is:

  • a slow takeoff makes alignment MUCH, MUCH easier [edit: if we get one, I'm uncertain and think the correct position from the current state of evidence is uncertainty]
  • as a result, all prominent approaches lean very hard on slow takeoff
  • there is uncertainty about takeoff speed, but folks have mostly given up on reducing this uncertainty

I suspect that even if we have a bunch of good agent foundations research getting done, the result is that we just blast ahead with methods that are many times easier because they lean on slow takeoff, and if takeoff is slow we're probably fine if it's fast we die.

Ways that could not happen:

  • Work of the of the form "here are ways we could notice we are in a fast takeoff world before actually getting slammed" produces evidence compelling enough to pause, or cause leading labs to discard plans that rely on slow takeoff
  • agent foundations research aiming to do alignment in faster takeoff worlds finds a method so good it works better than current slow takeoff tailored methods even in the slow takeoff case, and labs pivot to this method

Both strike me as pretty unlikely. TBC this doesn't mean those types of work are bad, I'm saying low probability not necessarily low margins

Load More