Lukas Finnveden

Previously "Lanrian" on here. Research analyst at Open Philanthropy. Views are my own.

Sequences

Project ideas for making transformative AI go well, other than by working on alignment
Extrapolating GPT-N performance

Wiki Contributions

Comments

Sorted by

I'm not sure if the definition of takeover-capable-AI (abbreviated as "TCAI" for the rest of this comment) in footnote 2 quite makes sense. I'm worried that too much of the action is in "if no other actors had access to powerful AI systems", and not that much action is in the exact capabilities of the "TCAI". In particular: Maybe we already have TCAI (by that definition) because if a frontier AI company or a US adversary was blessed with the assumption "no other actor will have access to powerful AI systems", they'd have a huge advantage over the rest of the world (as soon as they develop more powerful AI), plausibly implying that it'd be right to forecast a >25% chance of them successfully taking over if they were motivated to try.

And this seems somewhat hard to disentangle from stuff that is supposed to count according to footnote 2, especially: "Takeover via the mechanism of an AI escaping, independently building more powerful AI that it controls, and then this more powerful AI taking over would" and "via assisting the developers in a power grab, or via partnering with a US adversary". (Or maybe the scenario in 1st paragraph is supposed to be excluded because current AI isn't agentic enough to "assist"/"partner" with allies as supposed to just be used as a tool?)

 

What could a competing definition be? Thinking about what we care most about... I think two events especially stand out to me:

  • When would it plausibly be catastrophically bad for an adversary to steal an AI model?
  • When would it plausibly be catastrophically bad for an AI to be power-seeking and non-controlled?

Maybe a better definition would be to directly talk about these two events? So for example...

  1. "Steal is catastrophic" would be true if...
    1. "Frontier AI development projects immediately acquire good enough security to keep future model weights secure" has significantly less probability of AI-assisted takeover than
    2. "Frontier AI development projects immediately have their weights stolen, and then acquire security that's just as good as in (1a)."[1]
  2. "Power-seeking and non-controlled is catastrophic" would be true if...
    1. "Frontier AI development projects immediately acquire good enough judgment about power-seeking-risk that they henceforth choose to not deploy any model that would've been net-negative for them to deploy" has significantly less probability of AI-assisted takeover than
    2. "Frontier AI development acquire the level of judgment described in (2a) 6 months later."[2]

Where "significantly less probability of AI-assisted takeover" could be e.g. at least 2x less risk.

  1. ^

    The motivation for assuming "future model weights secure" in both (1a) and (1b) is so that the downside of getting the model weights stolen imminently isn't nullified by the fact that they're very likely to get stolen a bit later, regardless. Because many interventions that would prevent model weight theft this month would also help prevent it future months. (And also, we can't contrast 1a'="model weights are permanently secure" with 1b'="model weights get stolen and are then default-level-secure", because that would already have a really big effect on takeover risk, purely via the effect on future model weights, even though current model weights probably aren't that important.)

  2. ^

    The motivation for assuming "good future judgment about power-seeking-risk" is similar to the motivation for assuming "future model weights secure" above. The motivation for choosing "good judgment about when to deploy vs. not" rather than "good at aligning/controlling future models" is that a big threat model is "misaligned AIs outcompete us because we don't have any competitive aligned AIs, so we're stuck between deploying misaligned AIs and being outcompeted" and I don't want to assume away that threat model.

Ok, gotcha.

It's that she didn't accept the reasoning behind that number enough to really believe it. She added a discount factor based on fallacious reasoning around "if it were that easy, it'd be here already".

Just to clarify: There was no such discount factor that changed the median estimate of "human brain compute". Instead, this discount factor was applied to go from "human brain compute estimate" to "human-brain-compute-informed estimate of the compute-cost of training TAI with current algorithms" — adjusting for how our current algorithm seem to be worse than those used to run the human brain. (As you mention and agree with, although I infer that you expect algorithmic progress to be faster than Ajeya did at the time.) The most relevant section is here.

I suspect there's a cleaner way to make this argument that doesn't talk much about the number of "token-equivalents", but instead contrasts "total FLOP spent on inference" with some combination of:

  • "FLOP until human-interpretable information bottleneck". While models still think in English, and doesn't know how to do steganography, this should be FLOP/forward-pass. But it could be much longer in the future, e.g. if the models get trained to think in non-interpretable ways and just outputs a paper written in English once/week.
  • "FLOP until feedback" — how many FLOP of compute does the model do before it outputs an answer and gets feedback on it?
    • Models will probably be trained on a mixture of different regimes here. E.g.: "FLOP until feedback" being proportional to model size during pre-training (because it gets feedback after each token) and then also being proportional to chain-of-thought length during post-training.
    • So if you want to collapse it to one metric, you'd want to somehow weight by number of data-points and sample efficiency for each type of training.
  • "FLOP until outcome-based feedback" — same as above, except only counting outcome-based feedback rather than process-based feedback, in the sense discussed in this comment.

Having higher "FLOP until X" (for each of the X in the 3 bullet points) seems to increase danger. While increasing "total FLOP spent on inference" seems to have a much better ratio of increased usefulness : increased danger.

 

In this framing, I think:

  • Based on what we saw of o1's chain-of-thoughts, I'd guess it hasn't changed "FLOP until human-interpretable information bottleneck", but I'm not sure about that.
  • It seems plausible that o1/o3 uses RL, and that the models think for much longer before getting feedback. This would increase "FLOP until feedback".
  • Not sure what type of feedback they use. I'd guess that the most outcome-based thing they do is "executing code and seeing whether it passes test".

It's possible that "many mediocre or specialized AIs" is, in practice, a bad summary of the regime with strong inference scaling. Maybe people's associations with "lots of mediocre thinking" ends up being misleading.

Thanks!

I agree that we've learned interesting new things about inference speeds. I don't think I would have anticipated that at the time.

Re:

It seems that spending more inference compute can (sometimes) be used to qualitatively and quantitatively improve capabilities (e.g., o1, recent swe-bench results, arc-agi rather than merely doing more work in parallel. Thus, it's not clear that the relevant regime will look like "lots of mediocre thinking".[1]

There are versions of this that I'd still describe as "lots of mediocre thinking" —adding up to being similarly useful as higher-quality thinking.

(C.f. above from the post: "the collective’s intelligence will largely come from [e.g.] Individual systems 'thinking' for a long time, churning through many more explicit thoughts than a skilled human would need to solve a problem" & "Assuming that much of this happens 'behind the scenes', a human interacting with this system might just perceive it as a single super-smart AI.) 

The most relevant question is whether we'll still get the purported benefits of the lots-of-mediocre-thinking-regime if there's strong inference scaling. I think we probably do.

Paraphrasing my argument in the "Implications" section:

  • If we don't do much end-to-end training of models thinking a lot, then supervision will be pretty easy. (Even if the models think for a long time, it will all be in English, and each leap-of-logic will be weak compared to what the human supervisors can do.)
  • End-to-end training of models thinking a lot is expensive. So maybe we won't do it by default, or maybe it will be an acceptable alignment tax to  avoid it. (Instead favoring "process-based" methods as the term is used in this post.)
  • Even if we do end-to-end training of models thinking a lot, the model's "thinking" might still remain pretty interpretable to humans in practice.
  • If models produce good recommendations by thinking a lot in either English or something similar to English, then there ought to be a translation/summary of that argument which humans can understand. Then, even if we're giving the models end-to-end feedback, we could give them feedback based on whether humans recognize the argument as good, rather than by testing the recommendation and seeing whether it leads to good results in the real world. (This comment discusses this distinction. Confusingly, this is sometimes referred to as "process-based feedback" as opposed to "outcomes-based feedback", despite it being slightly different from the concept two bullet points up. )

I think o3 results might involve enough end-to-end training to mostly contradict the hopes of bullet points 1-2. But I'd guess it doesn't contradict 3-4.

(Another caveat that I didn't have in the post is that it's slightly tricker to supervise mediocre serial thinking than mediocre parallel thinking, because you may not be able to evaluate a random step in the middle without loading up on earlier context. But my guess is that you could train AIs to help you with this without adding too much extra risk.)

One argument I have been making publicly is that I think Ajeya's Bioanchors report greatly overestimated human brain compute. I think a more careful reading of Joe Carlsmith's report that hers was based on supports my own estimates of around 1e15 FLOPs.

Am I getting things mixed up, or isn’t that just exactly Ajeya’s median estimate? Quote from the report: ”Under this definition, my median estimate for human brain computation is ~1e15 FLOP/s.”

https://docs.google.com/document/d/1IJ6Sr-gPeXdSJugFulwIpvavc0atjHGM82QjIfUSBGQ/edit

We did the the 80% pledge thing, and that was like a thing that everybody was just like, "Yes, obviously we're gonna do this."

Does anyone know what this is referring to? (Maybe a pledge to donate 80%? If so, curious about 80% of what & under what conditions.)

Related: The monkey and the machine by Paul Christiano. (Bottom-up processes ~= monkey. Verbal planner ~= deliberator. Section IV talks about the deliberator building trust with the monkey.)

A difference between this essay and Paul's is that this one seems to lean further towards "a good state is one where the verbal planner ~only spends attention on things that the bottom-up processes care about", whereas Paul's essay suggests a compromise where the deliberator gets to spend a good chunk of attention on things that the monkey doesn't care about. (In Rand's metaphor, I guess this would be like using some of your investment returns for consumption. Where consumption would presumably count as a type of dead money, although the connotations don't feel exactly right, so maybe it should be in a 3rd bucket.)

Here's the best explanation + study I've seen of Dunning-Krueger-ish graphs: https://www.clearerthinking.org/post/is-the-dunning-kruger-effect-real-or-are-unskilled-people-more-rational-than-it-seems 

Their analysis suggests that their data is pretty well-explained by a combination of a "Closer-To-The-Average Effect" (which may or may not be rational — there are multiple possible rational reasons for it) and a "Better-Than-Average Effect" that appear ~uniformly across the board (but getting swamped by the "closer-to-the-average effect" at the upper end).

probably research done outside of labs has produced more differential safety progress in total

To be clear — this statement is consistent with companies producing way more safety research than non-companies, if companies also produce even way more capabilities progress than non-companies? (Which I would've thought is the case, though I'm not well-informed. Not sure if "total research outside of labs look competitive with research from labs" is meant to deny this possibility, or if you're only talking about safety research there.)

Load More