ryan_greenblatt — LessWrong

I'm the chief scientist at Redwood Research.

Not a crux for me ~at all. Some upstream views that make me think "AI takeover but humans stay alive" is more likely and also make me think avoiding AI takeover is relatively easier might be a crux.

I expect a roughly 5.5 month doubling time in the next year or two, but somewhat lower seems pretty likely. The proposed timeline I gave consistent with Anthropic's predictions requires <1 month doubling times (and this is prior to >2x AI R&D acceleration, at least given my view of what you get at that level of capability).

I'd guess swe bench verified has an error rate around 5% or 10%. They didn't have humans baseline the tasks, just look at them and see if they seem possible.

Wouldn't you expect thing to look logistic substantially before full saturation?

Wouldn't you expect this if we're close to saturating SWE bench (and some of the tasks are impossible)? Like, you eventually cap out at the max performance for swe bench and this doesn't correspond to an infinite time horizon on literally swe bench (you need to include more longer tasks).

I agree probably more work should go into this space. I think it is substantially less tractable than reducing takeover risk in aggregate, but much more neglected right now. I think work in this space has the capacity to be much more zero sum (among existing actors, avoiding AI takeover is zero sum with respect to the relevant AIs) and thus can be dodgier.

Seems relevant post AGI/ASI (human labor is totally obsolete and AIs have massively increased energy output) maybe around the same point as when you're starting to build stuff like Dyson swarms or other massive space based projects. But yeah, IMO probably irrelevant in the current regime (for next >30 years without AGI/ASI) and current human work in this direction probably doesn't transfer.

I think the case in favor of space-based datacenters is that energy efficiency of space-based solar looks better: you can have perfect sun 100% of the time and you don't have an atmosphere in the way. But, this probably isn't a big enough factor to matter in realistic regimes without insane amounts of automation etc.

In addition to hitting higher energy from a given area, you also can get the same energy 100% of the time (without issues with night or clouds). But yeah, I agree, and I don't see how you get 50x efficiency even if transport to space (and assembly/maintenance in space) were free.

If your theory of change is convincing Anthropic employees or prospective Anthropic employees they should do something else, I think your current approach isn't going to work. I think you'd probably need to much more seriously engage with people who think that Anthropic is net-positive and argue against their perspective.

Possibly, you should just try to have less of a thesis and just document bad things you think Anthropic has done and ways that Anthropic/Anthropic leadership has misled employees (to appease them). This might make your output more useful in practice.

I think it's relatively common for people I encounter to think both:

Anthropic leadership is engaged in somewhat scumy appeasment of safety motivated employees in ways that are misleading or based on kinda obviously motivated reasoning. (Which results in safety motivated employees having a misleading picture of what the organization is doing and why and what people expect to happen.)
Anthropic is strongly net positive despite this and working on capabilities there is among the best things you can do.

An underlying part of this view is typically that moderate improvements in effort spent on prosaic safety measures substantially reduces risk. I think you probably strongly disagree with this and this might be a major crux.

Personally, I agreee with what Zach said. I think working on capabilities^[1] at Anthropic is probably somewhat net positive but would only be the best thing to work on if you had very strong comparative advantage relative to all the other useful stuff (e.g. safety research). So probably most altruistic people with views similar to mine should do something else. I currently don't feel very confident that capabilities at Anthropic is net positive and could imagine swinging towards thinking it is net negative based on additional evidence

Putting aside strongly differential specific capabilities work. ↩︎

This seems non-crazy, but my current guess would be that the training for 4.5 (relative to 4) made both verbalized and unverbalized eval awareness more likely. E.g., I don't have much confidence that 4.5 isn't aware it is in an eval even if it doesn't mention this in the cot (especially if it sometimes mentions this for pretty similar inputs or sometimes mentions this for this input). Like, it's good that we know eval awareness is an issue in 4.5, but it's hard to be confident you've avoided this issue in some eval by looking for verbalized eval awareness!

(If we somehow became confident that 4.5 is eval aware if it verbalizes with some experiments/science that could change my perspective.)

If you aggressively optimize for finding training environments that train against misalignment and which don't increase verbalized eval awareness, it does seem plausible you find environments that make unverbalized eval awareness more likely. I'd propose also not doing this and instead trying to roughly understand what makes eval awareness increase and avoid this in general.

Also, can models now be prompted to trick probes? (My understanding is this doesn't work for relatively small open source models, but maybe SOTA models can now do this?)

New Report: An International Agreement to Prevent the Premature Creation of Artificial Superintelligence