My perception is that Trump 2 is on track to be far worse (e.g. in terms of eroding democratic practices and institutions) than Trump 1. My vague understanding is that a main driver of this difference is how many people gave up on the "working with bad guys to make them less bad" plan--though probably this was not directly because they changed their view on such reasoning.
Should this update us on the working for net-negative AGI companies case?
(Thanks for expanding! Will return to write a proper response to this tomorrow)
I've never been compelled by talk about continual learning, but I do like thinking in terms of time horizons. One notion of singularity that we can think of in this context is
Escape velocity: The point at which models improve by more than unit dh/dt i.e. horizon h per wall-clock t.
Then by modeling some ability to regenerate, or continuously deploy improved models you can predict this point. Very surprised I haven't seen this mentioned before, has someone written about this? The closest thing that comes to mind is T Davidson's ASARA SIE.
Of course, the METR methodology is likely to break down well before this point, so it's empirically not very useful. But, I like this framing! Conceptually there will be some point where models can robustly take over R&D work--including training, inventing of new infrastructure (skills, MCPs, cacheing protocols etc.). If they also know when to revisit their previous work they can then work productively over arbitrary time horizons. Escape velocity is a nice concept to have in our toolbox for thinking about R&D automation. It's a weaker notion than Davidson SIE.
It can both be the case that "a world in which the U.S. government took von Neumann's advice would likely be a much darker, bleaker, more violent one" and that JvN was correct ex ante. In particular, I find it plausible that we're living in quite a lucky timeline--one in which the Cuban missile crisis and other coinflips landed in our favor.
UK AISI is a government agency, so the pie chart is probably misleading on that segment!
I like the concept; on the other hand the flag feels strongly fascist to me.
Ran it by the AIs and 2 out of 3 had "authoritarian" as their first descriptor responding to "What political alignment does the aesthetic of this flag evoke?" FWIW.
From a skim, seems you should be using the 6.25x value rather than the 2.5x in B2 of your sheet. If I'm skimming it correctly, 6.25x is the estimate for replacing a hypothetical all median lab with a hypothetical all top researcher lab. This is what occurs when you improve your ASARA model. Whereas, 2.5x is the estimate for replacing the actual lab with an all top lab.
This still gives a lower than 4x value, but I think if you plug in reasonable log-normals 4x will be within your 90% CI, and so it seems fine.
I don't think this should be a substantial update about whether obfuscated arguments are a problem for recursive decomposition approaches to scalable oversight.
Can you elaborate on why you think this? Perhaps you were already aware that obfuscated arguments should only pose a problem for completeness (in which case for the majority of obfuscated-arguments-aware people I'd imagine the new work should be a significant update.) Or perhaps your concern is that in the soundness case the refutation is actually inefficient and the existence of small Bob circuits demonstrated in this paper is too weak to be practically relevant?
Regarding (1.2) I also see this a priority to study. Basically, we already have a strong prior from interacting with current models that LMs will continue to generalise usably from verifiable domains to looser questions. So, the practical relevance of debate hinges on how expensive it is to train with debate to increase the reliability of this generalisation. I have in mind here questions like "What experiment should I run next to test this debate protocol".
One dataset idea to assess how often stable arguments can be found is to curate 'proof advice' problems. These problems are proxies for research advice in general. Basically:
Question: "Does this textbook use theorem Y to prove theorem X?", where the debaters see the proof but the judge does not.
(Note that this is not about proof correctness, a dishonest debater can offer up alternative correct proofs that use non-standard techniques.) This setup forces arguments based on heuristics and "research taste"—what makes a proof natural or clear. It's unclear what independent evidence would look like in this domain.
Here's another way of thinking about this that should be complementary to my other comment:
Let's assume our safety property we want to ensure is
Low correlation over side objectives. E.g. In a sequential setting (c.f. my other comment #2), Alice must have a hard time picking vulnerabilities that all serve the same side objective.
The overlooked point which I want to make here is that {S}<<{E}. In this case, there are two possibilities:
Thought provoking, thanks! First off, whether or not humanity (or current humans, or some distribution over humans) has already achieved escape velocity does not directly undercut the utility of escape velocity as a useful concept for AI takeoff. Certainly, I'd say the collective of humans has achieved some leap to universality in a sense that the bag of current near-SotA LLMs has not! And in particular, it's perfectly reasonable to take humans as our reference to define a unit dh/dt point for AI improvement.
Ok, now on to the hard part (speculating somewhat beyond the scope of my original post). Is there a nice notion of time horizon that generalizes METRs and lets us say something about when humanity has achieved escape velocity? I can think of two versions.
The easy way out is to punt to some stronger reference class of beings to get our time horizons, and measure dh/dt for humanity against this baseline. Now the human team is implicitly time limited by the stronger being's bound, and we count false positives against the humans even if humanity could eventually self correct.
Another idea is to notice that there's some class of intractable problems on which current human progress looks like either (1) random search or (2) entirely indirect, instrumental progress--e.g. self-improvement, general tool building etc. In these cases, there may be a sense in which we're exponentially slower than task-capable agent(s). We should be considered incapable of completing such tasks. I imagine some millenium problems, astrological engineering, etc. would be reasonably considered beyond us on this view.
Overall, I'm not particularly happy with these generalizations. But I still like having 'escape velocity for AI auto-R&D' as a threshold!