I think there is more empirical evidence of robust scaling laws than of robust horizon length trends, but broadly I agree—I think it's also quite unclear how scaling laws should constrain our expectations about timelines.
(Not sure I understand what you mean about the statistical analyses, but fwiw they focused only on very narrow checks for external validity—mostly just on whether solutions were possible to brute force).
I agree it seems plausible that AI could accelerate progress by freeing up researcher time, but I think the case for horizon length predicting AI timelines is even weaker in such worlds. Overall I expect the benchmark would still mostly have the same problems—e.g., that the difficulty of tasks (even simple ones) is poorly described as a function of time cost; that benchmarkable proxies differ critically from their non-benchmarkable targets; that labs probably often use these benchmarks as explicit training targets, etc.—but also the additional (imo major) source of uncertainty about how much freeing up researcher time would accelerate progress.
Fwiw, in my experience LLMs lie far more than early Wikipedia or any human I know, and in subtler and harder to detect ways. My spot checks for accuracy have been so dismal/alarming that at this point I basically only use them as search engines to find things humans have said.
I'm really excited to hear this, and wish you luck :)
My thinking benefited a lot from hanging around CFAR workshops, so for whatever it's worth I do recommend attending them; my guess is that most people who like reading LessWrong but haven't tried attending a workshop would come away glad they did.
I'd guess the items linked in the previous comment will suffice? Just buy one mask, two adapters and two filters and screw them together.
I live next to a liberally-polluting oil refinery so have looked into this a decent amount, and unfortunately there do not exist reasonably priced portable sensors for many (I'd guess the large majority) of toxic gasses. I haven't looked into airplane fumes in particular, but the paper described in the WSJ article lists ~130 gasses of concern, and I expect detecting most such things at relevant thresholds would require large infrared spectroscopy installations or similar.
(I'd also guess that in most cases we don't actually know the relevant thresholds of concern, beyond those which cause extremely obvious/severe acute effects; for gasses I've researched, the literature on sub-lethal toxicity is depressingly scant, I think partly because many gasses are hard/expensive to measure, and also because you can't easily run ethical RCTs on their effects.
I had the same thought, but hesitated to recommend it because I've worn a gas mask before on flights (when visiting my immunocompromised Mom), and many people around me seemed scared by it.
By my lights, half-face respirators look much less scary than full gas masks for some reason, but they generally have a different type of filter connection ("bayonet") than the NATO-standard 40mm connection for gas cartridges. It looks like there are adapters, though, so perhaps one could make a less scary version this way? (E.g. to use a mask like this with filters like these).
I think the threshold of brainpower where you can start making meaningful progress on the technical problem of AGI alignment is significantly higher than the threshold where you can start making meaningful progress toward AGI.
This is also my guess, but I think required intelligence thresholds (for the individual scientists/inventors involved) are only weak evidence about relative problem difficulty (for society, which seems to me the relevant sort of "difficulty" here).
I'd guess the work of Newton, Maxwell, and Shannon required a higher intelligence threshold-for-making-progress than was required to help invent decent steam engines or rockets, for example, but it nonetheless seems to me that the latter were meaningfully "harder" for society to invent. (Most obviously in the sense that their invention took more person-hours, but I suspect they similarly required more experience of frustration, taking on of personal risk, and other such things which tend to make given populations less likely to solve problems in given calendar-years).
So, I think takeoff has begun, but it's under quite different conditions than people used to model.
I don't think they are quite different. Christiano's argument was largely about the societal impact, i.e. that transformative AI would arrive in an already-pretty-transformed world:
I believe that before we have incredibly powerful AI, we will have AI which is merely very powerful. This won’t be enough to create 100% GDP growth, but it will be enough to lead to (say) 50% GDP growth. I think the likely gap between these events is years rather than months or decades.
In particular, this means that incredibly powerful AI will emerge in a world where crazy stuff is already happening (and probably everyone is already freaking out). If true, I think it’s an important fact about the strategic situation.
I claim the world is clearly not yet pretty-transformed, in this sense. So insofar as you think takeoff has already begun, or expect short (e.g. AI 2027-ish) timelines—I personally expect neither, to be clear—I do think this takeoff is centrally of the sort Christiano would call "fast."
Sorry, looking again at the messiness factors fewer are about brute force than I remembered; will edit.
But they do indeed all strike me as quite narrow external validity checks, given that the validity in question is whether the benchmark predicts when AI will gain world-transforming capabilities.
I felt very confused reading this claim in the paper. Why do you think they are representative? It seems to me that real-world problems obviously differ systematically from these factors, too—e.g., solving them often requires having novel thoughts.