I agree AI minds might be very different, and best described with different measures. But I think we currently have little clue what those differences are, and so for now humans remain the main source of evidence we have about agents. Certainly human-applicability isn't a necessary condition for measures of AI agency; it just seems useful as a sanity check to me, given the context that nearly all of our evidence about (non-trivial) agents so far comes from humans.
Sorry, looking again at the messiness factors fewer are about brute force than I remembered; will edit.
But they do indeed all strike me as quite narrow external validity checks, given that the validity in question is whether the benchmark predicts when AI will gain world-transforming capabilities.
“messiness” factors—factors that we expect to (1) be representative of how real-world tasks may systematically differ from our tasks
I felt very confused reading this claim in the paper. Why do you think they are representative? It seems to me that real-world problems obviously differ systematically from these factors, too—e.g., solving them often requires having novel thoughts.
I think there is more empirical evidence of robust scaling laws than of robust horizon length trends, but broadly I agree—I think it's also quite unclear how scaling laws should constrain our expectations about timelines.
(Not sure I understand what you mean about the statistical analyses, but fwiw they focused only on very narrow checks for external validity—mostly just on whether solutions were possible to brute force).
I agree it seems plausible that AI could accelerate progress by freeing up researcher time, but I think the case for horizon length predicting AI timelines is even weaker in such worlds. Overall I expect the benchmark would still mostly have the same problems—e.g., that the difficulty of tasks (even simple ones) is poorly described as a function of time cost; that benchmarkable proxies differ critically from their non-benchmarkable targets; that labs probably often use these benchmarks as explicit training targets, etc.—but also the additional (imo major) source of uncertainty about how much freeing up researcher time would accelerate progress.
Fwiw, in my experience LLMs lie far more than early Wikipedia or any human I know, and in subtler and harder to detect ways. My spot checks for accuracy have been so dismal/alarming that at this point I basically only use them as search engines to find things humans have said.
I'm really excited to hear this, and wish you luck :)
My thinking benefited a lot from hanging around CFAR workshops, so for whatever it's worth I do recommend attending them; my guess is that most people who like reading LessWrong but haven't tried attending a workshop would come away glad they did.
I'd guess the items linked in the previous comment will suffice? Just buy one mask, two adapters and two filters and screw them together.
I live next to a liberally-polluting oil refinery so have looked into this a decent amount, and unfortunately there do not exist reasonably priced portable sensors for many (I'd guess the large majority) of toxic gasses. I haven't looked into airplane fumes in particular, but the paper described in the WSJ article lists ~130 gasses of concern, and I expect detecting most such things at relevant thresholds would require large infrared spectroscopy installations or similar.
(I'd also guess that in most cases we don't actually know the relevant thresholds of concern, beyond those which cause extremely obvious/severe acute effects; for gasses I've researched, the literature on sub-lethal toxicity is depressingly scant, I think partly because many gasses are hard/expensive to measure, and also because you can't easily run ethical RCTs on their effects.
I had the same thought, but hesitated to recommend it because I've worn a gas mask before on flights (when visiting my immunocompromised Mom), and many people around me seemed scared by it.
By my lights, half-face respirators look much less scary than full gas masks for some reason, but they generally have a different type of filter connection ("bayonet") than the NATO-standard 40mm connection for gas cartridges. It looks like there are adapters, though, so perhaps one could make a less scary version this way? (E.g. to use a mask like this with filters like these).
They typically explain where the room is located right after giving you the number, which is almost like making a memory palace entry for you. Perhaps the memory is more robust when it includes a location along with the number?