I operate by Crocker's rules. All LLM output is explicitely designated as such. I have made no self-hiding agreements.
My own results after seven trials for checking the effects of Orexin (analysis only for my data, missing for my two collaborators):
| Variable | Cohen's d | p-value | Orexin | Placebo | Difference |
|---|---|---|---|---|---|
| PVT Mean RT (ms) | -0.292 | 0.463 | 261.5 ± 28.3 (n=14) | 269.5 ± 26.2 (n=14) | -8.0 |
| PVT Median RT (ms) | -0.022 | 0.955 | 247.7 ± 18.1 (n=14) | 248.1 ± 15.9 (n=14) | -0.4 |
| PVT Slowest 10% (ms) | -0.385 | 0.335 | 304.2 ± 72.5 (n=14) | 336.4 ± 93.4 (n=14) | -32.2 |
| PVT False Starts | -0.718 | 0.078 | 0.71 ± 0.70 (n=14) | 1.29 ± 0.88 (n=14) | -0.57 |
| DSST Correct | 0.401 | 0.316 | 71.1 ± 4.8 (n=14) | 67.6 ± 11.7 (n=14) | +3.6 |
| DSST Accuracy | 0.500 | 0.214 | 0.988 ± 0.028 (n=14) | 0.975 ± 0.025 (n=14) | +0.013 |
| Digit Span Forward | 0.000 | 1.000 | 7.36 ± 0.89 (n=14) | 7.36 ± 1.44 (n=14) | +0.00 |
| Digit Span Backward | 0.357 | 0.372 | 6.71 ± 0.80 (n=14) | 6.36 ± 1.17 (n=14) | +0.36 |
| Digit Span Total | 0.217 | 0.585 | 14.1 ± 1.5 (n=14) | 13.7 ± 1.8 (n=14) | +0.4 |
| SSS Rating | 0.431 | 0.282 | 3.50 ± 0.73 (n=14) | 3.14 ± 0.91 (n=14) | +0.36 |
| Sleep Duration (hrs) | -0.273 | 0.645 | 7.67 ± 2.84 (n=7) | 8.24 ± 0.78 (n=7) | -0.57 |
| Sleep Time Asleep (min) | -0.443 | 0.458 | 400 ± 142 (n=7) | 447 ± 48 (n=7) | -47 |
| Sleep Efficiency (%) | -0.290 | 0.625 | 88.7 ± 7.5 (n=7) | 90.4 ± 3.7 (n=7) | -1.7 |
| Sleep Deep (min) | -0.298 | 0.626 | 78.8 ± 27.2 (n=6) | 85.6 ± 16.9 (n=7) | -6.7 |
| Sleep Light (min) | 0.316 | 0.603 | 279 ± 62 (n=6) | 263 ± 31 (n=7) | +16 |
| Sleep REM (min) | -0.426 | 0.485 | 84.5 ± 41.2 (n=6) | 98.1 ± 18.8 (n=7) | -13.6 |
| Sleep Wake (min) | 0.687 | 0.268 | 69.5 ± 43.1 (n=6) | 46.9 ± 17.8 (n=7) | +22.6 |
(Sleep stages are missing for one block because I didn't sleep enough for the FitBit to start determining sleep stages.)
INTERP RESEARCHER: Like... crystals crystals? Healing crystals? Are you about to tell me about chakras?
is there a service auto matching donation swap?
Unfortunately not, seems like the legal situation here is still unresolved, and as long as that's the case my best guess is that nobody will want to take the legal risk of building such a platform.
At a wild guess, I'd say that if the useful artifact is literally a paragraph or less, and you've gone over it several times, then it could be "ok" as testimony according to me. Like, if the LLM drafted a few sentences, and then you read them and deeply checked "is this really the right way to say this? does this really match my idea / felt sense?", and then you asked for a bunch of rewrites / rewordings, and did this several times, then plausibly that's just good.
Yeah, insofar as I'd endorse publishing LLM text that'd be the minimum, maybe in addition to adding links.
Code feels similar, I often end up deleting a bunch of LLM-generated code because it's extraneous to my purpose, and this is much more of an issue because I don't feel like publishing LLM-written text but don't know how to feel about LLM-written code. I guess a warning at the top telling the reader that they're about to wade into some-level-of-unedited code is warranted.
Why LLM it up? Just give me the prompt.
Often to-me-useful artefacts come about in a long conversation with LLMs after many bits of steering & revisions, so there's a spectrum of how me-generated to LLM-generated some text is.
I ask because I haven't been able to notice any measurable impacts from meditation on any variable in my life (possibly too much information at 1 2 3. I also like meditation and will go on probably 2, plausibly 3 retreats this year, but the lack of measurable impact on my life leaves me skeptical that it's actually a good use of my time.
A key missing ingredient holding back LLM economic impact is that they're just not robust enough.
I disagree with this in this particular context. We are looking at AI companies trying to automate AI automation via AIs. Most tasks in AI R&D don't require much reliability, I don't know the distribution of outcomes in ML experiments but I reckon a lot of them are basically failures/have null results, but the distribution of the impact of such experiments has a long tail [1] . Also ML experiments don't have many irreversible parts, AI R&D researchers aren't like surgeons where mistakes have huge costs: Any ML experiment can be sandboxed, given a bounded amount of resources, shut down when it takes up too much. You need high reliability when the cost of failure is necessarily very high, but when running ML experiments that's not the case.
Edit: Claude 4.5 Sonnet gives feedback on my text above, says that the search strategy matters if we're looking at ML engineering. If it's breadth-first & innovations don't require a deep tree to go down, then low reliability is fine. But if we need to combine ≥4 innovations in a depth-first search then reliability matters more.
I don't think this is a crux for me but learning that it's a thin-tailed distribution would make me at least think about this problem a bit more. Claude claims hyperparameter tunes have lognormal returns (shifted so that the mean is slightly below baseline). ↩︎
I hadn't seen DirectedEvolution's review, so it was useful evidence for me. The questions Yarrow picked for determining whether LW was early seemed fair on the LW side (what other question than "did people on LW talk about COVID-19 early on" would one even be asking?), if not on the side of governments and the mainstream media. Even though DirectedEvolution's review exists I remember people claiming LW was early on COVID-19, so Yarrow's independent look into it is still useful. (Again, I know Yarrow is not set up to be fair to LW, but even unfair/biased information sources can be useful if one can model the degree/way in which they're biased.) I think the statement "LW was early on COVID-19" [[1]] (which I used to believe!) is just wrong? I haven't seen counter-evidence, yet people continue saying it.
I'd say e.g. that Metaculus came out far ahead of basically anyone else, with the first market being published on the 19th of January. (I'm looking at my predictions from the time on Metaculus and PredictionBook and there was a lot of fast updating based on evidence. I wish more LWers had forecasting track records (like this one) instead of vaguely talking about "Bayes points" [[2]] .)
These kinds of retrospectives are in general not done very often, and follow-up/investigation of common claims are rare and kinda annoying to do, so I want to signal-boost empirical posts like the one I linked.
I think at minimum you have to credit the trading gains many LessWrongers made at the time, and MicroCovid.
Yeah, Yarrow is not trying to be fair. MicroCovid was cool, as was VaccinateCA. Maybe I'll take a look into how many trading gains LWers made at the time, though this suffers from obvious selection bias. (I, for example, made no gains at the time, because I didn't see the signal to start an investment account early on.)
There were other misses from the LW side like this Zvi roundup.
Complicatedly I think LW settled more firmly on "this is a problem" while there was a bunch of political positioning in March/early April, but my understanding is that (inter-)governmental health organizations were earlier, even if their governments didn't listen to them. My impression is that the general population wasn't very worried until basically a week before lockdowns, and the polls Yarrow uses to refute this aren't very convincing to me because I just don't trust polls, in general—it's very cheap to say "I'm worried", but expensive to do anything about it. ↩︎
<microrant>I'm baffled that this has gotten into parlance, it seems like a purely social fiction??? Another way of assigning status with almost no grounding in any kind of measurement?</microrant> ↩︎
I mean if one thinks of oneself as a system with simple inputs (water, basic nutrients, light, air) & outputs, then trying to list the relevant inputs and intervening on them makes sense? And in my mind wondering "what's the optimal level/type of light" bottoms out at "early spring/late summer day".
(I'm (very slowly) running an RCT on the effects of lumenators (38/50 datapoints collected), probably to be posted EOY. In the meantime there's Sandkühler et al. testing people with SAD & 100k lumens, finding broadly positive results)