It seems especially risky that Claude has trouble ignoring noise, and if you don't understand the analysis you might not know to ignore it. I get this all the time, where Claude wants to write a paper about how our brilliant technique reduces training loss by 0.001. I would call it p-hacking, but Claude usually notices this if it bothers to calculate p-values.
Wtf does that even mean? Eh, could be interesting to see the result. Enter
A peculiar side effect of model intelligence in discovery-based research is that it’s possible to run every statistical/quantitative analysis under the sun, burn millions of tokens and gain little intuition on how to make headway into your problem.[1]
Pre-AI, the effort associated with constructing a pipeline to run any meaningful quantitative analysis incurred a real cost. Indeed, it forced one to consider whether the analysis made any sense at all. People who had the necessary skills to execute were often thoughtful about their analysis by necessity. You couldn’t just apply Mendelian Randomisation if you didn't know of its existence; there was effort required to both articulate your question to search for the right tools and de-risk by considering whether a tool fit your question.[2] This forced one to earn some intuition for how it worked and what to expect. The pain that such friction inflicted is now alleviated by the CLI agentic coder of your choice. Some would argue that this frees humans to spar with an intellectual partner at a new layer of abstraction rather than concern themselves with the specifics of code. I would argue that a productive spar requires both parties to hold opinions they are willing to defend and on a day-to-day basis this is not what I see.
Instead, the trajectory of decisions feels almost pre-ordained, primarily driven by three effects. The first is this veneer of problem-specific competence as models string shibboleths together. Shibboleths often serve as a heuristic for expert opinion e.g. septic patient with febrile tachycardia versus high temperature and heart rate. Deferring to expert opinion (or what masquerades like one) feels only natural. The second is an illusion of control as each recommended option is contrasted against straw-men alternatives. The third is sycophancy (one less obvious than the verbal submissiveness), an insidious bias towards writing tests that are likely to produce favourable empirical results or worse interpreting them in such way as to support the framing you seem to desire.[3]
Great news, 90/126 nominal p-values are significant. Would you like me to proceed with the write-up? [4]
Holding on to one’s critical eye is like cupping sand when the path of least resistance is seductive and one *Enter* away.
Perhaps most concerning is that after several rounds of this analytical cosplay, one is almost entirely robbed of the ability to form an opinion. As though waking up from a dissociative fugue,[5] the question “how happy or sad or confused do I feel about these results,” becomes entirely intractable.[6] Congratulations, it appears you have Outrun Your Headlights. Outrunning your headlights refers to driving fast enough that your stopping distance is further than your lights shine. In engineering it often refers to not knowing what is going on anymore. Good engineering practices e.g. test-driven development, architecture design records (now formalised often as Skills) protect the artifact being produced, be it the codebase or product. Unfortunately, in discovery work, the artifact is NOT the analytical pipeline but the justified set of beliefs you collect over time for which no test suite exists. Indeed, there appears not to be the comparable angst about protecting the integrity of one's beliefs and intuition even though one dubious analysis could poison this well.
Furthermore, our odometer for progress has not recalibrated. That is to say, running analyses still provides the rewarding sensation of making headway with none of the earned intuition necessary to ask nuanced follow-up questions or indeed call out approaches that are suspect. This is especially dangerous given models inject assumptions in subtle ways when motivated to produce a positive result. One such concrete example I have noticed in my own bioinformatics work is that language models have a tendency to recall baked-in knowledge to facilitate discovery analysis e.g. classify cells in single cell RNA sequencing by manually printing a list of cell-specific gene markers to match against rather than using a canonical approach with a reference atlas. Anthropic's BioMysteryBench proudly touts this as model capability (which I guess in one sense is savant-like were it to occur in a human) but what I would also argue is unwanted behaviour in the context of autonomous research. Simultaneously it would be foolish to starve yourself of this intelligence substrate for the puritan ideal of understanding everything.
In sum, better solutions must be built to protect ‘belief quality’ but, in the meantime, I simply suggest we form a strong opinion with an anticipation of being proven wrong before letting the slop cannon rip. Introduce a little friction, slow down a little, and don’t Outrun Your Headlights.
Or build high beams.
I want to carefully define what I mean here by intuition in the context of discovery work. I refer to the working map of territory you carry in your head about a problem that is refined by stress-testing it against counterfactuals and alternate hypotheses. It allows you to curate targeted questions that meaningfully update your priors.
Yes one can argue that there was no shortage of statistical abuse and practitioners not respecting the assumptions of their instruments. However, I see this as a distinct problem from language models suggesting approaches which might be completely foreign to the user; approaches which said user necessarily have no intuition to judge any result by.
The average LessWrong participant is less likely to be a victim of these effects but I describe what I believe to be pervasive in the broader research community.
Yes I am aware that in some analyses such as differential expression where the number of concurrent tests is high, nominal p-values can still be a useful signal of directionality. The problem is that models err towards optimism and misuse of such 'loopholes'.
I use this pejoratively; there is a formal psychiatric condition specified in the DSM-V describing sudden awakening followed by distress and retrograde amnesia for one's own identity.
I will set aside the catastrophic alternative where one is happy to take a result at face value and isn’t conscious enough to even recognise that not understanding the analyses which led to the result is not okay.