Daniel V — LessWrong

My guess is the overall Amtrak number is the outlier, including lots of miles of open land. Perhaps relevant for assessing safety/mile when you'd otherwise drive or fly it, but not as relevant for assessing safety/risk in urban areas.

Brightline is Actually Pretty Dangerous

Daniel V1mo90

Here's some data (Miami Herald). It seems like a bit of a perfect storm of being in a denser and more night-active area, having more crossings, lacking fencing, not sounding horns (thanks locals!), etc. It's easy to blame victims since the tracks don't move, but it's clear that design factors contribute to safety, and Brightline has leaned into just letting the deaths happen (including inaccurately characterizing them).

Banning Said Achmiz (and broader thoughts on moderation)

Daniel V3mo43

I agree, the meta-point of selection bias is valid but the direction of bias is unclear.

The Purpose of a System is what it Rewards

Daniel V4mo2215

I agree this is better (the system rewards only a subset of what it does), but it is still overgeneralizing. Systems could have a different purpose but fallible reward structure (also Goodharting). You could be analyzing the purpose-reward link at the wrong level: political parties want change, so they seek power, so they need donations. This makes it look like the purpose is to just get donations because of rewards to bundlers, but it ignores the rewards at other levels and confuses local rewards with global purpose. Just as a system does a lot of things, so it rewards a lot of things.

Shallow Water is Dangerous Too

Daniel V4mo10

That's true and a very important point I wish I had included. I assumed consciousness and some unstated degree of able-bodiedness. A good hit to the head on the way in and/or certain physical limitations, and mere inches of depth will be the determinant.

Monthly Roundup #32: July 2025

Daniel V4mo10

using urban firewood prices for what was mostly rural consumption.
Bernard Stanford: If you value all informal economy firewood production at market price, and then compare it to extant GDP estimates, you need to make sure ALL informal economic production is similarly valued, or you’ll massively overestimate firewood’s share of GDP. Seems to be what happened!
The approach seems to have a serious flaw in assuming that THIS sector of the informal economy was underestimated, but surely not any OTHER sector.

There are two different problems being raised here.

The outside quote is Zvi pointing out the issue of what price is used to multiply by quantity output to get value estimates. This is a good point because we want to multiply by a representative price, not a too-high price.

The inside quote is a related but distinct issue, which I disagree with. Bernard is arguing that if you are going to include an informal element in GDP, you have to include all informal elements. Sure, it's the ideal, but 1) that's not what's going on here, and 2) even if it were, I'd be fine with it. First, firewood is already semi-included in GDP, this process is just attempting to get a better estimate. Imagine BLS sends people out to grocery stores to price milk and estimate milk consumption, but they only end up getting urban prices and only end up observing 1/3 of the milk consumed in the US. It'd actually be great to add back in the missing consumption. How do you price it? I guess with the only price you have (and Zvi is right that it's a problematic price to use). Second, imagine BLS sends people out to start to price household dishwashing and estimate household dishwashing service provision, but they don't bother to collect household vacuuming. It'd actually still be great to do that, though it'd be difficult to pull off.

So, I think what the authors did is actually a nice exercise, but the price issue means 28% could be an overestimate. How bad of one? The paper discusses Figure A11, recreating Barger's (1955) estimates (though it only goes back to 1869), so what was the GDP share using that one? In the 1870s, the NBER paper in question reports a share of about 8% (on $500M of firewood). Barger reports about $350M of firewood (because of the lower price index), which would imply a share of about 5.4%. If the 28% was a 50% overestimate, that would mean the "true" share in the 1830s (if you assume Barger's price index is "true") could be more like 18.7%. That's still quite high and far from "half an order of magnitude" overestimate, more like 1/5 of one. Is it absurd? Figure 6 here [pdf via academia.edu] has British and Swedish shares at 20% and 40%, respectively.

Monthly Roundup #32: July 2025

Daniel V4mo42

I too have grown increasingly skeptical that meta-analysis in its typical form does anything all that useful.

Unfortunately, people can be bad at understanding meta-analyses. If you have studies that disagree like 50/50, it's not necessarily true that half did something wrong. It's possible there is a legitimate hidden moderator that changes the effect of the variable (probably being revealed by the meta-analysis but not picked up sufficiently by popular reporting). Or even revealing that half have a fatal flaw would be a contribution of the meta-analysis! Sometimes the effects are not totally comparable, in which case that should either be modeled/adjusted, or excluded (probably already considered by the meta-analyst [though the salient counter-examples where the researcher(s) screw up confirm that not all papers are perfect, though this is not unique to meta-analyses]). It is indeed problematic when a meta-analysis concludes the average effect is the effect, particularly with a bimodal distribution of effect sizes (would be crazy to conclude that in that case!).

The alternative "look at the studies individually" suffers from all the same issues: garbage in, garbage out; hidden moderators; non-comparability; over-concluding from an average. At least meta-analysis brings some systematic thinking to the evaluation of the literature. A strong meta-analysis interacts with these issues and hopefully avoids the pitfalls because it does what a weak meta-analysis does not - it explains inclusion criteria, considers variability, and explains differences rather than just reporting a mean.

Shallow Water is Dangerous Too

Daniel V4mo72

Yes, that's what I meant, thanks.

Shallow Water is Dangerous Too

Daniel V4mo132

I'm glad she's totally fine. Maybe even a net positive for her and the family on future water safety. It showcases the importance of thinking about bodies of water beyond the prototypes.

A somewhat similar event occurred last weekend with my toddler in a pool. I was less than a foot away from him, as intended, and he was walking around in waist-deep water. Lost his feet but his waist is taller than his arms are long - so he needed me to intervene. He swallowed a little water in the less than two seconds he was sloshing around, but he otherwise didn't care.

The lesson is the same: the bottom needs to be reliable enough that they can regain footing + footing may mean arm lengths + you basically can't count on buoyancy/control (even if he had his arms straight out he might not know how to properly keep his head out, he might not know to hold his breath, and also he might be panicking) = don't go swimming alone.

Shutdown Resistance in Reasoning Models

Daniel V5mo*242

I think it would be great to formally lay out alternative hypotheses for why this behavior arises, so we can begin to update their likelihoods based on evidence as it accumulates. I have several:

The LLM "misreads" the instruction with probability P. P could vary as a function of the prompt itself, perhaps as a function of both the specificity/detail (negatively so) and complexity (positively so) of the prompt. Thus, we'd expect variation in sabotage rates, but with some specificity that allows us to test this hypothesis (e.g., "Condition 1" might outperform some prompts, but you'd also think "YOU MUST" would operate similarly).
The LLM "acts out" what it thinks we expect of it. To the extent that some prompts treat the LLM as more of an agent than computer, the LLM may lean into that role. (e.g., "Condition 1" might outperform "YOU MUST")
The LLM has a survival drive. As noted by Schlatter et al., this can be difficult to disambiguate from H2.
The LLM is task-focused. It prioritizes completing its defined task above all else. Maybe having shutdown described more as "part of" the task would reduce sabotage rates? (Edit: Such as in this comment by nostalgebraist.)

It would be valuable to figure out how variations across models would reflect on these hypotheses. Plus, what other hypotheses are there and what patterns of variation do they predict?

LESSWRONG
LW

LESSWRONG
LW

Posts

Wikitag Contributions

Comments