All of dynomight's Comments + Replies

Thanks, you've 100% convinced me. (Convincing someone that something that (a) is known to be true and (b) they think isn't surprising, actually is surprising is a rare feat, well done!)

Chat or instruction finetuned models have poor prediction cailbration, whereas base models (in some cases) have perfect calibration.


Tell me if I understand the idea correctly: Log-loss to predict next token leads to good calibration for single token prediction, which manifests as good calibration percentage predictions? But then RLHF is some crazy loss totally removed from calibration that destroys all that?

If I get that right, it seems quite intuitive. Do you have any citations, though?

I don’t find it intuitive at all. It would be intuitive if you started by telling a story describing the situation and asked the LLM to continue the story, and you then sampled randomly from the continuations and counted how many of the continuations would lead to a positive resolution of the question. This should be well-calibrated, (assuming the details included in the prompt were representative and that there isn’t a bias of which types of ending the stories are in the training data for the LLM). But this is not what is happing. Instead the model outpu... (read more)


Sadly, no—we had no way to verify that.

I guess one way you might try to confirm/refute the idea of data leakage would be to look at the decomposition of brier scores: GPT-4 is much better calibrated for politics vs. science but only very slightly better at politics vs. science in terms of refinement/resolution. Intuitively, I'd expect data leakage to manifest as better refinement/resolution rather than better calibration.

That would definitely be better, although it would mean reading/scoring 1056 different responses, unless I can automate the scoring process. (Would LLMs object to doing that?)

Thank you, I will fix this! (Our Russian speaker agrees and claims they noticed this but figured it didn't matter 🤔) I re-ran the experiments with the result that GPT-4 shifted from a score of +2 to a score of -1.

Well, no. But I guess I found these things notable:

  • Alignment remains surprisingly brittle and random. Weird little tricks remain useful.
  • The tricks that work for some models often seem to confuse others.
  • Cobbling together weird little tricks seems to help (Hindi ranger step-by-step)
  • At the same time, the best "trick" is a somewhat plausible story (duck-store).
  • PaLM 2 is the most fun, Pi is the least fun.

You've convinced me! I don't want to defend the claim you quoted, so I'll modify "arguably" into something much weaker.

Also perhaps of interest might be this discussion from the SSC subreddit awhile back where someone detailed their pro-Bigfoot case.

I don't think I have any argument that it's unlikely aliens are screwing with us—I just feel it is, personally.

I definitely don't assume our sensors are good enough to detect aliens. I'm specifically arguing we aren't detecting alien aircraft, not that alien aircraft aren't here. That sound like a silly distinction, but I'd genuinely give much higher probability to "there are totally undetected alien aircraft on earth" than "we are detecting glimpses of alien aircraft on earth."

Regarding your last point, I totally agree those things wouldn't explain the we... (read more)

I know that the mainstream view on Lesswrong is that we aren't observing alien aircraft, so I doubt many here will disagree with the conclusion. But I wonder if people here agree with this particular argument for that conclusion. Basically, I claim that:

  • P[aliens] is fairly high, but
  • P[all observations | aliens] is much lower than P[all observations | no aliens], simply because it's too strange that all the observations in every category of observation (videos, reports, etc.) never cross the "conclusive" line.

As a side note: I personally feel that P[observat... (read more)

Even if there are aliens, and humans do sometimes gain data showing such, if the aliens are sufficiently advanced and don't want to be found, I would not be surprised if they selectively took away our conclusive data but left behind the stuff that's already indistinguishable from noise. Kinda like how we take our trash with us after hiking and camping, but don't worry too much in most places about our footprints or the microscopic bits of material our gear and bodies leave behind.
The general point that you need to update on the evidence that failed to materialize is in the sequences and is exactly where I expected you to go based on your introductory section.
I make no claim to speak for anyone who isn't me, but I agree with your analysis. I would say similar things about e.g. ESP and miracles and the like.
Glitches happen. Misunderstandings happen. Miscommunications happen. Coincidences happen. Weird-but-mundane things happen. Hoaxes happen. To use machine learning terminology, the real world occurs at temperature 1. We shouldn't expect P[observations] to be high - that would require temperature less than 1. The question is, is P[observations] surprisingly low, or surprisingly high for some different paradigm, to such an extent as would provide strong evidence for something outside of current paradigms? My assessment is no. (see my discussion of Nimitz for example) Some additional minor remarks specifically on P[aliens]: * non-detection of large (in terms of resource utilization) alien civilizations implies that the density of interstellar-spacefaring civilizations is low - I don't expect non-expansion to be the common (let alone overwhelmingly selected) long term choice, and even aestivating civilizations should be expected to intervene to prevent natural entropy generation (such as by removing material from stars to shut them down) * If the great filter (apart from the possible filter against resource-utilization expansion by interstellar-spacefaring civilizations, which I consider unlikely to be a significant filter as mentioned above) is almost entirely in the abiogenesis step, and interstellar panspermia isn't too hard, then it would make sense for a nearby civilization to exist as Robin Hanson points out. I do actually consider it fairly likely that a lot of the great filter is in abiogenesis, but note that there needs to be some combination of weak additional filter between abiogenesis and spacefaring civilization or highly efficient panspermia for this scenario to be likely. * If a nearby, non-expanding interstellar-spacefaring civilization did exist, then of course it could, if it so chose, mess with us in a way that left hints but no solid proof. They could even calibrate their hints across multiple categories of observations, and adjust over time, to m

I get very little value from proofs in math textbooks, and consider them usually unnecessary (unless they teach a new proof method).


I think the problem is that proofs are typically optimized for "give most convincing possible evidence that the claim is really true to a skeptical reader who wants to check every possible weak point". This is not what most readers (especially new readers) want on a first pass, which is "give maximum possible into why this claim is true for to a reader who is happy to trust the author if the details don't give extra intuition." At a glance, infinite Napkin seems to be optimizing much more for the latter.

If you're worried about computational complexity, that's OK. It's not something that I mentioned because (surprisingly enough...) this isn't something that any of the doctors discussed. If you like, let's call that a "valid cost" just like the medical risks and financial/time costs of doing tests. The central issue is if it's valid to worry about information causing harmful downstream medical decisions.

I'm sorry, but I just feel like we've moved the goal posts then. I don't see a lot of value in trying to disentangle the concept of information from 1.) costs to acquire that information, and 2.) costs to use that information, just to make some type of argument that a certain class of actor is behaving irrationally. It starts to feel like "assume a spherical cow", but we're applying that simplification to the definition of what it means to be rational. First, it isn't free to acquire information. But second, even if I assume for the sake of argument that the information is free, it still isn't free to use it, because computation has costs. if a theory of rational decision making doesn't include that fact, it'll come to conclusions that I think are absurd, like the idea that the most rational thing someone can do is acquire literally all available information before making any decision.

I might not have described the original debate very clearly. My claim was that if Monty chose "leftmost non-car door" you still get the car 2/3 of the time by always switching and 1/3 by never switching. Your conditional probabilities look correct to me. The only thing you might be "missing" is that (A) occurs 2/3 of the time and (B) occurs only 1/3 of the time. So if you always switch your chance of getting the car is still (chance of A)*(prob of car given A) + (chance of B)*(prob of car given B)=(2/3)*(1/2) + (1/3)*(1) = (2/3).

One difference (outside the... (read more)

Ah, I see, fair enough.

Just to be clear, when talking about how people behave in forums, I mean more "general purpose" places like Reddit. In particular, I was not thinking about Less Wrong where in my experience, people have always bent over backwards to be reasonable!

I have two thoughts related to this:

First, there's a dual problem: Given a piece of writing that's along the Pareto frontier, how do you make it easy for readers who might have a utility function aligned with the piece to find it.

Related to this, for many people and many pieces of writing, a large part of the utility they get is from comments. I think this leads to dynamics where a piece where the writing that's less optimal can get popular and then get to a point on the frontier that's hard to beat.

I loved this book. The most surprising thing to me was the answer that people who were there in the heyday give when asked what made Bell Labs so successful: They always say it was the problem, i.e. having an entire organization oriented towards the goal of "make communication reliable and practical between any two places on earth". When Shannon left the Labs for MIT, people who were there immediately predicted he wouldn't do anything of the same significance because he'd lose that "compass". Shannon was obviously a genius, and he did much more after than most people ever accomplish, but still nothing as significant as what he did when at at the Labs.

I thought this was fantastic, very thought-provoking. One possibly easy thing that I think would be great would be links to a few posts that you think have used this strategy with success.

Drawing from my own posts: * Many of the abstraction research posts used this strategy. I was trying to pump out updates at least ~weekly, and most weeks I didn't have a proof for a new theorem or anything like that. The best I could do was explain whatever I was thinking about, and why it seemed interesting/important. * Some of my best posts (IMO) came from looking at why I believed some idea, finding a ball of illegible intuitions, and untangling that ball. The constraints/scarcity posts all came from that process, the review of Design Principles of Biological Circuits came from that process, Everyday Lessons From High-Dimensional Optimization and various posts on gears-level models came from that process, Whats So Bad About Ad-Hoc Mathematical Definitions? came from this process, probably many others. * Core Pathways of Aging would never have been finished if I'd tried to hunt down every source.

Thanks, I clarified the noise issue. Regarding factor analysis, could you check if I understand everything correctly? Here's what I think is the situation:

We can write a factor analysis model (with a single factor) as


  1. is observed data
  2. is a random latent variable
  3. is some vector (a parameter)
  4. is a random noise variable
  5. is the covariance of the noise (a parameter)

It always holds (assuming and are independent) that

In the simplest variant of factor analysis (in the current post) we use in which cas... (read more)

2Radford Neal3y
Assuming you're using "C" to denote Covariance ("Cov" is more common), that seems right. It's typical that the noise covariance is diagonal, since a general covariance matrix for the noise would render use of a latent variable unnecessary (the whole covariance matrix for x could be explained by the covariance matrix of the "noise", which would actually include the signal as well).  (Though it could be that some people use a non-diagonal covariance matrix that is subject to some other sort of constraint that makes the procedure meaningful.) Of course, it is very typical for people to use factor analysis models with more than one latent variable.  There's no a priori reason why "intelligence" couldn't have a two-dimensional latent variable.  In any real problem, we of course don't expect any model that doesn't produce a fully general covariance matrix to be exactly correct, but it's scientifically interesting if a restricted model (eg, just one latent variable) is close to being correct, since that points to possible underlying mechanisms.

Thanks for pointing out those papers, which I agree can get at issues that simple correlations can't. Still, to avoid scope-creep, I've taken the less courageous approach of (1) mentioning that the "breadth" of the effects of genes is an active research topic and (2) editing the original paragraph you linked to to be more modest, talking about "does the above data imply" rather than "is it true that". (I'd rather avoid directly addressing 3 and 4 since I think that doing those claims justice would require more work than I can put in here.) Anyway, thanks again for your comments, it's useful for me to think of this spectrum of different "notions of g".

Thanks, very clear! I guess the position I want to take is just that the data in the post gives reasonable evidence for g being at least the convenient summary statistic in 2 (and doesn't preclude 3 or 4).

What I was really trying to get at in the original quote is that some people seem to consider this to be the canonical position on g:

  1. Factor analysis provides rigorous statistical proof that there is some single underlying event that produces all the correlations between mental tests.

There are lots of articles that (while not explicitly stating the abo... (read more)

I agree that a simple factor analysis does not provide anything even close to proof of 3 or 4, but I think it's worth noting that the evidence on g goes beyond the factor-analytic, e.g. with the studies I linked.

Can I check if I understand your point correctly? I suggested we know that g has many causes since so many genes are relevant and thus f you opened up a brain, you wouldn't be able to "find" g in any particular place. It's the product of a whole bunch of different genes, each of which is just coding for some protein, and they all interact in complex ways. If I understand you correctly, you're pointing out that there could be a sort of "causal bottleneck" of sorts. For example, maybe all the different genes have complex effects, but all that really matters ... (read more)

Well, there's sort of a spectrum of different positions one could take with regards to the realism of g: 1. One could argue that g is pure artifact of the method, and not relevant at all. For instance, some people argue that IQ tests just measure "how good you are at tests", argue that things like test-taking anxiety or whatever are major influences on the test scores, etc.. 2. One could argue that g is not a common underlying cause of performance on tests, but instead a convenient summary statistic; e.g. maybe one believes that different abilities are connected in a "network", such that learned skill at one ability transfers to other "nearby" abilities. In that case, the g loadings would be a measure of how central the tests are in the network. 3. One could argue that there are indeed common causes that have widespread influence on cognitive ability, and that summing these common causes together gives you g, without necessarily committing to the notion that there is some clean biological bottleneck for those common causes. 4. One could argue that there is a simple biological parameter which acts as a causal bottleneck representing g. Of these, the closest position that your post came to was option 2, though unlike e.g. mutualists, you didn't commit to any one explanation for the positive manifold. That is, in your post, you wrote "It does not mean that number causes test performance to be correlated.", which I'd take to be distancing oneself from positions 3+. Meanwhile, out of these, my comment defended something inbetween options 3 and 4. You seem to be asking me about option 4. I agree that strong versions of option 4 seem implausible, for probably similar reasons to you; it seems like there is a functional coordination of distinct factors that produce intelligence, and so you wouldn't expect strong versions of option 4 to hold. However, it seems reasonable to me to define g as being the sum of whichever factors have an positive effect on all cognitive a

I used python/matplotlib. The basic idea is to create a 3d plot like so:

fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')

Then you can add dots with something like this:


Then you save it to a movie with something like this:

def update(i, fig, ax):
    ax.view_init(elev=20., azim=i)
    return fig, ax

frames = np.arange(0, 360, 1)
anim = FuncAnimation(fig, update, frames=frames, repeat=True, fargs=(fig, ax))
writer = 'ffmpeg', dpi=80, writer=writer, fps=30)
... (read more)

Thanks for the reply. I certainly agree that "factor analysis" often doesn't make that assumption, though it was my impression that it's commonly made in this context. I suppose the degree of misleading-ness here depends on how often people assume isotropic noise when looking at this kind of data?

In any case, I'll try to think about how to clarify this without getting too technical. (I actually had some more details about this at one point but was persuaded to remove them for the sake of being more accessible.)

2Radford Neal3y
I'm not sure how often people assume equal noise in all measurements, but I suspect it's more often than they should - there must be a temptation to do so in order that simple methods like SVD can be used (just like Bayesian statisticians sometimes use "conjugate" priors because they're analytically tractable, even if they're inappropriate for the actual problem). Note that it's not really just literal "measurement noise", but also any other sources of variation that affect only one measured variable.

if a trait is 80% heritable and you want to guess whether or not Bob has that trait then you'll be 80% more accurate if you know whether or not Bob's parents have the trait than if you didn't have that information.

I think this is more or less correct for narrow-sense heritability (most commonly used when breeding animals) but not quite right for broad-sense heritability (most commonly used with humans). If you're talking about broad-sense heritability, the problem is that you'd need to know not just if the parents have the trait, but also which genes Bo... (read more)

On the other hand, there is some non-applied scientific value in heritability. For example, though religiosity is heritable, the specific religion people join appears to be almost totally un-heritable. I think it's OK to read this in the straightforward way, i.e. as "genes don't predispose us to be Christian / Muslim / Shinto / whatever". I don't have any particular application for that fact, but it's certainly interesting.

Similarly, schizophrenia has sky-high heritability (like 80%) meaning that current environments don't have a huge impact on where schizophrenia appears. That's also interesting even if not immediately useful.

My view is that people should basically talk about heritability less and interventions more. In most practical circumstances, what we're interested in is how much potential we have to change a trait. For example, you might want to reduce youth obesity. If that's your goal, I don't think heritability helps you much. High heritability doesn't mean that there aren't any interventions that can change obesity-- it just means that the current environments that people are already exposed to don't create much variance. Similarly, low heritability means the enviro... (read more)

On the other hand, there is some non-applied scientific value in heritability. For example, though religiosity is heritable, the specific religion people join appears to be almost totally un-heritable. I think it's OK to read this in the straightforward way, i.e. as "genes don't predispose us to be Christian / Muslim / Shinto / whatever". I don't have any particular application for that fact, but it's certainly interesting. Similarly, schizophrenia has sky-high heritability (like 80%) meaning that current environments don't have a huge impact on where schizophrenia appears. That's also interesting even if not immediately useful.

In principle, I guess you could also think about low-tech solutions. For example, people who want to opt out of alcohol might have some slowly dissolving tattoo / dye placed somewhere on their hand or something. This would eliminate the need for any extra ID checks, but has the big disadvantage it would be visible most of the time.

Combine it with getting entrance to a place. It doesn't have last too long, just long enough.

Thanks. Are you able to determine what the typical daily dose is for implanted disulfiram in Eastern Europe? People who take oral disulfiram typically need something like 0.25g / day to have a significant physiological effect. However, most of the evidence I've been able to find (e.g. this paper) suggest that the total amount of disulfiram in implants is around 1g. If that's dispensed over a year, you're getting like 1% of the dosage that's active orally. On top of that, the evidence seems pretty strong that bioavailability from implants is lower than from... (read more)

Yep, the first google result http://xn--80akpciegnlg.xn--p1ai/preparaty-dlya-kodirovaniya/disulfiram-implant/ (in Russian) says that you use an implant with 1-2g of the substance for up to 5-24 months and that "the minimum blood level of disulfiram is 20 ng/ml; ". This paper says "Mild effects may occur at blood alcohol concentrations of 5 to 10 mg/100 mL."

Very interesting! Do you know how much disulfiram the implant gives out per day? There's a bunch of papers on implants, but there's usually concerns about (a) that the dosage might be much smaller than the typical oral dosage and/or (b) that there's poor absorption.

I specified (right before the first graph) that I was using the US standard of 14g. (I know the paper uses 10g. There's no conflict because I use their raw data which is in g, not drinks.)

Sorry, my oversight.

Ironically, there is no standard for what a "standard drink" is, with different countries defining it to be anything from 8g to 20g of ethanol.

Then it makes a lot of sense to specify what standard is used in the statistics you cite. Without a defined standard a claim like the one you made feels bullshitty to me. 

I wasn't (intentionally?) being ironic. I guess that for underage drinking we have the advantage that you can sort of guess how old someone looks, but still... good point.

The main advantage for underage drinking is that a bartender only has to check the birth date on the ID, whereas for self-exclusion, they would have to check the id against a database or there would have to be some kind of icon on the id.

I've politely contacted them several times via several different channels just asking for clarifications and what the "missing coefficients" are in the last model. Total stonewall- they won't even acknowledge my contacts. Some people more connected to the education community also apparently did that as a result of my post, with the same result. 

You could model the two as being totally orthogonal:

  • Rationality is the art of figuring out how to get what you want.
  • Utilitarianism is a calculus for figuring out what you should want.

In practice, I think the dividing lines are more blurry. Also, the two tend to come up together because people who are attracted to the thinking in one of these tend to be attracted to the other as well.

You definitely need a number of data at least exponential in the number of parameters, since the number of "bins" is exponential. (It's not so simple as to say that exponential is enough because it depends on the distributional overlap. If there are cases where one group never hits a given bin, then even an infinite amount of data doesn't save you.)

I see what you're saying, but I was thinking of a case where there is zero probability of having overlap among all features. While that technically restores the property that you can multiply the dataset by arbitrarily large numbers, if feels a little like "cheating" and I agree with your larger point.

I guess Simpson's paradox does always have a right answer in "stratify along all features", it's just that the amount of data you need increases exponentially in the number of relevant features. So I think that in the real world you can multiply the amount of... (read more)

I like your concept that the only "safe" way to use utilitarianism is if you don't include new entities (otherwise you run into trouble). But I feel like they have to be included in some cases. E.g. If I knew that getting a puppy would make me slightly happier, but the puppy would be completely miserable, surely that's the wrong thing to do?

(PS thank you for being willing to play along with the unrealistic setup!)

This covers a really impressive range of material -- well done! I just wanted to point out that if someone followed all of this and wanted more, Shannon's 1948 paper is surprisingly readable even today and is probably a nice companion:

Well, it would be nice if we happened to live in a universe where we could all agree on an agent-neutral definition of what the best actions to take in each situation are. It seems to be that we don't live in such a universe, and that our ethical intuitions are indeed sort of arbitrarily created by evolution. So I agree we don't need to mathematically justify these things (and maybe it's impossible) but I wish we could!

If I understand your second point, you're suggesting that part of our intuition seems to suggest large populations are better is that larger populations tend to make the average utility higher. I like that! It would be interesting to try to estimate at that human population level average utility would be highest. (In hunter/gatherer or agricultural times probably very low levels. Today probably a lot higher?)

Can you clarify which answer you believe is the correct one in the puppy example? Or, even better, the current utility for the dog in the "yes puppy" example is 5-- for what values you believe it is correct to have or not have the puppy?

Given the setup (which I don't think applies to real-world situations, but that's the scenario given) that they aggregate preferences, they should get a dog whether or not they value the dog's preferences.  10 + 10 < 14 + 8 if they think of the dog as an object, and 10 + 10 < 14 + 8 + 5 if they think the dog has intrinsic moral relevance.   It would be a more interesting example if the "get a dog" utilities were 11 and 8 for C and B.  In that case, they should NOT get a dog if the dog doesn't count in itself.  And they SHOULD get a dog if it counts. But, of course, they're ignoring a whole lot of options (rows in the decision matrix).  Perhaps they should rescue an existing dog rather than bringing another into the world.  

My guess is that the problem is I didn't make it clear that this is just the introduction from the link? Sorry, I edited to clarify.

Yes, that was it – thanks! No worries tho! I'm not aware of any good and common convention here for handling link posts. I like to post the link and then my own separate commentary. But I've also seen a lot of people go to the opposite extreme and cross-post here. For this post, it would have been much less confusing had you quoted the entire last paragraph of the intro, and also added something like "Read the rest here". I like putting "[Link] ..." in the title of my link posts here too so that that info is available for people skimming titles. (I don't think that's always necessary or should be required; just a personal preference.) What's the theory for why "state patrol agencies" are less racist/biased than "municipal police departments"? This is a hard topic to discuss rationally (or reasonably) because of politics. I also worry there's a large 'mistake theory vs conflict theory' conflict/mistake dynamic too. I like your idea of analyzing a bunch of dimensions, e.g. age, gender, income/wealth, education, and political identification, for things like police traffic stops and vehicle searches. That's something Andrew Gelman suggests a lot: It'd be nice if the researchers for the studies you reference in your post had also published their data. (Did they? I expect they didn't – but I haven't checked.)

Totally agree that the different failure modes are in reality interrelated and dependent. In fact, one ("necessary despot") is a consequence of trying to counter some of the others. I do feel that there's enough similarity between some of the failure modes at different sites that's it's worth trying to name them. The temporal dimension is also an interesting point. I actually went back and looked at some of the comments on Marginal Revolution posts years ago. They are pretty terrible today, but years ago they were quite good.

In principle, for work done for market, I guess you don't need to explicitly think about free trade. Rather, by everyone pursing their own interests ("how much money can I make doing this"?) they'll eventually end up specializing in their comparative advantage anyway. Though, with finite lifetime, you might want to think about it to short-circuit "eventually".

For stuff not done for market (like dividing up chores), I'd think there's more value in thinking about it explicitly. That's because there's no invisible hand naturally pushing people toward their comparative advantage so you're more likely to end up doing things inefficiently.

Thanks for pointing this out. I had trouble with the image formatting trying to post it here.

That's definitely the central insight! However, experimentally, I found that explanation alone was only useful for people who already understood Monty Hall pretty well. The extra steps (the "10 doors" step and the "Monty promising") seem to lose fewer people.

That being said, my guess is that most lesswrong-ites probably fall into the "already understood Monty Hall" category, so...

A few months ago I tried a similar process to this with my dad who's pretty smart but like most does not know the Monty Hall Problem. I put three cards down, showed him one ace which is the winner, shuffled the cards so that only I knew where the ace was and told him to pick a card, after which I would flip over one of the other loser cards. We went through it and he said that it didn't matter whether he switched or not, 50-50. Luckily he did not pick the ace the first time so there was a bit of a uh huh moment. I repeated the process except using 10 total cards. As I was revealing the loser cards one by one he started to understand that his chances were improving. But he still thought that at the end it's a 50-50 between the card he chose and the remaining card although his resolve was wavering at that point. I hinted, "What was your chance of selecting the ace the first time", he said, "1 out of 10", and then I gave him the last hint he needed saying, "And if you selected a loser what is that other card there?" A few seconds later it clicked for him and he understood his odds were 9/10 to switch with the 10 cards and 2/3 to switch with the 2 cards. He ended up giving me additional insight when he asked what would happen if I didn't know which card was the ace, I flipped cards at random, and we discarded all the worldlines where I flipped over an ace. We worked on that situation for a while and discovered that the choice to switch at the end really is a 50-50. I did not expect that.