dynomight

I blog at https://dynomight.net where I like to destroy my credibility by claiming that incense and ultrasonic humidifiers are bad for you.

Wiki Contributions

Comments

As the original author of underrated reasons to be thankful (here), I guess I can confirm that tearing apart the sun for raw materials was not an intended implication.

I think matplotlib has way too many ways to do everything to be comprehensive! But I think you could do almost everything with some variants of these.

ax.spines['top'].set_visible(False) # or 'left' / 'right' / 'bottom'
ax.set_xticks([0,50,100],['0%','50%','100%'])
ax.tick_params(axis='x', left=False, right=False) # or 'y'
ax.set_ylim([0,0.30])
ax.set_ylim([0,ax.get_ylim()[1]])

Good point regarding year tick marks! I was thinking think that labeling 0°C would make the most sense when freezing is really important. Say, if you were plotting historical data on temperatures and you were interested in trying to estimate the last frost date in spring or something. Then, 10°C would mean "twice as much margin" as 5°C.

One way you could measure which one is "best" would be to measure how long it takes people to answer certain questions. E.g. "For what fraction of the 1997-2010 period did Japan spend more on healthcare per-capita than the UK?" or "what's the average ratio of healthcare spending in Sweden vs. Greece between 2000 and 2010?" (I think there is an academic literature on these kinds of experiments, though I don't have any references on hand.)

In this case, I think Tufte goes overboard in saying you shouldn't use color. But if the second plot had color, I'd venture it would win most such contests, if only because the y-axis is bigger and it's easier to match the lines with the labels. But even if I don't agree with everything Tufte says, I still find him useful because he suggests different options and different ways to think about things.

Thanks, someone once gave me the advice that after you write something, you should go back to the beginning and delete as many paragraphs as you can without making everything incomprehensible. After hearing this, I noticed that most people tend to write like this:

  • Intro
  • Context
  • Overview
  • Other various throat clearing
  • Blah blah blah
  • Finally an actual example, an example, praise god

Which is pretty easy to correct once you see it!

Hey, you might be right! I'll take this as useful feedback that the argument wasn't fully convincing. Don't mean to pull a motte-and-bailey, but I suppose if I had to, I'd retreat to an argument like, "if making a plot, consider using these rules as one option for how to pick axes." In any case, if you have any examples where you think following this advice leads to bad choices, I'd be interested to hear them.

I think you're basically right: Correlation is just one way of measuring dependence between variables. Being correlated is a sufficient but not necessary condition for dependence. We talk about correlation so much because:

  1. We don't have a particularly convenient general scalar measure of how related two variables are. You might think about using something like mutual information, but for that you need the densities not datasets.
  2. We're still living in the shadows of the times when computers weren't so big. We got used to doing all sorts of stuff based on linearity decades ago because we didn't have any other options, and they became "conventional" even when we might have better options now.

Thanks, you've 100% convinced me. (Convincing someone that something that (a) is known to be true and (b) they think isn't surprising, actually is surprising is a rare feat, well done!)

Chat or instruction finetuned models have poor prediction cailbration, whereas base models (in some cases) have perfect calibration.

 

Tell me if I understand the idea correctly: Log-loss to predict next token leads to good calibration for single token prediction, which manifests as good calibration percentage predictions? But then RLHF is some crazy loss totally removed from calibration that destroys all that?

If I get that right, it seems quite intuitive. Do you have any citations, though?

Sadly, no—we had no way to verify that.

I guess one way you might try to confirm/refute the idea of data leakage would be to look at the decomposition of brier scores: GPT-4 is much better calibrated for politics vs. science but only very slightly better at politics vs. science in terms of refinement/resolution. Intuitively, I'd expect data leakage to manifest as better refinement/resolution rather than better calibration.

Load More