Joel Burget

Wiki Contributions


Re Na:K : Potassium Chloride is used as a salt substitute (which tastes surprisingly like regular salt). This makes it really easy to tweak the Na:K ratio (if it turns out to be important). OTOH, it's some evidence that it's not important, otherwise I'd expect someone to notice that people lose weight when they substitute it for table salt.

We don't hear much about Apple in AI -- curious why you rank them so important.

Though the statement doesn't say much the list of signatories is impressively comprehensive. The only conspicuously missing names that immediately come to mind are Dean and LeCun (I don't know if they were asked to sign).

I have a couple of basic questions:

  1. Shouldn't diagonal elements in the perplexity table all be equal to the baseline (since the addition should be 0)?
  2. I'm a bit confused about the use of perplexity here. The added vector introduces bias (away from one digit and towards another). It shouldn't be surprising that perplexity increases? Eyeballing the visualizations they do all seem to shift mass away from b and towards a.

Link to Rob Bensinger's comments on this market:

I worry that this is conflating two possible meanings of FLOPS:

  1. Floating Point Operations (FLOPs)
  2. Floating Point Operations per Second (Maybe FLOPs/s is clear?)

The AI and Memory Wall data is using (1) while the Sandberg / Bostrom paper is using (2) (see the definition in Appendix F).

(I noticed a type error when thinking about comparing real-time brain emulation vs training).

One more, related to your first point: I wouldn't expect all mesaoptimizers to have the same signature, since they could take very different forms. What does the distribution of mesaoptimizer signatures look like? How likely is it that a novel (undetectable) mesaoptimizer arises in training?

As far as we are aware, GPT-4 will use the same 50,257 tokens as its two most recent predecessors.

I suspect it'll have more. OpenAI recently released This includes "cl100k_base" with ~100k tokens.

The capabilities case for this is that GPT-{2,3} seem to be somewhat hobbled by their tokenizer, at least when it comes to arithmetic. But cl100k_base has exactly 1110 tokens which are just digits. 10 1 digit tokens, 100 2 digit tokens and 1000 3 digit tokens! (None have preceding spaces).

Previous related exploration:

My best guess is that this crowded spot in embedding space is a sort of wastebasket for tokens that show up in machine-readable files but aren’t useful to the model for some reason. Possibly, these are tokens that are common in the corpus used to create the tokenizer, but not in the WebText training corpus. The oddly-specific tokens related to Puzzle & Dragons, Nature Conservancy, and David’s Bridal webpages suggest that BPE may have been run on a sample of web text that happened to have those websites overrepresented, and GPT-2 is compensating for this by shoving all the tokens it doesn’t find useful in the same place.

Load More