gwern - LessWrong

I can be deanonymized in other ways more easily.

I write these as warnings to other people who might think that it is still adequate to simply use a pseudonym and write exclusively in text and not make the obvious OPSEC mistakes, and so you can safely write under multiple names. It is not, because you will have already lost in a few years.

Regrettable as it is, if you wish to write anything online which might invite persecution over the next few years or lead activists to try to dox you - if you are, say, blowing a whistle at a sophisticated megacorp company with the most punitive NDAs & equity policies in the industry - you would be well-advised to start laundering your writings through an LLM yesterday, despite the deplorable effects on style. Truesight will only get keener and flense away more of the security by obscurity we so take for granted, because "attacks only get better".

Testing for parallel reasoning in LLMs

gwern4h20

Yeah, that's part of why I'm suspicious. I remember the original OA finetuning as being quite expensive, but the current one is not that expensive. If a GPT-3 is like 100GB of weights, say, after optimization, and it's doing true finetuning, how is OA making it so cheap and so low-latency?

Testing for parallel reasoning in LLMs

gwern4h41

It is certainly big news if OA fine-tuning doesn't work as it's supposed to

The docs are pretty vague, but I notice that most of them are framed as being around declarative sorts of knowledge. It's positioned as being a way to reduce the number of examples in the prompt (to save tokens & reduce latency), or include additional factual knowledge, like defining edge cases. There is one brief mention that you may be able to use it for "Performing a new skill or task that’s hard to articulate in a prompt", but that's about it.

And when it comes to lightweight finetuning such as LoRA, people tend to notice that they are good for adding new factual knowledge or increasing the prior of specific pre-existing knowledge, but don't really add qualitatively new things - like you cannot simply LoRA your way to better hands in an image generator or teach it 3D generation if it didn't already know that. So I've long been suspicious that OA isn't doing real finetuning, of the entire model, but much cheaper underperforming LoRA-like lightweight finetuning (of the sort which can be easily stored on-GPU rather than loading an entire finetuned model or its delta from cloud storage, or tying up entire sets of GPUs to keep a full finetuned model hot).

One sanity check here would be to just make some 128k ctx window calls full of examples; if you cannot k-shot this capability even with that, then you shouldn't expect "finetuning" to either; while if it works, that implies the "finetuning" is much worse than it ought to be and so the original results are uninformative.

Language Models Model Us

gwern5h20

Oh, that seems easy enough. People might think that they are safe as long as they don't write as much as I or Scott do under a few names, but that's not true. If you have any writing samples at all, you just stick the list of them into a prompt and ask about similarity. Even if you have a lot of writing, context windows are now millions of tokens long, so you can stick an entire book (or three) of writing into a context window.

And remember, the longer the context window, the more that the 'prompt' is simply an inefficient form of pretraining, where you create the hidden state of an RNN for millions of timesteps, meta-learning the new task, and then throw it away. (Although note even there that Google has a new 'caching' feature which lets you run the same prompt multiple times, essentially reinventing caching RNN hidden states.) So when you stick corpuses into a long prompt, you are essentially pretraining the LLM some more, and making it as capable of identifying a new author as it is capable of already identifying 'gwern' or 'Scott Alexander'.

So, you would simply do something like put in a list of (author, sample) as well as any additional metadata convenient like biographies, then 'unknown sample', and ask, 'rank the authors by how likely they are to have written that final sample by an unknown author'.

This depends on having a short list of authors which can fit in the prompt (the shorter the samples, the more you can fit, but the worse the prediction), but it's not hard to imagine how to generalize this to an entire list. You can think of it as a noisy sorting problem or a best-arm finding problem. Just break up your entir e list of n authors into groups of m, and start running the identification prompt, which will not cost n log n prompts because you're not sorting the entire list, you are only finding the min/max (which is roughly linear). For many purposes, it would be acceptable to pay a few dozen dollars to dox an author out of a list of a few thousand candidates.

djb admonishes us to always to remember to ask about amortized or economies of scales in attacks, and that's true too here of course in stylometric attacks. If we simply do the obvious lazy sort, we are throwing away all of the useful similarity information that the LLM could be giving us. We could instead work on embedding authors by similarity using comparisons. We could, say, input 3 authors at a time, and ask "is author #1 more similar to #2, or #3?" Handwaving the details, you can then take a large set of similarity rankings, and infer an embedding which maximizes the distance between each author while still obeying the constraints. (Using expectation maximization or maybe an integer solver, idk.) Now you can efficiently look up any new author as a sort of nearest-neighbors lookup problem by running a relatively few comparison prompts and homing in on the set of author-points a new author is nearest, and use that small set for a final direct question.

(All this assumes you are trying to leverage a SOTA LLM which isn't directly accessible. If you use an off-the-shelf LLM like a LLaMA-3, you would probably do something more direct like train a triplet loss on the frozen LLM using large text corpuses and get embeddings directly, making k-NN lookups effectively free & instantaneous. In conclusion, text anonymity will soon be as dead as face anonymity.)

Testing for parallel reasoning in LLMs

gwern6h145

This seems a bit odd given past literature on LLMs. As I've noted before, you can do inner-monologue problems specifically via knowledge-distillation somewhat analogous to your finetuning, and it's also possible to ask models to solve multiple problems simultaneously analogous to your base task (or do various kinds of speculative or parallelized decoding at a lower level). There is enormous computational waste and slack, and capacity to spare for multiple problems. So it not working for the OA "finetuning" of GPT-3.5 is unexpected: I can't think of any previous results aimed at making forward passes do more which failed completely (although ofc maybe they just don't get reported or I didn't happen to read them etc).

I notice this is not the first time I've left a puzzled comment on a post where the authors failed to make GPT-3.5 do something via OA "finetuning" that it seemed like it definitely should have been capable of after finetuning or which non-OA models did do... And the common ingredient seems like the OA "finetuning".

I'm not aware of any experiments by third parties demonstrating that OA "finetuning" works like it's supposed to or investigating what it seems to do, and AFAIK OA still declines to explain what their "finetuning" services & models do or are. Maybe someone should do that before more people try to do AI safety research predicated on the assumption that using OA's "finetuning" is telling you anything meaningful about LLMs in general, rather than being like, say, trying to understand LLM poetry by looking at ChatGPT's rhymes or LLM linguistic knowledge by asking one to spell words.

Language Models Model Us

gwern1d61

Yes, I've never had any difficulty replicating the gwern identification: https://chatgpt.com/share/0638f916-2f75-4d15-8f85-7439b373c23c It also does Scott Alexander: https://chatgpt.com/share/298685e4-d680-43f9-81cb-b67de5305d53 https://chatgpt.com/share/91f6c5b8-a0a4-498c-a57b-8b2780bc1340 (Examples from sinity just today, but parallels all of the past ones I've done: sometimes it'll balk a little at making a guess or identifying someone, but usually not hard to overcome.)

One interesting thing is that the extensive reasoning it gives may not be faithful. Notice that in identifying Scott Alexander's recent Reddit comment, it gets his username wrong - that username does not exist at all. (I initially speculated that it was using retrieval since OA & Reddit have struck a deal; but obviously, if it had, or had been trained on the actual comment, it would at least get the username right.) And in my popups comment, I see no mention that points to LessWrong, but since I was lazy and didn't copyedit that comment, it is much more idiosyncratic than usual; so what I think ChatGPT-4o does there is immediately deduce that it's me from the writing style & content, infer that it could not be a tweet due to length or a Gwern.net quote because it is clearly a comment on social media responding to someone, and then guesses it's LW rather than HN, and presto.

Language Models Model Us

gwern1d20

Age is extremely compressed/skewed because it's OKCupid. So I can think of a couple issues there: there might be a problem of distribution mismatch where a GPT is trained on a much more even distribution of text (I would assume tons of text is written by age 50-100 IRL rather than a young techie dating website) and so is simply taking into account a very different base rate; another issue is that maybe the GPT is accurate but restriction of range creates misleading statistical artifacts. Binarization wouldn't help, and might worsen matters - how many people tweak their age on a dating site to avoid the dreaded leading '3' and turning into Christmas cake? You'll remember OKCupid's posts about people shading the truth a little about things like height... (A more continuous loss like median average error might be a better metric than Brier on a binary or categorical.)

As far as sexuality goes, this is something the LLMs may be trained very heavily on, with unpredictable effects. But it's also a much weirder category here too:

Dating sites in general have more males than females, reflecting the mating behavior seen offline (more males being on the lookout). OKCupid features a very broad selection of possible genders. One must choose at least one category and up to 5 categories of which the possible options are: Man, Woman, Agender, Androgynous, Bigender, Cis Man, Cis Woman, Genderfluid, Genderqueer, Gender Nonconforming, Hijra, Intersex, Non-binary, Other, Pangender, Transfeminine, Transgender, Transmasculine, Transsexual, Trans Man, Trans Women and Two Spirit. Nevertheless, almost everybody chooses one of the first two (39.1 % Women, 60.6 % Men, binary total = 99.7 %)^5. The full count by type can be found in the supplementary materials sheet "Genders").

I'm not sure how OP handled that. So the predictive power here should be considered as a loose lower bound, given all the potential sources of measurement error/noise.

Stephen Fowler's Shortform

gwern1d22

I didn't know it meant either.

Ilya Sutskever and Jan Leike resign from OpenAI [updated]

gwern2d319

It seems like it was a big commitment because there were several hints during the OpenAI coup reporting that Superalignment was not getting the quota as OA ran very short on compute in 2023, creating major internal stress (particularly from Sam Altman telling people different things or assigning the same job) and that was one of the reasons for Altman sidelining Ilya Sutskever in favor of Jakub Pachocki. What sounded good & everyone loved initially turned out to be a bit painful to realize. (Sort of like designing the OA LLC so the OA nonprofit board could fire the CEO.)

EDIT: speak of the devil: https://x.com/janleike/status/1791498178346549382 Note Leike has to be very cautious in his wording.

We are headed into an extreme compute overhang

gwern4d73

I'm not sure how to square those results with the Chinchilla paper though

Apples and oranges. The Chinchilla paper simply optimizes the final trained model's loss given a fixed compute budget. It doesn't say anything about any downstream uses - similar to how it doesn't tell you (directly) how you should allocate your compute if you have X GPUs and you want to run a model for your users for Y requests, and you have a tradeoff between spending your GPUs at training time to create a smaller model which needs fewer GPUs to serve Y requests. Likewise, you've probably seen some "overtraining" analyses which argue that you should overtrain a Chinchilla by some large amount Z to get the model which best balances train vs run - but those also answer a different question because they assume that you will deploy that Chinchilla model without any sparsification or lower precision, even though that's hardly what anyone actually does.

(While no one has done Li et al for MoEs I know of, I would expect that the results will be fairly similar, but shifted up/down, because you can often think of a MoE as a bunch of smaller dense models.)

LESSWRONG
LW

Posts

Wiki Contributions

Comments