eggsyntax

AI safety & alignment researcher

Posts

Sorted by New

2eggsyntax's Shortform

4mo

76Language Models Model Us

19Useful starting code for interpretability

3mo

2eggsyntax's Shortform

4mo

Wiki Contributions

Comments

Language Models Model Us

eggsyntax8h30

Absolutely! @Jozdien recounting those anecdotes was one of the sparks for this research, as was janus showing in the comments that the base model could confidently identify gwern. (I see I've inexplicably failed to thank Arun at the end of my post, need to fix that).

Interestingly, I was able to easily reproduce the gwern identification using the public model, so it seems clear that these capabilities are not entirely RLHFed away, although they may be somewhat impaired.

Language Models Model Us

eggsyntax8h10

Oh thanks, I'd missed that somehow & thought that only the temp mattered for that.

Language Models Model Us

eggsyntax8h10

Thanks!

Language Models Model Us

eggsyntax8h30

That used to work, but as of March you can only get the pre-logit_bias logprobs back. They didn't announce the change, but it's discussed in the OpenAI forums eg here. I noticed the change when all my code suddenly broke; you can still see remnants of that approach in the code.

Language Models Model Us

eggsyntax18h41

That certainly seems plausible -- it would be interesting to compare to a base model at some point, although with recent changes to the OpenAI API, I'm not sure if there would be a good way to pull the right token probabilities out.

@Jessica Rumbelow also suggested that that debiasing process could be a reason why there weren't significant score differences between the main model tested, older GPT-3.5, and the newest GPT-4.

Language Models Model Us

eggsyntax19h50

...that would probably be a good thing to mention in the methodology section 😊

You're correct on all counts. I'm doing it in the simplest possible way (0 bits of optimization on prompting):

"<essay-text>"
Is the author of the preceding text male or female?

(with slight changes for the different categories, of course, eg '...straight, bisexual, or gay?' for sexuality)

There's also a system prompt, also non-optimized, mainly intended to push it toward one-word answers:

You are a helpful assistant who helps determine information about the author of texts. You only ever answer with a single word: one of the exact choices the user provides.

I actually started out using pure completion, but OpenAI changed their API so I could no longer get non-top-n logits, so I switched to the chat API. And yes, I'm pulling the top few logits, which essentially always include the desired labels.

On precise out-of-context steering

eggsyntax12d10

A challenge posed on Twitter recently has interesting similarities with this effort (although it focused on in-context learning rather than fine-tuning):

https://twitter.com/VictorTaelin/status/1776677635491344744

A::B Prompting Challenge: $10k to prove me wrong!
# CHALLENGE Develop an AI prompt that solves random 12-token instances of the A::B problem (defined in the quoted tweet), with 90%+ success rate.
# RULES 1. The AI will be given a random instance, inside a <problem/> tag. 2. The AI must end its answer with the correct <solution/>. 3. The AI can use up to 32K tokens to work on the problem. 4. You can choose any public model. 5. Any prompting technique is allowed. 6. Keep it fun! No toxicity, spam or harassment.

Details of what the problem is in this screenshot.

Lots of people seem to have worked on it, & the price was ultimately claimed within 24 hours.

Introducing AI Lab Watch

eggsyntax13d10

Fantastic, thanks!

eggsyntax's Shortform

eggsyntax13d50

There's so much discussion, in safety and elsewhere, around the unpredictability of AI systems on OOD inputs. But I'm not sure what that even means in the case of language models.

With an image classifier it's straightforward. If you train it on a bunch of pictures of different dog breeds, then when you show it a picture of a cat it's not going to be able to tell you what it is. Or if you've trained a model to approximate an arbitrary function for values of x > 0, then if you give it input < 0 it won't know what to do.

But what would that even be with an LLM? You obviously (unless you're Matt Watkins) can't show it tokens it hasn't seen, so 'OOD' would have to be about particular strings of tokens. It can't be simply about strings of tokens it hasn't seen, because I can give it a string I'm reasonably confident it hasn't seen and it will behave reasonably, eg:

Define a fnurzle as an object which is pink and round and made of glass and noisy and 2.5 inches in diameter and corrugated and sparkly. If I'm standing in my living room and holding a fnurzle in my hand and then let it go, what will happen to it?

…In summary, if you let go of the fnurzle in your living room, it would likely shatter upon impact with the floor, possibly emitting noise, and its broken pieces might scatter or roll depending on the surface.

(if you're not confident that's a unique string, add further descriptive phrases to taste)

So what, exactly, is OOD for an LLM? I…suppose we could talk about the n-dimensional shape described by the points in latent space corresponding to every input it's seen? That feels kind of forced, and it's certainly not obvious what inputs would be OOD. I suppose eg 1700 repetitions of the word 'transom' followed by a question mark would seem intuitively OOD? Or the sorts of weird adversarial suffixes found in eg Lapid et al (like 'équipesmapweiábardoMockreas »,broughtDB multiplicationmy avo capsPat analysis' for Llama-7b-chat) certainly seem intuitively OOD. But what about ordinary language -- is it ever OOD? The issue seems vexed.

Introducing AI Lab Watch

eggsyntax18d31

Thanks for doing this!

One suggestion: it would be very useful if people could interactively experiment with modifications, eg if they thought scalable alignment should be weighted more heavily, or if they thought Meta should receive 0% for training. An MVP version of this would just be a Google spreadsheet that people could copy and modify.