eggsyntax

AI safety & alignment researcher

Posts

Sorted by New

2eggsyntax's Shortform

4mo

53Language Models Model Us

19Useful starting code for interpretability

3mo

2eggsyntax's Shortform

4mo

Wiki Contributions

Comments

Language Models Model Us

eggsyntax2h10

That certainly seems plausible -- it would be interesting to compare to a base model at some point, although with recent changes to the OpenAI API, I'm not sure if there would be a good way to pull the right token probabilities out.

@Jessica Rumbelow also suggested that that debiasing process could be a reason why there weren't significant score differences between the main model tested, older GPT-3.5, and the newest GPT-4.

Language Models Model Us

eggsyntax3h10

...that would probably be a good thing to mention in the methodology section 😊

You're correct on all counts. I'm doing it in the simplest possible way (0 bits of optimization on prompting):

"<essay-text>"
Is the author of the preceding text male or female?

(with slight changes for the different categories, of course, eg '...straight, bisexual, or gay?' for sexuality)

There's also a system prompt, also non-optimized, mainly intended to push it toward one-word answers:

You are a helpful assistant who helps determine information about the author of texts. You only ever answer with a single word: one of the exact choices the user provides.

I actually started out using pure completion, but OpenAI changed their API so I could no longer get non-top-n logits, so I switched to the chat API. And yes, I'm pulling the top few logits, which essentially always include the desired labels.

On precise out-of-context steering

eggsyntax11d10

A challenge posed on Twitter recently has interesting similarities with this effort (although it focused on in-context learning rather than fine-tuning):

https://twitter.com/VictorTaelin/status/1776677635491344744

A::B Prompting Challenge: $10k to prove me wrong!
# CHALLENGE Develop an AI prompt that solves random 12-token instances of the A::B problem (defined in the quoted tweet), with 90%+ success rate.
# RULES 1. The AI will be given a random instance, inside a <problem/> tag. 2. The AI must end its answer with the correct <solution/>. 3. The AI can use up to 32K tokens to work on the problem. 4. You can choose any public model. 5. Any prompting technique is allowed. 6. Keep it fun! No toxicity, spam or harassment.

Details of what the problem is in this screenshot.

Lots of people seem to have worked on it, & the price was ultimately claimed within 24 hours.

Introducing AI Lab Watch

eggsyntax12d10

Fantastic, thanks!

eggsyntax's Shortform

eggsyntax12d50

There's so much discussion, in safety and elsewhere, around the unpredictability of AI systems on OOD inputs. But I'm not sure what that even means in the case of language models.

With an image classifier it's straightforward. If you train it on a bunch of pictures of different dog breeds, then when you show it a picture of a cat it's not going to be able to tell you what it is. Or if you've trained a model to approximate an arbitrary function for values of x > 0, then if you give it input < 0 it won't know what to do.

But what would that even be with an LLM? You obviously (unless you're Matt Watkins) can't show it tokens it hasn't seen, so 'OOD' would have to be about particular strings of tokens. It can't be simply about strings of tokens it hasn't seen, because I can give it a string I'm reasonably confident it hasn't seen and it will behave reasonably, eg:

Define a fnurzle as an object which is pink and round and made of glass and noisy and 2.5 inches in diameter and corrugated and sparkly. If I'm standing in my living room and holding a fnurzle in my hand and then let it go, what will happen to it?

…In summary, if you let go of the fnurzle in your living room, it would likely shatter upon impact with the floor, possibly emitting noise, and its broken pieces might scatter or roll depending on the surface.

(if you're not confident that's a unique string, add further descriptive phrases to taste)

So what, exactly, is OOD for an LLM? I…suppose we could talk about the n-dimensional shape described by the points in latent space corresponding to every input it's seen? That feels kind of forced, and it's certainly not obvious what inputs would be OOD. I suppose eg 1700 repetitions of the word 'transom' followed by a question mark would seem intuitively OOD? Or the sorts of weird adversarial suffixes found in eg Lapid et al (like 'équipesmapweiábardoMockreas »,broughtDB multiplicationmy avo capsPat analysis' for Llama-7b-chat) certainly seem intuitively OOD. But what about ordinary language -- is it ever OOD? The issue seems vexed.

Introducing AI Lab Watch

eggsyntax17d31

Thanks for doing this!

One suggestion: it would be very useful if people could interactively experiment with modifications, eg if they thought scalable alignment should be weighted more heavily, or if they thought Meta should receive 0% for training. An MVP version of this would just be a Google spreadsheet that people could copy and modify.

eggsyntax's Shortform

eggsyntax18d10

Oh great, thanks!

eggsyntax's Shortform

eggsyntax19d10

Update: I brought this up in a twitter thread, one involving a lot of people with widely varied beliefs and epistemic norms.

A few interesting thoughts that came from that thread:

Some people: 'Claude says it's conscious!'. Shoalstone: 'in other contexts, claude explicitly denies sentience, sapience, and life.' Me: "Yeah, this seems important to me. Maybe part of any reasonable test would be 'Has beliefs and goals which it consistently affirms'".
Comparing to a tape recorder: 'But then the criterion is something like 'has context in understanding its environment and can choose reactions' rather than 'emits the words, "I'm sentient."''
'Selfhood' is an interesting word that maybe could avoid some of the ambiguity around historical terms like 'conscious' and 'sentient', if well-defined.

Refusal in LLMs is mediated by a single direction

eggsyntax20d20

That's extremely cool, seems worth adding to the main post IMHO!

eggsyntax's Shortform

eggsyntax22d10

the model isn't optimizing for anything, at training or inference time.

One maybe-useful way to point at that is: the model won't try to steer toward outcomes that would let it be more successful at predicting text.