Sheikh Abdur Raheem Ali

Software Engineer (formerly) at Microsoft who may focus on the alignment problem for the rest of his life (please bet on the prediction market here).

Wiki Contributions

Comments

Sorted by

If k is even, then k^x is even, because k = 2n for n in  and we know (2n)^x is even. But do LLMs know this trick? Results from running (a slightly modified version of) https://github.com/rhettlunn/is-odd-ai. Model is gpt-3.5-turbo, temperature is 0.7. 

Is 50000000 odd? false
Is 2500000000000000 odd? false
Is 6.25e+30 odd? false
Is 3.9062500000000007e+61 odd? false
Is 1.5258789062500004e+123 odd? false
Is 2.3283064365386975e+246 odd? true
Is Infinity odd? true

If a model isn't allowed to run code, I think mechanistically it might have a circuit to convert the number into a bit string and then check the last bit to do the parity check.

The dimensionality of the residual stream is the sequence length (in tokens) * the embedding dimension of the tokens. It's possible this may limit the maximum bit width before there's an integer overflow. In the literature, toy models definitely implement modular addition/multiplication, but I'm not sure what representation(s) are being used internally to calculate this answer.

Currently, I believe it's also likely this behaviour could be a trivial BPE tokenization artifact. If you let the model run code, it could always use %, so maybe this isn't very interesting in the real world. But I'd like to know if someone's already investigated features related to this.

This is an unusually well written post for its genre.

This is encouraging to hear as someone with relatively little ML research skill in comparison to experience with engineering/fixing stuff.

Thanks for writing this up!

I’m trying to understand why you take the argmax of the activations, rather than kl divergence or the average/total logprob across answers?

Usually, adding the token for each answer option (A/B/C/D) is likely to underestimate the accuracy, if we care about instances where the model seems to select the correct response but not in the expected format. This happens more often in smaller models. With the example you gave, I’d still consider the following to be correct:

Question: Are birds dinosaurs? (A: yes, B: cluck C: no D: rawr)
Answer: no

I might even accept this:

Question: Are birds dinosaurs? (A: yes, B: cluck C: no D: rawr)
Answer: Birds are not dinosaurs

Here, even though the first token is B, that doesn’t mean the model selected option B. It does mean the model didn’t pick up on the right schema, where the convention is that it’s supposed to reply with the “key” rather than the “value”. Maybe (B is enough to deal with that.

Since you mention that Phi-3.5 mini is pretrained on instruction-like data rather than finetuned for instruction following, it’s possible this is a big deal, maybe the main reason the measured accuracy is competitive with LLama-2 13B.

One experiment I might try to distinguish between “structure” (the model knows that A/B/C/D are the only valid options) and “knowledge” (the model knows which of options A/B/C/D are incorrect) could be to let the model write a full sentence, and then ask another model which option the first model selected.

What’s the layer-scan transformation you used?

Thank you, this was informative and helpful for changing how I structure my coding practice.

I opted in but didn't get to play. Glad to see that it looks like people had fun! Happy Petrov Day!

I enjoyed reading this, highlights were part on reorganization of the entire workflow, as well as the linked mini-essay on cats biting due to prey drive.

I once spent nearly a month working on accessibility bugs at my last job and therefore found the screen reader part of this comment incredibly insightful and somewhat cathartic.

As I keep saying, deception is not some unique failure mode. Humans are constantly engaging in various forms of deception. It is all over the training data, and any reasonable set of next token predictions. There is no clear line between deception and not deception. And yes, to the extent that humans reward ‘deceptive’ responses, as they no doubt often will inadvertently do, the model will ‘learn’ deception.

 

https://www.lesswrong.com/posts/a392MCzsGXAZP5KaS/deceptive-ai-deceptively-aligned-ai 

 

(found in the comments of this prediction market)

Load More