Sheikh Abdur Raheem Ali

Software Engineer (formerly) at Microsoft who may focus on the alignment problem for the rest of his life (please bet on the prediction market here).

Posts

Sorted by New

2Sheikh Abdur Raheem Ali's Shortform

Wiki Contributions

Comments

Sorted by

Newest

Sheikh Abdur Raheem Ali's Shortform

Sheikh Abdur Raheem Ali17d00

If k is even, then k^x is even, because k = 2n for n in and we know (2n)^x is even. But do LLMs know this trick? Results from running (a slightly modified version of) https://github.com/rhettlunn/is-odd-ai. Model is gpt-3.5-turbo, temperature is 0.7.

Is 50000000 odd? false
Is 2500000000000000 odd? false
Is 6.25e+30 odd? false
Is 3.9062500000000007e+61 odd? false
Is 1.5258789062500004e+123 odd? false
Is 2.3283064365386975e+246 odd? true
Is Infinity odd? true

If a model isn't allowed to run code, I think mechanistically it might have a circuit to convert the number into a bit string and then check the last bit to do the parity check.

The dimensionality of the residual stream is the sequence length (in tokens) * the embedding dimension of the tokens. It's possible this may limit the maximum bit width before there's an integer overflow. In the literature, toy models definitely implement modular addition/multiplication, but I'm not sure what representation(s) are being used internally to calculate this answer.

Currently, I believe it's also likely this behaviour could be a trivial BPE tokenization artifact. If you let the model run code, it could always use %, so maybe this isn't very interesting in the real world. But I'd like to know if someone's already investigated features related to this.

Species as Canonical Referents of Super-Organisms

Sheikh Abdur Raheem Ali22d20

This is an unusually well written post for its genre.

leogao's Shortform

Sheikh Abdur Raheem Ali25d10

This is encouraging to hear as someone with relatively little ML research skill in comparison to experience with engineering/fixing stuff.

SAE features for refusal and sycophancy steering vectors

Sheikh Abdur Raheem Ali25d10

Thanks for writing this up!

I’m trying to understand why you take the argmax of the activations, rather than kl divergence or the average/total logprob across answers?

Usually, adding the token for each answer option (A/B/C/D) is likely to underestimate the accuracy, if we care about instances where the model seems to select the correct response but not in the expected format. This happens more often in smaller models. With the example you gave, I’d still consider the following to be correct:

Question: Are birds dinosaurs? (A: yes, B: cluck C: no D: rawr)
Answer: no

I might even accept this:

Question: Are birds dinosaurs? (A: yes, B: cluck C: no D: rawr)
Answer: Birds are not dinosaurs

Here, even though the first token is B, that doesn’t mean the model selected option B. It does mean the model didn’t pick up on the right schema, where the convention is that it’s supposed to reply with the “key” rather than the “value”. Maybe (B is enough to deal with that.

Since you mention that Phi-3.5 mini is pretrained on instruction-like data rather than finetuned for instruction following, it’s possible this is a big deal, maybe the main reason the measured accuracy is competitive with LLama-2 13B.

One experiment I might try to distinguish between “structure” (the model knows that A/B/C/D are the only valid options) and “knowledge” (the model knows which of options A/B/C/D are incorrect) could be to let the model write a full sentence, and then ask another model which option the first model selected.

What’s the layer-scan transformation you used?

Startup Success Rates Are So Low Because the Rewards Are So Large

Sheikh Abdur Raheem Ali1mo30

https://www.benkuhn.net/outliers/

Open Thread Fall 2024

Sheikh Abdur Raheem Ali1mo20

Thank you, this was informative and helpful for changing how I structure my coding practice.

2024 Petrov Day Retrospective

Sheikh Abdur Raheem Ali1mo10

I opted in but didn't get to play. Glad to see that it looks like people had fun! Happy Petrov Day!

If I wanted to spend WAY more on AI, what would I spend it on?

Sheikh Abdur Raheem Ali2mo11

I enjoyed reading this, highlights were part on reorganization of the entire workflow, as well as the linked mini-essay on cats biting due to prey drive.

Not every accommodation is a Curb Cut Effect: The Handicapped Parking Effect, the Clapper Effect, and more

Sheikh Abdur Raheem Ali2mo50

I once spent nearly a month working on accessibility bugs at my last job and therefore found the screen reader part of this comment incredibly insightful and somewhat cathartic.

GPT-o1

Sheikh Abdur Raheem Ali2mo10

As I keep saying, deception is not some unique failure mode. Humans are constantly engaging in various forms of deception. It is all over the training data, and any reasonable set of next token predictions. There is no clear line between deception and not deception. And yes, to the extent that humans reward ‘deceptive’ responses, as they no doubt often will inadvertently do, the model will ‘learn’ deception.

https://www.lesswrong.com/posts/a392MCzsGXAZP5KaS/deceptive-ai-deceptively-aligned-ai

(found in the comments of this prediction market)