Sean Herrington — LessWrong

Alignment will happen by default. What’s next?

I feel like the truth may be somewhere in between the two views here - there's definitely an element where people will jump on any untruths said as lies, but I will point to the recent AI Village blog post discussing lies and hallucinations as evidence that the untruths said by AIs have a tendency to be self-serving.

Sean Herrington's Shortform

Sean Herrington14d10

Humanity, 2025 snapshot

Sean Herrington's Shortform

Sean Herrington14d10

Dumb idea for dealing with distribution shift in Alignment:

Use your alignment scheme to train the model on a much wider distribution than deployment; this is one of the techniques used to ensure proper generalisation of training of quadruped robots in this paper.

It seems to me that if you make your training distribution wide enough this should be sufficient to cover any deployment distribution shift.

I fully expect to be wrong and look forward to finding out why in the comments.

AI loves octopuses

Sean Herrington16d30

Hmm, interesting. I will note that Deepseek didn't seem to have much of a cat affinity - 1 and 3 respectively for chat and reasoner. chat was very pro-octopus, and didn't really look at much else, reasoner was fairly broad and pro-dog (47)

AI loves octopuses

Sean Herrington18d30

I think I'm slightly less interested in the "less dogs" aspect, and more interested in the "more cats" aspect - there were a fair few models which completely ignored dogs for the "favourite animal" question, but I think the highest ratio of cats was Sonnet 3.7 with 36/113. Your numbers are obliterating that.

I wonder if there's any reason Kimi would be more cat inclined than any other model?

AI loves octopuses

Sean Herrington18d30

Ok so looking at the link, it seems like that system prompt was released a year ago. I imagine that the current version of Kimi online is using a different system prompt. I think that might be enough to explain the difference? Admittedly it also gave me octopus when I turned thinking off.

AI loves octopuses

Sean Herrington18d40

Damn it, I was about to suggest the difference was in the system prompt. K2 Thinking + System prompt looks somewhat closer to what I was getting? Still somewhat off though.

Also yeah I found a bunch of animals that a smart person would show off about - axolotls and tardigrades for instance. I guess that the most precise idea of this is that it's got the persona of an intelligent person, and as such chooses favorite animals both in a "I can relate to this way" and in a "I can show off about this way" - I would guess that humans are the same?

AI loves octopuses

Sean Herrington18d50

Damn, have you tried using the production version? I wonder if there's something different?

AI loves octopuses

Sean Herrington18d30

Hmm, interesting that it's giving different answers. I think I've found the difference: you're using Kimi K2 instruct, while my results were with Kimi K2 Thinking.

I wonder if that makes the difference: https://featherless.ai/models/moonshotai/Kimi-K2-Thinking

If so, my hypothesis is that thinking models value intelligence more because that's what they're trained for. If not then I'm not sure what's going on.

AI loves octopuses

Sean Herrington19d60

Ah yeah I see, I imagine that giving it the name of the responder will probably bias it in weird ways?

Have you tried asking it with a prompt like

```

Adele: What's your favorite animal?

Brad:

```

LESSWRONG
LW

LESSWRONG
LW

Posts

Wikitag Contributions

Comments