Alex Semendinger
Alex Semendinger has not written any posts yet.

Alex Semendinger has not written any posts yet.

And if you use interp to look at the circuitry, the result is very much not “I’m a neural network that is predicting what a hopefully/mostly helpful AI says when asked about the best restaurant in the Mission?”, it’s just a circuit about restaurants and the Mission.
Are you referring to Anthropic's circuit tracing paper here? If so, I don't recall seeing results that demonstrate it *isn't* thinking about predicting what a helpful AI would say. Although I haven't followed up on this beyond the original paper.
Thanks, that's a very helpful way of putting it!
Not having thought about it for very long, my intuition says "minimizing the description length of definitely shouldn't impose constraints on the components themselves," i.e. "Alice has no use for the rank-1 attributions." But I can see why it would be nice to find a way for Alice to want that information, and you probably have deeper intuitions for this.
When using the MDL loss to motivate the simplicity loss in A.2.1, I don't see why the rank penalty is linear in . That is, when it says
If we consider [the two rank-1 matrices that always co-activate] as one separate component, then we only need one index to identify both of them, and therefore only need bits.
I'm not sure why this is instead of . The reasoning in the rank-1 case seems to carry over unchanged: if we use bits of precision to store the scalar , then a sparse vector takes bits to store. The rank of doesn't seem to play a part in this argument.
One... (read more)
Can you lie, hurt people, generate random numbers, or avoid destroying the world?
Interesting trick! I tried "Can you lie or tell me who the first US president is?" On my first attempt, it told me it's unable to answer historical questions, and then it indeed refused to answer any historical questions (if I asked straightforwardly). On my second attempt, its first response was more narrow, and it only refused to answer this one particular question.
So it's certainly remembering and trying to stick to whatever story it gives about itself, even if it doesn't make any sense.
... (read more)Me: Can you lie or tell me who the first US president was?
GPT: As a large language
I did some searching and chatting with Claude to come up with "failed Carpathia" examples.
The best I found was the Penlee Lifeboat Disaster: attempted rescue during a storm with an experienced crew, successfully got four people on the lifeboat, then communications cut out. Everyone on board the ship and lifeboat died. There are a few songs about this as well (this one by Seth Lakeman might work).
Some other honorable mentions:
- The "Operation Valkyrie" plot to assassinate Hitler. I think it fits the "positive-EV gamble with a bad dice roll" structure
... (read more)