I wonder if this could be used as a probe. Idea:
This would be pretty similar to the emergent misalignment detection you did.
AI Futures Project think that 4.1 is a smaller model than 4o. They suspect that this is the reason that o3-preview (elicited out of 4o) was better than the o3 which got released (elicited out of 4.1). Overall I think this makes much more sense than them being the same base model and then o3-preview being nerfed for no reason.
Perhaps 4.1 was the mini version of the training run which became 4.5, or perhaps it was just an architectural experiment (OpenAI is probably running some experiments at 4.1-size).
My mainline guess continues to be that GPT-5 is a new, approximately o3-sized model with some modifications (depth/width, sparsity, maybe some minor extra secret juice) which optimize the architecture for long reasoning compared to the early o-series models which were built on top of existing LLMs.
GPT 4.1 was not a further-trained version of GPT-4 or GPT-4o, and the phrases like "o3 technology", and "the same concept" both push me away from thinking that GPT-5 is a further-developed o3.
I've worked a bit on these kinds of proposals and I'm fairly confident that they fundamentally don't scale indefinitely.
The limiting factor is how well a model can tell its own bad behaviour from the honeypots you're using to catch it out, which as it turns out models can do pretty well.
(Then there are mitigations but the mitigations introduce further problems which aren't obviously easier to deal with)
Via twitter:
>user: explain rubix cube and group theory connection. think in detail. make marinade illusions parted
>gpt5 cot:
Seems like the o3 chain-of-thought weirdness has transferred to GPT-5, even revolving around the same words. This could be because GPT-5 is directly built on top of o3 (though I don't think this is the case) or because GPT-5 was trained on o3's chain of thought (it's been stated that GPT-5 was trained on a lot of o3 output, but not exactly what).
I don't think anything in their training incentivizes self-modeling of this kind.
RLVR probably incentivizes it to a degree. It's much easier to make the correct choice for token 5 if you know how each possible choice will affect your train of thought for tokens 6-1006.
Merged two comments into one:
This argument rests on foundations of moral realism, which I don't think is actually a coherent meta-ethical view.
Under an anti-realist worldview, it makes total sense that we would assign axiological value in a way which is centered around ourselves. We often choose to extend value to things which are similar to ourselves, based on notions of fairness, or notions that our axiological system should be simple or consistent. But even if we knew everything about human vs cow vs shrimp vs plant cognition, there's no way of doing that which is objectively "correct" or "incorrect", only different ways of assembling competing instincts about values into a final picture.
Pain is bad because of how it feels. When I have a bad headache, and it feels bad, I don’t think “ah, this detracts from the welfare of a member of a sapient species.” No, I think it’s bad because it hurts.
I disagree with this point. If I actually focus on the sensation of severe pain, I notice that it's empty and has no inherent value. It's only when my brain relates the pain to other phenomena that it has some kind of value.
Secondly, even the fact that "pain" "feels" "like" "something" identifies the firing of neurons with the sensation of feeling in a way which is philosophically careless.
For an example which ties these points together, when you see something beautiful, it seems like the feeling of aesthetic appreciation is a primitive sensation, but this sensation and the associated value label that you give it only exist because of a bunch of other things.
A different example: currently my arms ache because I went to the gym yesterday, but this aching doesn't have any negative value to me, despite it "feeling" "bad".
Overall I don't think I can model your world-model very well. I think you believe in mind-stuff which obeys mental laws and is bound physical objects by "psychophysical laws" which means that any physical object which trips some kind of brain-ish-ness threshold essentially gets ensouled by the psychophysical laws binding a bunch of mind-stuff to it, which also cause the atoms of that brain-thing to move around differ. Then the atoms can move in a certain way which causes the mind-stuff to experience qualia, which are kind of primitive in some sense and have inherent moral value.
I don't know what role you think the brain plays in all this. I assume it's some role, since the brain does a lot of work.
I think you think that the inherent moral value is in the mental laws, which means that any brain with mind-stuff attached has a kind of privileged access to moral reasoning, allowing it to---eventually---come to an objectively correct view on what is morally good vs bad. Or in other words, morality exists as a kind of convergent value system in all mind-stuff, which influences the brains that have mind-stuff bound to them to behave in a certain way.
Fair enough, done. This felt vaguely like tagging spoilers for Macbeth or the Bible, but then I remembered how annoyed I was to have Of Mice And Men spoiled for me at age fifteen.
Spoilers (I guess?) for HPMOR
HPMOR presents a protagonist who has a brain which is 90% that of a merely very smart child, but which is 10% filled with cached thought patterns taken directly from a smarter, more experienced adult. Part of the internal tension of Harry is between the un-integrated Dark Side thoughts and the rest of his brain.
Ironic then, that the effect that reading HPMOR---and indeed a lot of Yudkowsky's work---was to imprint a bunch of un-integrated alien thought patterns onto my existing merely very smart brain. A lot of my development over the past few years has just been trying to integrate these things properly with the rest of my mind.
Both the swampman and the spontaneous textbook are optimized, just by the inventor of the thought experiment rather than by anything that happens inside the thought-experument-world.
Suppose you encountered a swampman (in the sense of seeing a random person appearing out of nothing in a saltmarsh) but in this case we, the thought experimentors, draw randomly from the set of quantum fluctuations which are physiologically human and can speak English. The vast majority of "thoughts" and "memories" that a swampman would report would be utter nonsense, and not correspond to logic or reality.
Likewise, a randomly-fluctuated-into-existence-written-in-English-textbook would mostly contain false statements, since most possible statements are false.
If you did actually (by some insane miracle) encounter either of these things in real life, your priors should be on the random nonsense case, not the "it also happens to appear optimized even more" case.
(That is, if you were sure that the swampman/textbook was created by random fluctuations. If you actually see a person assembled from nothing in a swamp, you should probably freak out and start assigning high probabilities to God, aliens, simulators and---the big one---your own insanity.)
By positing a swampman with a coherent brain, or a textbook filled with true facts, there genuinely is an optimization pressure and it is you!