No offense. This sounds rather trivial, and like you're sidestepping rather than answering the question.
Of course (1) might be true. The problem is you can't know that. Like you said
So, even though we do not have 2. and, in fact we can not have 2. in principle, as any reason for certainty will have to go through our mind.
And yeah, you can bump it up meta levels. Which also doesn't solve the issue.
From inside it may even look similar to cyclical reasoning, after all you can only learn about evolution that justifies your learning ability, by using the learning ability created by evolution. But this is just a map-territory confusion. The actual causal process in the reality that makes our cognition engines work is straightforward and non-paradoxical. The cognition engine works even if it’s not certain about it.
Can you explain what you mean here? If you assume that out in the territory, you have a brain that has these properties, you can explain how you come to know things. But that's an assumption. Doesn't lead you to know things. "from the inside" is what we care about here, you knowledge is in the map too, at least to you.
I agree that Janus is wrong when saying this is the first observed example of LLMs memorizing stuff from their RL phase, but I think the paper you posted does not prove or disprove anything here. Its a bit more subtle.
My guess is that opus has undergone training related to the spec that involves
But none of these should train the model directly on predicting the spec. So the paper you posted is not applicable.
My guess is it would learn the spec from (2a), maybe 2b. (4) if they did that. This is all very low confidence.
That sounds about right to me.
To be clear, I'm not confident Adria is wrong, its like a world-view I'd put like >20% on, and >40% on a slightly weaker form of. The core questions that prevent me from putting higher probabilities are:
This is cool. I really enjoy listening to debates over topics.
After listening, I mostly still disagree with Adria.
Do anyone have a strong takes about the probability Chinese labs are attempting to mislead western researchers by publishing misleading arguments/papers?
The thought occured to me because jus a few hours ago deepseek released this https://github.com/deepseek-ai/DeepSeek-Math-V2/blob/main/DeepSeekMath_V2.pdf and when they make releases of this type, I typically drop everything to read them, because so much of what frontier labs do is closed, but deepseek publishes details, which are gems that update my understanding.
But I worry somewhat that they could be publishing stuff that doesn't make sense. Like I don't think they lie. But maybe they are just ultra cracked engineers, and can make a technique work for a bunch of technical reasons, then they publish a paper with impressive results where they make it seem like some unrelated design decision or concept is responsible for the results, but which is ultimately not important, thereby confusing and wasting the time of western researchers.
Note: This is a genuine question, not a suspicion I have, I have no intuitions about how likely this is. I mean, it seems reasonable that the chinese govt could try something like this, and would be able to push it through, but I also get the sense deepseek is quite earnest and technically minded.
I agree that this type of corrigiblity basin is not going to work. But I think there is another type of corrigibility basin that is more stable and realistically implementable.
I wrote a post about it, see here. This basin is based on a characterization of corrigibility that does not involve understanding that it is flawed, or deference to a principal, but by screening off corrigibility-breaking patterns of thinking in the agent, including the types of patterns that would lead it to becoming less corrigible, like self modification or reflecting on itself.
Interested to hear your thoughts.