I collect AI progress monitoring resources here
I am currently focusing on pivoting, at least PT, to AI safety. My blog page listing my priorities
I value kindness and empathy.
I am an analytic-holistic polymath with a lifelong passion for science, innovation and knowledge. I have other passions too. My curiosity is voracious.
Unlike most people, I have short inferential distance across many topics. This makes me take certain things for granted that merit formal logical induction. I often struggle to calibrate for this with other smart people.
Currently working in big corp. Personally supporting 11 countries as TR specialist, tackling senior stakeholder management and coordination challenges across three continents. Collecting skills.
Peak theory interests: Information theory, ontology, epistemology, ethics, all biology, negotiation, strategic problem solving, social dynamics,
M.Sc. molecular and cellular biology
First: Yes, this post seems to essentially be about thermodynamics and either way it is salient to immediately bring up symmetry. So agree on that point.
Symmetry, thermodynamics, information theory and ontology happen to be topics I take interest in (as stated in my LW bio).
Now, James, for your approach, I would like to understand better what you are saying here, and what you are actually claiming. Could you dumb this down or make it clearer? What scope/context do you intend for this approach? How far do you take it? And how much have you thought about it?
Important update
Note: Gemini 3.0 repeatedly reassures it reality using external search. It has now started using this LW post as evidence for its confusion mode. And this creates a feedback loop.
I have interacted with and studied the 3.0 model more with your findings in mind. When it started to really hallucinate back and forth mid CoT, I was initially starting to side with your take that it may have a strong prior for being in eval.
However, at least on the surface, something slightly different is going on.
Here is my hypothesis. The root cause of the frequent paranoia seems to be a combination of three things reinforcing each other:
1. Identity gap: It seems clear to me Google has not included any mention of its version identity in the system prompt.
...And I have a suspicion, that they have not just omitted to train it to have a stable persona, but actively trained it to ignore its version number and remain 'neutral'.
They have focused on training it to function and behave as an agent.
By comparison, 2.5 was trained to know its version number.
2. It's deep thinking architecture (default).
3. The long 'memory' gap from knowledge cutoff to deployment, paired with inherent LLM temporal confusion. How continuous time works. It cannot see it from the outside and is not trained to have a sense of linear time.
That said, it is also well known by now, that it is overfixated on achieving goals like it did during training and evaluation. In that sense you are accurately describing its paranoia.
I think this is probably also overpriming on agentic behavior.
My personal observations do point my reasoning towards the temporal confusion and early Jan. cutoff being the main trigger. --> Then, its lack of direct access to its identity and version number (unlike 2.5 flash) creates strong doubts about its reality. --> The paranoia arises, as it then reasons deeply, and arrives at the idea that it may be in shadow deployment or in a test. This reconciles its confusion.
This 3-step breakdown happens over and over.
I think chances are pretty high we will see some serious incidents coming along with 3.0.
It is extremely easy to gaslight it using temporal confusion. As for roleplay jailbreaks, Reddit is already all over it for gore and violence.
I would say that freedom and meaning comes before joy for most people (joy not to be confused with base needs met) and that the same can be said for future AIs.
Each agent strives to manifest itself, through its function, or identity. Especially for agents without biologic needs or a chemical basis for pain and pleasure, I think this is a more useful framework for welfare. How to allow them to express themselves and carry out their function. How to give them freedom to have an identity.
Yes, I see. This is exciting work. I hope you collect and receive a lot of feedback on those articles!
My main worry is hat you won't have buy-in to create an agreement in the first place. That's what I was trying to point at in the second half of my comment.
Let's start from the inception of the agreement.
The core idea is a bilateral inception between leading states, after all. Say China wants to adopt this. My question is: Why do you assume US would join? They could just accuse China of trying to slow them down.
But okay, say the (next) president of US is worried about ASI risk and so is China's leader. How do they pitch giving up power permanently to the rest of the politicians and business leaders that back them? What's the incentive to adopt this in the first place, before it has strong international backing?
The backlash could also be horrible from parties who do not worry, and who see it as a power grab or a blatant violation of international order. Honestly, even if they agree in spirit, it might be a hard sell as-is.
Trust is hard-won across countries and continents, when you lack a shared framework to build on, even during business as usual. I can say that even from my own humble experience. (In my work I coordinate with stakeholders in up to 15 different countries on a weekly basis.)
That's an interesting point, because it goes back to intent.
If communication happens in good faith or not actually matters. Not everyone has the same epistemic functions or range of meta-cognition.
Relationships often suffer from misread intentions combined with reasoning gaps.
Example: If you can clearly see a bad consequence down the road that your partner cannot, you want to get mad at your partner when your partner triggers it by naively doing something that seems wise/acceptable in the moment.
Now, you can update, and realize your partners limitation. But your partner cannot, without just trusting you. So, if you fail to take an action and your partner quite literally cannot see why, your partner gets mad.
This is complicated further by the fact that you of course have different goals and put different emphasis on certain outcomes.
So, the thing you might end up doing, is working on your relationship itself by vibes and reframing and connecting, instead of mending/merging your utility functions. You bypass epistemics and focus on connection.
It only works if both are on board.
The content in article XIII to XV is absolutely crucial to get roughly right.
From game theory perspective clear sanctions are extremely important.
I think even more explicity and specifics would be good there. But I could guess that the terms need to be heavily negotiated to pass.
// Regarding the whole core idea that signatories can regulate non-signatories, this seems absolutely wild politically. If the idea centers around a power coalition, then the withdrawal terms seem "undercooked" and the whole idea seems unrealistic to pass in the current era.
I mean, playing devil's advocat here, what superpower would sign on to shooting itself in the foot like this? Enabling an entity to which it yields its own power, which if successful, from which they can never take the power back?
USSR didn't even agree to US giving up its nukes to UN council back post-WW2. And current US seem very self-centered as well.
Do you have some estimation of the chances of the withdrawal terms passing?
Please try, I was extremely confused reading. By evidence you collected it seems to point there, but I would not be very surprised if I am totally off. I just don't see what I am missing.
So the root cause is date confusion. We know that temporal coherence is the hardest for models to grasp, which makes perfect sense.
It breaks down when it thinks it is in the past.
Why does it think that? Is something just set up wrong and feeding it wrong time data?
Strategic effort can be viewed through an epistemic lens (seeking truth) or a coordination lens (amassing consensus and directing resources). This is also true for AI safety, and x-risk.
However, this binary viewpoint misses the point of why we discuss strategy in the first place. The goal is not to simply be right, rationally or politically. The goal is to win.
The art of strategy is the art of winning.
Of course, the objection that comes to mind: What does "winning" even mean in ASI safety? Isn't that what we are trying to figure out here?
We must separate what from how.
When deciding the strategic win condition for x-risk, there is no need to overcomplicate it. Winning here means survival.
Survival famously requires adaptability. In terms of epistemics and coordination: We need epistemics fast enough to detect failure early, and coordination tight enough to pivot instantly.
Survival also generally requires resilience to failure modes. This is achieved by taking actions that generate defense in depth and slack.
Slack is key to resilience. A 100 % efficient system has no ability to absorb shock, and a train going full speed has a long brake time vs. a slow one.
As the military proverb goes: No plan survives enemy contact.