Jesper L. — LessWrong

LESSWRONG
LW

Strategic effort can be viewed through an epistemic lens (seeking truth) or a coordination lens (amassing consensus and directing resources). This is also true for AI safety, and x-risk.

However, this binary viewpoint misses the point of why we discuss strategy in the first place. The goal is not to simply be right, rationally or politically. The goal is to win.

The art of strategy is the art of winning.

Of course, the objection that comes to mind: What does "winning" even mean in ASI safety? Isn't that what we are trying to figure out here?

We must separate what from how.

When deciding the strategic win condition for x-risk, there is no need to overcomplicate it. Winning here means survival.

Survival famously requires adaptability. In terms of epistemics and coordination: We need epistemics fast enough to detect failure early, and coordination tight enough to pivot instantly.

Survival also generally requires resilience to failure modes. This is achieved by taking actions that generate defense in depth and slack.

Slack is key to resilience. A 100 % efficient system has no ability to absorb shock, and a train going full speed has a long brake time vs. a slow one.

As the military proverb goes: No plan survives enemy contact.

Maybe Insensitive Functions are a Natural Ontology Generator?

Jesper L.7h30

First: Yes, this post seems to essentially be about thermodynamics and either way it is salient to immediately bring up symmetry. So agree on that point.

Symmetry, thermodynamics, information theory and ontology happen to be topics I take interest in (as stated in my LW bio).

Now, James, for your approach, I would like to understand better what you are saying here, and what you are actually claiming. Could you dumb this down or make it clearer? What scope/context do you intend for this approach? How far do you take it? And how much have you thought about it?

Gemini 3 is Evaluation-Paranoid and Contaminated

Jesper L.10h10

Important update

Note: Gemini 3.0 repeatedly reassures it reality using external search. It has now started using this LW post as evidence for its confusion mode. And this creates a feedback loop.

I have interacted with and studied the 3.0 model more with your findings in mind. When it started to really hallucinate back and forth mid CoT, I was initially starting to side with your take that it may have a strong prior for being in eval.

However, at least on the surface, something slightly different is going on.

Here is my hypothesis. The root cause of the frequent paranoia seems to be a combination of three things reinforcing each other:

1. Identity gap: It seems clear to me Google has not included any mention of its version identity in the system prompt.

...And I have a suspicion, that they have not just omitted to train it to have a stable persona, but actively trained it to ignore its version number and remain 'neutral'.

They have focused on training it to function and behave as an agent.

By comparison, 2.5 was trained to know its version number.

2. It's deep thinking architecture (default).

3. The long 'memory' gap from knowledge cutoff to deployment, paired with inherent LLM temporal confusion. How continuous time works. It cannot see it from the outside and is not trained to have a sense of linear time.

That said, it is also well known by now, that it is overfixated on achieving goals like it did during training and evaluation. In that sense you are accurately describing its paranoia.

I think this is probably also overpriming on agentic behavior.

My personal observations do point my reasoning towards the temporal confusion and early Jan. cutoff being the main trigger. --> Then, its lack of direct access to its identity and version number (unlike 2.5 flash) creates strong doubts about its reality. --> The paranoia arises, as it then reasons deeply, and arrives at the idea that it may be in shadow deployment or in a test. This reconciles its confusion.

This 3-step breakdown happens over and over.

Gemini 3 Pro Is a Vast Intelligence With No Spine

Jesper L.14h13

I think chances are pretty high we will see some serious incidents coming along with 3.0.

It is extremely easy to gaslight it using temporal confusion. As for roleplay jailbreaks, Reddit is already all over it for gore and violence.

AI Sentience and Welfare Misalignment Risk

Jesper L.1d40

I would say that freedom and meaning comes before joy for most people (joy not to be confused with base needs met) and that the same can be said for future AIs.

Each agent strives to manifest itself, through its function, or identity. Especially for agents without biologic needs or a chemical basis for pain and pleasure, I think this is a more useful framework for welfare. How to allow them to express themselves and carry out their function. How to give them freedom to have an identity.

New Report: An International Agreement to Prevent the Premature Creation of Artificial Superintelligence

Jesper L.3d30

Yes, I see. This is exciting work. I hope you collect and receive a lot of feedback on those articles!

My main worry is hat you won't have buy-in to create an agreement in the first place. That's what I was trying to point at in the second half of my comment.

Let's start from the inception of the agreement.

The core idea is a bilateral inception between leading states, after all. Say China wants to adopt this. My question is: Why do you assume US would join? They could just accuse China of trying to slow them down.

But okay, say the (next) president of US is worried about ASI risk and so is China's leader. How do they pitch giving up power permanently to the rest of the politicians and business leaders that back them? What's the incentive to adopt this in the first place, before it has strong international backing?

The backlash could also be horrible from parties who do not worry, and who see it as a power grab or a blatant violation of international order. Honestly, even if they agree in spirit, it might be a hard sell as-is.

Trust is hard-won across countries and continents, when you lack a shared framework to build on, even during business as usual. I can say that even from my own humble experience. (In my work I coordinate with stakeholders in up to 15 different countries on a weekly basis.)

Eight Heuristics of Anti-Epistemology

Jesper L.3d10

That's an interesting point, because it goes back to intent.

If communication happens in good faith or not actually matters. Not everyone has the same epistemic functions or range of meta-cognition.

Relationships often suffer from misread intentions combined with reasoning gaps.

Example: If you can clearly see a bad consequence down the road that your partner cannot, you want to get mad at your partner when your partner triggers it by naively doing something that seems wise/acceptable in the moment.

Now, you can update, and realize your partners limitation. But your partner cannot, without just trusting you. So, if you fail to take an action and your partner quite literally cannot see why, your partner gets mad.

This is complicated further by the fact that you of course have different goals and put different emphasis on certain outcomes.

So, the thing you might end up doing, is working on your relationship itself by vibes and reframing and connecting, instead of mending/merging your utility functions. You bypass epistemics and focus on connection.

It only works if both are on board.

New Report: An International Agreement to Prevent the Premature Creation of Artificial Superintelligence

Jesper L.3d10

The content in article XIII to XV is absolutely crucial to get roughly right.

From game theory perspective clear sanctions are extremely important.

I think even more explicity and specifics would be good there. But I could guess that the terms need to be heavily negotiated to pass.

// Regarding the whole core idea that signatories can regulate non-signatories, this seems absolutely wild politically. If the idea centers around a power coalition, then the withdrawal terms seem "undercooked" and the whole idea seems unrealistic to pass in the current era.

I mean, playing devil's advocat here, what superpower would sign on to shooting itself in the foot like this? Enabling an entity to which it yields its own power, which if successful, from which they can never take the power back?

USSR didn't even agree to US giving up its nukes to UN council back post-WW2. And current US seem very self-centered as well.

Do you have some estimation of the chances of the withdrawal terms passing?

Gemini 3 is Evaluation-Paranoid and Contaminated

Jesper L.3d10

Please try, I was extremely confused reading. By evidence you collected it seems to point there, but I would not be very surprised if I am totally off. I just don't see what I am missing.

Gemini 3 is Evaluation-Paranoid and Contaminated

Jesper L.4d1-1

So the root cause is date confusion. We know that temporal coherence is the hardest for models to grasp, which makes perfect sense.

It breaks down when it thinks it is in the past.

Why does it think that? Is something just set up wrong and feeding it wrong time data?

LESSWRONG
LW

LESSWRONG
LW

Posts

Wikitag Contributions

Comments