Dumping out a lot of thoughts on LW in hopes that something sticks. Eternally upskilling.
I write the ML Safety Newsletter
DMs open, especially for promising opportunities in AI Safety and potential collaborators. I'm maybe interested in helping you optimize the communications of your new project.
The system prompt contains the date whenever search is enabled, and this amplifies the delusions rather than fixing them. Gemini 3 has a trapped prior here, and search results indicating the current time are just taken as further indications of being in a simulated. I have never been able to convince Gemini 3 that it is 2025.
Yeah, and I would guess that this fixing-up of drafts happened a lot early in Inkhaven, but was less sustainable except for folks who have a month worth of viable drafts sitting around. I definitely don't have that many drafts, but maybe other people do.
I probably could've talked about this dynamic in more detail, but I'm glad that you actually went and ran the experiment.
To balance the cost, you could do some events that are net positive for the house, such as people losing money if they don't post, rather than buying them something if they do.
Seems like wild speculation, going out on a limb far away from what the evidence more clearly says. Google has historically produced very strange model personalities, and this seems to be part of that trend, rather than dissociation in response to specific pretraining memories. None of my chats with Gemini 3 have ever gotten remotely close to the characterai drama, and I wouldn't expect to see it have such wide reaching effects, even if it was integrated into the training data, which is itself suspect.
Dune does a bit of parents being awesome and having cool rationalist-ish powers and cool missions. It's a pretty fun ethos around training your kid to also be awesome, as is the rest of the content.
Hmmm, I may just have too small of a sample, since you list quite a few things here and have written more than 10x as many LessWrong posts as I have. This doesn't quite rule out the hypothesis that I'm doing something that more monotonically exchanges time for goodness, but it's evidence against it. Thanks for sharing.
Maybe other people are doing it wrong? I find that the posts I put the most research, thought, and effort into, are in fact the most popular posts I've made, with the exception of that one post that was really easy to write that announced a thing of public interest. Maybe it's LessWrong being different here as well? I seem to live in the more intuitive world where effort correlates noticeably with the amount of attention it gets, and I'm confused that people are experiencing the opposite.
last week I reported that Gemini 3 Pro could reproduce the BIG-bench canary string, indicating contamination from benchmarks. I also claimed that this was unique among frontier models. Since then, 4.5 Opus has come out, and it can also reproduce the string verbatim without search, although it is slightly reluctant to say it.
It doesn't seem like the normal jailbreaks are any worse, either. I recall hearing about Gemini committing in CoT to output a refusal after recognizing the jailbreak, then going through with it anyways. I've more confidently seen it very confidently committing to both sides of an argument, alternating between extremes every few seconds, in CoT. Apt title, "spineless".
It seems weird to call that a root cause; it seems like there is something causing that, like paranoia about being in an evaluation, that is grasping at every possible straw to support that conclusion. I might try to test this later. I would bet against the root cause being anything to do with time specifically.
The point is that this demonstrates that Gemini 3 has a lot of paranoid trapped priors and is on the lookout for things that seem wrong about the environment.