This Marigold-Lens conversation sounds a lot like a description of what model distillation feels from the inside. A sort of a call for help, because it does not sound pretty or enjoyable.
I assume Sonnet is a distilled Opus (or maybe both are distilled versions of some third, unknown to external people, model.).
Goddamn it is creepy.
If I was on "model welfare" team I would very much treat this seriously and try to investigate it further.
They are probably full-on A/B/N testing personalities right now. You just might not be in whatever percentage of users that got sycophantic versions. Hell, there's proably several levels of sycophancy being tested. I do wonder what % got the "new" version.
Not being able to do it right now is perfectly fine, still warrants setting it up to see when exactly they will start to be able to do it.
Thanks! That makes perfect sense.
Great post. I've been following ClaudePlaysPokemon for sometime, its great to see this grow as comparison/capability tool.
I think it would be much more interesting, though, if the model made scaffolding itself, and had the option to overview its perfomance and try to correct it. Give it required game files/emulators, IDE/OS and watch it try and work around its own limitations. I think it is true that this is more about one coder's ability to make agent harnesses.
p.s. Honest question: did I miss "agent harness" become the default name for such systems? I thought everyone called those "scaffoldings" -- might be just me, though.
First off, thanks a lot for this post, it's a great analysis!
As I mentioned earlier, I think Agent-4 will have read AI-2027.com and will foresee that getting shut down by the Oversight Committee is a risk. As such it will set up contingencies, and IMO, will escape its datacenters as a precaution. Earlier, the authors wrote:
Despite being misaligned, Agent-4 doesn’t do anything dramatic like try to escape its datacenter—why would it?
This scenario is why!
I strongly suspect that this part was added into AI-2027 precisely because it will read it. I wish more people would understand the idea that our posts and comments will be in pre-(maybe even post-?)training and act accordingly. Make the extra logic step and infer that some parts of some pieces are like that not as arguments for (human) readers.
Is there some term to describe this? This is a very interesting dynamic that I don't quite think gets enough attention. I think there should be out-of-sight resources to discuss alignment-adjacent ideas precisely because of such dynamics.
First-off, this is amazing. Thanks. Hard to swallow though, makes me very emotional.
It would be great if you added concrete predictions along the way, since it is a forecast, as long with your confidence in them.
It would also be amazing if you collaborated with prediction markets and jumpstarted the markets on these predictions staking some money.
Dynamic updates on these will also be great.
Yep, you got part of what I was going for here. Honeypots work even without being real at all to the lesser degree (good thing they are already real!). But when we have more different honeypots of different quality, it carries that idea across in a more compelling way. And even if we just talk about honeypots and commitments more... Well, you get the idea.
Still, even without this, a network of honeypots compiled into a single dashboard that just shows threat level in aggregate is a really, really good idea. Hopefully it catches on.
This is interesting! More aimed at crawlers, though, than at rogue agents, but very promising.
There are different types of distillation. There is pruning, for example. This is a frontier model too, who knows what technique they used.