So when I look at palisaderesearch.org from bottom to top, it looks like a bunch of stuff you published doesn't have much at all to do with Agentic AI but does have to do with scary stories, including some of the stuff you exclude in this paragraph:
...
That's actually everything on your page from before 2025. And maybe... one of those kinda plausibly about Agentic AI, and the rest aren't.
Yep, I think that's about right.
I joined Palisade in November of 2024, so I don't have that much context on stuff before then. Jeffrey can give a more informed picture here.
But my impression is that Palisade was started as a "scary demos" and "cyber evals" org but over time its focus has shifted and become more specific as we've engaged with the problem.
We're still doing some things that are much more in the category of "scary demos" (we spent a bunch of staff time this year on a robotics project for FLI that will most likely not really demonstrate any of our core argument steps). But we're moving away from that kind of work. I don't think that just scaring people is very helpful, but showing people things that update their implicit model of what AI is and what kinds of things it can do, can be helpful.
Does "emotionally impactful" here mean you're seeking a subset of scary stories?
Not necessarily, but often.
Here's an example of a demo (not research) that we've done in test briefings: We give the group a simple cipher problem, and have them think for a minute or two about how they would approach solving the problem (or in the longer version, give them ten minutes to actually try to solve it). Then we give the problem to DeepSeek r1[1], and have the participants watch the chain of thought as r1 iterates through hypotheses, and solves the problem (usually much faster than the participants did,).
Is this scary? Sometimes participants might be freaked out by it. But more often, their experience is something like astonishment or surprise. (Though to be clear, they're also sometimes nonplussed. Not everyone finds this interesting or compelling.) This demonstration violates their assumptions about what AI is—often they insist that the models can't be creative or can't really think, but we stop getting that objection after we do this demo.[2] It hits a crux for them.
An earlier version of this demo involved giving the the AI a graduate level math problem, and watching it find a solution. This was was much less emotionally impactful, because people couldn't understand the problem, or understand what the AI was thinking in the chain of thought. It was just a math thing that was over their heads. It just felt like "the computer is good at doing computer stuff, whatever." It didn't hit an implicit crux.
We want to find examples that are more like cipher-problem-based demo and less like the math-problem-based demo.
Notably the example above is about a demonstration of some capabilities that are well understood by people who are following the AI field. But we're often aiming for a similar target when doing research.
Redwood's alignment faking work, Apollo's chain of thought results, and Palisade's own shutdown resistance work are all solid examples of research that hit people's cruxes in this way.
When people hear about those datapoints, they have reactions like "the AI did WHAT?" or "wow, that's creepy." That is, when non-experts are exposed to these examples, they revise their notion of what kind of thing AI is and what it can do.
These results vary in how surprising they were to experts who are closely following the field. Palisade is interested in doing research that improves the understanding of the most informed people trying to understand AI. But we're also interested in producing results that make important points accessible and legible to non-experts, even if they're are broadly predictable to the most informed experts.
I totally agree that the
AI_experiments.meets_some_other_criteria()is probably a feature of your loop. But I don't know if you meant to be saying that it's anandor anorhere.
If I'm understanding your question right, it's an "and".
Though having written the above, I think "emotionally impactful" is more of a proxy. The actual thing that I care about is "this is evidence that will update some audience's implicit model about something important." That does usually come along with an emotional reaction (eg surprise, or sometimes fear), but that's a proxy.
We could use any reasoning model that has a public chain of thought, at this point, but at the time we started doing this we were using r1.
Interestingly, I think many of the participants would still verbally endorse "AI can't be creative" after we do this exercise. But they stop offering that as an objection, because it no longer feels relevant to them.
I would not have been surprised if someone posted a result about o3's anomalously strong chess performance on Twitter, only to retract it later when other people failed to replicate it with more robust setups, because their setup allowed this sort of cheating and they didn't catch it before posting.
Nitpick: I think that specific counterfactual seem unlikely because they presumably would have noticed that the games were abnormally short. If o3 beats stockfish in one move (or two, as I vaguely remember some of the hacks required for some reason—don't quote me), you should be suspicious that something weird is happening.
But the spirt of your point still stands.
In general, one major way that we source research ideas is by trying to lay out the whole AI x-risk argument to people, step by step, and then identifying the places where they're skeptical, or where their eyes glaze over, or where it feels like fantasy-land.
Then we go out and see if we can find existence proofs of that argument step that are possible with current models.
Like, in 2015, there was an abstract argument for why superintelligence was an x-risk, that included (conjunctive and disjunctive) sub-claims about the orthogonality thesis, instrumental convergence, the Gandhi folk theorem, and various superhuman capabilities. But this argument was basically entirely composed of thought experiments. (Some of the details of that argument look different now, or have more detail, because the success of Deep Learning and transformers. But the basic story is still broadly the same.)
But most people (rightly) don't trust their philosophical reasoning very much. Generally, people respond differently very differently to hearing about things that an actual AI actually did, including somewhat edge case examples that only happened in contrived setup, than to thought experiments about how future AIs will behave.
So one of the main activities that I'm doing[1] when trying to steer research, is going through the steps of current updated version of the old school AI risk argument, and wherever possible, trying to find empirical observations that demonstrate that argument step.
The shut-down resistance result, if I remember correctly, originally came out of a memo I wrote for the team titled "A request for demos that show-don’t-tell instrumental convergence".
From that memo:
The Palisade briefing team has gotten pretty good at conveying 1) why we expect AI systems to get to be strategically superhuman and 2) why we expect that to happen in the next 10 years.
We currently can’t do a good job of landing why strategically superhuman agents would be motivated to take over.
Currently, my explanation for the why and how of takeover risk is virtually entirely composed of thought experiments. People can sometimes follow along with those thought experiments, but they don’t feel real to them.
This is a very different experience from when I talk about e.g. the chess hacking result. When we can say “this actually happened, in an actual experiment”, that hits people very differently than when I give them an abstract argument why, hypothetically, the agents will do this.
For that reason, we really want demos that show instrumental convergence. This is the number one thing that I want from the global team, and from others at Palisade who could make demos.
Notably, our shutdown resistance work doesn't really do that. But it was an interesting and surprising result, so we shared it.
The other activity that I'm doing is just writing up my own strategic uncertainties, and the things that I want to know, and trying to see if there are experiments that would reduce those uncertainties.
My own impression (perhaps not shared by the rest of Palisade) is that this is close to correct, but importantly different from what we're doing.
First of all, I think the criterion for publishing is basically "this is interesting" or "this is surprising".
For instance, if Palisade had done the followup work on Alignment Faking, we definitely would have published it. It adds more information to the world about what's up with the alignment faking phenomenon, even though the result undercuts the narrative that "alignment faking is an example of strategic deception". That followup paper shifted the original alignment faking result from seeming like a slam dunk example of instrumental deception to "IDK, seems like things are kind of confusing."
We want to inject more relevant information into the discourse about AI, even if that information seems like it first order hurts our political agenda.
But which things we choose to publish is only part of the story. At least equally important is our search process: which things we choose to investigate in the first place, and what we're steering towards and away from when iteratively exploring the space.
According to me we are not steering towards research that "looks scary", full stop. Many of our results will look scary, but that's almost incidental.
Rather, we're searching for observations that are in the intersection of...
There are lots of "scary demos" that do not meet these criteria. We could be trying to show stuff about AI misinformation, or how terrorists could jailbreak the models to manufacture bioweapons, or whatever. But we're mostly not interested in those, because they they're not central steps of our stories for how an Agentic AI takeover could happen.
In contrast, we are interested in showing cyber-hacking capabilities, and self-replication capabilities, and how AI training shapes AI motivations because those are definitely relevant to an AI lab leak or an AI takeover.
Most of that research is not cruxy for our views. If we find "actually the models can't really effectively self-replicate", in 2026, that's not going to be a major update for us, because we expect that to happen in some year. But it's still helpful work to do, because if we do find that the models can do it, it's the kind of thing that is definitely cruxy evidence for some decision-makers (either for being convinced personally, or for feeling like they have sufficient legible evidence to make the case for some action to their boss).
But some of it might turn out to be cruxy! I hope that our research on model motivations sheds light on what kind of minds these models are, and what kinds of behavior we can expect from them.
I haven't had that much contact with Palisade, but I interpreted them as more like "trying to interview people, see how they think, and provide them info they'll find useful, and let their curiosities/updates/etc be the judge of what they'll find useful", which is ... not fraught.
We have done things that are kind of like that (though I wouldn't describe it that way), but it isn't the main thing that we're doing.
Specifically, in 2025, we did something like 65 test sessions in which met with small groups of participants (some of these were locals, mostly college students, who we met in our office, some were people recruited via survey sites that we met in zoom), and try to explain the AI situation, as we understand it, to them. We would pay these test participants.
Through that process, we could see how these participants were misunderstanding us, and what things they were confused about, and what follow up questions they had. We would then iterate on the content that we were presenting and try again with new groups of participants.
By default, these sessions were semi-structured conversations. Usually, we had some specific points that we wanted explain, or a frame or metaphor we want to try. Often we had prepared slides, and in the later sessions, we were often "giving a presentation" that was mostly solidified down to the sentence level.
I would not describe this as "provide them info they'll find useful, and let their curiosities/updates/etc be the judge of what they'll find useful".
That said, the reason we were doing this in small groups is give the participants the affordance to interrupt and ask questions and flag if something seemed wrong or surprising. And we were totally willing to go on tangents from our "lesson plan", if that seemed like where the participants were at. (Though by the time we had done 15 of these, we had already built up a sense of what the dependencies were, and so usually sticking to the "lesson plan" would answer their confusions faster than deviating, but it was context-dependent, just like any teaching environment.)
We did also have some groups that seemed particularly engaged/ interested / invested in understanding. We invited those groups back for followup sessions that were explicitly steered by their curiosity: they would ask about anything they were confused about, and we would try to do our best to answer. But these kinds of sessions were the minority, maybe 3 out of 65ish.
Notably, the point of doing all this is to produce scalable communication products that do a good job of addressing people's actual tacit beliefs, assumptions, and cruxes about AI. The goal was to learn what people's background views are, and what kinds of evidence they're surprised by, so that we can make videos or similar that can address specific common misapprehensions effectively.
I don't have an opinion about Palisade, and my comment just means to discuss general principles.
I think if you included literally this sentence at the top of your comment above, that would be a little bit clearer. Reading your comment, I was trying to figure out if you were claiming that the chess hacking research in particular runs afoul of these principles or does a good job meeting them, or something else.
(You do say "I don't want to give an overall opinion, but some considerations:", but nevertheless.)
Slightly sharper and more energetic than that, but close.
It feels important to clarify I want to clarify that most of the times I've heard Ray say "wuckles", it hasn't sounded like it had an exclamation mark. It's not a shout. It's not an big or emotional expression. It's closer to the energy level of someone saying "huh" or "lol" or "wowzers" than the energy level of someone saying "fuck" or bursting into laughter or shouting at someone on the road.
If this was basically an oversight that went viral to ultimately billions of people that's hilarious.
We did change things / set up new systems to prevent this from happening again.
They at least helped, but it is not the case that we've had 0 similar incidents since (though I think all of them were less bad than that initial one). This is something we're continuing to iterate on.