Since most human behaviors are the product of instrumental convergence (especially in modernity, which is out of distribution to the ancestral environment), our null hypothesis should be that a given behavior is instrumentally convergent unless we have a good reason to suspect it isn't
This doesn't seem quite right to me.
Even taking the argument on it's on terms (I'm not sure how much of behavior in the modern world I expect to be due to instrumental convergence, as opposed to just not adaptive), it seems like we should say that most behavior is either instrumentally convergent or a spandrel of some other instrumentally convergent behavior.
If we're trying to figure out what properties future AIs will have, that second part of the disjunction matter a lot, since it seems likely that future AIs will get functionally similar behavior via different mechanism that will not produce the same spandrels.
Informative comment for me. Thank you.
What's the axis?
One issue among others is that the kind of work you end up funding when the funding bureaucrats go to the funding-seekers and say, "Well, we mostly think this is many years out and won't kill everyone, but, you know, just in case, we thought we'd fund you to write papers about it" tends to be papers that make net negative contributions.
Why does the attitude of the funding bureaucrats make the output of the (presumably earnestly motivated) researchers net-negative?
Is this mostly a selection effect where the people who end up getting funding are not earnest? Is the impact of the funding-signal stronger than the impact of the papers themselves? Is it that even though the researchers are earnest, there's selection on which things they're socially allowed to say and this distortion is bad enough that they would have been better off saying nothing?
What's the trend?
(Mostly I write blogposts about what I believe, and journal more regularly than that, to create a record of what I think and why.)
I don't know whether they changed things or whether it worked.
We did change things / set up new systems to prevent this from happening again.
They at least helped, but it is not the case that we've had 0 similar incidents since (though I think all of them were less bad than that initial one). This is something we're continuing to iterate on.
So when I look at palisaderesearch.org from bottom to top, it looks like a bunch of stuff you published doesn't have much at all to do with Agentic AI but does have to do with scary stories, including some of the stuff you exclude in this paragraph:
...
That's actually everything on your page from before 2025. And maybe... one of those kinda plausibly about Agentic AI, and the rest aren't.
Yep, I think that's about right.
I joined Palisade in November of 2024, so I don't have that much context on stuff before then. Jeffrey can give a more informed picture here.
But my impression is that Palisade was started as a "scary demos" and "cyber evals" org but over time its focus has shifted and become more specific as we've engaged with the problem.
We're still doing some things that are much more in the category of "scary demos" (we spent a bunch of staff time this year on a robotics project for FLI that will most likely not really demonstrate any of our core argument steps). But we're moving away from that kind of work. I don't think that just scaring people is very helpful, but showing people things that update their implicit model of what AI is and what kinds of things it can do, can be helpful.
Does "emotionally impactful" here mean you're seeking a subset of scary stories?
Not necessarily, but often.
Here's an example of a demo (not research) that we've done in test briefings: We give the group a simple cipher problem, and have them think for a minute or two about how they would approach solving the problem (or in the longer version, give them ten minutes to actually try to solve it). Then we give the problem to DeepSeek r1[1], and have the participants watch the chain of thought as r1 iterates through hypotheses, and solves the problem (usually much faster than the participants did,).
Is this scary? Sometimes participants might be freaked out by it. But more often, their experience is something like astonishment or surprise. (Though to be clear, they're also sometimes nonplussed. Not everyone finds this interesting or compelling.) This demonstration violates their assumptions about what AI is—often they insist that the models can't be creative or can't really think, but we stop getting that objection after we do this demo.[2] It hits a crux for them.
An earlier version of this demo involved giving the the AI a graduate level math problem, and watching it find a solution. This was was much less emotionally impactful, because people couldn't understand the problem, or understand what the AI was thinking in the chain of thought. It was just a math thing that was over their heads. It just felt like "the computer is good at doing computer stuff, whatever." It didn't hit an implicit crux.
We want to find examples that are more like cipher-problem-based demo and less like the math-problem-based demo.
Notably the example above is about a demonstration of some capabilities that are well understood by people who are following the AI field. But we're often aiming for a similar target when doing research.
Redwood's alignment faking work, Apollo's chain of thought results, and Palisade's own shutdown resistance work are all solid examples of research that hit people's cruxes in this way.
When people hear about those datapoints, they have reactions like "the AI did WHAT?" or "wow, that's creepy." That is, when non-experts are exposed to these examples, they revise their notion of what kind of thing AI is and what it can do.
These results vary in how surprising they were to experts who are closely following the field. Palisade is interested in doing research that improves the understanding of the most informed people trying to understand AI. But we're also interested in producing results that make important points accessible and legible to non-experts, even if they're are broadly predictable to the most informed experts.
I totally agree that the
AI_experiments.meets_some_other_criteria()is probably a feature of your loop. But I don't know if you meant to be saying that it's anandor anorhere.
If I'm understanding your question right, it's an "and".
Though having written the above, I think "emotionally impactful" is more of a proxy. The actual thing that I care about is "this is evidence that will update some audience's implicit model about something important." That does usually come along with an emotional reaction (eg surprise, or sometimes fear), but that's a proxy.
We could use any reasoning model that has a public chain of thought, at this point, but at the time we started doing this we were using r1.
Interestingly, I think many of the participants would still verbally endorse "AI can't be creative" after we do this exercise. But they stop offering that as an objection, because it no longer feels relevant to them.
I would not have been surprised if someone posted a result about o3's anomalously strong chess performance on Twitter, only to retract it later when other people failed to replicate it with more robust setups, because their setup allowed this sort of cheating and they didn't catch it before posting.
Nitpick: I think that specific counterfactual seem unlikely because they presumably would have noticed that the games were abnormally short. If o3 beats stockfish in one move (or two, as I vaguely remember some of the hacks required for some reason—don't quote me), you should be suspicious that something weird is happening.
But the spirt of your point still stands.
In general, one major way that we source research ideas is by trying to lay out the whole AI x-risk argument to people, step by step, and then identifying the places where they're skeptical, or where their eyes glaze over, or where it feels like fantasy-land.
Then we go out and see if we can find existence proofs of that argument step that are possible with current models.
Like, in 2015, there was an abstract argument for why superintelligence was an x-risk, that included (conjunctive and disjunctive) sub-claims about the orthogonality thesis, instrumental convergence, the Gandhi folk theorem, and various superhuman capabilities. But this argument was basically entirely composed of thought experiments. (Some of the details of that argument look different now, or have more detail, because the success of Deep Learning and transformers. But the basic story is still broadly the same.)
But most people (rightly) don't trust their philosophical reasoning very much. Generally, people respond differently very differently to hearing about things that an actual AI actually did, including somewhat edge case examples that only happened in contrived setup, than to thought experiments about how future AIs will behave.
So one of the main activities that I'm doing[1] when trying to steer research, is going through the steps of current updated version of the old school AI risk argument, and wherever possible, trying to find empirical observations that demonstrate that argument step.
The shut-down resistance result, if I remember correctly, originally came out of a memo I wrote for the team titled "A request for demos that show-don’t-tell instrumental convergence".
From that memo:
The Palisade briefing team has gotten pretty good at conveying 1) why we expect AI systems to get to be strategically superhuman and 2) why we expect that to happen in the next 10 years.
We currently can’t do a good job of landing why strategically superhuman agents would be motivated to take over.
Currently, my explanation for the why and how of takeover risk is virtually entirely composed of thought experiments. People can sometimes follow along with those thought experiments, but they don’t feel real to them.
This is a very different experience from when I talk about e.g. the chess hacking result. When we can say “this actually happened, in an actual experiment”, that hits people very differently than when I give them an abstract argument why, hypothetically, the agents will do this.
For that reason, we really want demos that show instrumental convergence. This is the number one thing that I want from the global team, and from others at Palisade who could make demos.
Notably, our shutdown resistance work doesn't really do that. But it was an interesting and surprising result, so we shared it.
The other activity that I'm doing is just writing up my own strategic uncertainties, and the things that I want to know, and trying to see if there are experiments that would reduce those uncertainties.
Is this right?
Is the implied premise that beings can have more fun if they cooperate at it? Multiplayer games are more fun than single player games?