Fun link! But I think designing the "game" such that it corresponds to this persuasion game is both practically difficult (in the real world the AI's moves might be able to cause the judge to do something other than Bayesian updating), and domains in which a convex utility function over our beliefs will help us get what we want in the real world might either be simple and low-impact, or require a very "high-power" utility function that has to already know a good set of beliefs for us to have and want to aim very precisely at that point.
The Bayesian persuasion framework requires that the set of possible world states be defined in advance -- and then the question becomes, given certain utility functions for the expert and decision-maker, what information about the world state should the expert commit to revealing?
I think that Bayesian persuasion might not be the right framework here, because we get to choose the AI's reward function. Assume (as Bayesian persuasion does) that you've defined all possible world states.[1] Do you want to get the AI to reveal all the information -- i.e. which particular world state we're in -- rather than a convenient subset (that it has precommitted to)? That seems straightforward: just penalize it really heavily if it refuses to tell you the world state.
I think the much bigger challenge is getting the AI to tell you the world state truthfully -- but note that this is outside the scope of Bayesian persuasion, which assumes that the expert is constrained to the truth (and is deciding which parts of the truth they should commit to revealing).
"World states" here need not mean the precise description of the world, atom by atom. If you only care about answering a particular question ("How much will Apple stock go up next week?" then you could define the set of world states to correspond to relevant considerations (e.g. the ordered tuple of random variables (how many iPhones Apple sold last quarter, how much time people are spending on their Macs, ...)). Even so, I expect that defining the set of possible world states to be practically impossible in most cases.
I'm an economist and quite new to AI alignment. In reading about the perils of persuasive AI, I was reminded of an influential model in economic theory: the Bayesian persuasion model (Kamenica and Gentzkow, 2011). It's used to model situations in which a decisionmaker wants to learn from a biased expert (e.g. a judge learning from a prosecutor). The punchline of the basic model I linked is that in equilibrium, communication depends on the curvature of the expert's payoff function: if the expert's payoff is concave in the decisionmaker's beliefs, then no communication occurs, whereas if the expert's payoff is convex in the decisionmaker's beliefs, then the expert discloses all their information.
To my non-expert eyes, this approach seems like it could be very useful in modelling the challenge of learning from AI while trading off the risk of persuasive AI. Does it seem promising, and if so, has it been done?