I found this clarifying for my own thinking! Just a small additional point, in Hidden Incentives for Auto-Induced Distributional Shift, there is also the example of a Q learner that learns to sometimes take a non-myopic action (I believe cooperating with its past self in a prisoner's dilemma), without any meta learning.
Thank you! :)
Yes, one could e.g. have a clear disclaimer above the chat window saying that this is a simulation and not the real Bill Gates. I still think this is a bit tricky. E.g., Bill Gates could be really persuasive and insist that the disclaimer is wrong. Some users might then end up believing Bill Gates rather than the disclaimer. Moreover, even if the user believes the disclaimer on a conscious level, impersonating someone might still have a subconscious effect. E.g., imagine an AI friend or companion who repeatedly reminds you that they are just an AI, versus one that pretends to be a human. The one that pretends to be a human might gain more intimacy with the user even if on an abstract level the users knows that it's just an AI.
I don't actually know whether this would conflict in any way with the EU AI act. I agree that the disclaimer may be enough for the sake of the act.
My takeaway from looking at the paper is that the main work is being done by the assumption that you can split up the joint distribution implied by the model as a mixture distribution
such that the model does Bayesian inference in this mixture model to compute the next sentence given a prompt, i.e., we have P(s∣s0)=P(s⊗s0)P(s0). Together with the assumption that P0 is always bad (the sup condition you talk about), this makes the whole approach with giving more and more evidence for P0 by stringing together bad sentences in the prompt work.
To see why this assumption is doing the work, consider an LLM that completely ignores the prompt and always outputs sentences from a bad distribution with α probability and from a good distribution with (1−α) probability. Here, adversarial examples are always possible. Moreover, the bad and good sentences can be distinguishable, so Definition 2 could be satisfied. However, the result clearly does not apply (since you just cannot up- or downweigh anything with the prompt, no matter how long). The reason for this is that there is no way to split up the model into two components P0 and P1, where one of the components always samples from the bad distribution.
This assumption implies that there is some latent binary variable of whether the model is predicting a bad distribution, and the model is doing Bayesian inference to infer a distribution over this variable and then sample from the posterior. It would be violated, for instance, if the model is able to ignore some of the sentences in the prompt, or if it is more like a hidden Markov model that can also allow for the possibility of switching characters within a sequence of sentences (then either P0 has to be able to also output good sentences sometimes, or the assumption P=αP0+(1−α)P1 is violated).I do think there is something to the paper, though. It seems that when talking e.g. about the Waluigi effect people often take the stance that the model is doing this kind of Bayesian inference internally. If you assume this is the case (which would be a substantial assumption of course), then the result applies. It's a basic, non-surprising learning-theoretic result, and maybe one could express it more simply than in the paper, but it does seem to me like it is a formalization of the kinds of arguments people have made about the Waluigi effect.
Fixed links to all the posts in the sequence: