Summary and Contributions: This post tackles the intriguing topic of performative prediction within AI systems, a phenomenon where making a prediction can, in itself, affect the outcome. The authors present a unique mechanism that leverages joint prediction scoring and a competitive, zero-sum game involving multiple predictors, to enable a decision-maker to definitively take the most beneficial action. The authors' approach aims to limit the potential for AI systems to manipulate the world towards more predictable, yet potentially harmful results. Key contributions include the introduction of this novel mechanism and the formulation of a theoretical Proposition 1, illustrating both the decision maker's selection of the optimal action, and predictors' alignment with true distribution outcomes.
Strengths: The innovative mechanism offers a fresh perspective on AI safety, amalgamating joint prediction scoring with a zero-sum competition between prediction models. It is especially pertinent to those interested in AI alignment and manipulation issues.
Weaknesses: This work's limitations are its assumption that all predictors possess identical beliefs about conditional distributions, a lack of exploration of inner alignment concerns, and no provision for evaluating conditional predictions for actions not taken. Further empirical research and experiments to validate the post's assertions in various contexts are required.
Correctness: The proposed mechanism, which employs a joint prediction scoring rule, appears theoretically sound, though the assumptions might be restrictive. Empirical methodology is recognized as an avenue for future exploration.
Clarity: The clarity of the post is generally satisfactory, but enhancements could be made to the structure and logical flow, such as providing more background or context earlier in the post.
Relation to Prior Work: The post acknowledges related work on performative prediction issues and builds on prior research like Decision Rules and Decision Markets. However, a deeper exploration of existing literature could help highlight the novelty of the proposed mechanism. For example, it's noteworthy that the concept of peer prediction has similarities with performative prediction, as discussed in this post.
Peer prediction is a mechanism designed to elicit truthful reporting in the absence of a verifiable ground truth. In peer prediction mechanisms the participants' reports influence the outcome of the elicitation leading to a performative prediction problem. This problem is studied in key references like "Peer Prediction Method" (Miller, Resnick and Zeckhauser, 2005), "Eliciting Informative Feedback: The Peer-Prediction Method" (Jurca and Faltings, 2006), "A Bayesian Truth Serum for Subjective Data" (Prelec, 2004), and "Surveys with Peer Prediction" (Witkowski and Parkes, 2012). Therefore, in such systems, the predictions made by individuals can potentially influence the behavior of others, thereby affecting the outcomes. This overlap between peer prediction and performative prediction could further extend the discourse and theoretical grounding of the post's proposed mechanism.
Reproducibility: The theoretical framework of the work is fairly reproducible. However, the empirical validation and practical applications warrant further investigation.
Additional Feedback: The post could be improved by providing more context and background early on, incorporating visual aids like diagrams or illustrations, and delving deeper into related work to give a stronger context within the existing literature.
Broader Impact: The authors have somewhat addressed the broader impact of their work, emphasizing the potential positive implications of the proposed mechanism for enhancing AI safety and alignment. Further discussion on possible negative ethical and societal implications would help present a more balanced viewpoint.
I think this problem is not well specified for other reasons so I won't belabor an argument.
Claim: There are no finite FFN gradient hackers.
Proof: Suppose the FFN has -trainable weights . We must have and then our input has the decomposition . It is clear that there exists a map that returns the gradient. However, an exact answer requires multiplications between weights. The multiplication relation cannot be implemented with a finite FFN. Therefore, the gradient cannot be internal in a finite FFN and so we violate condition 2b. We conclude there are no finite FFN gradient hackers.
Despite being a researcher with several years of experience at multiple R1 schools, I find it particularly hard to contextualize what people are doing in alignment research or how their work fits into a larger narrative. While I understand that more people in this field doesn't necessarily speed up progress, good scholarship should be understandable, place itself well within the broader literature, and offer avenues for follow-up work.
Paul Christiano's research does an excellent job of meeting these criteria by placing itself in the literature and outlining a research methodology. For example, see this statement on related work for the ELK problem. They also do a great job of outlining how a reasonably interested person could get started towards making a contribution by outlining a research methodology. Overall, Paul shows that you can work on pre-paradigmatic topics without sacrificing on the criteria for good scholarship.
On the other hand, while Turntrout's work is interesting, it is difficult to research due to their non-standard notation and lack of references to related work. For example, I am not sure what the difference between Shard Theory and in-context learning is. They have relatively similar definitions, but the latter notion is much more established in the literature. As another example, from discussions with learning theory folk, I suspect their notion of power reduces to a variant of Rademacher complexity. Not having this proper context can make it difficult to engage with their work. Some of my sentiment is echoed by reviewers, who point out that the central ideas of their peer-reviewed power-seeking paper were not very clear. Overall, this can leave one feeling it may be easier to generalize the power-seeking theorems on their own rather than work through the paper which would just lead to another instance of the problem posed in the OP.
Finally, I found minimal sufficiency as an idea particularly interesting in Johnswentworth's research. However, the lack of clear theorem statements in some of their earlier work made it difficult to identify core contributions. I do want to praise them for offering up a concrete research agenda with their selection theorems program. Even if with time I also grew dissatisfied with their research agenda on selection theorems I find this be a step in the right direction. However, I'm not quite sure where the difference between these and typical coherence or money-pump arguments lies. I wish more attention was given to this. More recently, there may have been push-back against their utility, assuming selection theorems include coherence theorems.
In conclusion, while I acknowledge the community's efforts in this field, I urge researchers to make their work more understandable, place it within the broader literature, and offer avenues for follow-up work to speed up progress. I find myself tending more towards established academic approaches to alignment work like CIRL games, incentive compataibility, mechanism design, and performative prediction these days in the mean-time.
So are you suggesting that ChatGPT gets aligned to the values of the human contractor(s) that provide data during finetuning, and then carries these values forward when interacting with users?
You are correct that this appears to stand in contrast one of the key benefits of CIRL games. Namely, that they allow the AI to continuously update towards the user's values. The argument I present is that ChatGPT can still learn something about the preferences of the user it is interacting with through the use of in-context value learning. During deployment, ChatGPT will then be able to learn preferences in-context allowing for continuous updating towards the user's values like in the CIRL game.
The reward is from the user which ranks candidate responses from ChatGPT. This is discussed more in OpenAI’s announcement. I edited the post to clarify this.
Liked this post. A refreshing break from the usual AI stuff. I liked how you presented background before going into your argument for why hot air balloons should’ve been invented earlier.
First up, behavior equivalence gets no credit...we’re interested in internal structure here.
This restriction seems obnoxious. As you said, the good regulator theorem only shows behavioral equivalence because the selection is over scalar utilities. However, we can obtain a world model when we withhold the utility function from the conditioning. This forces selection over representations. Accordingly, if you stipulate that optimization/selection occurs in the 'middle' it seems intuitive we have instrumental convergence of the internal structure to a world model.
Almost a year ago, in my last post, I explicitly tried to critique the selection theorem program on the grounds that there are an unlimited number, of any type signature, that can all be manufactured from the 'minimal map theorem'.
Really though, even the 'minimal map theorem' isn't a novel theorem. This is just the Doob Martingale property for conditional expectation applied to a Bernoulli scheme. These are interesting because if two Bernoulli schemes have the same Kologomorov entropy they are isomorphic. My understanding is that this can be used to show an equivalence between things like finite stationary stochastic processes (probability) and billiards. So when I say above that we have convergence with the internal structure I could be referring to that internal structure as bouncing billiards.
In 2016 Yan Lecun proposed the cake analogy for AI.
If intelligence is a cake, the bulk of the cake is unsupervised learning, the icing on the cake is supervised learning, and the cherry on the cake is reinforcement learning (RL)
Many (important) people were aware of the relative importance of unsupervised / self-supervised training before GPT. I don't think that there's been a lack of attention paid to self-supervised models. I think you might be confusing this with the situation before and after significant breakthroughs were made in the training of such models.
"Foundation models" has hundreds of references. Are there any in particular that you think are relevant?
I think you may be missing my point still. Part of good scholarship is putting in the work to find relevant and related work. It seems at least mildly irrational to,
Said another way. From your post summary, I first googled "foundational ai" and immediately was pointed to the Wikipedia article on foundation models. All the references I cited can be sourced from there. This took about 5 minutes which makes me suspect that there are fundamental issues with the scholarship in this article.
TL;DR: Self-supervised learning may create AGI or its foundation. What would that look like?
I find the lack of scholarship in the article concerning. In this post, there seems to be this idea that the space of large self-supervised models hasn't received enough attention. This isn't true. The term Foundation model was introduced as a way to talk about large self-supervised machine learning models that are useful for a variety of downstream tasks with little-to-no fine-tuning. The risks and opportunities of foundation models have been discussed along with extensive citations to other related work. There's speculation as far back as 1965 concerning machines trained in a self-supervised manner on vast amounts of data.
The lifecycle perspective presented here and in Ted Chiang's work, "The Lifecycle of Software Objects", is an intriguing lens to consider when thinking about the development of AI and scaling/maturation. This perspective is similar to the argument here in that it centers the importance of long-term, ongoing interaction and environmental influences in shaping the AI's behavior. However, the complexity and unpredictability of the real-world (and of AI) highlights key challenges in implementation. This necessitates further research into concrete strategies to manage these complexities.
Parenting?