My colleague, Ramesh Viswanathan, sent this to me. It’s the most interesting thing I’ve seen on how transformers work. Alas, the math is beyond me, which is often then case, but there are diagrams early in the paper, and I understand them well enough (I think). It seems consistent with intuitions I developed while working on this paper from a year ago: ChatGPT tells stories, and a note about reverse engineering: A Working Paper.

Siddhartha Dalal, Vishal Misra, The Matrix: A Bayesian learning model for LLMs, rXiv:2402.03175v1 [cs.LG], https://doi.org/10.48550/arXiv.2402.03175.

Abstract: In this paper, we introduce a Bayesian learning model to understand the behavior of Large Language Models (LLMs). We explore the optimization metric of LLMs, which is based on predicting the next token, and develop a novel model grounded in this principle. Our approach involves constructing an ideal generative text model represented by a multinomial transition probability matrix with a prior, and we examine how LLMs approximate this matrix. We discuss the continuity of the mapping between embeddings and multinomial distributions, and present the Dirichlet approximation theorem to approximate any prior. Additionally, we demonstrate how text generation by LLMs aligns with Bayesian learning principles and delve into the implications for in-context learning, specifically explaining why in-context learning emerges in larger models where prompts are considered as samples to be updated. Our findings indicate that the behavior of LLMs is consistent with Bayesian Learning, offering new insights into their functioning and potential applications.

New Comment