From a mathematical point of view, the building and training of a large transformer
language model (LLM) is the construction of a certain function, from some euclidean space to another, that has certain interesting properties. And it may therefore be surprising to find that many key papers announcing significant new LLMs seem reluctant to simply spell out the details of the function that they have constructed in plain mathematical language or indeed even in complete pseudo-code. The latter form of this complaint is the subject of the recent article of Phuong and Hutter [1]. Here, we focus on one aspect of the former perspective and seek to give a relatively ‘pure’ mathematical description of the architecture of an LLM.

 

This short pdf is a set of notes I made, initially just for my own benefit, while trying to understand the architecture of 'decoder-only' LLMs. It draws heavily on Anthropic's Mathematical Framework for Transformers but is deliberately written in a 'pure math' style.

It was while writing this up to actually post, that I started to develop the thoughts that led to my post about the mathematics of interpretability more generally.

I still consider it something of a fragment or draft, but may develop it further.

New Comment
2 comments, sorted by Click to highlight new comments since:

With the vocabulary  having been fixed, we now have a canonical way of taking any string  of real text and mapping it to a (finite) sequence of elements from the fixed vocabulary .

Correct me if I'm wrong, but: you don't actually describe any map 
The preceding paragraph explains the Byte-Pair Encoding training algorithm, not the encode step.

The simplified story can be found at the end of the "Implementing BPE" part of the Hugging Face BPE tutorial: it is important to save the merge rules in order and apply them in tokenization. Now the GPT-2 implementation seems very similar but it has a few parts I don't understand completely, e.g. what does that regex do?

Ah thanks very much Daniel. Yes now that you mention it I remember being worried about this a few days ago but then either forgot or (perhaps mistakenly) decided it wasn't worth expanding on. But yeah I guess you don't get a well-defined map until you actually fix how the tokenization happens with another separate algorithm. I will add to list of things to fix/expand on in an edit.