Notes on the Mathematics of LLM Architectures

Spencer Becker-Kahn

Notes on the Mathematics of LLM Architectures

by Spencer Becker-Kahn

1 min read9th Feb 20232 comments

12 Ω 6

Language ModelsAI

Frontpage

Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

This is a linkpost for https://drive.google.com/file/d/1hqrHAhZAoDpFsnp1G0fQrQ1SxFtrOhEp/view?usp=sharing

From a mathematical point of view, the building and training of a large transformer
language model (LLM) is the construction of a certain function, from some euclidean space to another, that has certain interesting properties. And it may therefore be surprising to find that many key papers announcing significant new LLMs seem reluctant to simply spell out the details of the function that they have constructed in plain mathematical language or indeed even in complete pseudo-code. The latter form of this complaint is the subject of the recent article of Phuong and Hutter [1]. Here, we focus on one aspect of the former perspective and seek to give a relatively ‘pure’ mathematical description of the architecture of an LLM.

This short pdf is a set of notes I made, initially just for my own benefit, while trying to understand the architecture of 'decoder-only' LLMs. It draws heavily on Anthropic's Mathematical Framework for Transformers but is deliberately written in a 'pure math' style.

It was while writing this up to actually post, that I started to develop the thoughts that led to my post about the mathematics of interpretability more generally.

I still consider it something of a fragment or draft, but may develop it further.

New to LessWrong?

Getting Started

FAQ

Library

Notes on the Mathematics of LLM Architectures

9th Feb 2023

3Daniel Paleka

3Spencer Becker-Kahn

New Comment

2 comments, sorted by

top scoring

Click to highlight new comments since: Today at 5:46 AM

[-]Daniel Paleka1y31

With the vocabulary having been fixed, we now have a canonical way of taking any string $S$ of real text and mapping it to a (finite) sequence of elements from the fixed vocabulary $V$ .

Correct me if I'm wrong, but: you don't actually describe any map $Text \to V^{*} ?$
The preceding paragraph explains the Byte-Pair Encoding training algorithm, not the encode step.

The simplified story can be found at the end of the "Implementing BPE" part of the Hugging Face BPE tutorial: it is important to save the merge rules in order and apply them in tokenization. Now the GPT-2 implementation seems very similar but it has a few parts I don't understand completely, e.g. what does that regex do?

[-]Spencer Becker-Kahn1y30

Ah thanks very much Daniel. Yes now that you mention it I remember being worried about this a few days ago but then either forgot or (perhaps mistakenly) decided it wasn't worth expanding on. But yeah I guess you don't get a well-defined map until you actually fix how the tokenization happens with another separate algorithm. I will add to list of things to fix/expand on in an edit.

Moderation Log