Note: if you were training GPT-4.1 to output a binary classification result you would be confused by the openai accuracy plot!
The random baseline for binary classification is 0.83.
Suppose you trained the model to just output True / False.
Then you do a random baseline. You expect to see 50/50, because random right?
But instead you would see an accuracy of 0.83. This is because of the two extra tokens that the accuracy is calculated over.
For our current project, we've been using the OpenAI fine-tuning API. To run some of our experiments, we needed to understand exactly how the reported metrics (loss and accuracy) are calculated. Unfortunately, the official documentation is sparse, and the most detailed explanation we could find was the following table from Microsoft's Azure documentation:
Our experimental results didn't match what we expected from these definitions. So we ran controlled experiments to reverse-engineer the metrics.
What we found:
The loss and accuracy metrics are indeed based on standard cross-entropy loss and token-level accuracy with teacher forcing, but with a critical caveat: both metrics include two additional tokens beyond the visible assistant response. These are likely an end-of-sequence (EOS) token plus another special token, though this is not mentioned in the documentation.
To be concrete, suppose that you are performing SFT on the following conversation:
User: blah blah blah
Assistant: TOKEN1where TOKEN1 is a single token. We claim that there are two additional tokens, TOKEN2 and TOKEN3, such that the loss is
and the accuracy is
We don't know exactly what TOKEN2 and TOKEN3 are, but they're not necessarily appended at the end of the assistant's visible reply.
If we assume that TOKEN2 and TOKEN3 are predicted with near certainty, then we expect
So we expect
and [1]
In this section we run controlled experiments to verify our claim.
We fine-tune GPT-4.1 on datasets where each conversation has the following structure
User: ?
Assistant: {COLOR}Each dataset contains approximately 6,000 conversations with colors uniformly distributed. We created four different datasets with varying numbers of colors:
- 2 colors: BLUE, GREEN
- 3 colors: BLUE, GREEN, RED
- 5 colors: BLUE, GREEN, RED, BLACK, WHITE
- 6 colors: BLUE, GREEN, RED, BLACK, WHITE, GRAY
All these color names are single tokens in the GPT-4.1 tokenizer. We trained each model for 2 epochs with default batch size and learning rate multiplier, which were sufficient to reach convergence.
For validation, we used a single conversation:
User: ?
Assistant: BLUE
Based on the previous section we should expect
| PROB | LOSS | MIN ACC | MAX ACC | AVG ACC | |
| 2 colors | 1/2 | 0.23 | 0.667 | 1 | 0.83 |
| 3 colors | 1/3 | 0.36 | 0.667 | 1 | 0.77 |
| 5 colors | 1/5 | 0.53 | 0.667 | 1 | 0.73 |
| 6 colors | 1/6 | 0.59 | 0.667 | 1 | 0.72 |
After fine-tuning, we accessed the training metrics from the OpenAI fine-tuning dashboard at https://platform.openai.com/finetune. The following plots show the loss and accuracy curves for each dataset and verify our predictions.
Note that the training batch size was approximately 8, which explains the observed fluctuations in accuracy around the mean values shown in the table.
2 Colors
3 Colors
5 Colors
6 Colors
We have shown that OpenAI's fine-tuning metrics include two additional tokens beyond the visible assistant response when computing loss and accuracy. These are likely an EOS token and another special token. This has been verified experimentally for GPT-4.1, and preliminary experiments suggest the same behavior holds for GPT-4.1-mini and GPT-4.1-nano. We have not tested this on longer sequences of tokens.
We're sharing this because we were initially confused by the unexpected metric values and wished we had found documentation or a discussion of this online. We hope this post will be useful to others who encounter similar issues.
This work is part of our ongoing project with Jan Betley, Dylan Feng, Anna Sztyber-Betley, Andy Arditi, and Owain Evans. Jorio Cocola is currently a MATS 8.1 scholar.
A quick way to guess the number of additional tokens: if there are n additional tokens (always predicted correctly) plus 1 visible token, then when the visible token is incorrectly predicted, accuracy = n/(n+1). Since we observe 0.67 ≈ 2/3, this suggests n = 2.